When building a application that simply serves a bunch of users, you can make a lot of assumptions. You might assume that they can share a certain amount of data, interact in certain ways, and use certain features of the application. However, if your clients are businesses or other organizations, you’ll quickly find that those organizations want to have their own little areas in your application and only want their users to interact with each other. Further, they are going to want assurances that their sensitive data is not going to be shared. It gets even more fun if any part of the system is exposed to your client’s clients.
Changing a system to support multiple organizations sounds simple, especially to the business people who aren’t having to deal with the consequences. You’ll hear the word “just” a lot when this happens. This episode is intended to make sure that you can appropriately advise stakeholders so that they understand that the word “just” shouldn’t be in their vocabulary. It’s also true that you may want to set up a multi-tenant system – if so, this episode includes a lot of information about things you should consider when doing so.
Properly designing systems for multiple clients is a lot more complex than simply adding multiple clients to a system. A lot of considerations come into play as more clients start using a system. If you don’t plan well enough, you’ll find yourself spending a lot of time and effort reacting to problems caused by having multiple clients in a single system. Marketing people and “vision people” never consider this stuff before deciding to go multi-tenant, so you’ll have to bring this up yourself if you are tasked with making a system multi-tenant.
09:05 What is multi-tenancy?
Wikipedia defines it as a software architecture in which a single instance of software runs on a server and handles multiple organizations. A tenant is defined as a group of users who share common access with specific privileges to the software instance. Within a multi-tenant architecture, a software application is designed to provide every tenant a dedicated share of the instance. All these definitions come from wikipedia.
Multi-tenancy offers a few benefits. It can be cheaper to host than a multi-instance systems. It can save money via economy of scale. You can data-mine across customers to find out about industry trends.
10:10 How to handle permissions
You are probably going to have a way for your clients to manage security for their own team members. This means that you’ll need all the typical infrastructure that is used to handle application permissions currently, but you’ll need to have a restricted version for clients. If there are financial consequences for adding/removing users, you’ll need to have business rules around that.
Most organizations are going to want both groups and the ability to use the groupings to secure assets. However, small organizations are going to want to secure things at the user level, and then change as they grow. This complicates the picture a lot, as it constitutes a change from “who owns this?” to “who can access this and in what way?”.
Organizations may want to share their assets with other organizations. This might include users at other organizations. They don’t want the stuff from the other organization rolling up in their reports (or vice versa). This can mean that users aren’t necessarily “owned” entirely by an organization.
They may want single sign on with their own domain, website, etc. This gets really nasty really quickly when you have multiple ways to do this. It’s also fun when there is a major security issue and only half your clients have mitigated it. What happens when some other system doesn’t work.
You’d better believe you are audit trailing all this stuff. Audit trails will need to be visible to the tenants. Clients may be legally bound to make sure that unauthorized parties can’t get to their data. You’re on the hook if you said it was safe and it wasn’t, so you’ll want these even if the clients don’t. Their lawyers will. You need to be able to quickly separate audit trails by tenant as well, in case they are subpoenaed.
17:10 An organization is a group of people at a point in time, or a period of time, not forever.
Organizations are constantly in flux. Companies are bought, sold, merged, dissolved, etc. all the time. You should be considering from the outset how you are going to handle things like mergers and acquisitions of your clients, because they will happen.
This doesn’t sound too bad, until you realize that a lot of mergers and acquisitions don’t happen all at once. They may want their acquired company to essentially be a sub-company, at least for a time. This means you now have a hierarchy in the mix, which makes security and permissions really interesting. You’re also going to have to have some means of dealing with situations where everybody with admin permission in one company got fired and you need to hand off to the other company.
Sometimes companies will sell off parts of themselves for various reasons. This may mean that you need to think about how historical reporting needs to change. This can also mean that some of the users may have membership in multiple organizations as the companies separate.
20:40 Some organizations (or people using their stuff, or who hate them) mis-behave.
Believe it or not, the internet can be a dangerous place for software to live. If your software handles any sensitive data at all, or looks like it might, somebody will be trying to break in. If you are using anything off-the-shelf in your stack, there are probably script-based attackers out there looking for vulnerabilities as well.
When an organization is attempting to use your software, they will screw up at some point. This can be as simple as a developer messing up a timer and flooding your system with requests. This is always a possibility when it takes more computing power to service a request than it requires to make a request. The above almost always applies to computing tasks done over a network – otherwise you’d do it yourself locally and save the latency.
Your customers will eventually hire someone who is a bad actor or suffer a breach. You should never assume a system is secure, especially someone else’s system. Someone working for one of your customers may try to illegally obtain data from another customer. Industrial espionage is a very real thing. Further, someone with bad intentions may well get a job with one of your good customers and simply abuse the system.
If your customer’s clients are going to be interacting with your system, they are going to eventually want to customize it. This may be as simple as adding logos and changing colors to match their own system. It can also mean branding for your customer’s clients. It may mean further customization to tie in with their processes, through web hooks, APIs, and reporting systems.
If customizations could potentially be extensive, this means you need to think about how your clients will roll out changes. They probably don’t want to “edit in production” either. If they are outsourcing the work, they may have regulatorly compliance issues around interaction with live data (think of a bank outsourcing some design and integration work outside the country). They probably also need the ability to roll back their changes.
Their clients may want staged rollouts of changes to their own environments as well. It’s unlikely that all of your customer’s clients will be ok with the same outage window. This may mean that you need to support staged rollouts of customizations and/or feature flags per end client.
27:15 Deployments/Releases, and outage windows.
You’ll also run into “fun” if you think you can take the whole system down while upgrading. If you have very many clients at all, this means you are either postponing an update until they can all tolerate it, forcing them all to tolerate it, or never updating. You are almost certainly going to want to set up mechanisms for rolling out changes while the live system is still running.
You also need to carefully consider what happens when different parts of the system are on different versions of the software. What happens in the middle of long-running jobs when the software/database/api version changes in the middle? You’ll probably want a message queueing architecture that allows you to keep long-running jobs from being started during a rollout.
Version inequality tolerance is not one-way. You may have to roll back. You also need to consider what happens if you fail on a rollout and have to roll back changes. You don’t want to lose data committed by a new version during a failed rollout. You can’t just restore a backed up version of the database if financially significant events occurred during the rollout.
You’re going to want to ignore spurious errors during a rollout, except when you don’t want to ignore them. You’ll have an elevated error reporting rate during many rollouts. You’ll want to ignore them if they aren’t relevant. However, you can’t just ignore them all, because the errors might point to a problem that means you need to roll back.
31:55 Data sharding.
Eventually, you’ll have a reason to break your data store into multiple data stores. This can occur because of the shear volume of data, geographic distances between users, or even regulatory issues. This might also occur when you have clients who are simply (rightfully) paranoid.
Data sharding brings interesting issues into the mix when staged upgrades are occurring. Your assumptions about database schemas can really cause problems. You may have to break a single rollout into separate rollouts if the new version is incompatible with the old. For instance, you may want to add a new, non-nullable field to a central table in your database. You first add the field and make it nullable. Then you have to get data into every row. Then you can make it non-nullable. This gets interesting if the required data is client-provided, expensive/time-consuming to compute, etc. This also means that systems directly interacting with the databases must have graceful fallback if version incompatibility occurs.
Some data migrations may take a lot of time. Things like large index rebuilds, data migrations, etc. can take hours. Backups and restores can also take hours. You will find that many database migration operations are more complex than whatever required the migration.
36:45 Insider Security Threats
Because you are hosting access to your clients’ data, you also have to mitigate risks from inside your organization. This means audit trails and logging that cannot be disabled by most people. If the data is particularly sensitive, you’re going to want notifications to happen if something particularly sensitive is accessed. You are also going to want very stringent security checks for any data access. Developers shouldn’t have access to production as a regular business operation – it should require extreme cirumstances.
Audit trails will have to be maintained for a very long time. Clients lie sometimes, especially when they find they have liability for a breach. Your system will need to be able to prove that no one inappropriately accessed a record, even if it was years ago. Audit trails are often huge, so you may be moving them to another system. You need to consider what happens to audit trails of particular tables when those tables change over time, especially when data is removed from those tables.
41:00 APIs, Webhooks, and bulk loading
If your clients are of any size, they are probably going to want to customize their interaction with your system. This means that you will need API endpoints for clients to use for querying your system. This can also mean webhooks for situations where the client wants to be notified that something happened in the system. They may also need means for bulk uploading/editing of data, especially during the early onboarding phases.
If your clients are using APIs or bulk operations, you need to be cautious of how they use your system. If a developer is anticipating how a system will be used, they tend to only anticipate how their code would use it. The load profile created by a client may be drastically different. You will want caching, rate limiting, etc. to limit the damage an API user can do to your running system.
Webhooks have other issues. You’ll want to be careful how you call a client’s webhook(s) so that a slow system at their end doesn’t hurt throughput of calls to all your other clients. You need to handle errors and system outages at the client end with some sort of back-off and retry mechanism. If a client’s webhook fails for long enough, you have to stop trying. You need to be able to surface information about webhook failures to the client as well.
45:00 Reporting and data archival/export
Clients are also going to expect to be able to make business decisions based on their data that you are storing. This usually means that they are going to want to slice and dice the data themselves. If you try to predict what they are going to want, you’re going to be constantly adjusting it.
Reporting will need to happen in another system, full stop. Because of the types of data manipulation and the like that will be required in such a system, you don’t want clients doing it in the live system that is used by all your clients. You probably also don’t need up-to-the-minute data in your reporting system, so you can move it over at a slower rate than you would in the main system.
Clients are also going to want to be able to pull their data out of the system. Sometimes this is because they are leaving. It can also be because they want to perform analysis on the data on their own system, possibly by combining it with data from other systems. You need to make this easy to do, so that they don’t abuse your APIs to get this data.
48:15 Leaving Multi-tenancy
As some clients get larger, they are not going to want to be in a multi-tenant environment, for various reasons. Particularly large companies may simply want their data segregated from everyone elses, to reduce risks in the event of a breach. There may also be regulatory or contractual obligations that require them to separate their systems from everyone else’s. Finally, they may just be trying to keep some measure of control over how system updates impact their business.
You should have a plan in place for what happens when a client leaves your system. This would include things like the following: How long the system keeps their data. How they can download their data. Whether they have an option for hosting separately and how that happens.
How do clients come back in after leaving? A client may leave your multi-tenant environment, work on a dedicated system for a few years, and then return. However, they might decide to go back to the multi-tenant system after a while due to cost or other considerations. Never assume that a client’s decision is permanent.
50:25 Third party service fun
Be careful how you use third party services from your multi-tenant system. Your clients can cause problems for everyone else. For instance, if you use a third party Email Service Provider (ESP), you need to consider what happens when a client starts sending a lot of spam. Similarly, with a multi-tenant system, you may find that your cost goes up significantly with third party services as you move from a “starter” plan to an “enterprise” plan. Rate limiting can also be an issue. If one client has a big job processing through a third party, it could slow down the processing of work for other clients.
You also need to be able to track utilization of 3rd party services. If your account is charged based on volume used, pricing needs to reflect this. While pricing discussions are not part of your job (probably), at some point management will come to you wanting to find out who is using third party services and how frequently. You should build this kind of tracking in from the start.
Third party services can also cause issues with your system. If the service is down, for instance, it may impact your application. You should always be able to turn third party services off quickly and your application should be written so that it can run with reduced functionality until they come back on. You should also be careful about the data that you send to third party services, in case they get breached.
Lucie Labs has built an IoT wristband for concert goers and organizers. For attendees the band pulses with the rhythm of the music and their movement enhancing the concert experience. This is great for Raves or Dance Parties with lots of EDM or techno. For the producers and organizers it collects and sends data about the crowd and how they are reacting to the concert. This could mean changes in how concerts are put together or even new forms of art created from the concert data. There’s lots of possibilities with this technology.
Tricks of the Trade
Regular nature probably isn’t going to eat you. Human nature absolutely will, so you’d better take it into account.
There are some clicks throughout the episode. We’re working on preventing them and improving the overall audio quality.