Deployment Resilience

Deployment of software is probably one of the most painful parts of the whole process, bar none. Deployment considerations have always been a can of worms for software, and have often served to shape the way software is designed, developed, and used in a way that is hard to overestimate. In fact, the success of modern web applications is often attributed to their “relatively simpler” deployment model when compared to traditional desktop and mobile applications. This seems to be true even before you start considering things like mobile and cross-platform considerations. Deployment is often one of the first places where an application runs into serious problems with load, security, configuration, logging, monitoring, and backward compatibility.

Making your deployment process resilient is not easy. Not only do you have to potentially make changes to a running application while people are using it, but you have to make sure that you can quickly recover from mistakes while trying to make things fast. A resilient deployment process can best be characterized as one where the users have no idea that anything is even going on with the system during most deployments. Additionally, a resilient deployment system does not have notably degraded performance during a deployment, can be rolled back to a previous state with as little work as possible, and can easily be verified for correctness. It’s a tall order, and is likely to be neglected by the people in charge in favor of rolling out yet another feature. But your deployment process is as much a part of your software as any other feature – after all, if the users are thwarted while trying to do their work, that new feature isn’t going to matter.

Deployments can be one of the most challenging parts of writing software. To do them well, you have to walk a tightrope between consistency in application behavior and avoiding application downtime. Depending on what your software does, you will find that you err on one side or the other. However, regardless of which position is better for your organization, getting your deployments right and making them resilient, will make it much easier to deliver software to your end users with a minimum amount of difficulty.

Episode Breakdown

Limit the surface area of deployments

Your source control architecture should not determine how your deployments are structured. You should deploy the minimum amount of stuff and avoid redeploying things that have already been deployed. Besides making the deployment faster, it also limits the number of things that can go wrong, and makes it faster to roll back to a previous state.

Towards this end, you should also be limiting what is being built upstream of a deployment (if your deployment starts when a build is completed). This makes rebuilds and subsequent redeployments quicker, making recovery easier if you have to “roll forward” (which we’d almost never recommend). When you reduce the amount of stuff you deploy, it also makes troubleshooting and sanity checks easier, because there are fewer variables to consider. This allows you to validate a deployed build for correctness more quickly.

Deploy in parallel and cut over when ready.

In a production application, especially one that other people rely on, you should do what you can to limit outages due to maintenance windows. However, if you are deploying over the top of an existing deployment, downtime is unavoidable for the duration of the deployment. While speeding up your deployments can certainly help, it’s unlikely that you can speed them up enough to completely avoid all problems. As a result, it’s often better to deploy to a fresh environment and then redirect the existing traffic to it when complete.

This approach also means that the old environment stays untouched while you verify that the new environment doesn’t have issues. While this doesn’t solve things like database migration rollbacks, it does help you focus on the things that truly require manual intervention. If your application has a long “spin-up” cycle, this also allows you to get the new environment warmed up (possibly as part of the deployment) before sending traffic to it. This can keep users from being inconvenienced by a slow system during the spin up phase of a deployment. This also has the nice side effect that it forces you to design your upgrade path for backward compatibility, because you’ll break the system during deploy if you don’t.

Limit load during application spinup

It’s common during application spin up to do a lot of things to improve run time performance, or just to get the application ready to go. This can include actions like pre-loading caches, migrating databases to the latest version, wiring up dependency injection, dynamically building images and other static assets, and sometimes even generating code (ugh). While some things simply have to get done while an application is starting up, the more work you can delay, the less risk it constitutes to deployment. Remember that from the time a deployment begins until it completes (with all old assets cleaned up), the predictability of system state is lower than normal. That’s risk.

Additionally, you’ll want to make sure that the spinup of an application doesn’t create an undue amount of load for other parts of a system. If your system creates excessive load during spin up, it makes it far more risky to deploy during periods of high load. This can put you in a bad spot when you discover that you need to deploy to fix something just before a period of high utilization.

Mind your database migrations

It’s common to do database migrations during application spin up. This can be a really bad idea for a number of reasons. First, if you are spinning up multiple instances, it can be hard to coordinate. Database migrations can also require an excessive amount of time to complete, especially if you are building indexes or moving a lot of data around. You may run into failures due to time limits in your migration framework or you may get hurt by excessively long application spin ups if you do this.

Development and QA versions of databases often have limited amounts of data, lower load, and lack the kind of “data weirdness” that plagues every long-running production application. As a result, the things that can go wrong during a migration are often not found until that migration goes to production. If a migration alters a lot of data, changes indexes, or adds additional constraints, it can also mean that database index statistics will be out of date for a while. This can absolutely nail you on system performance, especially when combined with load spikes during application spin up.

Front-load CDN changes

If your application has a fair bit of scale, odds are good that you are using a content delivery network (CDN) to deliver static assets such as images, javascript files, CSS files, and the like to clients. Doing this takes load off your application server and improves application performance. If you are refreshing a lot of data in the CDN when deploying your application, you’ll want to make sure that this happens before anyone has occasion to access that data, as you will break your application if the data isn’t there.

While for many static assets, this can be resolved simply by putting CDN content out in parallel, it does mean that you want to make sure this has happened before the new version goes live. If you are doing mass updates to CDN content that require code to execute, then you’ll probably want to do that work out of band and gate it with a feature flag that can be toggled when the work completes. This way it doesn’t hold up the deployment and doesn’t break the system in the interim.

Use feature flags to limit risk

If you are changing application functionality (which we assume you are doing, since you are deploying), you should strongly consider rolling out changes behind a feature flag and slowly turning them on for your users. This allows you to make sure that new features work well in production (including under production load) before changes are inflicted on the entire userbase. It also makes it easier to “rollback”, because it potentially only requires changing a feature flag, rather than a deployment rollback.

If your feature flags are configurable per user (or group), this also means that longer-running post-deployment processes can be configured to occur before the feature flag is switched over. This is an alternative to deploying the migration code, migrating, and then deploying an update that switches everyone over at once. Feature flags can also be used to inform other parts of the system that certain functionality is currently offline. This can be used to display warnings of degraded performance, or even to keep certain operations from starting during a deployment. Used sparingly, it can significantly improve deployment resilience.

Cache bust and pre-rebuild cache before switchover

If data (or the way it is structured) changes between application versions, you’ll want to make sure that your updated version cache-busts. In other words, it needs a new cache key for anything that has changed. Depending on how expensive it is to regenerate the data, this could be as simple as having the application version be used when composing the cache key.

In other cases where it is more expensive build the data that is in cache, your “cache” may more closely resemble a document database in many ways and you probably want to have the version of that document vary independently from the version of the application. In many applications, the buildup of cache happens during application startup. While this can be “good enough”, you should be careful that this only happens for cached data that rarely changes and has a long lifetime – otherwise a slow spinup could mean that things have fallen out of cache by the time they are needed.

Have appropriate monitoring and notifications

While it’s great to have a robust, resilient deployment system, you still need to have mechanism for knowing when something goes wrong. This requires things like application performance monitoring and notifications when errors occur. You also need to make sure that these notification channels aren’t noisy from spurious errors. If they are, whoever is monitoring it will get notification fatigue and you won’t know when something really goes wrong. Make sure that the metadata on your logging and notifications is sufficient to let someone quickly filter this data. Better yet, have filtered views of the data in place so that it is possible to troubleshoot the data without “digging” into another system.

Tricks of the Trade

Make a plan for complex things. Having a plan for the complex things in life will help you to get what you want done and focus on what is important to you.

Tagged with: , , ,