Database migrations are a complicated topic. While easy solutions exist for simple migrations, trivial implementations can fail spectacularly in a real production environment, especially if you aren’t running the migration during a period when no one else is on the system. While we often think about our databases as simply data stores that are owned by our application, in many environments, that is not entirely true. In addition to our applications, our databases may be accessed by ETL processes that are loading data, reporting processes that are moving (and reshaping) data to place into another system for easier reporting, backup processes so that we don’t lose everything, auditing processes so that we meet compliance and security requirements, and even other applications.
While it is helpful during development to consider that your database is owned by your application, when you are dealing with devops and production considerations, you may be better off thinking about it as more like a microservice for data storage. Phrased this way, you can make more reasonable decisions about how to alter the database without taking down other parts of the system. Additionally, you may find that approaches that were perfectly acceptable when the database was owned by your application, are completely unworkable when it is an independent and critical part of a larger system. While the shortest distance between two points may be a straight line, a straight line that causes system disruption is not a path worth walking at all.
Whether you are doing automated database migrations or manual ones, there are issues you need to keep in mind when you are planning out just how a migration needs to occur. In particular, you will find that you are constantly working around things that may impact system uptime, responsiveness, and even possibly cause errors. Further, if you are an application developer who isn’t constantly messing around with the database, a lot of the issues that can occur will surprise you.
Database migrations are a fact of life in modern development environments. While modern tools have made them easier, there are still things that can cause you problems. However, if you are prepared for the kind of things that might go wrong during a production migration, you can often avoid them. It’s not that database stuff is difficult – it’s just that we often don’t have to think about it too much during the development process (or we don’t want to, because we want to iterate quickly). Being careful about these things can greatly reduce the risks of taking code to production, which means you can do it more often, on smaller, less risky updates.
Make sure you have a safety net.
If you don’t have automated, regularly tested backup procedures and provable ability to restore, you need to handle that first. You should be testing this stuff in some sort of staging environment. While migrations might run well on a developer machine, you should also make sure that they can run from whatever tool is used in your CI pipeline (if any) or in a database scripting environment.
Test your rollbacks as well, along with the functioning of the application AFTER a rollback. Make sure your database migrations are stored in source control, so that they can be correlated with the code that requires them. Should you need to roll back to an earlier version (or just look at application history for some reason), this will be critical.
Keep timeouts in mind
If an automated tool (such as your CI pipeline) will be running your database migrations, you need to be well aware of what kind of timeouts you’ll be dealing with. This is particularly important if your staging environment isn’t similar in size to your production environment.
A good way to estimate this is to see how long something takes in your staging environment, and then compare table row counts to your production environment. Multiply by the ratio between the two. That will tell your possible best case scenario. Also be aware that your production environment may be under load for various reasons when you run your migration, so its performance may be worse than staging. Alternatively, it may be on a more powerful server and may be better.
Make sure you know how your database locks rows during updates. You may need to change some operations, especially if you are doing this in a running production environment, just to keep your migration script from being chosen as a deadlock victim. Try to do the work inside the database instead of inside your ORM. It’s usually faster.
Watch out for after hours processes
If your migration is particularly intensive, your team may opt to schedule it to run after hours or during a period of low activity. While this limits the amount of stuff that can go wrong, it doesn’t adequately protect you from everything that can go wrong. Be especially careful of time periods where large, time-critical bulk processing (such as payment processing) is going on. Try to make sure that your migrations aren’t scheduled during this period.
Also be aware of backups. Depending on how backups are conducted in your organization, they can result in the database not being available or being under extreme load that causes timeouts. Try to avoid doing large migrations around the first or fifteenth of the month, around the end of a quarter, or just before the end of the company’s fiscal year. You don’t want to break something for some executive who is trying to do financial reporting.
Look for data loss when changing columns
Your database migration framework may put a lot of layers of abstraction between you and the database. Be very certain that you understand what it is doing when it changes columns in a table. Some frameworks (and underlying platforms) are bad about dropping columns before changing their datatype, leading to data loss.
If you find that your migration framework does drop columns in this case, it’s best to add a new column of the old type, copy your existing data into it, then do your change, then move the data back in. It may be wise to keep the old version of the data around for a while, for a variety of reasons, rather than just deleting it. You might also find that when you do this, that you need to disable triggers (that may delete/update records), indexes (that may want to rebuild) and constraints (that may fail as you change data) and then restore them afterward.
Watch out for index rebuilds
If, in addition to changing database schema during a migration, you are also updating large amounts of data, index-related issues may cause you a lot of issues. Simply changing millions of rows in a heavily indexed table can take a tremendous amount of server resources. Remember that unless you are operating during an outage window (maybe even then), that you (or the migration process) are probably not the only users of the server. Depending on who the other users are, it may be really important to avoid causing too many performance issues.
This also means that you need to be careful about how you drop and create indexes during a migration. For instance, if you are simply adding a field to an index, you might be better off creating a brand new index with the old fields and the new one, instead of dropping and creating the existing one. Then you can delete the old one later. You may also want to run a command to rebuild index statistics on the relevant tables if you’ve done something that might impact one of them.
Coordinate with app deploys
When you have an application running in production that is dependent on the old version of a table, you need to be careful about how you update to the new version. You may need to break the migration into multiple stages. A good example of this is the case that occurs when you to rename a column. Your first migration might make a new column and copy the existing data in (while possibly also adding a trigger so that updates to the old column go to the new). Then you update the app, then you deploy a migration that drops the old column (and the trigger).
The idea here is to avoid blowing up a running application while users are on the system, and then transparently switch over to the new system. You may also run into a situation where multiple versions of the application are running concurrently (perhaps due to a staged roll-out over using a load balancer), so you need to make sure that any relevant combination of database versions and app versions is still stable.
Try to have a fast rollback instead of recovering from a backup
If you completely screw up a migration, you might be tempted to simply restore the database. If it is small enough and there aren’t many people using the system, you might even get away with this. However, in a larger environment, this is a really bad idea.
Database restores can be extremely slow and the application may not be able to do anything while they are running. This is effectively unscheduled application downtime and can really irritate your users (which translates into irritating your boss). Most database migration frameworks, at least since the days of ruby on rails, have two sections. One section is for “Up” migrations, which move up in version, and the other is for “Down” migrations, which roll back to a previous version. Both should get equal attention during testing, so that you can rely on the “Down” migration if things go poorly.
This also means that the “Up” migration should probably never destroy data that you would hope to restore with a “Down” migration. Instead, you should keep the data until it is no longer needed, and then have a separate migration that deletes it in the “Up” migration.
Pay attention to triggers
Like indexes, triggers can also cause you a lot of problems during migration. Not only can they place extra load on the system, but they can even fail, especially if your update is something they weren’t written to expect. You might want to disable triggers during some updates, however, if they are used for audit trails, you may not be able to. These triggers may end up generating a lot of noise and/or errors. This can be even more interesting if the triggers are used to fire off other business logic.
Make sure any triggers that exist in production have their counterparts in the staging environment. If you are touching a table they are on, make sure you know why they are there, and what’s downstream of them. You’ll need to include those things in your testing plan.
Monitor performance after the update
After a migration (successful or rolled back), make sure that you are adequately monitoring the database for errors and performance issues. While you might not THINK that you did something that damaged performance, even fairly simple changes can cause issues. For instance, if you changed the size of a column that happened to be used in an index, at the very least you may need to update index statistics and you may actually need to drop and rebuild the index in some cases.
If you’ve added or altered large amounts of data, or if the application changed its access patterns as part of an update, you may need to delete (or update) existing indexes, or even add new ones. While you can try to avoid as much of this as possible through performance testing and monitoring in your staging environment, these situations are complex enough that some invariably slip through to production. Since you can’t perfectly mitigate them in a technological sense, you should mitigate them in a political sense by detecting and fixing them before they become a problem.
Do bulk data operations out of band if needed
While you can do many bulk operations inside of a database migration, you may not want to do so in some cases. For instance, if you are transforming a lot of data and projecting it into a new table structure with an expensive transformation, you may want to do it in a delayed manner. This may also need to occur if the bulk data update requires information from outside the database.
A good example of this might be a case where your database contains references to urls for some reason (user avatars) and you are moving the whole set of records over to a completely different cloud provider. While a bulk update might work, provided that it’s quick enough, you may want to schedule a job to do it out of band so that you can move things in small batches. This can also be useful in cases where you need to load a large amount of data from files – this operation should really not be part of a migration, but the migration may be needed to stage things so that it can work.
Tricks of the Trade
We all need help from time to time on a project, especially when doing something for the first time or when it’s new. Be respectful of the time you take from your coworkers, even leads. They have their own work to get done and are spending that time helping you. Don’t ask them to do things that you can do or to tell you where something is located when you haven’t looked in the code. They will know and will be less likely to help you in the future. Instead come prepared with a list of what you have tried, the results, any error messages you’ve received, and what you found on Google or Stack Overflow when you looked up the errors. If you are not familiar with a tool or technology, look up a tutorial on how to use it before stating that you can’t do something or expecting someone else to figure it out for you. Basically, take responsibility for yourself and your tasks. Be respectful of the time of those helping you and only seek help when you’ve exhausted all avenues on your own.