Database Resilience

Your application’s data is probably one of the most valuable and longest-lived pieces of your application. When your application goes to the big bit bucket in the sky, its data will likely live on, perhaps being migrated into another system that is, itself, waiting for its own eventual obsolescence. Such valuable data is unlikely to remain the sole property of your application for very long, no matter much you try to keep other people in your company out of it. As a result, if you want your application to be stable, one of the main sources of problems you’ll encounter is your own database.

Further, because databases hold such critical data and because they are the “furthest in” component in most application models, it’s much harder to scale them effectively and safely. A small issue in a database can ripple outward, potentially crippling multiple applications. Besides the ordinary worries of things like network issues and hardware failures, the very way that databases tend to get used in modern apps can cause scaling issues. Making things worse, many database stability and resiliency issues don’t make themselves apparent until the system is already under load, making it difficult to find problems before they occur. And many developers don’t do a good job of making sure that the their code doesn’t cause problems at the database level. Finally, the causes of many database issues is transient and involves the interactions of multiple disparate systems, and often includes timing considerations.

However, there is some hope in all this. There are some realistic things you can do that will improve the stability of your database and interactions with it. These things will not only make it more likely that the database remains in a working state during your application’s lifecycle, but they also make it more likely that your application (which depends on its database) will not be as easily broken by another client of the same database. Most of these suggestions can be simply summarized as “do less stuff on the main database server”, but like most summaries, it’s not enough to act on.

Databases are the most brittle part of many applications. Not only do they contain the most critical data for a system, but load management strategies for databases are always more complex than simply making another copy of data somewhere. A broken database can also break a lot of things – even if you think your application is the only one that will use your database, you’ll probably find that this statement doesn’t hold true over time. The database is often the easiest, but not the most sustainable, place to hook into the workings of an existing application, especially if one wants to avoid asking the team about it. As a result, having a resilient and robust database implementation is critical to application stability over the long term. Database resiliency covers a lot more areas besides “just” system administration tasks and often has implications that reach into code as well.

Episode Breakdown

Keep schemas backward-compatible with a couple of versions of deployed applications

While it can be tempting to do things like deleting columns in a single state, it can wreak havoc with application stability. You’ll want to update the applications that use a column to no longer do so, WELL BEFORE deleting the column. While columns are a simple example of this, other parts of your database also matter in this regard. For instance, deleting or changing an index can have similar implications. While nothing will explicitly break, removing an index that an application needs can significantly degrade database performance and is best avoided until the index is no longer in use.

You should also be cautious when adding columns (especially required ones), changing column types, or renaming anything. The “old” version of an application that accesses a database should be able to continue working until a new one is deployed and verified. This can get hairy quickly, if multiple applications are accessing the same database, because one application may need a change, while other applications are not yet ready to deal with a change.

Be redundant everywhere

It goes without saying that a single point of failure is simply an invitation for failure to drop by. This is often especially true of database systems used in applications. For many applications, if the database server goes down, there is no application.

At a minimum, you should have a failover system available for your database and have data replicated to it if you are doing anything critical in production. You should not rely on restoring from a backup unless you are comfortable with hours of downtime. This also means redundant storage and other infrastructure that supports the database if you are hosting it yourself. While you probably are better off in many cases hosting your database on the cloud, that’s meaningless if a router failure makes your database unavailable.

Limit connection and transaction lifetimes and use realistic timeouts

In general, database connections should be as short-lived as possible (but no shorter, obviously). Long-lived connections that aren’t doing anything are a waste of resources and a potential source of problems. You should also have the same approach regarding database transactions. When you leave a transaction open for a long time while processing a lot of data, it is going continue to eat up resources over time. This can also cause you to keep tables locked far longer than intended (depending on your database of choice).

Shorter connection timeouts are also helpful. If your database is accessed by multiple applications, a shorter timeout will ensure that pathological applications will feel the pain of the issues they cause, rather than other, uninvolved connections. It can be tempting to simply increase timeout length to avoid timeouts, but this often causes even more problems in other parts of the system. It can be done short term if you need to do so, but otherwise you should consider breaking up the work into logical chunks, handling it in a batch process during a time of low utilization, or doing the work on a system other than your OLTP database.

Limit data transmission by paging, etc.

With modern tooling (especially tools that generate sql from another language), it’s very easy for an unwitting developer to pull back a lot more data than they intended to. Such mistakes tend not to show up until the code is in production, resulting in unexpected spikes in load (and timeouts for others). This can happen via pulling back too many rows or via pulling back too many columns. While processes within the database (such mass data updates) may arguably need access to all the data from a table or set of tables at once, this is unlikely to be true for anything that is pulling data across a wire. Even large data exports can be paged – if they can’t be, then you are insane to be running them from a production system.

If you are pulling a lot of data, it becomes a “chunky” call. These often have a lot of overhead going across the wire and can chew up a lot of memory on whatever is using them. However, be aware that the other extreme, so-called “chatty” calls are also a problem. There is usually a “sweet spot” between pulling everything back and pulling one record at a time. If performance becomes enough of an issue, you may need to actually load test and tune this. In practice, you can often get by for a good while with just guesswork.

Limit upstream load via queueing and by limiting concurrent connections

When processes get overloaded with work, the first thing to break is usually the database. Because of both licensing costs and the inherently central role that a database plays in most modern applications, the database is usually the worst-scaling part of any system. As a result, if you want your database to remain up, even if the system is under load, you’re going to need to mitigate this issue. Services that are calling into the database need to have their incoming workload managed by a message queue where possible, to keep a load spike from spreading to the database (and from there, the rest of the system).

While you may (or may not) be able to get upstream systems to limit their incoming workload with a queue, you may also want to limit how many concurrent connections you will allow those systems to have. While the error messages that can result from this are not nice, they do keep a poorly configured application from bringing your database down with too many connections.

Essentially then, your client applications can either implement the “nice” thing that keeps them from overloading the system, or they can get the “nasty” thing that breaks their app if it isn’t a good citizen. While you’ll get a lot of resistance on this one at first, it will eventually shape the behavior of your system’s users so that they don’t break the database as often in production.

Keep reporting concerns out of your OLTP system

When you first build a system, the database at the core of the system is the logical place from which to pull reporting data. However, as a system matures and scales, it becomes more and more critical to stop putting this functionality in the main online transaction processing system. Reports are unlike most normal system load in that they can often be of arbitrary complexity and often join quite a few tables. Reports also tend to become more complex and tend to be over ever-increasing sets of data as time goes on.

Reports are also more likely to be run when the system is already under load. Peak time for reporting is also peak time for the things you are reporting on. Reports also tend to be run in a “bursty” fashion. Certain days of the week, days of the month, and days of the year will see large increases in report requests and those can often correspond with large bursts of traffic to the application, especially if your app touches financial considerations.

Use caching where you can

One thing that can really help with database resilience is to actually limit the use of the database. Caching the results of common queries relieves the load on the database for frequently used and relatively static data. This caching should be multi-layered, possibly using the user’s machine, the web server cache, and a caching service to handle different things. For instance, the user machine might cache the results of user-specific queries against the system, while the web server might cache the result of frequently changing, but user-agnostic data. The cache server might store larger, more complex chunks of data that are more expensive to create.

Bear in mind that if you cache a lot of things in a non-durable cache, that you may need to rebuild them after deployments or as load increases first thing in the morning. If not managed well, this can load spike your database, even though your intent was to avoid a load spike. From an architecture perspective, you should also consider whether you need up-to-the-minute updates on frequently used data. Even allowing for a slight delay can significantly reduce your system load. After all, a read replica of a database is very cache-like (but costs cash, as it were).

Tricks of the Trade

Many of the points we talked about have to do with limiting or keeping an eye on the amount of something. This concept applies to not only data transactions but most things in our lives. Moderation is the key to success is many areas. A lot of times you can go to an extreme on either side and that will be bad. For example with weight loss, eat too little and you risk undernourishment and even developing eating disorders but too much and you have a plethora of health issues. The same applies to exercise, politics, and many other areas. Think of these like a road with large ditches on either side, go too far one way or the other and you are off the road in the ditch.

Tagged with: , , ,