Database Network Performance

Database performance issues are the scourge of any large application, especially as load increases. They can be hard to understand, happen at times that aren’t particularly predictable, and difficult to replicate in a development or QA environment. This can make them very difficult to mitigate. The problem is compounded by the fact that many database performance issues are actually side effects of so-called “proper” code reuse and abstraction. As such, fixing them in a codebase can quickly become a problem on a team; many people have read books on the “right” way to do things, but don’t understand when those rules don’t apply.

Given this, the process of troubleshooting database network performance issues in your application is really the process of piercing through a pile of abstractions to whatever depth is necessary to see what is going on. Really shallow bugs may be as obvious as a query that brings back thousands of records and doesn’t need to. Deep bugs may force you to have to dig deeply into how your ORM builds queries and how data is actually transmitted over the wire. Fixing these bugs may be as simple as caching frequently used data for a few minutes instead of constantly retrieving it from the database, while more complex issues may require major reworking of your codebase. The latter can be especially fun, as it often has to be accomplished piecemeal while still allowing feature development to continue.

Database networking issues can often be fairly counter-intuitive to solve. While we like to believe that we can know what our code is doing by simply reading it, the truth is that we live atop a huge pile of abstractions that we probably didn’t create. And it gets even more interesting when our own abstractions work against us, or when our coworkers (or ourselves in six months) slightly misinterpret an abstraction and create unintended side effects in application performance. Worse still, there are operations and infrastructure issues that can manifest in slow application performance, while appearing to be problems with our application.

While database network performance issues are complex, many of them can be traced down to incorrect usage of underlying infrastructure. In essence, these are physics problems manifesting as programming problems. Many of these problems occur as a result of abstraction hiding what’s really going on under the hood. Once you get past the abstractions of your application and see what is really happening at the network and database level, the source of the problem is usually pretty easy to find. Solving them can be more difficult, however, as doing so often requires rethinking the way that you use abstraction within your application and may require that you adopt different coding patterns entirely.

Episode Breakdown

Select N+1

This occurs frequently when you load a top level record and then individually load records underneath that record. This is common when (mis)using an ORM. In many ORMs, child objects can be retrieved transparently as required (lazy loading). While convenient, this approach doesn’t scale and results in counter-intuitive behavior.

To avoid this, it is generally best to avoid lazy loading, as well as loading child records within the body of a loop. To fix this problem, you need to load the relevant child records before you need them, preferably along with the parent record.

Filtering data on the wrong end of the wire.

This frequently occurs when filtering criteria are complex or when an ORM exhibits counter-intuitive behavior that means that a filtering statement is not applied at the database.

There isn’t a great way to avoid this at the team level. Instead, you will want to monitor your system to look for sudden changes in performance.

Fixing this problem is going to vary depending on the framework you are using. In many cases, you may need to revert to using raw sql (parameterized please) in order to get the kind of performance you need.

Failing to cache (repeated invocation).

This frequently occurs with fairly static data that is used all over the system. A good example of this in a multi-tenant system is things like organizational settings. In many cases where an application is requesting the same data over and over, the data tends not to change very much. Even if it can be retrieved with a fairly simple query, it results in unnecessary network traffic and database usage.

These sorts of calls tend to live in middleware or in utility code that is widely used across the system. As long as your codebase is reasonably clean, these should be relatively easy to find. The best way to fix this is by using some sort of caching server, such as redis, along with mechanisms to make sure that you invalidate this cache in the event of a change.

Too many columns

In many cases, especially when using an ORM, you may be pulling back too many columns from the database. This tends to happen when you map database tables into types, because these tend to be implemented as “SELECT *” by default. Another non-obvious cause of this is deep loading an object graph. In many cases, this will be accomplished by the underlying framework using joins. If you aren’t careful, this results in a cartesian product.

In general, you are going to want to be very careful to only load the data you actually need for a particular operation when you need it. When you try to reuse your database access logic, you MUST be very careful and make sure that each use case needs exactly the same data in the same shape. If you are using a CQRS (command query responsibility segregation), you may want to use projections from your tables for the query side, and only deep load object graphs to the degree required on the command side.

Chatty calls

When a system is over-abstracted, you may find that you are making lots of small calls to the database. While you do want to limit how much you bring back in a single call, there is still overhead for each call you make. This is a pretty common result of overly abstracting your code.

Another common cause for this one is lazy-loading occurring in a loop. In this case, it may not be immediately obvious that a database call is occurring. This sort of thing is why many developers consider lazy loading to be evil and disable it by default.

In general, the way to avoid this is to load all the data you need before you start doing work on it, and to load ONLY the data you need. This doesn’t play nicely with code reuse as practiced by many.

Chunky calls

The opposite of the previous point is a situation where you are bringing back a large number of rows. This can be the result of being over-eager when loading the data you need, or when loading a bunch of extra data that you don’t need.

A common cause of this is deep loading an object graph where the root object has large numbers of children. This can also occur when joining several wide tables or when returning large columns of binary or text data.

To avoid this, be careful about how you structure objects in your ORM. If your tables have large columns, you may want to be careful about mapping them into commonly used objects when they are not needed. In many cases, you may be better off retrieving projections of several tables in a flat structure instead of an object graph if any of the objects have big columns or if you need to do aggregation on child records. This keeps the size of your network payloads smaller.

Failure to use connection pooling or cycling through connections too quickly.

Another common cause of poor database network performance is poor management of database connections. In particular, this occurs when you open a new database connection for each call, rather than reusing a connection that is already open. A common reason for this is actually a symptom of poor use of a repository pattern. In this case, the repository is responsible for opening a database connection when required and it doesn’t have a way of knowing whether a connection is already open or not.

In SQL server, this will show in the profiler with lots of calls to sp_reset_connection if you are using connection pooling, but you may also see connections actually being opened and closed quickly if not. Both are bad signs if it happens too much. To fix this, manage database connection lifetime in your IOC container and scope it to a reasonable scope (like a single request in a web app), then inject it into the things that need it.

Failure to de-allocate connections

This is the opposite of the previous problem. In this case, database connections are not de-allocated (or closed) in a timely fashion. This leads to issues such as connection pool starvation. A common cause of this is the use of languages that automatically manage memory. While this is a huge advantage for developers, there are many things (such as database connections) that you really want to de-allocate in a predictable manner.

Another common cause of this is objects that “own” connections and have long lifetimes. You need to be careful how many of these you have, because if they are created in an unbounded manner, you may find that you are creating an excessive number of database connections.

Like the previous example, the best way to handle this is to use DI/IOC and manage connection lifetime using policy.

Antivirus

Another issues (that exists outside of your code) and may cause database network performance issues is poorly configured (or written…) antivirus software. Oftentimes, such software will make interactions with a database much slower than they would be otherwise.

This situation can be detected if your connection is otherwise fast enough and your database server responds slowly when queried for a small amount of data when not under load. Since you’ve eliminated the other likely candidates for a performance issue, it’s probably something other than code.

While antivirus is critical on consumer machines and other machines with less controlled network connections, in a typical data center scenario, the connections are more tightly controlled. This doesn’t mean antivirus isn’t still important, just that it can be configured in a less disruptive way when other infrastructure is also protecting the system. While this is more of an operations thing, as a developer you’ll find that you are often the person who discovers operations issues and you need to know how to prove that the problem isn’t with your code.

Poor cache performance

Sometimes you’ll find that even though you are following previous recommendations and using a caching layer for common operations, that the caching layer is slow, or doesn’t contain what you need for some reason. A common reason for this is a poorly configured caching server that doesn’t have enough RAM or disk space, causing things to fall out of the cache more quickly than might otherwise be expected.

Cache servers may also become a problem due to antivirus, firewall, and network configuration issues, or because they are located too far away from the servers that need them to be of use. This can also be an issue if the data stored in cache is not structured appropriately. For instance, if you need a small object with none of the rest of the object graph loaded, it doesn’t help you as much if the only thing in cache is a fully populated object graph.

High latency or low bandwidth

You may also find that network infrastructure issues are to blame for poor application performance due to high latency or low bandwidth. Bandwidth is a measure of how much data can be transferred from one point in a network to another in a period of time, while latency measures how long it takes for data to go to and endpoint and return.

You want higher bandwidth and lower latency. Because your application is generally interacting with services on other machines, high latency will generally give you worse problems more quickly than low bandwidth. This is one of the reasons why chatty calls are a problem – they are especially vulnerable to high latency, especially in scenarios where a sequence of calls occurs in a synchronous fashion.

Low bandwidth can become a problem when your application is transferring a large amount of data between two points (such as scenarios where you are streaming, moving large files around, etc.). High latency is typically also the reason why web pages with a lot of css, images, and javascript files are slow, since browsers will typically limit the number of concurrent downloads from the same website, essentially forcing you to pay attention to latency as a web developer. Bundling resources helps, because bandwidth issues are easier to fix (typically) than latency issues.

Tricks of the Trade

Connectivity issues aren’t just for computer networks. The concept of networking or growing your network is to reach out and meet people either in your industry or in a field you are interested in. This can be done at conferences, schools, local meetups, or even non tech events. Sometimes you can have trouble connecting to other people or are not in an area where there are a lot of others. This can be due to a lot of causes including our own inability to communicate, others not being receptive, or not having the opportunity. Recently, during the quarantine, we all experienced a massive networking outage when it came to in person networking. As such many of us found ways to connect and even meet new people without being in person. If you’re having trouble connecting with other developers think about it like you would a network connection and start trouble shooting the problem. Don’t be afraid to try out novel solutions either, if you are not one for going to conferences…when they start back up…check one out.

Tagged with: , , , ,