Schrodinger’s Cache
Podcast: Play in new window | Download (46.8MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
Generally speaking, repeating the same actions under the same conditions should create the same results (or close enough that the results aren’t a surprise, at least). This tendency is at the root of all scientific pursuits, our development of technology, our understanding of economics (such as it is), and forms a core belief in a big chunk of the population. This is especially true for software developers, because we literally rely on functions at a deep level. In fact, the mathematical definition of a function is “a relation from a set of inputs to a set of possible outputs where each input is related to exactly one output” (mathinsight.org). When examined, this statement simply says that doing the same thing should produce the same result with the same inputs, albeit with a bit more mathematical rigor.
However, if you’ve ever had the experience of getting burned by caching, you are more likely to describe the experience as being more like Schrödinger’s cat. This is a concept from a thought experiment posed by Erwin Schrödinger during discussions with Albert Einstein while expressing his objections to the Copenhagen interpretation of quantum mechanics. While a full discussion of this principle is beyond the scope of this podcast, the gist of it, per Schrödinger was that if you put a cat in a box with something that would kill it (in this case, a radioactive particle), that if you couldn’t perceive what was in the box, that one could model that cat as either being “alive” or “dead”. While a somewhat gruesome notion, if you’ve ever dealt with cache invalidation of unknown length, this probably sounds familiar.
So, if you suspect a cache issue, because things are behaving in a way that indicates Schrödinger’s Cache might be your issue, you probably need a reasonable way to think about your troubleshooting efforts. Troubleshooting cache issues without a plan is not only unlikely to succeed, but you may find that you have difficulty proving that you actually fixed an intermittent issue. In effect, you need to figure out how to let the cat out of the box so that it is in a provable state.
Dealing with caching issues is very frustrating. One of the biggest problems is that caching problems often manifest in a very unpredictable manner, often driven by load or by user behavior. Because of this, troubleshooting caching problems often requires that you first prove that caching is a factor in the problem, before attempting to correct it. Once you have proved that the cache is involved in the problem, you can manipulate your cache in order to make the problem easier to reproduce, which will greatly aid you in troubleshooting and debugging.
Episode Breakdown
Basics of caching
A cache is a higher speed (or lower cost) data storage layer, which stores a subset of data, typically transient in nature so that future requests for this data are faster/cheaper and create less load on the system. When a cache is in use an incoming request will be examined and a cache key generated. This key will be used to look up the result in the cache. If the result is there, it will be returned. If it isn’t, it will be retrieved, stored in cache, and then returned.
Items in cache will have a cache window, that is, a period of time that will elapse before the item is removed from cache. Accesses of data subsequent to cache invalidation will cause the data to be reloaded and recached. Sometimes caches have sliding expiration, that is, the item will remain in cache for a certain amount of time since it was last accessed. This lets the cache perform better under load, while possibly reducing accuracy. Most caching systems also let you manually invalidate an item in cache. This is handy if the thing you are caching sometimes exhibits more volatility more frequent than the length of your cache window. Most systems will have multiple caching levels, for different reasons. Users will also have caching that you may not be able to completely control.
Note that these tips for troubleshooting cache issues will not let us tell you what the state of Schrodinger’s cache is in your environment, but these are ways of peering into the box.
Try turning off all caches you control.
Probably the most drastic thing you can do. However, if you can test this out in an environment that has a provably equivalent setup and similar load to the environment where a problem occurred, sometimes the caching problem will become very apparent. However, it’s unlikely that you have a copy of a production environment hanging around that can handle similar load without caching. Your production environment might not even be able to do that. Remember, you are only trying to suggest correlation, not prove causation.
Try adjusting cache windows
Another thing that might help is to lengthen the amount of time during which data is cached. For instance, if a particular type of aggregate root in your system is cached for 5 minutes, bump it up to 10 and see if the reported frequency of issues changes. Note that if an issue appears to go away when you do this, that doesn’t mean that a cache window adjustment will fix it. This proves correlation, not causation. If the data is volatile and your system doesn’t do manual cache eviction, be careful how the cached data is used, as you can introduce other bugs that will be just as hard to troubleshoot as this one. Also beware that you don’t have full control over cache eviction. Server restarts, resource contention, bugs in cache software, or even other instances of your application can mean things get evicted before you think they should be. If possible, try to determine whether data you are retrieving came from cache. The results may surprise you.
Try adjusting cache invalidation rules
If your app is using hard cache timeouts, try switching to a sliding cache timeout. If an issue’s frequency decreases roughly proportional to the amount you changed the cache timeout, that tends to indicate that the problem is related to cache. Further, if the issues seem to appear at a predictable amount of time after the change and that time corresponds with the cache timeout, it can often point you towards other issues either with the cache being stale, or having problems as a result of load near the cache boundary (aka, when it has to get stuff again and put it in the cache, it may be poorly handling load in the interim).
Try modulating load to look for resource contention leading to cache invalidation.
Another option is to leave your caching settings the same, but put a lot of load on the system (in a test system, obviously). What you are looking for here are problems that are the result of things being removed from cache more frequently than you would expect. If this approach drastically increases the number of problems, then there is a good chance that it is a load-based phenomena due to premature cache eviction. If this drastically decreases the number of problems, then this might indicate that your app’s cache invalidation strategy causes it to hang on to resources for too long, or fails to free resources that should be freed up.
Try cachebusting techniques
You might also look into cache busting techniques. While these vary, the essential idea here is to do something so that cache keys from one part or version of the system are different than those in another part. Be careful about where you do this, as it is easy to get inconsistency and weird behavior here. Essentially, what this does for you is that it makes sure that you are correct in your assumptions about which systems are putting things into cache.
Look for cache window mismatches
If two chunks of data are similar (let’s say that they came out of the same tables) but are shaped differently, they probably should inhabit cache for similar amounts of time. If two pieces of data are related in some way (including by aggregation), then you have to be careful about cache durations. Essentially, what can happen is that one piece of data is kept for longer than the other, and that periodic application errors can occur because the assumption is that they are kept for the same amount of time.
Look for cache usage mismatches
Also look for ways (by looking at code) where some parts of your application and any supporting pieces circumvent caching, either for reads or writes. Sometimes it’s not a cache issue per se, but rather it is an issue of some part of the system not using cache properly or consistently. The result of this behavior will still be sporadic, and it may put things into cache that are emitted to cause problems later. This can also be fun when you have multiple levels of cache involved in the system, such as browser cache, versus webserver, versus cache server.
Look for bad cache key generation
Make sure that cache key generation is consistent across your app for the same type of return payload. Also make sure that there isn’t an overlap in cache key generation. If you are manually generating cache keys and do a poor job of it (for instance, using a person’s name as their key), then you could have collisions. This can cause at least some cases of weird data issues. It’s less likely now than it used to be, but be on guard for it anyway. Essentially, what can happen here is that if there is a collision, one set of data could overwrite another.
Make sure cache invalidation windows either align with the volatility of the things being cached, or that cache is evicted upon relevant changes.
It’s also possible that data has changed since it entered cache. In some cases, this is acceptable, while in others the cache needs to be kept up to date. The volatility of the data being cached should be considered when you are using caching. However, volatility itself may be volatile, depending on application usage patterns. If the data is usually cached for a short time when compared to the frequency of updates to the data, you might be missing some performance gains that could make your application perform better. On the other hand, if you are caching data for a longer time period than the data is likely to remain unchanged, then you have a risk of data inconsistency if you aren’t evicting things from cache.
Tricks of the Trade
Meetings are the bane of many developers. Many times they are not productive or they could have been handled via email or other asynchronous communication channels. Now, it’s true there are times for meetings such as trouble shooting between developers or in emergency situations. One thing that helps to prevent random meetings being scheduled is to schedule development time on your work calendar instead of leaving it open. As tempting as it might be, you can’t take up all of your work day with “development time”. You’ll need to leave some spaces open for meetings, but this way you have some control over them. Also when people want to schedule something during normal development time they’ll have to ask you about it since that time will be blocked.