Breaking The Relational Model
Podcast: Play in new window | Download (60.2MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
The rule: Normalize until it hurts; denormalize until it works. That’s what you typically get taught early on. While normalizing a database is a solid practice and something you should do when designing a system; it can jump up and bite you when you get under load. You probably won’t deal with most of these issues early on in a system (and shouldn’t), but the time will come where eventually some degree of denormalization is necessary to keep performance up to snuff.
On average if you start suggesting denormalizing a system, you will have to be able to show that it improves performance significantly. There are good reasons that systems aren’t denormalized and people will typically balk at doing so. However, there are places where under some circumstances it’s worth considering. After all, CPU cycles on a database server are often some of the most expensive ones out there – go check your SQL Server or Oracle licensing fees for proof.
Rules are made to be followed in general, but sometimes you have to bend them just to get by. This is especially in computer science, where many “rules” or more like “guidelines for most cases” at best.
Episode Breakdown
When additional join tables will hurt performance.
Joins are not cheap or free. Complex joins are especially bad (for instance, a function call in a join). This is especially true when you are joining across a lot of tables and aggregating a lot of results on a regular basis for data that doesn’t frequently change.
When you need to keep historical database records.
Consider a system that manages invoices and products. You don’t want to mess up historical data on invoices from last year when a product price changed today. Typically historical tables are also going to be used mostly for reads, so you want to cut down the number of joins just to make that faster. Tables with historical data can often be more heavily indexed than you might see in transactional tables that are used in online transaction processing that has to be fast.
When data needs to be structured for reporting purposes.
Reporting systems generally have a lot of data duplication simply so that more interesting trends can be spotted with the data at hand. You’ll also have very different indexing requirements for reporting tables. The way the tables are accessed is also very different for reporting. They tend to be read heavy and are going to need different levels of locking depending on what you’re doing.
When the data itself is hierarchical in structure.
Hierarchical data is difficult to represent efficiently in a relational model. This is especially true when you need to rearrange a tree, or determine if something is an ancestor of something else, especially if the structure has an arbitrary depth. It’s also very difficult to write logic for tree structures stored in a relational model such that you don’t accidentally create cycles in the data.
When the data is a graph structure.
It’s even more fun when it’s a graph, because cycles are actually possible. While storing nodes and edges is easy enough, doing things like finding the shortest path will be difficult. You’ll also have a difficult time proving that any two nodes are connected without a serious risk of major performance problems.
When you have very large records.
Relational database have issues with large records, whether you query them or not. They aren’t really designed for holding things like images and huge chunks of text. This is especially true if you want to index and search through this content, as a lot of the search capabilities of a relational database are not really built for this purpose. If you are storing large records, with big chunks of binary in them, it’s likely that you may want to implement custom search logic for them that is more than your database can do. Good luck with that.
When you have very sparse records in which most fields are never queried.
Sometimes you get records that potentially have hundreds of columns, most of which are never queried at the database level. You’ll often see this for things like settings values per user or tenant. If you aren’t querying many of these fields directly and getting them as a chunk instead, it may be worth consolidating them and stuffing them into a single field. Sparsely populated tables tend to waste a lot of space and may be extremely wasteful to try to index.
When you have records of variable schema in which most of the data is never queried.
If users are creating records of their own design in your system, good luck making a relational model for that. You might be tempted to go to some kind of EAV (extended attribute value) setup for this. It can work for a while. However, at some point it fails to scale and you either have to get away from the relational model or bend it considerably.
When your database is extremely read-heavy on computed data.
If your database is getting hit with a lot of reads compared to writes, and having to do a lot of calculations, you can substantially lower load by essentially caching the calculations in a table. This is also helpful if the aggregate data doesn’t have to be completely up to date. Many times reports don’t, but are forced to by the database structure. If you do need it updated in real time, you may be better off updating it when the other data changes, rather than recalculating each time you need it.
When you have multitable constraints.
Sometimes you have constraints on your data as understood by the business that end up crossing tables. If these can’t be expressed in things like foreign keys, you may want to consolidate. This can yield a lot of benefits. Many database systems have far better constraint capabilities when all the relevant pieces are in a single table. This can also make it easier to deal with things like triggers if you have to use them.
When two types of environments require the same table.
Sometimes you have a table that is getting heavy use for reporting purposes while getting heavy use in day to day transaction processing. This is a problem because you need to limit indexes for fast writes, but you tend to need more indexes for faster querying. As a result, you may want to duplicate the table and have different setups for each case.
Book Club
Remote Work – The Complete Guide
Will Gant
Chapter 7 – Creating good habits for remote work. Your habits determine how well remote work is going to work for you. Bad habits will cause almost certain failure. While you don’t need the structure of the office, you do have to have some structure so that you can be effective.
Tricks of the Trade
We are not computers or robots. Rules exist for a reason, typically they were created for protection of ourselves or others. Most times they are there because someone has done something requiring the rule to be made. There is an intent behind them. It takes understanding and discernment but we are able to reason and know the underlying reason for a rule. When we know this we can then know how to apply that rule, how to bend it, and how and when it may be necessary to break it.