Normalize Your Database
Podcast: Play in new window | Download (46.3MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
Database normalization is the process of organizing data in a relational database. It was first proposed by Edgar F. Codd in his 1970 paper “A Relational Model of Data for Large Shared Databanks.” He was the first to propose the notion of a relational database. Normalization reduces the amount of space a database consumes. It also ensure that data is logically stored. It does so by eliminating redundant data and ensuring data dependencies make sense.
The normal forms are a series of guidelines to help communicate the level of normalization a database is in. There are six along with a few special named forms. They are listed in order first through sixth. In practice you will not likely see anything beyond the third normal form. Each one adds on to the previous so to be in a higher normal form a table must meet the criteria of all of the previous normal forms.
Database normalization can help you better organize your database. Even if you aren’t a DBA or back end developer understanding the basics of how the relational databases are organized can provide insights into working with other developers or existing code. Normalization may not always be the best idea depending on your business needs. Also it is possible to hurt performance by over optimizing one aspect of data retrieval and manipulation.
Episode Breakdown
14:08 Objectives of Normalization
“Codd specifically complained about SQL.”
The reason for first order normalization is to allow data to be queried and manipulated via a “universal data sub-language” (Example: SQL). Codd listed four objectives for normalization beyond first order. These were listed in his book “Further Normalization of the Data Base Relational Model.”
To free the collection of relations from undesirable insertion, update and deletion dependencies.
To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs.
To make the relational model more informative to users.
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
16:27 Reasons To Normalize
The first reason you’d want to normalize is to minimize redesign when adding to the schema. Normalized schemas allow for new types of data. Existing tables and structures do not need to change.
“I totally just called Bill Gates ‘Gatesy’!”
Update errors or anomalies occur when you have the same information in multiple rows. Updates to one relational area require changes in multiple rows. For example if you have an order form with customer information (Name, Address, Company, etc) on each order from that customer if they change address you have to change it in all the rows.
Flattened tables may not allow for partial information to be inserted. This may result in being unable to store some information.
When deleting data you may have to remove more than intended. Removing some information requires removing unrelated information.
21:00 First Normal Form (1NF)
“They’re descriptions of different levels of normalization.”
The first normal form starts the normalization of data by requiring no repeating elements or groups of elements. Remove duplicate columns from the same table. Create separate tables for each group of related data. Identify each set of related data with a primary key.
The values in each field of the database must be atomic. They cannot be broken down into individual parts. This can be complicated because the atomicity of a value may be business dependent. A person’s name can be broken into first, last, middle. A string can have sub-strings, a fixed-point can be an integer and a fraction, etc.
Chris Date updated Codd’s definition to say that a table is first normal if it is isomorphic to another relationship. There’s no top-to-bottom ordering to the rows. Each row has a unique primary key so it is not like a spreadsheet where you can have subsequent rows associated with each other. There’s no left-to-right ordering to the columns. There are no duplicate rows. Every row-and-column intersection contains exactly one value from the applicable domain. All columns are regular. Rows have no hidden components such as row IDs, object IDs, or hidden time-stamps. Breaking it down to functional units of information.
28:05 Second Normal Form (2NF)
The second normal form continues to remove duplicate information by not allowing partial dependencies. Separate data that applies to multiple rows into different tables. Use foreign keys to create relationships between tables.
All columns that are not the primary key are dependent on the primary key. Primary key is a unique identifier for each row in a table. In order to find a particular bit of information in a row you must know it’s primary key.
Each column should describe what the primary key identifies. This groups data into logical dependencies or table with associated information. You have customer information separate from product information and join them based on their primary keys.
30:40 Third Normal Form (3NF)
The third normal form adds to the second by not allowing non-transitive dependencies on anything but the primary key. If it can exist with out the primary key then it doesn’t belong in that table. Remove dependencies between columns.
Dictionary definition for Transitive: “if it applies between successive members of a sequence, it must also apply between any two members taken in order”
“Otherwise you don’t really have a primary key.”
Basically the relationship between elements has to be the same across the table and to the primary key. What this means is that all the columns are dependent only the primary key and no other columns in the table.
33:04 Boyce-Codd Normal Form (BCNF or 3.5NF)
Boyce-Codd normal form adds to third normal form by requiring that all determinants are candidate keys. Every dependency must either be trivial or contain a super key. Trivial functional dependency would mean one is a subset of the other. A super key is a unique identifier or set of identifiers to the entire schema. A candidate key is the minimal information needed to create a super key. When you do this all redundancy based on functional dependency has been removed.
Informally expressed as: “Each attribute must represent a fact about the key, the whole key, and nothing but the key.”
This form was created by Raymond Boyce and Edgar Codd in 1974 to address certain anomalies not covered by 3NF. Most third normal form tables will meet criteria for boyce-codd. A 3NF with no overlapping candidate keys will meet criteria. Multiple overlapping candidate keys may cause it to not depending on the functional dependencies.
35:55 Beyond Third Normal Form
“I remember thinking in class, ‘When am I ever gonna use this’ and it turns out…never!”
The previous normal forms have been concerned with functional dependencies fourth normal (4NF) focuses on multi-valued dependency. A dependency is when information in a table determines other information in the same table. A multi-valued dependency occurs when one or more rows in a table imply the presence of more rows in the same table. For any non-trivial multi-valued dependency there must be a super key involved.
Fifth Normal Form (5NF) isolates semantically related relationships. A table is in fifth normal form if and only if all non-trivial join dependencies are implied by the candidate keys. This can only be the case if each join dependency is a super key. It’s rare for a 4NF to not meet criteria for 5NF.
A table is in sixth normal form (6NF) if and only if it all of it’s join dependencies are trivial. The idea is to break the data down into irreducible components. This can become important when adding in historical data or changes over time. It is used in some data warehouses where the benefits of less storage space outweigh the cost to retrieval by having so many tables.
IoTease: Product
Velocitip Uno App
This is an app created by Full Flight Technology a company that built an system for measuring ballistic information from an “electronic field tip.” While the ballistic system is fascinating it’s price puts it pretty much exclusively in the range of serious competition archers. The Uno app however will help even amateur archers like myself to sight in their bows. To use it just sight in at one distance and provide that app with certain information. It will then customize teh pin selection for any set up so you don’t have to sight in at different distances. All you have to do is match the display on you phone screen.
Tricks of the Trade
Be aware of trade offs. You may end up working with a person that wants to over normalize. This may not be the best in an OLTP database as you may need to flatten data to increase retrieval time. Joins have a cost that you need to take into account. There’s a trade off between the read speed and write speed in a database. Be aware when you want to get pedantic that there are trade offs for what you are pushing.
This podcast has been very much helpful for me to stay productive during my commute time. I get to refresh my knowledge