Design a Better SQL Database With Database Normalization
18 Apr 2014This is the first post in a series of posts about doing data science with SQL.
In this post, I will go over the benefits of laying out data in a relational database like SQL. By way of a simple example, I will then go over the basics of how to design a robust database and the concept of database normalization. These topics are essential in being able to design and interact effectively with databases.
A Simple Example Recipes Database
We will introduce these topics through a simple example. Suppose that we wanted to store information about recipes in a cookbook.
The fundamental building block of SQL databases are two-dimensional tables. This may seem like a limited design. But as you will see, this limitation will become incredibly powerful.
For our cookbook, each recipe will have a name. So we can begin by
creating a table of recipe names. Our recipes
table will be:
recipe_id | recipe_name |
---|---|
0 | Tacos |
1 | Tomato Soup |
2 | Grilled Cheese |
We associate a unique ID with each recipe so that we can connect rows in this table to rows in other tables (more on this soon).
Next, we need a table listing all the ingredients.
To make later examples more interesting,
we will also also assume that each ingredient
has a price. Our ingredients
table is:
ingredient_id | ingredient_name | ingredient_price |
---|---|---|
0 | Beef | 5 |
1 | Lettuce | 1 |
2 | Tomatoes | 2 |
3 | Taco Shell | 2 |
4 | Cheese | 3 |
5 | Milk | 1 |
6 | Bread | 2 |
Again, the ingredient IDs are to connect rows in this table to other
tables. Finally, we need some way of listing what ingredients are
in each recipe. Although we might naturally want to put this
information into the recipe
table, it is advantageous to use a third
table to store this information.
Although it might seem cumbersome at first, a straightforward way
to store this information is to make another table listing all
(recipe_id
, ingredient_id
) pairs. I will discuss
in the next section why this is advantageous.
To make the later examples more interesting, we will also
list an amount of each ingredient required in each recipe.
Our recipe_ingredients
table is:
recipe_id | ingredient_id | amount |
---|---|---|
0 | 0 | 1 |
0 | 1 | 2 |
0 | 2 | 2 |
0 | 3 | 3 |
0 | 4 | 1 |
1 | 2 | 2 |
1 | 5 | 1 |
2 | 4 | 1 |
2 | 6 | 2 |
Of course, this example is very simplified. We could imagine that a real database, like the one curated by yummly.com, would have a lot more information (more tables, more columns, and more rows). But this example should be sufficient to allow for interesting examples later.
Before we move on, I will mention two bits of terminology common to databases. A schema is the structure of the tables in the database. For our example, our three tables make up the scheme of our recipes database.
A query in SQL is a command which retrieves data from a database. For example, we might want to query the number of recipes in our database.
How to Normalize Your Database
At this point, you should be confused about why we laid out the
database across three tables and why we need all these strange IDs.
And finally, you might be wondering why each recipe-ingredient pair
has to take up an entire row in the recipe_ingredients
table.
This leads us to the concept of database normalization. Database normalization is the process of designing a database so that every piece of information shows up in only one place in the database. This is the most important concept for designing effective databases.
In principle, we could directly cram all three tables into one larger table:
recipe_name | ingredient_name | amount | price |
---|---|---|---|
Tacos | Beef | 1 | 5 |
Tacos | Lettuce | 2 | 1 |
… | … | … | … |
This would be a denormalized table.
But although it is easier to read, it is very fragile. For example,
at some point in the future somebody might decide that they wanted
the tacos to be made of Chicken instead of beef. In the process of
updating the table, they might accidentally change the name Tacos
to Taco
in only one of the rows:
recipe_name | ingredient_name | amount | price |
---|---|---|---|
Taco | Beef | 1 | 5 |
Tacos | Lettuce | 2 | 1 |
… | … | … | … |
A this point, our table would no longer be self consistent and this could break all sorts of downstream queries. Similar problems could happen if our code crashed after updating half of the rows. In addition, this schema also allows for the same ingredient to have different prices in different recipes which shouldn’t be possible.
Finding and fixing these sorts of bugs is costly, time consuming, and frustrating. The benefit of having small tables linked by IDs is that the IDs can be assumed to never change since they are only used inside the database. On the other hand, important information like names only show up in one place, avoiding inconsistencies and allowing for easier and less-error-prone modifications.
Another benefit of database normalization is that it scales nicely
to adding new kinds of information. For example, suppose we wanted
to store additional information about the steps needed to build the
recipe. We could just create a new table which linked back to the
recipes by their ID. The recipe_instructions
table might look
like:
recipe_id | step | step_description |
---|---|---|
0 | 0 | Put the Taco Shells in the oven |
0 | 1 | Cook the beef in a pan on medium |
Being able to add new data without changing the existing tables is much less error prone and potentially costly.
Finally, if the extra data only existed for a limited number of rows, this would be another win because we wouldn’t have to store empty rows for all the missing data.
Why Do Tables Have to be Flat?
The limitation that SQL tables have to be flat may at first seem very strange. Although it should be clear that there is probably some way to flattening out any data structure to fit into flat tables, it is probably unclear why this is better than just allowing nested data.
Scripting languages like ruby, python, or perl make it especially easy to nest data using dictionaries. In python, a common design for our recipes data might look like:
recipes = {
"Tacos": ("Beef", "Lettuce", "Tomatoes",
"Taco Shell", "Cheese"),
"Tomato Soup": ("Tomatoes", "Milk"),
"Grilled Cheese": ("Cheese", "Bread")
}
Using this dictionary, it would be easy to find all the ingredients for a particular recipe:
soup_ingredients = recipes["Tomato Soup"]
But the downside of this is that it would be expensive to ask other questions about the data. For example, finding all the recipes with a given ingredient would require looking through each recipe:
tomato_recipes = [i for i in recipes.keys() \
if "Tomatoes" in recipes[i]]
We could build a different data structure to make this other query easier:
recipes = {
"Beef": ("Tacos"),
"Lettuce": ("Tacos"),
"Tomatoes": ("Tacos", "Tomato Soup"),
"Taco Shell": ("Tacos"),
"Cheese": ("Tacos", "Grilled Cheese"),
"Milk": ("Tomato Soup")
"Bread": ("Grilled Cheese")
}
But this structure would make the first query more difficult!
Although it seems strange at first, flattening out the tables ensures that our data layout is not biased towards (or against) any particular kind of query. Instead, the designers of SQL optimized their database to perform all kinds of different queries against flat tables efficiently. There is a good chance that what we will want to query tomorrow is different from today, so this is a huge advantage.
As a final note, some newer databases like MongoDB natively support storing nested JSON-like data structures. Despite the caveats mentioned above, in real-world situations sometimes this is a necessity.
In the next post in this series of posts, I will explain how to install MySQL on your local machine to test out SQL commands.
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!