Database normalization
Last updated
Last updated
Database normalization is a process used to design tables in a relational database in order to minimize the duplication of information and to ensure good data integrity. This means that data is kept complete, accurate, consistent, and in context. Data integrity is key to ensuring the data is actually useful to its owner.
If the same information is stored in different places in multiple tables, the risk of making the database inconsistent increases significantly. If a person's address is stored in multiple tables, and the address is changed in just one table, it is no longer possible to know which address is correct just by looking at the data. The database has lost data integrity.
It is common to refer to four levels (forms) of normalization in database modeling, levels 1 to 4. There are different requirements that must be met in order to obtain a certain level. In addition, all requirements of the lower levels must be met as well.
The following example presents a simple dataset and the steps to achieve each of the normalization levels in detail, including how it can be represented in Appfarm Create.
It is worth noting that having a data model of a higher normalization level often leads to more tables (object classes in Appfarm Create). This may have an impact on performance due to the need to join different tables to find complete information. With Appfarm Create, this means more data sources with deep data bindings and complex filters. Therefore, in systems with very complex data structures, there may be an advantage in adopting a lower degree of normalization.
This example features a dataset containing customers who have bought different items from different suppliers. The customers have also subscribed to different newsletters from the different suppliers.
In order to create a data model that supports this dataset in the best possible way, we will start by looking at the first level of normalization, the first normal form.
To obtain a data model that is in the first normal form the following requirements must be met:
Each cell must be single-valued.
Entries in a column must be of the same type.
Rows must be uniquely identified.
The example dataset violates the first requirement of the 1NF by having several values in a single cell in the columns Item and Newsletter. The dataset also violates the second requirement by having the values Unknown
and No number
in the columns Supplier and Supplier Phone. The third requirement is also not met since there is no unique identifier for each row.
The dataset below has been altered in order to satisfy the requirements for the 1NF. There are no longer cells containing more than one value, each column has data that is of the same type (Unknown
and No number
are replaced with proper values), and each row is now uniquely identified with the addition of the Cust _id column.
The following images show what the data model for this dataset would look like in Appfarm Create. It is a very simple data model, consisting of only one object class: Sales Records. However, this data model has several issues which can be address through further normalization.
To obtain a data model that is in the second normal form the following requirement must be met:
All attributes (non-ID columns) must be dependent on the key.
If you look at the non-ID columns in the dataset, are there any columns that are not dependent on the Cust _id column? For example, does the customer ID determine the price of the product? In other words, does the price depend on who buys the product? No. It does not matter who buys the product, the price will remain the same, so the dataset does not meet the requirements of the 2NF.
In order to fulfil the requirements of the 2NF, the columns that are not dependent on the key column must be separated. The Price, Item, Supplier, and Supplier Phone columns can be moved into a new table. Additionally, another new table known as a junction table, will hold the transactions.
The data model in Appfarm Create would have three object classes, with the Sales Records object class referencing both the Customer and Product object classes.
To obtain a data model that is in the third normal form the following requirement must be met:
All fields (columns) can be determined only by the key (_id) in the table and no other column.
Looking at the dataset, you can see that Supplier and Supplier Phone always go together. If you know the supplier, you know the supplier's phone number. If you added more items from Microsoft and Sony the phone numbers will keep repeating, meaning that if you want to change the supplier's phone number you have to change it in many places instead of just one.
To address this issue, you can create a new table to hold the supplier information. The Items table can then be updated to store a reference to a supplier instead of storing supplier information.
The data model in Appfarm Create now features four object classes, with the Product object class referencing the new Supplier object class.
To obtain a data model that is on the fourth normal form the following requirement must be met:
No multi-valued dependencies.
Looking at the Customer table in the dataset, you can see that all columns are dependent on the key (_id) column. However, there is still a problem. The customer Evan Wilson is in the Customer table twice with the same address since he has subscribed to two different newsletters. If he changes his address we would have to change it in two different places, and if he unsubscribes to the Dell news newsletter, should we just replace the value with null
? What if Alan Smith in Hanover Park also subscribes to Dell news? Then his name and address will occur twice in the Customer table as well.
To address this issue the Customer table must be normalized further. You can alter the Customer table to just hold the _id, Name, and Shipping Address so that names and addresses aren't stored multiple times per user. Then, you can create a Newsletter table that holds the name of the newsletter and a reference to its supplier. This means a supplier can have multiple newsletters. Finally, you can create a new junction table that holds all the customer subscriptions to newsletters by storing references to both the customers and the newsletters.
The final data model in Appfarm Create consists of six object classes with the addition of Newsletter and Newsletter subscription object classes.
You can see that some properties are marked as Required. This means that an object of this class cannot exist without a required value set. For example, each supplier must have a name. In addition, the Phone property of the supplier class is marked as Unique which means that a phone number can only exist once. This ensures that no suppliers will have the same phone number.
By normalizing the data model each customer only appears once at the same time as a customer can buy several products and subscribe to different newsletters. The same is true for products and suppliers. Each product appears only once, but can be bought multiple times. Each supplier appears only once but can have both multiple products and newsletters.