Database Normalization and Normal Forms

Database Normalization:

If you are doing database work from some time, you possibly have heard about Normalization before. Normalization is basically a process that is applied on databases to make sure of minimum redundancy (1. No duplicate data) and more consistency (2. Correctness of Data) of the data stored in the database.

Normalization is applied and measured by normalization levels. Normalization levels (Normal forms) are defined by Database experts. It means a database can be told in any one of normal form if it satisfies all the required criteria defined by that normal form.

Normalization, if applied correctly, can drastically improve the database performance. It reduces space used by data in the databases and it also helps in reducing update conflicts because of data redundancy.

Normal Forms:

Normal forms are basically normalization levels, used to measure the redundancy and consistency level of the databases. The higher the normal form means less data redundancy and more data consistency or more correctness of data. Normalization starts with 1NF (First Normal Form) and go further with 2NF (Second Normal Form), 3NF (Third Normal Form), BCNF (Boyce-Codd Normal Form or 3.5NF), 4NF (Forth Normal Form), 5NF and higher. 5NF and higher normal forms are rare seen and sometimes not practically possible, so we will not cover these normal forms in this post.

Normal forms are cumulative in nature; it means that, to be in particular normal form, database has to fulfill all the required criteria of all the lower normal forms. For example to be in 3NF database has to satisfy all the criteria of 2NF and 1NF.

To explain normal form we will normalize below Student table. This table has student data like his teachers, classes and class-rooms details. Rooms are assigned to Teachers on the class basis. For Example, Teacher Richard is assigned room Number 21 for class A, B and C.

Student Name

Student Address

Teacher Name

Room Number

Class1

Class2

Class3

John

Addr-A

Richard

21

A

B

D

David

Addr-B

Fillip

32

E

F

C

David

Addr-B

George

25

H

I

J

Fig.-1: Student Table

 

First Normal Form (1NF):

To be in first Normal form table has to satisfy following criteria.

1.       No duplicate columns in single table.

2.      Each column has single value (i.e. there is no group of values or array).

3.      Each row should be identified by unique key (A Primary Key).

In our example table, we have three columns named Class1, Class2, and Class3 for the same data. This violated first rule of 1NF.

One Alternative of this is to have one column named Classes having all class names in comma separated way (See Fig.-2). But this violates the second rule of 1NF that each column should have single value.

Student Name

Student Address

Teacher Name

Room Number

Classes

John

Addr-A

Richard

21

A, B, D

David

Addr-B

Fillip

32

E, F, C

David

Addr-B

George

25

H, I, J

Fig.-2: Column Classes has multiple values of same type.

Because every student can have multiple classes, we should have one row for each class of every student (See Fig.-3).

Student Name

Student Address

Teacher Name

Room Number

Class

John

Addr-A

Richard

21

A

John

Addr-A

Richard

21

B

John

Addr-A

Richard

21

D

David

Addr-B

Fillip

32

E

David

Addr-B

Fillip

32

F

David

Addr-B

Fillip

32

C

David

Addr-B

George

25

H

David

Addr-B

George

25

I

David

Addr-B

George

25

J

Fig.-3: Table has multiple rows of each student for every class

Now this table satisfies first two rules (No duplicate columns and no group of values of same type in a column) of 1NF. Before going further let’s understand the concept of Candidate Key and Primary Key.

Candidate Key: Any combination of columns (Single or multiple columns) of a table, that can be used to uniquely identify each row of a table, is the candidate key for that table. A table can have one or multiple candidate keys.

Primary Key: From the list of Candidate keys user can choose any one as a primary key, which will be used to uniquely identify rows of that table. A table can have only one primary key.

Now that we know the concept of Primary key and Candidate keys, we can apply it in our example.

We can set Student Name and Class column combination as Primary Key (See Fig.-4).

image

Fig.4: Student table having Primary Key set on Student Name and Class combination

Now Student table of Fig.-4 is in First Normal Form (1NF).

 

Second Normal Form (2NF):

To be in Second Normal Form Table has to satisfy following criteria.

1.       Table should be in First Normal Form.

2.      There should be no partial dependency between non-key attributes and composite key. A partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key.

3.      There should be no subset of data that comes with multiple rows in a table. Place that data into different table.

4.      Define relationships among the new tables and the base table using foreign key.

Here point 3 and 4 are conceptually same as point 2. If a table has partial dependency of any non-key column then it will always have some set of data that is repeated for multiple rows. We need to identify those set of data and place them into separate table and define proper relationships among those.

We can see in Fig-4 that Student Address column not fully functionally dependent on Primary key (Student Name, Class). We should keep Student Name and Student Address columns in Student table and place all other columns in separate table and name in Student-Teacher-Class table. 

Foreign Key: Foreign Key is set on a column which references a Primary Key on another table. It establishes referential integrity on a table. Foreign key make sure that no foreign key value exists in a referencing table, if it does not have corresponding value in referenced table.

We generated a unique Student ID column in Student table and set it as primary key. Then we created foreign key relationship with Student-Teacher-Class table using Student ID column.

We also create Primary Key of Student ID and Class column combination on Student-Teacher-Class table.

image

Fig-5: Student and Student-Teacher-Class table with Foreign Key Relationship

Now at this time Tables are in Second normal form.

Third Normal Form (3NF):

To be in Third Normal Form Table has to satisfy following criteria.

1.       Table should be in Second Normal Form.

2.      No non-key column should depend on other non-key column or has no transitive functional dependency.

To convert our tables into 3NF, we need to remove all non-key columns from a table which are not dependent upon primary key. In our example, Room Number (non-key) column on table Student-Teacher-Class is fully functionally depends on Teacher Name (non-key) and not on Student ID and Class (Primary Key) columns.

We need to Place Teacher Name and Room Number column in different table and create relationship with this table (See Fig.-6).

image

Fig.-6: Student-Teacher-Class table is decomposed and Teacher table created.

In Fig.-6 Student-Teacher-Class table is decomposed and Teacher table is created. We created Teacher ID as Primary key in Teacher table and define foreign key relationship of Teacher table with Student-Teacher-Class table using Teacher ID column. We set Student ID, Class and Teacher ID combination as primary key in Student-Teacher-Class table.

Now tables in Fig.6 are in Third Normal Form (3NF).

 

Boyce-Codd Normal Form (BCNF or 3.5NF):

To be in Boyce-Codd Normal Form Table has to satisfy following criteria.

1.       Table should be in Third Normal Form.

2.      Every determinant must be a candidate key.

Determinant Definition: Set of columns, used to identify other column values in the same row of a table.

In Student-Teacher-Class table we have two determinants. One is Class column which determines Student ID and Teacher ID. Second is Teacher ID column which determines Student ID.

Here Class column is a candidate key because each value of class uniquely identifies rows. But Teacher ID column is not candidate key because this column has duplicate values in different rows.

It is easy to check the anomalies in table Student-Teacher-Class. For Example, If a Teacher Richard (Teacher ID: 1) is leaving the school than the table will lose information that Student John (Student ID: 101) is attending classes A, B and D.

Of course the solution here is to decompose the table into two tables using Class column to relate these two tables (See Fig.-7).

image

Fig.-7: Student-Teacher-Class table is decomposed into Student-Class table and Teacher-Class table

Now as we can see in Fig.-7, every determinant of all tables is candidate key, hence we now have our tables are in Boyce-Codd Normal Form.

Fourth Normal Form (4NF):

To be in Fourth Normal Form table has to satisfy following criteria.

1.     Table should be in Third Normal Form.

2.     It should not have any Multi-Valued dependencies.

Multivalued dependencies occur when the presence of one or more rows in a table implies the presence of one or more other rows in that same table.

Examples:

For example, imagine a car company that manufactures many models of car, but always makes both red and blue colors of each model. If you have a table that contains the model name, color and year of each car the company manufactures, there is a multivalued dependency in that table. If there is a row for a certain model name and year in blue, there must also be a similar row corresponding to the red version of that same car.

In our example we don’t need to further normalize our tables.

Do i always need to normalize a database ?

Generally, Database Normalization is good for databases, but it is not always required. In some cases we intentionally need to violate normalization rules to enhance the performance of Database. Sometimes we have the requirement to duplicate same data in more than one tables, so that we can retrieve data easily using lesser number of Joins and where conditions.

For Example, In Data warehouse like environment where data modification operations are less happening and Reporting operations are frequently happening, we intentionally need to duplicate data in multiple tables, so that we can retrieve it easily.

 

I hope you will like this post. Please write your views on this topic in the comment section.

Advertisements

About Ashish Jain

I am Ashish Jain, a software engineer by profession. My goal of creating this blog is to share my knowledge of SQL server with all other SQL enthusiasts and also to learn from them.
This entry was posted in SQL Fundamentals and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s