2023-04-06

Normalization in Database

Normalization in Database

Normalization in the context of databases refers to the process of designing the structure of a relational database to reduce data redundancy and improve data integrity. The primary objectives of normalization include:

  • Eliminating Data Redundancy
    Normalization aims to ensure that each data element is stored in only one place. This is achieved by logically dividing the database into tables and establishing relationships between them.

  • Minimizing Data Modification Issues
    When data is stored in multiple places, keeping it consistent during modifications can be challenging. Normalization mitigates this issue by ensuring each piece of data resides in one place, thereby making data modification more manageable.

  • Simplifying Queries
    Well-organized and non-redundant data in a normalized database makes it easier to perform queries. This leads to enhanced performance, especially with large databases.

  • Maintaining Data Integrity
    Normalization helps maintain the accuracy and consistency of data throughout its lifecycle by reducing redundancy and establishing clear relationships among tables.

  • Accommodating Future Data
    A properly normalized database is more adaptable to future changes. As business requirements evolve, it becomes easier to modify the database if it is well-normalized.

Normal Forms

In normalization, normal forms are a set of rules that reduce data redundancy and improve logical consistency. The progression from the first normal form (1NF) to the fifth normal form (5NF) involves increasingly stringent rules.

First Normal Form (1NF)

A table is in the first normal form if it satisfies the following conditions:

  • It has a primary key.
  • All attributes contain atomic values (values that cannot be further decomposed).
  • Entries in a column are of the same kind.

Example

Consider a table that records the courses taken by students:

Student_ID Name Courses
1 Alice Math, History
2 Bob Science
3 Carol Math, Science, History

This table is not in 1NF because the Courses column contains non-atomic values. To make it 1NF compliant, we can break the data into separate rows:

Student_ID Name Course
1 Alice Math
1 Alice History
2 Bob Science
3 Carol Math
3 Carol Science
3 Carol History

Second Normal Form (2NF)

A table is in the second normal form if:

  • It is in 1NF.
  • It has no partial dependencies (i.e., no non-prime attribute is dependent on a proper subset of any candidate key).

Example

Consider a table where students are assigned to different classrooms in a school:

Student_ID Class_ID Class_Room Subject
1 A 101 Math
2 B 102 Science

Here, both Student_ID and Class_ID together form the primary key. The Class_Room depends only on a part of the primary key (Class_ID), which is a partial dependency. To make this table 2NF compliant, we can break it into two tables:

  • Student Table
Student_ID Class_ID Subject
1 A Math
2 B Science
  • Class Table
Class_ID Class_Room
A 101
B 102

Third Normal Form (3NF)

A table is in the third normal form if:

  • It is in 2NF.
  • It has no transitive dependencies (i.e., a non-prime attribute is not dependent on another non-prime attribute).

Example

Consider a table that records items and their prices in different currencies:

Item_ID Item_Name Currency Exchange_Rate Price_in_USD
1 Book EUR 1.2 20
2 Pen EUR 1.2 2

Here, Price_in_USD depends on Exchange_Rate, which is not part of the primary key. This is a transitive dependency. To make this table 3NF compliant, we can break it into two tables:

  • Item Table
Item_ID Item_Name Currency Price_in_USD
1 Book EUR 20
2 Pen EUR 2
  • Currency Table
Currency Exchange_Rate
EUR 1.2

Fourth Normal Form (4NF)

A table is in the Fourth Normal Form if:

  • It is in 3NF.
  • It has no multi-valued dependencies (an attribute is dependent on another attribute but not on the key).

Example

Consider a table that records the skills of employees in different departments:

Employee_ID Department Skill
1 HR Writing
1 HR Speaking
2 IT Coding
2 IT Testing

Here, Skill is independent of Department, but both are dependent on Employee_ID. This is a multi-valued dependency. To make this table 4NF compliant, we can break it into two tables:

  • Employee Table
Employee_ID Department
1 HR
2 IT
  • Skills Table
Employee_ID Skill
1 Writing
1 Speaking
2 Coding
2 Testing

Fifth Normal Form (5NF)

A table is in the Fifth Normal Form or Project-Join Normal Form if:

= It is in 4NF.
= It cannot be decomposed into any smaller tables without losing data or generating spurious data.

This normal form deals with cases where a table can be recreated by joining multiple smaller tables. For most practical purposes, reaching 4NF is often sufficient, and 5NF is used in more complex or specialized scenarios.

Denormalization

Denormalization is the process of strategically incorporating redundancy into a normalized database to optimize read performance. Although normalization is essential for maintaining data integrity and avoiding anomalies, it can sometimes lead to performance issues due to the need for multiple table joins in queries. This is where denormalization comes into play. Here are some situations when denormalization might be beneficial:

  • Read-Heavy Workloads
    If your database primarily serves read operations and query performance is crucial, denormalization can reduce the need for table joins and make queries faster.

  • Reporting and Analytics
    In cases where the database is used for reporting or analytics, denormalization can structure the data in a way that is more conducive to complex queries and aggregations.

  • Simplifying Data Retrieval for Applications
    Denormalization can make data retrieval more straightforward for client applications by reducing the complexity of SQL queries.

  • Scaling and Resource Limitations
    When working with limited resources, denormalization can sometimes be used as a way to achieve better performance without relying on more powerful hardware.

Balancing Normalization and Denormalization

It's important to recognize that denormalization is not an all-or-nothing approach. Often, the best database design involves a balance between normalization and denormalization. This is known as a hybrid approach. In a hybrid approach, certain portions of the database are denormalized to optimize for read performance, while other portions remain highly normalized to maintain data integrity. The decision on where to denormalize should be based on careful consideration of the specific requirements and constraints of your application and system.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!