Database Normalization




Normalization is the process that assures an efficient organization of data inside a database. The process is not very complicated but it can drastically improve the performance of a DBMS, especially in the case of a poor design. The two main goals of normalization are (1) to eliminate redundant data like what is happening when the same data is stored in more than one place, and (2) to ensure that data dependencies make sense, meaning that only related data is stored in a certain data table. Both of these goals are worth to follow as they ensure a reduce in the amount of data space used and the logical consistence of the stored data. Logical consistence means avoiding anomalies like the insertion anomalies, update anomalies, and deletion anomalies.

Figure 1 ( adapted from Aslam Memon's CS232 Database Lecture ). Normalization Process
NORMAL FORMS
The database community has developed a series of guidelines that should be followed in the database design. It is important to mention that these forms are just guidelines and nothing more, meaning that there are cases where either business rules or performance issues may lead to a design which is not conformant with the normal forms, but yields better overall performance.
. First normal form ( 1NF ): any multi valued attributes (repeating groups) have been removed, so there is a single value, which may be null at the intersection of each row and column of a table. Celko [1] observed that the structured query language SQL imposes a table to be in first normal form. The exceptions to this are vendor specific extensions like the introduction of the array data type, which are not compliant with the standard.



. Second normal form ( 2NF ): implies that the table is in 1NF + any partial functional dependencies have been removed; this happens in any of the following situations: (a) every non key attribute is determined by entire set of primary key attributes, (b) the primary key consists of only one attribute or (c) no non key attributes exist in the relation (thus all of the attributes in the relation are components of the primary key)
. Third normal form ( 3 NF ): is reached if the table is in 2NF + any transitive dependencies have been removed, so all non key columns are determined by the key, the whole key, and nothing but the key-Celko[1]
. Fourth normal form ( 4NF ): has to meet all the requirements of the third normal form + a relation is in 4NF if it has no multi-valued dependencies. If 4NF is satisfied, also the table is in BCNF.
. Boyce-Codd normal form ( BCNF ): this NF deals with tables that have alternate keys. A table is in Boyce - Codd NF if it is in 3rd NF and it remains in 3rd NF for any alternate key used as the primary key as well.
. Fifth normal form ( 5NF ): a table is in 5th NF if it is in 4th NF and there are no pair wise cyclical dependencies in the primary key comprised of three or more component fields.
Relations in 3NF are sufficient for most practical database applications, however 3NF does not guarantee that all anomalies have been removed so that is the reason we presented the 4NF, BCNF and 5NF forms, for completeness of the theoretical information on the subject.
Epocrates database design has been also the subject of a normalization process and most of the tables and relationships are conformant with the third normal form.