Normalizing the Database
Some ways of organizing data are better than others. Some are more logical. Some are simpler. Some are better at preventing inconsistencies when you start using the database.
A host of problems — called modification anomalies — can plague a database if you don’t structure the database correctly. To prevent these problems, you can normalize the database structure. Normalization generally entails split-ting one database table into two simpler tables.
Modification anomalies are so named because they are generated by the addi-tion of, change to, or deletion of data from a database table.
To illustrate how modification anomalies can occur, consider the table shown in Figure.
SALES
|
|||||
Customer_ID
|
Product
|
Price
|
|||
1001
|
Laundry detergent
|
12
|
|||
1007
|
Toothpaste
|
3
|
|||
This SALES
|
|||||
table
|
1010
|
Chlorine
bleach
|
4
|
||
leads to
|
|||||
modification
|
|||||
1024
|
Toothpaste
|
3
|
Your company sells household cleaning and personal-care products, and you charge all customers the same price for each product. The SALES table keeps track of everything for you. Now assume that customer 1001 moves out of the area and no longer is a customer. You don’t care what he’s bought in the past, because he’s not going to buy again. You want to delete his row from the table. If you do so, however, you don’t just lose the fact that cus-tomer 1001 has bought laundry detergent; you also lose the fact that laundry detergent costs $12. This situation is called a deletion anomaly.
In deleting one fact (that customer 1001 bought laundry detergent), you inadvertently delete another fact (that laundry detergent costs $12).
You can use the same table to illustrate an insertion anomaly. For example, say that you want to add stick deodorant to your product line at a price of $2. You can’t add this data to the SALES table until a customer buys stick deodorant.
The problem with the SALES table in the figure is that this table deals with more than one thing. The table covers which products customers buy and what the products cost. You need to split the SALES table into two tables, each one dealing with only one theme or idea, as shown in Figure
CUST_PURCH
|
PROD_PRICE
|
|||||||||
Customer_ID
|
Product
|
Product
|
Price
|
|||||||
1001
|
Laundry
detergent
|
Laundry
detergent
|
12
|
|||||||
1007
|
Toothpaste
|
Toothpaste
|
3
|
|||||||
The SALES
|
||||||||||
1010
|
Chlorine
bleach
|
Chlorine
bleach
|
4
|
|||||||
table is split
|
||||||||||
into two
|
||||||||||
1024
|
Toothpaste
|
|||||||||
tables.
|
||||||||||
CUST_PURCH, which deals with the single idea of customer purchases PROD_PRICE, which deals with the single idea of product pricing
You can now delete the row for customer 1001 from CUST_PURCH without losing the fact that laundry detergent costs $12. That fact is now stored in PROD_PRICE. You can also add stick deodorant to PROD_PRICE, whether anyone has bought the product or not. Purchase information is stored else-where, in the CUST_PURCH table.
The process of breaking up a table into multiple tables, each of which has a single theme, is called normalization. A normalization operation that solves one problem may not affect others. You may need to perform several succes-sive normalization operations to reduce each resulting table to a single theme. Each database table should deal with one — and only one — main theme. Sometimes, determining that a table really deals with two or more themes is difficult.
You can classify tables according to the types of modification anomalies to which they’re subject. In E. F. Codd’s 1970 paper, the first to describe the rela-tional model, Codd identified three sources of modification anomalies and defined first, second, and third normal forms (1NF, 2NF, 3NF) as remedies to those types of anomalies. In the ensuing years, Codd and others discovered additional types of anomalies and specified new normal forms to deal with them. The Boyce-Codd normal form (BCNF), the fourth normal form (4NF), and the fifth normal form (5NF) each afforded a higher degree of protection against modification anomalies. Not until 1981, however, did a paper, written by Ronald Fagin, describe domain/key normal form (DK/NF). Using this last normal form enables you to guarantee that a table is free of modification anomalies.
The normal forms are nested in the sense that a table that’s in 2NF is automati-cally also in 1NF. Similarly, a table in 3NF is automatically in 2NF, and so on. For most practical applications, putting a database in 3NF is sufficient to ensure a high degree of integrity. To be absolutely sure of its integrity, you must put the database into DK/NF.
After you normalize a database as much as possible, you may want to make selected denormalizations to improve performance. If you do, be aware of the types of anomalies that may now become possible.
First normal form
To be in first normal form (1NF), a table must have the following qualities:
The table is two-dimensional, with rows and columns.
Each row contains data that pertains to some thing or portion of a thing.
Each column contains data for a single attribute of the thing it’s describing.
Each cell (intersection of a row and a column) of the table must have only a single value.
Entries in any column must all be of the same kind. If, for example, the entry in one row of a column contains an employee name, all the other rows must contain employee names in that column, too.
Each column must have a unique name.
No two rows may be identical (that is, each row must be unique).
The order of the columns and the order of the rows is not significant.
A table (relation) in first normal form is immune to some kinds of modifi-cation anomalies but is still subject to others. The SALES table shown in Figure is in first normal form, and as discussed previously, the table is subject to deletion and insertion anomalies. First normal form may prove useful in some applications but unreliable in others.
Second normal form
To appreciate second normal form, you must understand the idea of functional dependency. A functional dependency is a relationship between or among attri-butes. One attribute is functionally dependent on another if the value of the second attribute determines the value of the first attribute. If you know the value of the second attribute, you can determine the value of the first attribute.
Suppose, for example, that a table has attributes (columns) StandardCharge, NumberOfTests, and TotalCharge, which relate through the following equation:
TotalCharge = StandardCharge * NumberOfTests
TotalCharge is functionally dependent on both StandardCharge and NumberOfTests. If you know the values of StandardCharge and NumberOfTests, you can determine the value of TotalCharge.
Every table in first normal form must have a unique primary key. That key may consist of one or more than one column. A key consisting of more than one column is called a composite key. To be in second normal form (2NF), all non-key attributes (columns) must depend on the entire key. Thus, every relation that is in 1NF with a single attribute key is automatically in second normal form. If a relation has a composite key, all non-key attributes must depend on all components of the key. If you have a table where some non-key attri-butes don’t depend on all components of the key, break the table up into two or more tables so that, in each of the new tables, all non-key attributes depend on all components of the primary key.
Sound confusing? Look at an example to clarify matters. Consider a table like the SALES table back in Figure. Instead of recording only a single pur-chase for each customer, you add a row every time a customer buys an item for the first time. An additional difference is that “charter” customers (those with CustomerID values of 1001 to 1009) get a discount from the normal price. Figure shows some of this table’s rows.
SALES_TRACK
|
|||||||
Customer_ID
|
Product
|
Price
|
|||||
1001
|
Laundry
detergent
|
11.00
|
|||||
In the
|
|||||||
1007
|
Toothpaste
|
2.70
|
|||||
SALES_
|
|||||||
TRACK
|
|||||||
table, the
|
1010
|
Chlorine
bleach
|
4.00
|
||||
Customer_
|
|||||||
ID and
|
1024
|
Toothpaste
|
3.00
|
||||
Product
|
|||||||
columns
|
1010
|
Laundry detergent
|
12.00
|
||||
constitute a
|
|||||||
composite
|
1001
|
Toothpaste
|
2.70
|
||||
key.
|
|||||||
In Figure, CustomerID does not uniquely identify a row. In two rows, CustomerID is 1001. In two other rows, CustomerID is 1010. The combina-tion of the CustomerID column and the Product column uniquely identifies a row. These two columns together are a composite key.
If not for the fact that some customers qualify for a discount and others don’t, the table wouldn’t be in second normal form, because Price (a non-key attribute) would depend only on part of the key (Product). Because some customers do qualify for a discount, Price depends on both CustomerID and Product, and the table is in second normal form.
Third normal form
Tables in second normal form are subject to some types of modification anomalies. These anomalies come from transitive dependencies.
A transitive dependency occurs when one attribute depends on a second attribute, which depends on a third attribute. Deletions in a table with such a dependency can cause unwanted information loss. A relation in third normal form is a relation in second normal form with no transitive dependencies.
Look again at the SALES table in Figure, which you know is in first normal form. As long as you constrain entries to permit only one row for each CustomerID, you have a single-attribute primary key, and the table is in second normal form. However, the table is still subject to anomalies. What if customer 1010 is unhappy with the chlorine bleach, for example, and returns
the item for a refund? You want to remove the third row from the table, which records the fact that customer 1010 bought chlorine bleach. You have a prob-lem: If you remove that row, you also lose the fact that chlorine bleach has a price of $4. This situation is an example of a transitive dependency. Price depends on Product, which, in turn, depends on the primary key CustomerID.
Breaking the SALES table into two tables solves the transitive dependency problem. The two tables shown in Figure, CUST_PURCH and PROD_PRICE, make up a database that’s in third normal form.
Domain-key normal form (DK/NF)
After a database is in third normal form, you’ve eliminated most, but not all, chances of modification anomalies. Normal forms beyond the third are defined to squash those few remaining bugs. Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form (5NF) are examples of such forms. Each form eliminates a possible modification anomaly but doesn’t guarantee prevention of all possible modification anomalies. Domain-key normal form (DK/NF), however, provides such a guarantee.
A relation is in domain-key normal form (DK/NF) if every constraint on the relation is a logical consequence of the definition of keys and domains. A con-straint in this definition is any rule that’s precise enough that you can evalu-ate whether or not it’s true. A key is a unique identifier of a row in a table. A domain is the set of permitted values of an attribute.
Look again at the database in Figure, which is in 1NF, to see what you must do to put that database in DK/NF.
Table: SALES (CustomerID, Product, Price)
Key: CustomerID
Constraints: 1. CustomerID determines Product
2. Product determines Price
3. CustomerID must be an integer > 1,000
To enforce Constraint 3 (that CustomerID must be an integer greater than 1,000), you can simply define the domain for CustomerID to incorporate this constraint. That makes the constraint a logical consequence of the domain of the CustomerID column. Product depends on CustomerID, and CustomerID is a key, so you have no problem with Constraint 1, which is a logical conse-quence of the definition of the key. Constraint 2 is a problem. Price depends on (is a logical consequence of) Product, and Product isn’t a key. The solu-tion is to divide the SALES table into two tables. One table uses CustomerID as a key, and the other uses Product as a key. This setup is what you have in Figure. The database in Figure, besides being in 3NF, is also in DK/NF.
Design your databases so they’re in DK/NF if possible. If you do so, enforcing key and domain restrictions causes all constraints to be met. Modification anomalies aren’t possible. If a database’s structure is designed so that you can’t put it into domain-key normal form, you must build the constraints into the application program that uses the database. The database doesn’t guar-antee that the constraints will be met.
Abnormal form
Sometimes being abnormal pays off. You can get carried away with normal-ization and go too far. You can break up a database into so many tables that the entire thing becomes unwieldy and inefficient. Performance can plummet. Often, the optimal structure is somewhat denormalized. In fact, practical data-bases are almost never normalized all the way to DK/NF. You want to normal-ize the databases you design as much as possible, however, to eliminate the possibility of data corruption that results from modification anomalies.
After you normalize the database as far as you can, make some retrievals. If performance isn’t satisfactory, examine your design to see whether selective denormalization would improve performance without sacrificing integrity. By carefully adding redundancy in strategic locations and denormalizing, you can arrive at a database that’s both efficient and safe from anomalies.
No comments:
Post a Comment