
Maintaining Integrity
A database is valuable only if you’re reasonably sure that the data it contains is correct. In medical, aircraft, and spacecraft databases, for example, incor-rect data can lead to loss of life. Incorrect data in other applications may have less severe consequences but can still prove damaging. The database designer must make sure that incorrect data never enters the database.
Some problems can’t be stopped at the database level. The application pro-grammer must intercept these problems before they can damage the data-base. Everyone responsible for dealing with the database in any way must remain conscious of the threats to data integrity and take appropriate action to nullify those threats.
Databases can experience several distinctly different kinds of integrity — and a number of problems that can affect integrity. In the following sections, I dis-cuss three types of integrity: entity, domain, and referential. I also look at some of the problems that can threaten database integrity.
Entity integrity
Every table in a database corresponds to an entity in the real world. That entity may be physical or conceptual, but in some sense, the entity’s exis-tence is independent of the database. A table has entity integrity if the table is
entirely consistent with the entity that it models. To have entity integrity, a table must have a primary key. The primary key uniquely identifies each row in the table. Without a primary key, you can’t be sure that the row retrieved is the one you want.
To maintain entity integrity, you need to specify that the column or group of columns that comprise the primary key are NOT NULL. In addition, you must constrain the primary key to be UNIQUE. Some SQL implementations enable you to add such a constraint to the table definition. With other implementa-tions, you must apply the constraint later, after you specify how to add, change, or delete data from the table. The best way to ensure that your pri-mary key is both NOT NULL and UNIQUE is to give the key the PRIMARY KEY constraint when you create the table, as shown in the following example:
CREATE TABLE CLIENT (
ClientName CHARACTER (30) PRIMARY KEY,
Address1 CHARACTER (30),
Address2 CHARACTER (30),
City CHARACTER (25),
State CHARACTER (2),
PostalCode CHARACTER (10),
Phone CHARACTER (13),
Fax CHARACTER (13),
ContactPerson CHARACTER (30)
) ;
An alternative is to use NOT NULL in combination with UNIQUE, as shown in the following example:
CREATE TABLE CLIENT (
ClientName CHARACTER (30) NOT NULL,
Address1 CHARACTER (30),
Address2 CHARACTER (30),
City CHARACTER (25),
State CHARACTER (2),
PostalCode CHARACTER (10),
Phone CHARACTER (13),
Fax CHARACTER (13),
ContactPerson CHARACTER (30),
UNIQUE (ClientName) ) ;
Domain integrity
You usually can’t guarantee that a particular data item in a database is cor-rect, but you can determine whether a data item is valid. Many data items have a limited number of possible values. If you make an entry that is not one of the possible values, that entry must be an error. The United States, for example, has 50 states plus the District of Columbia, Puerto Rico, and a few possessions. Each of these areas has a two-character code that the U.S. Postal Service recognizes. If your database has a State column, you can enforce domain integrity by requiring that any entry into that column be one of the rec-ognized two-character codes. If an operator enters a code that’s not on the list of valid codes, that entry breaches domain integrity. If you test for domain integrity, you can refuse to accept any operation that causes such a breach.
Domain integrity concerns arise if you add new data to a table by using either the INSERT or the UPDATE statements. You can specify a domain for a column by using a CREATE DOMAIN statement before you use that column in a CREATE TABLE statement, as shown in the following example:
CREATE DOMAIN LeagueDom CHAR (8)
CHECK (LEAGUE IN (‘American’, ‘National’));
CREATE TABLE TEAM (
TeamName CHARACTER (20) NOT NULL,
League LeagueDom NOT NULL
) ;
The domain of the League column includes only two valid values: American and National. Your DBMS doesn’t enable you to commit an entry or update to the TEAM table unless the League column of the row you’re adding has a value of either ‘American’ or ‘National’.
Referential integrity
Even if every table in your system has entity integrity and domain integrity, you may still have a problem because of inconsistencies in the way one table relates to another. In most well-designed databases, every table contains at least one column that refers to a column in another table in the database. These references are important for maintaining the overall integrity of the database. The same references, however, make update anomalies possible.
Update anomalies are problems that can occur after you update the data in a row of a database table.
The relationships among tables are generally not bi-directional. One table is usually dependent on the other. Say, for example, that you have a database with a CLIENT table and an ORDERS table. You may conceivably enter a client into the CLIENT table before she makes any orders. You can’t, however, enter an order into the ORDERS table unless you already have an entry in the CLIENT table for the client who’s making that order. The ORDERS table is dependent on the CLIENT table. This kind of arrangement is often called a parent-child relationship, where CLIENT is the parent table and ORDERS is the child table. The child is dependent on the parent. Generally, the primary key of the parent table is a column (or group of columns) that appears in the child table. Within the child table, that same column (or group) is a foreign key. A foreign key may contain nulls and need not be unique.
Update anomalies arise in several ways. A client moves away, for example, and you want to delete her from your database. If she has already made some orders, which you recorded in the ORDERS table, deleting her from the CLIENT table could present a problem. You’d have records in the ORDERS (child) table for which you have no corresponding records in the CLIENT (parent) table. Similar problems can arise if you add a record to a child table without making a corresponding addition to the parent table. The corresponding foreign keys in all child tables must reflect any changes to the primary key of a row in a parent table; otherwise, an update anomaly results.
You can eliminate most referential integrity problems by carefully controlling the update process. In some cases, you need to cascade deletions from a parent table to its children. To cascade a deletion, when you delete a row from a parent table, you also delete all the rows in its child tables that have foreign keys that match the primary key of the deleted row in the parent table. Take a look at the following example:
CREATE TABLE CLIENT (
ClientName CHARACTER (30) PRIMARY KEY,
Address1 CHARACTER (30),
Address2 CHARACTER (30),
City CHARACTER (25) NOT NULL,
State CHARACTER (2),
PostalCode CHARACTER (10),
Phone CHARACTER (13),
Fax CHARACTER (13),
ContactPerson CHARACTER (30)
) ;
CREATE TABLE TESTS (
TestName CHARACTER (30) PRIMARY KEY,
StandardCharge CHARACTER (30)
) ;
CREATE TABLE EMPLOYEE (
EmployeeName CHARACTER (30) PRIMARY KEY,
ADDRESS1 CHARACTER (30),
Address2 CHARACTER (30),
City CHARACTER (25),
State CHARACTER (2),
PostalCode CHARACTER (10),
HomePhone CHARACTER (13),
OfficeExtension CHARACTER (4),
HireDate DATE,
JobClassification CHARACTER (10),
HourSalComm
) ;
CHARACTER (1)
CREATE TABLE ORDERS (
OrderNumber INTEGER
ClientName CHARACTER (30),
TestOrdered CHARACTER (30),
Salesperson CHARACTER (30),
OrderDate DATE,
PRIMARY KEY,
CONSTRAINT NameFK FOREIGN KEY (ClientName)
REFERENCES CLIENT (ClientName)
ON DELETE CASCADE,
CONSTRAINT TestFK FOREIGN KEY (TestOrdered)
REFERENCES TESTS (TestName)
ON DELETE CASCADE,
CONSTRAINT SalesFK FOREIGN KEY (Salesperson)
REFERENCES EMPLOYEE (EmployeeName)
ON DELETE CASCADE
) ;
The constraint NameFK names ClientName as a foreign key that references the ClientName column in the CLIENT table. If you delete a row in the CLIENT table, you also automatically delete all rows in the ORDERS table that have the same value in the ClientName column as those in the ClientName column of the CLIENT table. The deletion cascades down from the CLIENT table to the ORDERS table. The same is true for the foreign keys in the ORDERS table that refer to the primary keys of the TESTS and EMPLOYEE tables.
You may not want to cascade a deletion. Instead, you may want to change the child table’s foreign key to a NULL value. Consider the following variant of the previous example:
CREATE TABLE ORDERS (
OrderNumber INTEGER
ClientName CHARACTER (30),
TestOrdered CHARACTER (30),
SalesPerson CHARACTER (30),
OrderDate DATE,
PRIMARY KEY,
CONSTRAINT NameFK FOREIGN KEY (ClientName)
REFERENCES CLIENT (ClientName), CONSTRAINT TestFK FOREIGN KEY (TestOrdered)
REFERENCES TESTS (TestName),
CONSTRAINT SalesFK FOREIGN KEY (Salesperson)
REFERENCES EMPLOYEE (EmployeeName)
ON DELETE SET NULL
) ;
The constraint SalesFK names the Salesperson column as a foreign key that references the EmployeeName column of the EMPLOYEE table. If a sales-person leaves the company, you delete her row in the EMPLOYEE table. New
salespeople are eventually assigned to her accounts, but for now, deleting her name from the EMPLOYEE table causes all of her orders in the ORDER table to receive a null value in the Salesperson column.
Another way to keep inconsistent data out of a database is to refuse to permit an addition to a child table until a corresponding row exists in its parent table. Yet another possibility is to refuse to permit changes to a table’s primary key. If you refuse to permit rows in a child table without a corresponding row in a parent table, you prevent the occurrence of “orphan” rows in the child table. This refusal helps maintain consistency across tables. If you refuse to permit changes to a table’s primary key, you don’t need to worry about updating for-eign keys in other tables that depend on that primary key.
Potential problem areas
Data integrity is subject to assault from a variety of quarters. Some of these problems arise only in multitable databases, whereas others can happen even in databases that contain only a single table. You want to recognize and mini-mize all these potential threats.
Bad input data
The source documents or data files that you use to populate your database may contain bad data. This data may be a corrupted version of the correct data, or it may not be the data you want. Range checks tell you whether the data has domain integrity. This type of check catches some problems but not all. Field values that are within the acceptable range, but are nonetheless incorrect, aren’t identified as problems.
Operator error
Your source data may be correct, but the data entry operator incorrectly transcribes the data. This type of error can lead to the same kinds of prob-lems as bad input data. Some of the solutions are the same, too. Range checks help, but they’re not foolproof. Another solution is to have a second operator independently validate all the data. This approach is costly, because indepen-dent validation takes twice the number of people and twice the time. But in some cases where data integrity is critical, the extra effort and expense may prove worthwhile.
Mechanical failure
If you experience a mechanical failure, such as a disk crash, the data in the table may be destroyed. Good backups are your main defense against this problem.
Malice
Consider the possibility that someone may want to intentionally corrupt your data. Your first line of defense is to deny database access to anyone who may have a malicious intent, and restrict everyone else’s access only to what they need. Your second defense is to maintain data backups in a safe place. Periodically reevaluate the security features of your installation. Being just a little paranoid doesn’t hurt.
Data redundancy
Data redundancy is a big problem with the hierarchical database model, but the problem can plague relational databases, too. Not only does such redun-dancy waste storage space and slow down processing, but it can also lead to serious data corruption. If you store the same data item in two different tables in a database, the item in one of those tables may change, while the corresponding item in the other table remains the same. This situation gener-ates a discrepancy, and you may have no way of determining which version is correct. A good idea is to hold data redundancy to a minimum. A certain amount of redundancy is necessary for the primary key of one table to serve as a foreign key in another. Try to avoid any redundancy beyond that.
After you eliminate most redundancy from a database design, you may find that performance is now unacceptable. Operators often purposefully use redundancy to speed up processing. In the previous example, the ORDERS table contains only the client’s name to identify the source of each order. If you prepare an order, you must join the ORDERS table with the CLIENT table to get the client’s address. If this joining of tables makes the program that prints orders run too slowly, you may decide to store the client’s address redundantly in the ORDERS table. This redundancy offers the advantage of printing the orders faster but at the expense of slowing down and complicat-ing any updating of the client’s address.
A common practice is to initially design a database with little redundancy and with high degrees of normalization and then, after finding that important applications run slowly, to selectively add redundancy and denormalize. The key word here is selectively. The redundancy that you add back in has a spe-cific purpose, and because you’re acutely aware of both the redundancy and the hazard it represents, you take appropriate measures to ensure that the redundancy doesn’t cause more problems than it solves.
Exceeding the capacity of your DBMS
A database system might work properly for years and then intermittently start experiencing errors, which become progressively more serious. This may be a sign that you are approaching one of the system’s capacity limits. There are limits to the number of rows that a table may have. There are also
limits on columns, constraints, and other things. Check the current size and content of your database against the specifications of your DBMS. If you’re near the limit in any area, consider upgrading to a higher capacity system. Or, you may want to archive older data that is no longer active and then delete it from your database.
No comments:
Post a Comment