Saturday, October 31, 2009

8.10 Using ALTER TABLE to Normalize a Table




I l@ve RuBoard










8.10 Using ALTER TABLE to Normalize a Table




8.10.1 Problem



You have a table
that's not in normal form.





8.10.2 Solution



ALTER TABLE can help you
normalize it.





8.10.3 Discussion



The preceding sections describe how
to use ALTER TABLE in fairly
general terms. This section shows a more concrete application of the
statement by demonstrating how ALTER
TABLE can help you redesign tables that you
discover to contain unnecessarily redundant data and therefore to be
in non-normal form.



Suppose you have a client_billing table for
recording billable items that is defined as follows:



CREATE TABLE client_billing
(
id INT UNSIGNED NOT NULL, # client ID number
name CHAR(20) NOT NULL, # client name
address CHAR(20) NOT NULL, # client adddress
date DATE NOT NULL, # billable item date
minutes INT NOT NULL, # number of billable minutes
description CHAR(60) NOT NULL # what was done
);


When you have just one row of information per client, the table looks
perfectly fine:



+----+------+---------------+------------+---------+------------------+
| id | name | address | date | minutes | description |
+----+------+---------------+------------+---------+------------------+
| 21 | John | 46 North Ave. | 2001-07-15 | 48 | consult by phone |
| 43 | Toby | 123 Elm St. | 2001-07-13 | 12 | office visit |
+----+------+---------------+------------+---------+------------------+


But when
you add more records and have multiple rows per client, it becomes
apparent that some of the information is redundant. In particular,
names and addresses are stored in every record, even though you
really need that information only once for any given client:



+----+------+---------------+------------+---------+------------------+
| id | name | address | date | minutes | description |
+----+------+---------------+------------+---------+------------------+
| 21 | John | 46 North Ave. | 2001-07-15 | 48 | consult by phone |
| 21 | John | 46 North Ave. | 2001-07-19 | 120 | court appearance |
| 43 | Toby | 123 Elm St. | 2001-07-13 | 12 | office visit |
| 43 | Toby | 123 Elm St. | 2001-07-14 | 60 | draft proposal |
| 43 | Toby | 123 Elm St. | 2001-07-16 | 180 | present proposal |
+----+------+---------------+------------+---------+------------------+


The
way to fix the problem is to split the information into two tables
and associate the records in them using the id
values:




  • One table (client_info) holds the information that
    is unique to each client, one row per client: the ID, name, and
    address.


  • The other table (bill_item) holds information
    about each billable item: date, number of minutes, and the
    description of what was done. Each row also should include the client
    ID number so that the item can be associated with the proper
    client_info table record.



In other words, the client information from the
client_billing table can be separated as follows
into the client_info and
bill_item tables:



client_info table:



+----+------+---------------+
| id | name | address |
+----+------+---------------+
| 21 | John | 46 North Ave. |
| 43 | Toby | 123 Elm St. |
+----+------+---------------+


bill_item table:



+----+------------+---------+------------------+
| id | date | minutes | description |
+----+------------+---------+------------------+
| 21 | 2001-07-15 | 48 | consult by phone |
| 21 | 2001-07-19 | 120 | court appearance |
| 43 | 2001-07-13 | 12 | office visit |
| 43 | 2001-07-14 | 60 | draft proposal |
| 43 | 2001-07-16 | 180 | present proposal |
+----+------------+---------+------------------+


To accomplish this redesign, first create the
client_info and bill_item
tables using the following statements, where each
table's columns are defined the same way as the
corresponding columns from the original
client_billing table:



CREATE TABLE client_info
(
id INT UNSIGNED NOT NULL, # client ID number
name CHAR(20) NOT NULL, # client name
address CHAR(20) NOT NULL # client adddress
);
CREATE TABLE bill_item
(
id INT UNSIGNED NOT NULL, # client ID number
date DATE NOT NULL, # billable item date
minutes INT NOT NULL, # number of billable minutes
description CHAR(60) NOT NULL # what was done
);


Next, use INSERT INTO ...
SELECT to copy the appropriate columns from the
client_billing table into the two new tables. For
the client_info table, the client information can
be copied like this:



INSERT INTO client_info (id,name,address)
SELECT id,name,address FROM client_billing;


Copying information to the bill_item table is
similar:



INSERT INTO bill_item (id,date,minutes,description)
SELECT id,date,minutes,description FROM client_billing;


Records in the two new tables are
associated based on id values, so
it's a good idea to index that column in each table
to allow them to be related efficiently. However,
that's not simply a matter of issuing an
ALTER TABLE
tbl_name ADD
INDEX (id) statement for each
table. For one thing, the client_info table at
this point has multiple records per client that should be collapsed
down to a single record for each client. That means we should create
a PRIMARY KEY or
UNIQUE index on the id column,
using the IGNORE keyword to tell MySQL to discard
duplicate records. In addition, queries on the
bill_item table are likely often to be date-based,
so we can include the date column in the index.
ALTER TABLE statements to
create these indexes look like this:



ALTER IGNORE TABLE client_info ADD PRIMARY KEY (id);
ALTER TABLE bill_item ADD INDEX (id, date);


After performing the procedure just described, the
client_billing table no longer is needed and can
be removed:



DROP TABLE client_billing;


With billing records stored in multiple tables, queries to retrieve
information become somewhat more complex, but relating tables is what
relational databases are good at, after all. For example, to show the
client name and address from the client_info table
along with a sum of the billable minutes listed per client in the
bill_item table, the query looks like this:



mysql> SELECT client_info.id, client_info.name, client_info.address,
-> SUM(bill_item.minutes) AS 'total minutes'
-> FROM client_info, bill_item
-> WHERE client_info.id = bill_item.id
-> GROUP BY client_info.id;
+----+------+---------------+---------------+
| id | name | address | total minutes |
+----+------+---------------+---------------+
| 21 | John | 46 North Ave. | 168 |
| 43 | Toby | 123 Elm St. | 252 |
+----+------+---------------+---------------+


Multiple-table queries are discussed further in Chapter 12.



The preceding example illustrates how to normalize a table for which
data values are repeated across rows. Another
kind of non-normal form occurs when you have multiple columns
within rows that record the same kind of
information. For example, if you're running a study
that involves administering two tests to subjects and recording the
date and score for each test, you might use a table that has the
following structure:



CREATE TABLE test_subject
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name CHAR(20) NOT NULL, # subject name
date1 DATE, # date and score of first test
result1 INT,
date2 DATE, # score of second test
result2 INT,
PRIMARY KEY (id)
);


Information in the table looks like this:



+----+--------+------------+---------+------------+---------+
| id | name | date1 | result1 | date2 | result2 |
+----+--------+------------+---------+------------+---------+
| 1 | Fred | 2001-07-13 | 78 | 2001-07-14 | 85 |
| 2 | Barry | 2001-07-12 | 79 | 2001-07-14 | 82 |
| 3 | Portia | 2001-07-16 | 82 | 2001-07-18 | 95 |
+----+--------+------------+---------+------------+---------+


When the information is contained in a table of this form, if you
want to do something such as computing each
subject's average score, you can do so with the
following query:



SELECT id, name, (result1 + result2) / 2 FROM test_subject;


If you decide you need to add a third test, and a fourth, you can add
columns to the table easily:



ALTER TABLE test_subject
ADD date3 DATE, ADD result3 INT,
ADD date4 DATE, ADD result4 INT;


But the query to compute average scores becomes more complex:



SELECT id, name, (result1 + result2 + result3 + result4) / 4
FROM test_subject;


In fact, if scores can be NULL to signify that a
subject hasn't taken a test yet, the query is much
messier than that, because you must add only the
non-NULL values to avoid getting a
NULL sum. This ballooning complexity is a clue
that the table design could be improved. The problem is that the
structure of the table is dependent on the number of tests you plan
to give. If you change that number, you must change the table. A way
around this problem is to use two tables, structured so that they
need not be changed if you decide to administer additional tests:




  • The first table records information unique to each subject.


  • The second table records test results, one row per test. Each row can
    be associated with the proper subject using the subject ID number.



This is very similar to the approach used in the preceding example,
where we used two tables to hold client billing information. However,
instead of creating two new tables and then removing the original
table, we'll use a different procedure that
preserves the original table, albeit in modified form:




  • Create a table test_result to hold test dates and
    scores, as well as the subject ID number. To allow scores to be
    ordered, we can also add a test number column. (The date might serve
    for ordering, unless it's possible for subjects to
    take multiple tests on the same day. An explicit test number does not
    have that problem.)


  • Copy the subject ID, date, and score information from
    test_subject to test_result.


  • Remove the date and score columns from
    test_subject, leaving only subject IDs and names.



Begin by creating a table to hold test results:



CREATE TABLE test_result
(
id INT UNSIGNED NOT NULL,
test_num INT NOT NULL,
date DATE,
result INT
);


Copy the test information from test_subject to
test_result. It's necessary to do
this separately for each test, because each result is copied from a
different set of columns:



INSERT INTO test_result (id,test_num,date,result)
SELECT id,1,date1,result1 FROM test_subject WHERE result1 IS NOT NULL;

INSERT INTO test_result (id,test_num,date,result)
SELECT id,2,date2,result2 FROM test_subject WHERE result2 IS NOT NULL;


Each INSERT INTO ...
SELECT statement specifies the
test_num value
"manually," because the appropriate
value is not present in the test_subject table and
cannot be derived directly from its contents. The
WHERE clause serves to copy only rows with
non-NULL test results. This handles the
possibility that your test_subject records might
be incomplete, such that a NULL score indicates a
subject hasn't taken a given test. (If the
test_subject table has no such missing test
results, the WHERE clauses are unnecessary and can
be omitted.)



The test_result table is populated now and can be
indexed. Note that although the id column is a
unique-valued column in the test_subject table, it
cannot be so in test_result because there will be
multiple rows per subject. However, it's still
possible to create a unique index by using both id
and test_num, assuming each test is given to each
subject only once:



ALTER TABLE test_result ADD PRIMARY KEY (id, test_num);


The test result columns in test_subject are now
unneeded and can be removed:



ALTER TABLE test_subject DROP date1, DROP result1, DROP date2, DROP result2;


The advantage of using a two-table layout like this is that queries
to perform operations such as score-averaging become independent of
the number of tests administered:



SELECT id, AVG(result) FROM test_result GROUP BY id;


To show subject names, too, join the test_result
table to the test_subject table:



SELECT test_result.id, test_subject.name, AVG(test_result.result)
FROM test_subject, test_result
WHERE test_subject.id = test_result.id
GROUP BY test_result.id;


You also can easily identify subjects who have not completed all the
tests. For example, if you've presented a total of
four tests, you can find subjects who don't have
four scores in the test_result table like this:



SELECT test_subject.id, test_subject.name, COUNT(test_result.result) AS count
FROM test_subject LEFT JOIN test_result ON test_subject.id = test_result.id
GROUP BY test_subject.id
HAVING count < 4;


This query uses LEFT JOIN to
make sure that a count for every subject is generated, even those to
whom no tests have yet been administered. (A regular join will fail
to identify a subject who has a record in the
test_subject table but no records yet in the
test_result table, because it will find no match
between the tables.) LEFT JOIN
is discussed in Chapter 12.










    I l@ve RuBoard



    No comments:

    Post a Comment