Tuesday, October 27, 2009

10.35 Using Temporary Tables for Data Transformation




I l@ve RuBoard










10.35 Using Temporary Tables for Data Transformation




10.35.1 Problem



You want to
preprocess input data for MySQL, but you don't have
access to external utilities for doing so.





10.35.2 Solution



Load the data into a temporary table, reformat it using SQL
statements, then copy the records into the final destination table.





10.35.3 Discussion



To work with information that must be checked or transformed before
it's ready to be added to a table,
it's sometimes helpful to load a datafile into a
temporary table first for validation purposes. (It's
generally easier to work with a dataset that is isolated into its own
table rather than combined with other records.) After you have made
sure that the temporary table's contents are
satisfactory, copy its rows to the main table and then drop it. (Note
that the use of "temporary" in this
context doesn't necessarily imply that you must use
the keyword TEMPORARY when creating the
table.[6] If you process the
table in multiple phases over the course of several server
connections, you'll need to create a
non-TEMPORARY table, then drop it explicitly when
you're done with it.)


[6] CREATE
TEMPORARY TABLE is discussed in
Recipe 3.25.



The following example illustrates how to use a temporary table to
solve a common problem: loading data into a table when the values do
not have the format required by the table structure. Suppose you have
a table main that contains three columns,
name, date, and
value, where date is a
DATE column requiring values in ISO format
(CCYY-MM-DD). Suppose also that
you're given a datafile
newdata.txt to be imported into the table, but
the contents look like this:



name1   01/01/99    38
name2 12/31/00 40
name3 02/28/01 42
name4 01/02/03 44


Here the dates are in MM/DD/YY format and
must be converted to ISO format to be stored as
DATE values in MySQL. One way to do this would be
to run the file through the cvt_date.pl script
shown earlier in the chapter:



% cvt_date.pl --iformat=us --add-century newdata.txt >tmp


Then you can load the tmp file into the
main table. But this task also can be accomplished
entirely in MySQL with no external utilities by importing the data
into a temporary table and using SQL to perform the reformatting
operations. Here's how:




  1. Create an empty table in which to load the test data. The following
    statements create the table tmp as an empty copy
    of main and add a cdate column
    to hold the dates from the datafile as character strings:

    mysql> CREATE TABLE tmp SELECT * FROM main WHERE 1 < 0;
    mysql> ALTER TABLE tmp ADD cdate CHAR(8);

  2. Load the datafile into the temporary table, storing the date values
    in the cdate column rather than in
    date:

    mysql> LOAD DATA LOCAL INFILE 'newdata.txt' INTO TABLE tmp (name,cdate,value);

  3. Transform the cdate values from
    MM/DD/YY format to
    YY-MM-DD format and store the results in
    the date column:

    mysql> UPDATE tmp
    -> SET date = CONCAT(RIGHT(cdate,2),'-',LEFT(cdate,2),'-',MID(cdate,4,2));

    MySQL will convert the two-digit years to four-digit years
    automatically, so the original MM/DD/YY
    values in the cdate column end up in the
    date column as ISO values in
    CCYY-MM-DD format. The following query
    shows what the original cdate values and the
    transformed date values look like after the
    UPDATE statement has been performed:


    mysql> SELECT cdate, date FROM tmp;
    +----------+------------+
    | cdate | date |
    +----------+------------+
    | 01/01/99 | 1999-01-01 |
    | 12/31/00 | 2000-12-31 |
    | 02/28/01 | 2001-02-28 |
    | 01/02/03 | 2003-01-02 |
    +----------+------------+

  4. Finally, copy the records from tmp to
    main (using the transformed
    date values rather than the original
    cdate values) and drop the temporary table:

    mysql> INSERT INTO main (name, date, value)
    -> SELECT name, date, value FROM tmp;
    mysql> DROP TABLE tmp;


This procedure assumes that MySQL's automatic
conversion of two-digit years to four digits produces the correct
century values. This means that the year part of the values must
correspond to years in the range from 1970 to 2069. If
that's not true, you'd need to
convert the year values some other way. (See Recipe 10.30.)



The procedure also assumes that the cdate values
are always exactly eight characters so that LEFT(
)
, MID( ), and RIGHT(
)
can be used to extract the pieces. If this assumption is
invalid, you'd have to modify the conversion
procedure. One possibility would be to use SUBSTRING_INDEX(
)
to break apart the
strings at the / delimiters:



mysql> UPDATE tmp
-> SET date =
-> CONCAT(SUBSTRING_INDEX(cdate,'/',-1),'-',
-> SUBSTRING_INDEX(cdate,'/',1),'-',
-> SUBSTRING_INDEX(SUBSTRING_INDEX(cdate,'/',2),'/',-1));


Another application for post-import processing is name splitting. If
you import values consisting of a first name, a space, and a last
name into a column full_name, you can reformat the
column into separate first_name and
last_name columns with these statements:



UPDATE tbl_name SET first_name = SUBSTRING_INDEX(full_name,' ',1);
UPDATE tbl_name SET last_name = SUBSTRING_INDEX(full_name,' ',-1);


However, this task can easily become more difficult if any of the
names have middle initials, or trailing words like
Jr. or Sr. If
that's the case, you're probably
better off preprocessing the names prior to import, using a pattern
matching utility that's smarter about breaking full
names into components.










    I l@ve RuBoard



    No comments:

    Post a Comment