Sunday, February 25, 2007

Load data; Fast

It's hardware that makes a machine fast. It's software that makes a fast machine slow --Craig Bruce

One of the most common activity in a data warehouse environment is loading data into the database. The volume of data to be loaded is large compared to a transaction processing system. Tens or hundreds of megabytes of data may be need to be loaded in a data warehouse environment from one file. Compare this to one transaction, which is probably less than a kilobyte, in the transaction processing environment. The large throughput required for data warehouse data loads is achieved by:
  1. using database utilities built specifically for the purpose, as opposed to the common SQL INSERT command; and
  2. aggressive parallelism: task, pipeline, and partition
Most databases provide a utility to bulk load a large dataset. The utility works faster by bypassing the database overhead associated with the SQL engine, transaction management, online backup, and index maintenance. For MySQL, the utility is LOAD DATA INFILE. Let us use the sales transactions data load example shown in the data profiling post to see how this works.

The sales_trans.dat file contains pipe delimited data:
1234|1-Regular|1|0 698615 00307 5|192|...
1234|1-Regular|2|6 57950 83781 3|121|...
...
The load_sales_sierra.sql file contains the LOAD DATA specifications:
LOAD DATA INFILE '/infiles/sierra/sales_trans.dat'
INTO TABLE staging.sales_transaction
FIELDS TERMINATED BY '|'
(trans_no, trans_typ, line_no, sku, prod_cat, prod_sub_cat, prod_typ, prod_sub_type, cust_id, metro_area, store_id, store_zip, register_typ, clerk_id, promo_typ, coupon_cd, placement_cd, unit_amt, price_dollar, line_tot_dollar, tax_dollar, basket_dollar, payment_typ, timestamp);
QUIT
MySQL security sidenote - the user 'dwload' needs the 'FILE' privilege to load data:
mysql> GRANT FILE ON *.* TO dwload;
The data file can now be loaded into the sales_transaction table of the staging database:
unix> mysql -u dwload -p**** < load_sales_sierra.sql
Since we are using the MyISAM engine, we do not have to worry about the transaction management and online backup bogging down the load performance (MyISAM does not support these). We do need to ensure that indexes, that have been created on this table to run the data profiling queries as well as the subsequent ETL processes to populate the data warehouse, are not being updated while data is being loaded. It is ensured that the indexes are updated once in bulk after the load completes, by disabling the indexes before the data load:
mysql> ALTER TABLE sales_transaction DISABLE KEYS;
Indexes are then enabled (created in bulk) once the load completes:
mysql> ALTER TABLE sales_transaction ENABLE KEYS;
Index rebuild is the most time consuming part of the load process. It is helped by ensuring that the system variables: myisam_max_sort_file_size, and myisam_sort_buffer_size are set appropriately.

Now we know how to load large data volumes fast into the database tables, in future posts we will see how the various parallelisms - task, pipeline, and partition - help us in making the load even faster, provided we have enough redundant hardware available.