One of the most common activity in a data warehouse environment is loading data into the database. The volume of data to be loaded is large compared to a transaction processing system. Tens or hundreds of megabytes of data may be need to be loaded in a data warehouse environment from one file. Compare this to one transaction, which is probably less than a kilobyte, in the transaction processing environment. The large throughput required for data warehouse data loads is achieved by:
- using database utilities built specifically for the purpose, as opposed to the common SQL INSERT command; and
- aggressive parallelism: task, pipeline, and partition
The sales_trans.dat file contains pipe delimited data:
1234|1-Regular|1|0 698615 00307 5|192|...The load_sales_sierra.sql file contains the LOAD DATA specifications:
1234|1-Regular|2|6 57950 83781 3|121|...
...
LOAD DATA INFILE '/infiles/sierra/sales_trans.dat'MySQL security sidenote - the user 'dwload' needs the 'FILE' privilege to load data:
INTO TABLE staging.sales_transaction
FIELDS TERMINATED BY '|'
(trans_no, trans_typ, line_no, sku, prod_cat, prod_sub_cat, prod_typ, prod_sub_type, cust_id, metro_area, store_id, store_zip, register_typ, clerk_id, promo_typ, coupon_cd, placement_cd, unit_amt, price_dollar, line_tot_dollar, tax_dollar, basket_dollar, payment_typ, timestamp);
QUIT
mysql> GRANT FILE ON *.* TO dwload;The data file can now be loaded into the sales_transaction table of the staging database:
unix> mysql -u dwload -p**** < load_sales_sierra.sqlSince we are using the MyISAM engine, we do not have to worry about the transaction management and online backup bogging down the load performance (MyISAM does not support these). We do need to ensure that indexes, that have been created on this table to run the data profiling queries as well as the subsequent ETL processes to populate the data warehouse, are not being updated while data is being loaded. It is ensured that the indexes are updated once in bulk after the load completes, by disabling the indexes before the data load:
mysql> ALTER TABLE sales_transaction DISABLE KEYS;Indexes are then enabled (created in bulk) once the load completes:
mysql> ALTER TABLE sales_transaction ENABLE KEYS;Index rebuild is the most time consuming part of the load process. It is helped by ensuring that the system variables: myisam_max_sort_file_size, and myisam_sort_buffer_size are set appropriately.
Now we know how to load large data volumes fast into the database tables, in future posts we will see how the various parallelisms - task, pipeline, and partition - help us in making the load even faster, provided we have enough redundant hardware available.