Sunday, June 24, 2007

Nested fact joins

One cannot have everything the way he would like it --Mark Twain

In the absence of other advanced techniques for star join optimization, nested fact join with dimensions while ensuring that the indexes are used for each dimension join delivers reasonable performance.

The following indexes are created on the star schema example shown:
  1. Primary key unique index on surrogate key for each dimension. One index each is created on: calendar (calendar_key), transaction_details (transaction_detail_key), product (product_key), sales_organization (sales_org_key)
  2. One index is created for each foreign key on the fact table: sales_transaction (calendar_key), sales_transaction (transaction_detail_key), sales_transaction (product_key), sales_organization (sales_org_key)
  3. In addition some indexes may be created on the dimension attributes that are frequently queried: transaction_details (transaction_type), product (category), sales_organization (store), etc.
Consider a typical query:
mysql> SELECT transaction_type, product.category, store, sum(dollar_retail_amount - dollar_discount_amount) AS dollar_sales
-----> FROM sales_transaction, calendar, transaction_details, product, sales_organization
-----> WHERE sales_transaction.calendar_key = calendar.calendar_key
-----> AND
sales_transaction.transaction_detail_key = transaction_details.transaction_detail_key
----->
AND sales_transaction.product_key = product.product_key
-----> AND sales_transaction.sales_org_key = sales_organization.sales_org_key
-----> AND calendar.month = 200702
-----> AND transaction_type = '1-Regular'
-----> AND product.category in ('Apparel', 'Personal Care')
-----> AND store in ('Sierra', 'Tundra')
-----> GROUP BY transaction_type, product.category, store;
As long as the optimization rules for a star schema are followed, the explain plan determined by the optimizer will look very similar to this:
.+---+-------------+---------------------+--------+-----------------------------------------------------------------------------------+------------------------+------------------------------------------------+----------------------------------------------+
.|id | select_type | table | type | possible_keys | key | ref | Extra |
.+---+-------------+---------------------+--------+-----------------------------------------------------------------------------------+------------------------+------------------------------------------------+----------------------------------------------+
1| 1 | SIMPLE | transaction_details | range | PRIMARY,trans_type_idx |
trans_type_idx | NULL | Using where; Using temporary; Using filesort |
2| 1 | SIMPLE | sales_transaction | ref | s_trans_cal_key_idx,s_trans_detail_key_idx,s_trans_prod_key_idx,s_trans_s_org_idx |
s_trans_detail_key_idx | dwh.transaction_details.transaction_detail_key | Using where |
3| 1 | SIMPLE | product | eq_ref | PRIMARY,prod_category_idx | PRIMARY | dwh.sales_transaction.product_key | Using where |
4| 1 | SIMPLE | sales_organization | eq_ref | PRIMARY,sales_org_store_idx | PRIMARY | dwh.sales_transaction.sales_org_key | Using where |
5| 1 | SIMPLE | calendar | eq_ref |
PRIMARY | PRIMARY | dwh.sales_transaction.calendar_key | Using where |
.+---+-------------+---------------------+--------+-----------------------------------------------------------------------------------+------------------------+------------------------------------------------+----------------------------------------------+
The following steps are executed in this explain plan:
  1. Filter dimension (transaction_details) rows using the index in the filtered attribute in the where clause
  2. Filter fact rows by joining the fact to the dimension rows filtered in step 1, the fact foreign key index is used for the join
  3. Filter the fact rows further by joining it with the next dimension (product), the dimension primary key index is used for the join
  4. Filter the fact rows further by joining it with the next dimension (sales_organization), the dimension primary key index is used for the join
  5. Filter the fact rows further by joining it with the next dimension (calendar), the dimension primary key index is used for the join
This is as good as it gets in the absence of bitmap indexes. The large fact table is joined iteratively, again and again, with the small dimensions using appropriate indexes to quickly bring down the number of rows to be retrieved for the final result set. Note that the index creation rule described above is very simple and straightforward as no concatenated indexes are required on the fact foreign keys. The optimizer plan works well irrespective of whether some or all the dimensions are used for restricting fact rows.