Sunday, June 24, 2007

Cartesian product of dimensions

I am indeed amazed when I consider how weak my mind is and how prone to error --Rene Descartes

This is not a recommended approach; only explained here so that you can identify it when it presents itself and avoid it.

The following indexes are created on the star schema example shown before:
  1. Concatenated index is created on the fact table: sales_transaction (transaction_detail_key, product_key, sales_org_key, calendar_key) - note the difference, compared to the nested join approach where multiple indexes on each foreign key were created
  2. In addition some indexes may be created on the dimension attributes that are frequently queried: transaction_details (transaction_type), product (SKU), sales_organization (clerk), etc.
The nested fact join approach joins the large fact table again and again with the small dimension. Even though the result of each join results in smaller and smaller data set for subsequent joins, the large fact table does get operated on several times. One way to avoid multiple large fact table operations is to first get all possible values for all the dimensions in the query (Cartesian product), and then join the result only once to the fact table:
mysql> SELECT transaction_type, category, store, sum(dollar_retail_amount - dollar_discount_amount) AS dollar_sales
-----> FROM sales_transaction fact,
-----> (SELECT calendar_key, transaction_detail_key, product_key, sales_org_key,
-----> transaction_type, category, store
-----> FROM calendar, transaction_details, product, sales_organization
-----> WHERE calendar.date = '2007-02-01'
-----> AND transaction_type = '1-Regular'
-----> AND SKU in ('
0698615003075', '657950837813')
-----> AND clerk in ('123', '456')) dim
-----> WHERE fact.calendar_key = dim.calendar_key
-----> AND fact.transaction_detail_key = dim.transaction_detail_key
-----> AND fact.product_key = dim.product_key
-----> AND fact.sales_org_key = dim.sales_org_key;
Note the absence of any join condition in the sub-query that returns dimension data for joining with the large fact table. This is what results in the Cartesian product. The explain plan determined by the optimizer will look very similar to this:
+----+-------------+---------------------+-------+------------------------+------------------------+----------------------------------------------------------------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+----+-------------+---------------------+-------+------------------------+------------------------+----------------------------------------------------------------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL |
NULL | NULL | Using temporary; Using filesort |
| 1 | PRIMARY | fact | ref | sales_transaction_idx |
sales_transaction_idx | dim.transaction_detail_key, dim.product_key, dim.sales_org_key, dim.calendar_key | Using where |
| 2 | DERIVED | calendar | ALL | NULL | NULL | NULL | Using where |
| 2 | DERIVED | transaction_details | range | transaction_detail_key | transaction_detail_key | NULL | Using where |
| 2 | DERIVED | product | range | product_key | product_key | NULL | Using where |
| 2 | DERIVED | sales_organization | range | sales_org_key |
sales_org_key | NULL | Using where |
+----+-------------+---------------------+-------+------------------------+------------------------+----------------------------------------------------------------------------------+---------------------------------+
This optimizer plan will work well only when the number of rows returned by each dimension is very, very small. The number of rows returned by the Cartesian product increases rapidly as the number of rows returned by each dimension increases. If each of the four dimensions in this query returns 100 rows, the Cartesian product will contain 100 million rows and, this approach becomes very, very bad.

Also, this approach requires concatenated index that will not work when the required dimension at the leading edge of the index is not used to constrain the query. To avoid this scenario, multiple concatenated indexes will have to be created resulting in index maintenance problems as well.

Before Oracle had gained enough bitmap index experience, this approach was supported with use of the STAR hint in the SQL queries and without resorting to use of sub-query as shown above. This is no longer recommended by Oracle. DB2 will also perform this optimization when required indexes are present, and the number of rows returned by each dimension are small enough, to make it worthwhile. All in all, this approach is generally avoided.