Sunday, June 10, 2007

Star schema optimizations

We are all in the gutter, but some of us are looking at the stars --Oscar Wilde

Dimensional modeling, explained by Ralph Kimball, provides industry specific techniques to ensure that:
  1. The number of rows in the dimension tables is small. The row size can be large, row size is the sum of bytes stored for all the attributes in a row;
  2. The row size of the fact table is small. The number of rows can be large, use range partitioning for fact tables with very large row counts.
The star schema thus created is easily understood, and easily queried by the data warehouse user into the trend reports. Using the data attributes discussed in the post on data profiling, here is an example star schema:

Transaction detail
(junk dimension)

Transaction detail key

Transaction type
Payment type
Coupon code
Placement code
Update date

50K rows, type 1







Product

Product key

SKU
Category
Sub-category
Color
Style
Effective date
Effective end date

100K rows, type 2

one



one



many
many






Sales Transaction

Calendar key

Transaction detail key
Product key
Sales Org key
Transaction number*
Transaction line*
Dollar retail amount
Dollar discount amount
Dollar tax amount
Cost of goods sold
Insert date

10,000K rows per partition
1 partition = 1 month

* degenerate dimension





many
many


one



one

Sales Organization

Sales Org key

Clerk
Manager
Store
Region
Effective date
Effective end date

50K rows, type 2







Calendar

Calendar key

Calendar Date
Day
Week
Weekend
Holiday
Month
Month_end
Quarter
Year
Update date

10K rows, type 1

The foreign keys on the fact table - Calendar key, Transaction detail key, Product key, and Sales Org key - are the dimension surrogate keys. Since the fact table is 100 times as big as the largest dimension, the dimension keys have a relatively small number of distinct values compared to the volume of rows in the fact table. In other words the dimension keys have a low cardinality on the fact table; there are a relatively small number of values for the dimension keys that repeat again and again on the fact table. The queries to extract data from the star schema use star joins for optimal performance. The star join performs well only when dimension row count can be minimized. While dimensional techniques, such as splitting a dimension into mini dimensions, are very potent in taming the large dimensions, there may be genuine requirements for having large dimensions. The customer-account dimension, especially when made type 2 to deal with legal and compliance needs, puts a damper on any star schema. As the dimension row count increases, the star schema starts to become unmanageable. The table below shows the optimizations used to make the star schema perform to its full potential:

Star join decision tableQuery selectivity;
number of dimension rows returned:
SmallLarge
Dimension
size:
Small,
< 1% of fact row count
Bitmap indexes
(Commercial DBs only)
Dimension cartesian product: not recommended.
Nested fact joins
Aggregates
Medium,
≈ 1% of fact row count
Nested fact joins
Large,
> 1% of fact row count
Are your sure this is needed?
Read "the book", again, before answering yes
Consider non-dimensional model
for the specific dimension and the requirement

Note that snowflakes are not a recommended optimization technique.