Sunday, November 11, 2007

Multi-dimensional vs. aggregate awareness

Ambition is the last refuge of failure --Oscar Wilde

The star schemas along with aggregates support the full scope of the data warehouse, the multi-dimensional database (MDB) can be used to support a limited scope where query performance is paramount:


Star schema aggregate awareness
MDB

Database
Oracle Materialized Views,
DB2 Materialized Query Tables
Essbase
vs.
Performance
Good with aggregates
Extreme
Scalability
Unlimited with bitmap indexes and partitioning
≈ 1 TB

Star schema trying to approach MDB performance

The star schema does not force you to think in terms of hierarchies, but there are implicit hierarchies built into the dimensions. The aggregations done by eliminating attributes in a dimension are actually done on a specific hierarchy path:
Calendar
Product
Year
Quarter
Month
Week
Category
Sub category
Product name

Also, keep in mind that there may be multiple hierarchies in a dimension:
Customer Account
Customer Account
Country
State
MSA
City
Zip
Geocode
Segmentation
Net-worth category
Income category

If you define all possible hierarchies that you are interested in, and you aggregate by eliminating all the possible hierarchy attributes in the star schema one at a time, you end up with a database which looks conceptually similar to an MDB. The advantage is performance. Since all possible aggregates are pre-calculated in an MDB, the use sum function and group by clause that are common in a star schema are not needed. The disadvantage is the scalability. Since all the aggregates are pre-calculated, the size of aggregates explodes quickly as additional dimensions or hierarchies are added to the model.

MDB trying to approach star schema scalability

To overcome the scalability issues, Essbase added a new database engine called ASO (vs. the original BSO), in which not ALL the aggregates have to be pre-calculated. Just like a star schema aggregates, you can choose the hierarchies on which the aggregates should be pre-calculated. The remaining aggregates are calculated at runtime using sum function and group by clause. So you can expect better scalability at the cost of query performance.