Tuesday, February 26, 2008

Popularity contest

The only thing worse than being talked about is not being talked about --Oscar Wilde

The table below shows the popularity of the data warehousing tools as determined by the number of the Google search results. There are only a few surprises, and you can easily arrive at your own conclusions. Just keep in mind that popular does not necessarily translate into good1. Also shown as a treemap.

Search phraseResult count
"SQL Server" BI2,200,000|||||||||||||||||||||||||||
Excel BI771,000|||||||||
Oracle BI403,000||||
"Business Objects" BI375,000||||
SAS BI343,000||||
Cognos BI265,000|||
Teradata BI225,000||
Hyperion BI87,600|
Microstrategy BI68,100
MySQL Database12,500,000||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Oracle Database10,400,000||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
"SQL Server" Database3,590,000||||||||||||||||||||||||||||||||||||||||||||
DB2 Database611,000|||||||
Teradata Database73,200
Oracle ETL3,130,000||||||||||||||||||||||||||||||||||||||
"SQL Server" ETL1,560,000|||||||||||||||||||
Teradata ETL178,000||
SAS ETL165,000||
Informatica ETL84,600|
"Ab Initio" ETL53,300
Datastage ETL27,000


1 For that matter even truth does not necessarily translate into good, but that is a whole another story.

Thursday, December 27, 2007

Shattered stars

The limits of my language means the limits of my world --Ludwig Wittgenstein

A properly designed star schema, along with appropriate performance optimizations, provides good performance and limitless scalability for the largest of the data warehouses. The main criteria for a properly designed star schema is that the dimension row counts are much smaller relative to the fact row counts. But there are dimensions, such as the customer dimension, that can be very large. These are called the monster dimensions, and stretch the star schema to the breaking point. These dimensions must be handled with caution, the bothersome data should be taken outside the data warehouse where appropriate.

The customer dimension can easily exceed 100 million rows for a large business. The situation is worse when the customer attributes need to be tracked at the account level, and the customer dimension morphs into the customer-account dimension. Further the changes to the dimension may need to be tracked over time, resulting in type-2 rows being added at regular intervals.

There are several techniques to deal with monster dimensions. You start by re-arranging the dimension attributes into several mini dimensions. Thus you get the demographics dimension, the customer-account status dimension, etc., in addition to the core customer-account dimension that has the basic customer information such as the name, address, etc. Then you convince the users that change tracking is not needed on the core customer-account dimension for the trend reporting and the longer term decision making. Thus the type-2 change rows are eliminated altogether, the changed type-1 dimension attributes are simply overwritten and historical changes are lost. Even without the type-2 rows, the core customer-account dimension is still quite big. You might be able to convince the users that the core customer-account attributes are not needed at all, and eliminate it altogether. If this approach satisfies the needs of your business, congratulations! But, the users may need access to and may want to track the changes to the core customer-account attributes as well.

At this point you need to clearly identify, why this information is needed and delineate the trend reporting requirements from the operational as well as statistical data-mining requirements. If the requirement is to track the customer-account changes to respond to the customer inquiries, it is really an operational requirement and, an operational application using normalized data structures should be built to support it. If the requirement is to track address changes for fraud detection, it is a data mining requirement and, a flat data structure should be built to support it. Either way, stuffing this data forcibly into a star schema will not yield satisfactory results.

Wednesday, December 26, 2007

Range partitioning large database tables

Divide and Conquer --Gaius Julius Caesar

Table partitioning in data warehouses is used to divide a large database table into smaller physical tables called partitions1 that are easier to manage. The partitions still retain a single logical view for the user query execution. Range partitioning by date facilitates historical ETL processing and trend reporting:
  1. ETL: Only current partition is impacted for load and index rebuild, the downtime is minimized as new data is rolled in while old data is rolled out
  2. Querying and Reporting: Only relevant partitions are read based on the query's where clause, there is a significant improvement in response times
ETL

In a properly designed star schema, the fact table is the only large table. The daily load of the fact table is a resource intensive process and can take several hours. Rebuilding the indexes (after the load) on a large table can sometimes takes longer than the load itself. Range partitioning provides and easy way out of this bind. In the star schema example, the sales transaction fact is partitioned by months:
Oracle> CREATE TABLE sales_transaction (
calendar_key NUMBER(8),
transaction_detail_key NUMBER(5),
product_key NUMBER(6),
sales_org_key NUMBER(5),
transaction_number VARCHAR2(20),
transaction_line NUMBER(4),
dollar_retail_amount NUMBER(10,2),
dollar_discount_amount NUMBER(10,2),
dollar_tax_amount NUMBER(10,2),
cost_of_goods_sold NUMBER(10,2),
insert_date
DATE
) NOLOGGING PARTITION BY RANGE(calendar_key) (
PARTITION sales_jan2007 VALUES LESS THAN(20070201),
PARTITION sales_feb2007 VALUES LESS THAN(20070301),
PARTITION sales_mar2007 VALUES LESS THAN(20070401),
PARTITION sales_apr2007 VALUES LESS THAN(20070501),
PARTITION sales_may2007 VALUES LESS THAN(20070601),
PARTITION sales_jun2007 VALUES LESS THAN(20070701)
);
A few months are shown here, in reality the months span several years. The indexes created on this table are also partitioned by month, as signified by the LOCAL clause:
Oracle> CREATE BITMAP INDEX sales_trans_calendar_key_bix ON sales_transaction (calendar_key) NOLOGGING LOCAL;
Oracle> CREATE BITMAP INDEX sales_trans_trans_det_key_bix ON sales_transaction (transaction_detail_key) NOLOGGING LOCAL;
Oracle> CREATE BITMAP INDEX sales_trans_product_key_bix ON sales_transaction (product_key) NOLOGGING LOCAL;
Oracle> CREATE BITMAP INDEX sales_trans_sales_org_key_bix ON sales_transaction (sales_org_key) NOLOGGING LOCAL;
Duration, several hours: To load the data for the month of July 2007, first create a new table. The data load for the new month is then done to this independent table. After the load is complete, the indexes can be created (The LOCAL clause is not specified as this is not a partitioned table):
Oracle> CREATE TABLE sales_transaction_load (
calendar_key NUMBER(8),
transaction_detail_key NUMBER(5),
product_key NUMBER(6),
sales_org_key NUMBER(5),
transaction_number VARCHAR2(20),
transaction_line NUMBER(4),
dollar_retail_amount NUMBER(10,2),
dollar_discount_amount NUMBER(10,2),
dollar_tax_amount NUMBER(10,2),
cost_of_goods_sold NUMBER(10,2),
insert_date
DATE
) NOLOGGING;

Oracle> CREATE BITMAP INDEX sales_trans_load_cal_key_bix ON sales_transaction_load (calendar_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_load_t_det_key_bix ON sales_transaction_load (transaction_detail_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_load_prod_key_bix ON sales_transaction_load (product_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_load_s_org_key_bix ON sales_transaction_load (sales_org_key) NOLOGGING;
Duration: sub-second: The loaded data and pre-built indexes are now appended to the main partitioned table:
Oracle> ALTER TABLE sales_transaction ADD PARTITION sales_jul2007 VALUES LESS THAN (20070801);

Oracle> ALTER TABLE sales_transaction EXCHANGE PARTITION
sales_jul2007
WITH TABLE sales_transaction_load INCLUDING INDEXES WITHOUT VALIDATION;
Duration, sub-second: When a new month added, an old month probably needs to be deleted as well. The data in the old month partition is swapped into an empty independent table for archival processing, and then deleted from the main table:
Oracle> CREATE TABLE sales_transaction_archive (
calendar_key NUMBER(8),
transaction_detail_key NUMBER(5),
product_key NUMBER(6),
sales_org_key NUMBER(5),
transaction_number VARCHAR2(20),
transaction_line NUMBER(4),
dollar_retail_amount NUMBER(10,2),
dollar_discount_amount NUMBER(10,2),
dollar_tax_amount NUMBER(10,2),
cost_of_goods_sold NUMBER(10,2),
insert_date
DATE
) NOLOGGING;

Oracle> CREATE BITMAP INDEX sales_trans_arch_cal_key_bix ON sales_transaction_archive (calendar_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_arch_t_det_key_bix ON sales_transaction_archive (transaction_detail_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_arch_prod_key_bix ON sales_transaction_archive (product_key) NOLOGGING;
Oracle> CREATE BITMAP INDEX sales_trans_arch_s_org_key_bix ON sales_transaction_archive (sales_org_key) NOLOGGING;

Oracle> ALTER TABLE sales_transaction EXCHANGE PARTITION sales_jan2007
WITH TABLE sales_transaction_archive
INCLUDING INDEXES WITHOUT VALIDATION;
Oracle> ALTER TABLE sales_transaction DROP PARTITION sales_jan2007;

Duration, several hours: The sales_transaction_archive table can now be archived or backed-up to offline storage.

As shown above, the resource and time intensive load, index build, and archive processing is all done on smaller tables. The addition or deletion of data on the partitioned table is an extremely fast process with negligible downtime for the main table that is being queried by the users.

Note that constraints such as primary keys and associated non-partitioned indexes (GLOBAL indexes in Oracle-speak) are not used in data warehousing. All the constraint checks are performed by the ETL process. Also, database logging (known as hot backup or online backup) is not required for ETL batch processing as signified by the NOLOGGING clause in the DDL statements shown above.

Querying and Reporting

Consider a typical query on the star schema:
Oracle> SELECT SUM(dollar_retail_amount), SUM(cost_of_goods_sold)
FROM sales_transaction, calendar, transaction_detail
WHERE sales_transaction.calendar_key = calendar.calendar_key
AND sales_transaction.transaction_detail_key = transaction_detail.transaction_detail_key
AND calendar.quarter = '2007Q2'
AND transaction_detail.transaction_type = '1-Regular';
The sales_transaction fact table has 10 million rows per month and half a billion rows over 4 years of history. This query needs to read data only for 3 partitions related to the second quarter of 2007. The explain plan2 shows the partition pruning on the fact table, as seen by the Pstart and Pstop columns, to indicate that only a fraction of the overall table and index data is being accessed to return the results of the query:
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Pstart| Pstop |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | SORT AGGREGATE | | | |
| 2 | HASH JOIN | | | |
| 3 | TABLE ACCESS FULL | CALENDAR | | |
| 4 | HASH JOIN | | | |
| 5 | TABLE ACCESS FULL | TRANSACTION_DETAIL | | |
| 6 | PARTITION RANGE SUBQUERY | |KEY(SQ)|KEY(SQ)|
| 7 | TABLE ACCESS BY LOCAL INDEX ROWID| SALES_TRANSACTION |KEY(SQ)|KEY(SQ)|
| 8 | BITMAP CONVERSION TO ROWIDS | | | |
| 9 | BITMAP AND | | | |
| 10 | BITMAP MERGE | | | |
| 11 | BITMAP KEY ITERATION | | | |
| 12 | BUFFER SORT | | | |
| 13 | TABLE ACCESS FULL | TRANSACTION_DETAIL | | |
| 14 | BITMAP INDEX RANGE SCAN | SALES_TRANS_TRANS_DET_KEY_BIX |KEY(SQ)|KEY(SQ)|
| 15 | BITMAP MERGE | | | |
| 16 | BITMAP KEY ITERATION | | | |
| 17 | BUFFER SORT | | | |
| 18 | TABLE ACCESS FULL | CALENDAR | | |
| 19 | BITMAP INDEX RANGE SCAN | SALES_TRANS_CALENDAR_KEY_BIX |KEY(SQ)|KEY(SQ)|
----------------------------------------------------------------------------------------------
Note that the query above did not refer to the partition key (sales_transaction.calendar_key) directly:
WHERE sales_transaction.calendar_key between (20070401 and 20070630)
...but through the join with the calendar dimension. The database optimizer, is therefore, only able to perform partion pruning at runtime (as opposed to query parse time). This is called dynamic partition pruning, and is required to support the typical data warehouse trend queries.


1 In Oracle, a partition is a table partition. To confuse the terminology, DB2 has traditionally used the term partition to define the hash partitioned node of an MPP cluster. Now with newer releases, DB2 also has table level range partitions in addition to node level hash partitions.

2 You also need to know the use of bitmap indexes in a star transformation to fully understand this explain plan.

Sunday, November 11, 2007

Multi-dimensional vs. aggregate awareness

Ambition is the last refuge of failure --Oscar Wilde

The star schemas along with aggregates support the full scope of the data warehouse, the multi-dimensional database (MDB) can be used to support a limited scope where query performance is paramount:


Star schema aggregate awareness
MDB

Database
Oracle Materialized Views,
DB2 Materialized Query Tables
Essbase
vs.
Performance
Good with aggregates
Extreme
Scalability
Unlimited with bitmap indexes and partitioning
≈ 1 TB

Star schema trying to approach MDB performance

The star schema does not force you to think in terms of hierarchies, but there are implicit hierarchies built into the dimensions. The aggregations done by eliminating attributes in a dimension are actually done on a specific hierarchy path:
Calendar
Product
Year
Quarter
Month
Week
Category
Sub category
Product name

Also, keep in mind that there may be multiple hierarchies in a dimension:
Customer Account
Customer Account
Country
State
MSA
City
Zip
Geocode
Segmentation
Net-worth category
Income category

If you define all possible hierarchies that you are interested in, and you aggregate by eliminating all the possible hierarchy attributes in the star schema one at a time, you end up with a database which looks conceptually similar to an MDB. The advantage is performance. Since all possible aggregates are pre-calculated in an MDB, the use sum function and group by clause that are common in a star schema are not needed. The disadvantage is the scalability. Since all the aggregates are pre-calculated, the size of aggregates explodes quickly as additional dimensions or hierarchies are added to the model.

MDB trying to approach star schema scalability

To overcome the scalability issues, Essbase added a new database engine called ASO (vs. the original BSO), in which not ALL the aggregates have to be pre-calculated. Just like a star schema aggregates, you can choose the hierarchies on which the aggregates should be pre-calculated. The remaining aggregates are calculated at runtime using sum function and group by clause. So you can expect better scalability at the cost of query performance.

Aggregating transactions

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail --Abraham Maslow

We looked at the significant performance gains achieved by aggregating the balances fact. The case for aggregations is not as clear-cut with the transactions fact.

Sparse transactions

The key difference between balances and transactions for aggregations is that the transactions are sparse.

With balances, you need to maintain a balance for every customer, every product, every day irrespective of whether the balance changes or not. Thus, aggregation on balances always results in significant and known reduction in data volume.

You will NOT have a transaction for every customer, every product, every day. The transaction volume is small relative to balances:
30 million Accounts
90 million Balances per day
2 million transactions per day

Aggregating transactions does not help much

Let us continue with the in the transactions fact example:

Account (customer)
30 million records





Product

\

/

Account customer
demographics
mini dimension
-
Summarized Transactions
2 million records per day
-
Organization
(branch,
financial adviser)

/

\

Account status
mini dimension





Calendar

When the transactions are aggregated by eliminating the customer dimension, the chances are much higher that there will be a transaction for every day, every product, every organization unit, and every account status, etc. The volume of aggregated transactions fact table, therefore, ends up being very similar to the aggregated balances table that was derived after eliminating the customer dimension:

Account customer
demographics
mini dimension





Product

\

/



Aggregated Transactions
1 million records per day
-
Organization
(branch,
financial adviser)

/

\

Account status
mini dimension





Calendar

You do not achieve the kind of performance improvement here that was achieved for the aggregated balances fact. The dimensionality of the balances and transactions being same at this point, they can be stored together in the same table. All the types of aggregations including the reduction in time periods stored can now be applied to this fact.

Monday, October 29, 2007

Transactions vs. Balances

Relationship between balance sheet and profit and loss account: The profit and loss (P&L) account summarises a business' trading transactions... The balance sheet, by comparison, provides a financial snapshot at a given moment --businesslink.gov.uk

The data warehouse strives to integrate the information across diverse business processes, product lines, and sales channels. The data warehouse does not re-invent the wheel, the calculations already performed by the operational systems are never repeated at the data warehouse.

The majority of facts in a star schema fall into 1 of the 3 categories:
  1. Transactions: retail sales transactions, financial services transactions - interest payment, loan disbursement, loan payment, etc.
  2. Balances: that are impacted by the transactions, product inventory, customer account balances, etc.
  3. Summarized impact of transactions on balances over a period of time: total sales of a product for the day or week, total interest payment to a customer for the month, etc.
The transactions are stored on the operational system where they originate. The data warehouse sources the transactions from the operational system.

The balances can be derived from the transactions, but are also calculated and stored on the operational system. The data warehouse sources the balances from the operational system.

The summarized impact of transactions on balances can be derived from the transactions as well. The summarized transactions are derived by eliminating the transaction detail dimension. The various transaction types are grouped together into a limited set, and helps in general understanding of data. The detailed transaction amounts:
Transaction detail.transaction type
Transactions.amount
Investment interest
100
Margin interest
50
Passive interest
20
etc.
...

are converted to summarized amounts limited by a set of categories:
Aggregated Transactions.interest amount
Aggregated Transactions.principal amount
etc.
170
250
...

Once the transaction detail dimension has been eliminated from the star:
Account (customer)
30 million records





Product

\

/

Account customer
demographics
mini dimension
-
Transactions
3 million records per day
-
Organization
(branch,
financial adviser)

/
|
\

Account status
mini dimension


Transaction detail


Calendar

the dimensions on the transactions fact are same as that on the balances:

Account (customer)
30 million records





Product

\

/

Account customer
demographics
mini dimension
-
Summarized Transactions
2 million records per day
-
Organization
(branch,
financial adviser)

/

\

Account status
mini dimension





Calendar

It is now possible to keep the balances and summarized transactions on the same fact table to make the subsequent aggregations consistent across balances and transactions.

In some cases, for example retails sales, the calculations for summarizing transactions into balance impact are simple and the operational system does not store the calculated results. In other cases, for example financial services, the calculations are quite complex and the operational system stores the calculated results along with the transactions themselves. The data warehouse sources the summarized transaction amounts from the operational system where they are available.

Sunday, September 16, 2007

Aggregate awareness

The glut of information was dulling awareness, not aiding it --Jerry Mander

The aggregate tables enhance reporting performance1 by reducing the amount of data that needs to be queried for widely used reports. Aggregate awareness at the database or the query tool makes the use of aggregate tables transparent to the user of the data warehouse.

Consider the example star schema for a financial services company as a simple case for aggregations:

Account (customer)
30 million records





Product

\

/

Account customer
demographics
mini dimension
-
Balances
90 million records per day!
-
Organization
(branch,
financial adviser)

/

\

Account status
mini dimension





Calendar

The balance fact is huge: assuming that there are an average of 3 financial instruments per account, you have 90 million balance records inserted everyday for the 30 million accounts!

The balances may be checking account balances, investment instrument positions, or the loan principal amounts depending on the business you are involved in.

Limiting the balance fact volume by reducing the number of time periods stored

Before you even consider aggregations, consider changing the granularity of the balance fact from daily to weekly / monthly / quarterly / yearly for historical data. For trend reporting done at the data warehouse, weekly or monthly snapshots for historical trends should be sufficient in most situations.

For example, only month-end balance amounts may be stored for data older than 2 months. This query provides the required month end fact balances for the new fact:
SQL> SELECT demo_id, status_id, product_id, org_id, balances.calendar_id, balance_amount,
SQL> FROM balances, calendar

SQL> WHERE balances.calendar_id = calendar.calendar_id
SQL> AND calendar.month_end = 'Y'
;
This query shows the fact records that are preserved for historical data. Effectively, you delete the records not returned by this query. The actual implementation will probably insert the records returned by the query into a separate partition and drop the original partitions with daily data, in the range partitioned fact table.

Aggregations on the balance fact by eliminating the customer dimension

Keeping the balances at the customer level adds significant volume to the fact. A large number of queries that do not contain any customer dimension attributes will benefit if a new fact is created without the customer dimension. This new aggregated fact is derived from the original detailed fact shown above by running the aggregation query:
SQL> SELECT demo_id, status_id, product_id, org_id, calendar_id, sum(amount)
SQL> FROM balances

SQL> GROUP BY
demo_id, status_id, product_id, org_id, calendar_id;

Account customer
demographics
mini dimension





Product

\

/



Aggregated Balances
1 million records per day
-
Organization
(branch,
financial adviser)

/

\

Account status
mini dimension





Calendar

Note that the account-customer dimension has been eliminated. All the other dimensions of the detailed star are used in this aggregated star schema.

Other aggregations on the balance fact

The change in time period granularity and elimination of the customer fact provides the biggest performance benefit, and will be sufficient in most situations. If required, several other aggregated facts can be created by eliminating other dimensions, or more than one dimensions simultaneously. Aggregation through elimination of relevant dimensions is recommended for it is easy to implement and understand. It will suffice when larger dimensions were appropriately broken into mini dimensions in the first place.

Do NOT create aggregations by eliminating the dimensional attributes (creating new dimensions)

It is also possible to create aggregated facts by eliminating some attributes from the dimensions rather than the complete dimensions. For example, detailed product attributes may be eliminated from the product dimension leaving only the higher level product categories. Similarly, detailed branch and financial adviser attributes may be eliminated from organization dimension leaving only the higher level regional attributes. This will require creation of new dimensions with only the new limited set of attributes. The new aggregated fact will now be linked to these new dimensions. All the queries that do not use the detailed product or organization attributes can now be executed against this new aggregated fact for better performance.

Account customer
demographics
mini dimension





Product category

\

/





More Aggregated Balances
0.2 million records per day
-
Organization
regions

/

\

Account status
mini dimension





Calendar

Note that this aggregated star uses a new balance fact as well as new dimensions created specifically to support this level of aggregation. While this will certainly improve performance for some queries, the cost of aggregate maintenance has to be balanced against the marginal benefit of performance improvement. This is not recommended as a general approach to aggregations. There might be a genuine need for extreme query performance that can be achieved by this type of aggressive aggregation, and appropriate products such as the multidimensional databases should be considered in that situation instead of implementing this requirement as aggregations on the relational star schemas.

Do create aggregations by eliminating the dimensional attributes

There is one situation, already discussed above where this approach is recommended. The reduction in fact volume by reducing the number of time periods stored is actually a special case of this aggregation approach.

When attributes from a dimension such as the product are removed, you need to create a new dimension. For example consider the original product dimension:
Product Key
Product Name
Class
Capitalization
Domestic
Style
...
1
Great Index Equity
Large
Y
Core
...
2
Strong Value
EquityLargeYValue...
3
Tremendous Value
EquityLargeYValue...
4
Grand Value
EquityLargeYValue...
5
Mungo Small
EquitySmall
Y
Value
...
6
Mini small
Equity
Small
Y
Value
...
7
Super Value
EquityLarge
Y
Value
...
8
Global Hero
EquityLarge
N
Value
...
9
International Intrigue
EquitySmall
N
Growth
...
...
...
...
...
...
...
...

A new dimension has to be created when some attributes, for example the individual product level details, are eliminated:
Product Cat Key
Class
Capitalization
Domestic
Style
...
1
Equity
Large
Y
Core
...
2
Equity
Large
Y
Value
...
2
Equity Small
Y
Value
...
4
Equity Large
N
Value
...
5
Equity Small
N
Growth
...
...
...
...
...
...
...

Compare this to the calendar dimension where the (shaded) month-end record is used to represent the month:
Calendar Key
Date
Day
Month
Month End
Year
...
20071126
26-NOV-2007
Monday
11
N
2007
...
20071127
27-NOV-2007 Tuesday
11
N
2007...
20071128
28-NOV-2007 Wednesday
11
N
2007...
20071129
29-NOV-2007 Thursday
11
N
2007...
20071130
30-NOV-2007 Friday
11
Y
2007...
20071201
01-DEC-2007 Saturday
12
N
2007...
...
...
...
...
...
...
...

The new month-level calendar dimension is not created, the usage of month-end records to represent the month is equivalent to creating a new calendar dimension:
This
dimension
is not

required
and not
created
Calendar Month Key
Month
Year
...
1
9
2007
...
2
10
2007
...
3
11
2007
...
4
12
2007
...
...
...
...
...

Since the original calendar dimension is used, the same fact can contain both daily balances for recent history and only weekly or monthly balances for older history. Mixing of different grains in this fashion is not possible with other dimensions. (Addition of a new attribute, 'Aggregate Level', to the dimensions do make it possible to mix the different dimension and fact aggregations in the same tables of the star. This is not recommended, and not discussed here).

Aggregate awareness

This query will run much faster if the aggregated fact, shown above, created by eliminating the customer dimension is used instead of the detailed fact used here:
SQL> SELECT organization.branch, sum(balance)
SQL> FROM balances, organization, calendar

SQL> WHERE balances.calendar_id = calendar.calendar_id
SQL> AND balances.org_id = organization.org_id
SQL> AND calendar.month = 11
SQL> AND calendar.year =
2007;
Aggregation awareness is the ability of the database or the reporting tool to do this substitution automatically:
SQL> SELECT organization.branch, sum(balance)
SQL> FROM aggregated_balances, organization, calendar

SQL> WHERE aggregated_balances.calendar_id = calendar.calendar_id
SQL> AND aggregated_balances.org_id = organization.org_id
SQL> AND calendar.month = 11
SQL> AND calendar.year =
2007;
The query user does not know anything about all the aggregated tables that have been created. The query user writes the first query and the second query is executed in the background with an order of magnitude improvement in the performance.


1 While aggregations are very effective for balance facts, that may not be the case for transactions.