Tuesday, September 4, 2007

What is data warehousing

I don't think its fair that I'm living for something I can't even define --Ani DiFranco

To define data warehousing is to put boundaries around it. To say out loud what it is NOT, in addition to saying what it is. The over-ambitious tool vendors and the over-achieving managers in your company will never get cornered into doing this for the fear that it will put a limit on how many future expansion opportunities they can choose from. But this is important because data warehousing is different from other IT work in several aspects. The challenges faced, both organizational as well as technical are different. And, the solutions are different. The approaches discussed here have been proven to work well, but only when applied appropriately. Before you start, the challenge is to define what you are even going to (not) do. So, what is data warehousing?

Business performance trend reporting

Data warehousing provides business performance reporting for longer term decision making. Longer term decision making is based on trends, the popular ones being sales revenue trends and profitability trends. Historical data is stored to enable long term trends.

Data warehouse copies the data from the operational systems, but data warehousing does NOT provide operational reporting for immediate decision making. For example, data warehousing does not provide a current-to-the-last-minute view of the customer that can be used by the call center rep to service the customer call.

Integrated data

The data warehouse integrates data from diverse operational systems. Multiple operational systems exist for different business processes, for different product lines, for different sales channels, or simply because they were inherited after the merger of the parent companies. The business processes are aligned with the organization structure: marketing, sales, servicing/manufacturing, purchasing, etc. The data warehouse subject areas in turn are aligned to the business processes. The products from all the operational systems are integrated within the subject area to provide a consistent view of all the relevant data elements. Further, the subject areas are linked to one another to provide a seamless view of the customer/product/sales channel across all the business process.

The data warehouse is NOT a collection of data that cannot be trivially pulled into a business performance report. It must be possible to pick a business performance metric, for example sales revenue, and compare it across all the customer/product/sales channel data pulled in from diverse operational systems. And it must be possible to do this without performing any calculations or derivations on this data.

User oriented data model

The data model for data warehouse is exposed to the business users through the reporting tool. The star schema optimizations are implemented to ensure easy understanding by the non-technical users. The star schema also provides good query performance required for the data warehouse.

The Entity-Relational model eliminates redundancy and provides good update performance required for the operational systems. The ER model is NOT recommended for the data warehouse.

Stale data

The data is sourced from operational systems and populated in the data warehouse in a batch mode. There is significant data transformation needed to achieve the required integration and consistency across various products, diverse source systems, and business processes. It is just not cost-effective to perform this function more than once a day. In extreme cases once a week, or even once a month. Always remember, to provide meaningful business performance reporting, it is considerably more important to have an integrated and consistent view of data rather than the most current data.

The data warehouse is NOT real time, and does not contain current-to-the-last-minute view of data. This is a good test to ensure that data warehouse is being used for business performance reporting and NOT for operational reporting. The operational systems should be used for operational reporting.

Exceptions

While you may end up storing some real time data and generating some operational reports for political expediency, make sure you understand what the real goal of your data warehouse is. And make sure that this diversion does not take over the real goal. At a minimum make sure that you are using appropriate approaches required to deal with the requirements that you are trying to satisfy.