Data warehousing involves considerable amounts of accidental and essential complexity. While being a satisfying pursuit when successful, data warehousing is also decidedly prone to failure. You need to learn to manage the essential complexity and eliminate the accidental.
Data warehousing deals with accidental complexity resulting from availability of myriad choices in implementing the technology solution. This complexity can be, and should be eliminated. You need to be careful in believing the claims being made by the jargon-laced2, hyped-up, cool-sounding, fad-terms, and be realistic about what the technology can achieve. Simple and proven solutions exist to deal with the 100TB data warehouses, and should be leveraged. SMP provides a simple and increasingly cost effective path for ETL and database scalability. The reporting scalability is easily provided by a cluster of load-balanced servers. Use bulk loads and range partitioning along with task, pipeline, and partition parallelism to tame the ETL processing windows. Effective use of star schemas, bitmap indexes, range partitioning, and aggregations on the database provides good performance and unlimited scalability for the resource intensive queries. Consider extraordinary solutions only for extreme requirements such as: MPP only for extreme scalability, or multi-dimensional databases only for extreme query performance requirements. Finally, make sure that you are NOT using data warehouse to fill in deficiencies of the operational source systems, fix the operational systems instead.
Data warehousing deals with essential complexity resulting from the tedious task of understanding business data and integrating it across diverse sources, organizations and business processes to provide business performance reporting over time. The integration is based on rules that need to be agreed upon by people from diverse groups across multiple organizations. This complexity cannot be reduced, only made manageable, by building the data warehouse iteratively. Also, make sure that you are not fighting the users by providing them with reporting tools they love to hate, keep in mind that Excel remains the most popular tool of choice for data manipulation and reporting.
1 The Mythical Man-Month
2 The following terms have been much abused with overloaded definitions to the point where they can be used to simultaneously mean anything, everything, and nothing in particular. Totally like whatever, you know? Listed in alphabetical order, these terms will not be used here:
- Architecture
- Enterprise
- Metadata
- Strategy