The RockOil Jeans Co introduced in the post on the trend reports, wants to analyze sales trends. The various stores including the company stores as well as the big department stores have agreed to provide the daily sales data from the respective transaction processing systems at the end of each day. "Data profiling" helps us understand this data before the pretty business performance trend reports can be created for the data warehouse user. For data profiling we create quick and dirty reports:
- that are a diluted version of the final trend reports that will ultimately be produced; and
- with simple SQL aggregation queries, using SUM and GROUP BY, on the data loaded into the database called "staging area"
Some sources will provide data in files that are a straight dump from the transaction processing system that stores data in a normalized structure spread across several tables. Therefore the data arrives in several files. More often than not, the data you get from the source will be denormalized where all the related attributes are bunched together in one file. This is a file that the source organization has been sending already to other organizations to fulfill data integration needs. You will have to reuse what is already in place even though it may not contain everything that you want. Here are the sample layouts:
| Denormalized (Store: Sierra) | Normalized (Store: Tundra) | |
Sales Transaction
| Sales Header
| |
Sales Detail
| Sales Payment
| |
The source data files are loaded to the staging area that is a component of the overall data warehouse environment. After the load has been completed we have a table in the database with a structure very similar to the denormalized file structure shown above. At this point, profiling data is a simple matter of running the SQL aggregation queries. The various sales amounts provided on the raw data are aggregated by each of the sales categorization attribute, or by a logical group of sales categorization attributes. Here are a couple of reports created from result of these queries:
Report 1: Sales Amounts for the Sierra Store by Transaction Type for Feb 10, 2007
| Transaction Type | Unit Amount | Unit Amount * Dollar Price | Tax Amount | Total Line Dollar Amount |
| 1-Regular | 15,580 | 438,200 | 30,674 | 445,430 |
| 2-Return | 779 | 19,885 | 1,392 | 20,200 |
| 31 | 750 | 53 | 0 | |
| Total | 16,390 | 458,835 | 32,118 | 465,630 |
This report provides some understanding of the data while raising additional questions:
- There are at least 3 types of transactions. Does the third type represent voided transactions? Even if we never get a good answer, we do know that it is small enough to be ignored for business performance trending purposes
- The line dollar amount does not add up to [(unit amount * dollar_price) + tax]. Is the dollar price representing the retail price, whereas the total dollar amount is adjusted for discounts? Adding promotion type and coupon code to this report may provide more insight
Report 2: Sales Amounts for the Sierra Store Product Category and Sub Category for Feb 10, 2007
| Product Category | Sub Category | Unit Amount | Unit Amount * Dollar Price | Tax Amount | Total Line Dollar Amount |
| 121 | Jeans | 6,100 | 325,000 | 22,750 | 340,000 |
| 121 | Belts | 1,200 | 45,000 | 3,150 | 35,000 |
| 121 | Socks | 140 | 2,500 | 160 | 2,500 |
| ... many more | |||||
| 186 | After-shave | 5,400 | 65,000 | 4,500 | 69,000 |
| ... more | |||||
| Total | 16,390 | 458,835 | 32,118 | 465,630 |
- Does category 121 represents apparel, and 186 is personal care? We can get a product file from the store, or use our internal product file to confirm this
- Is the socks sales data accurate? It seems too low
- Is the belts pricing data accurate? It seems to be too heavily discounted. Adding promotions and internal product pricing data to this report may provide better insight
The more time we spend in this stage, and more involvement we get from the data warehouse user, the more successful the final implementation will be. Note that we used sales revenue to drive the analysis here. Since this is the metric that the user is finally interested in, we stand a good chance of securing interest and participation. But none of the reports are created to analyze trends, and provide category summaries for a single day only. It is possible to create rudimentary trend reports once we have several days of data loaded into the staging area, but it will be cumbersome. We have to complete the data warehouse implementation before we can do a full blown trend analysis. As data warehouse gets implemented, the data becomes available in easily queried "star schema" structures. The data profiling on this data is now simplified and should be continued, to maintain the sanity of the data. The ongoing data profiling aids in further understanding of data.