Wednesday, February 14, 2007

Data profiling

USA Today has come out with a new survey - apparently, three out of every four people make up 75% of the population --David Letterman

The RockOil Jeans Co introduced in the post on the trend reports, wants to analyze sales trends. The various stores including the company stores as well as the big department stores have agreed to provide the daily sales data from the respective transaction processing systems at the end of each day. "Data profiling" helps us understand this data before the pretty business performance trend reports can be created for the data warehouse user. For data profiling we create quick and dirty reports:
  1. that are a diluted version of the final trend reports that will ultimately be produced; and
  2. with simple SQL aggregation queries, using SUM and GROUP BY, on the data loaded into the database called "staging area"
The sales data arrives in text files from the stores along with significant amount of sales categorization data. The categorization data tells you the promotion, shelf location, discount, store location, time, customer demographics, sales basket, etc. associated with the sale. This categorization data called "dimensional" data, is not standardized or "conformed" as called in data warehousing terminology: each store will have its own way of defining these categories, all the stores will not have all the same categories defined, and there will be errors or gaps in the data. Note that this categorization data is not critical for completing the sales transaction and is needed only to analyze the business performance. More categorization data such as product category, style, color, texture, etc. are already available to us internally, and hopefully more standardized than the store data. Although I will not bet on it.

Some sources will provide data in files that are a straight dump from the transaction processing system that stores data in a normalized structure spread across several tables. Therefore the data arrives in several files. More often than not, the data you get from the source will be denormalized where all the related attributes are bunched together in one file. This is a file that the source organization has been sending already to other organizations to fulfill data integration needs. You will have to reuse what is already in place even though it may not contain everything that you want. Here are the sample layouts:

Denormalized (Store: Sierra)
Normalized (Store: Tundra)
Sales Transaction
  • Transaction number
  • Transaction type
  • Line number
  • SKU
  • Product category
  • Product sub-category
  • Product type
  • Product sub-type
  • Customer id
  • Metropolitan area
  • Store id
  • Store location (zip code)
  • Register type
  • Clerk id
  • Promotion type
  • Coupon code
  • Placement code
  • Unit amount
  • Dollar price
  • Total line dollar amount
  • Tax amount
  • Total basket dollar amount
  • Payment type
  • Timestamp
Sales Header
  • Transaction number
  • Transaction type
  • Customer id (frequent shopper program)
  • Metropolitan area
  • Store id
  • Store location (zip code)
  • Register type
  • Clerk id
Sales Detail
  • Transaction number
  • Line number
  • SKU
  • Product category
  • Product sub-category
  • Product type
  • Product sub-type
  • Coupon code
  • Placement code
  • Unit amount
  • Dollar retail price
  • Dollar discount
  • Dollar amount
  • Tax Amount
Sales Payment
  • Transaction number
  • Total basket amount
  • Payment type
  • Credit card number (encrypted)
  • Timestamp

The source data files are loaded to the staging area that is a component of the overall data warehouse environment. After the load has been completed we have a table in the database with a structure very similar to the denormalized file structure shown above. At this point, profiling data is a simple matter of running the SQL aggregation queries. The various sales amounts provided on the raw data are aggregated by each of the sales categorization attribute, or by a logical group of sales categorization attributes. Here are a couple of reports created from result of these queries:

Report 1: Sales Amounts for the Sierra Store by Transaction Type for Feb 10, 2007
Transaction Type
Unit Amount
Unit Amount * Dollar Price
Tax Amount
Total Line Dollar Amount
1-Regular
15,580
438,200
30,674
445,430
2-Return
779
19,885
1,392
20,200

31
750
53
0
Total
16,390
458,835
32,118
465,630

This report provides some understanding of the data while raising additional questions:
  • There are at least 3 types of transactions. Does the third type represent voided transactions? Even if we never get a good answer, we do know that it is small enough to be ignored for business performance trending purposes
  • The line dollar amount does not add up to [(unit amount * dollar_price) + tax]. Is the dollar price representing the retail price, whereas the total dollar amount is adjusted for discounts? Adding promotion type and coupon code to this report may provide more insight
Report 2: Sales Amounts for the Sierra Store Product Category and Sub Category for Feb 10, 2007
Product Category
Sub Category
Unit Amount
Unit Amount * Dollar Price
Tax Amount
Total Line Dollar Amount
121
Jeans
6,100
325,000
22,750
340,000
121
Belts
1,200
45,000
3,150
35,000
121
Socks
140
2,500
160
2,500
... many more





186
After-shave
5,400
65,000
4,500
69,000
... more





Total

16,390
458,835
32,118
465,630
The questions raised by this report are:
  • Does category 121 represents apparel, and 186 is personal care? We can get a product file from the store, or use our internal product file to confirm this
  • Is the socks sales data accurate? It seems too low
  • Is the belts pricing data accurate? It seems to be too heavily discounted. Adding promotions and internal product pricing data to this report may provide better insight
After the first iteration of reports has been created, we gain better insight into data. We also have several additional questions. We might need to go back to the people in source organization to get answers to many of these questions. We also need active participation and feedback from the data warehouse user on data quality and prioritization of categorization data that is more relevant. Prioritization of categories will provide us with guidance on the important data attributes, and additional details may be needed on some of the categories to make the final trend reports meaningful. Critical data quality issues identified will have to be fixed to make the final trend reports meaningful. Several iterations of these reports with different category groupings and code translations might have to be created. Code translation converts encoded data values into meaningful descriptions for the data warehouse user, for example 121 is apparel in the report shown above.

The more time we spend in this stage, and more involvement we get from the data warehouse user, the more successful the final implementation will be. Note that we used sales revenue to drive the analysis here. Since this is the metric that the user is finally interested in, we stand a good chance of securing interest and participation. But none of the reports are created to analyze trends, and provide category summaries for a single day only. It is possible to create rudimentary trend reports once we have several days of data loaded into the staging area, but it will be cumbersome. We have to complete the data warehouse implementation before we can do a full blown trend analysis. As data warehouse gets implemented, the data becomes available in easily queried "star schema" structures. The data profiling on this data is now simplified and should be continued, to maintain the sanity of the data. The ongoing data profiling aids in further understanding of data.