My first Publication Agile-Data-Warehouse-Design-eBook | Page 154

Modeling Star Schemas 133 Unique Values and Frequency Another vital property of each candidate source is the number of unique values that it contains, and the frequency of each value. The SQL for counting unique values and calculating percentage unique is: Check source columns for uniqueness to identify candidate SELECT COUNT(DISTINCT [SourceColumn]), COUNT(DISTINCT [SourceColumn])/COUNT(*) * 100 FROM [SourceTable] keys, and hierarchy levels A source column with 100% unique values may be a good candidate for a business key while progressively lower percentage uniqueness can suggest that a set of columns represent a viable hierarchy. The SQL for ranking each value in a column by its frequency is: SELECT [SourceColumn], COUNT(*) FROM [SourceTable] GROUP BY [SourceColumn] ORDER BY 2 Source column value frequency can be graphed to help you spot columns that have no informational content in spite of not being Null. For example: Graph source column values by frequency Columns where values are (almost) all the same (equal to the default) Empty or spaces only strings: the logical equivalent of Null Favorite dates for lazy data entry staff such as “1/1/01” to discover poor quality content Data profiling requires full table scans, making some of the queries involved very resource intensive. You should avoid profiling a live operational system directly, because transactional performance can be adversely effected. This is clearly not the ideal first impression that any data warehouse team wants to make on opera- tional support! Instead use snapshots (off-line copies) of the candidate data sources held on your own server or wait until after-hours. Data Ranges and Lengths The third category of simple data profiling tests identify source data ranges by querying the minimum, maximum and average values for numeric columns, the earliest and latest dates for datetime columns, and the shortest and longest strings for character columns. As well as helping you to define data types and set date ranges for the data warehouse these queries help you to spot outliers that often represent errors. If source data is reliably time-stamped, try grouping your data profiling queries by the Month, Quarter, or Year that the data was inserted/updated to see how data quality changes over time. The worst quality issues may be older than the histori- cal scope of the warehouse—if you’re lucky. Query data ranges to help define data types and spot erroneous outlier values