My first Publication Agile-Data-Warehouse-Design-eBook | Page 154
Modeling Star Schemas
133
Unique Values and Frequency
Another vital property of each candidate source is the number of unique values
that it contains, and the frequency of each value. The SQL for counting unique
values and calculating percentage unique is:
Check source
columns for
uniqueness to
identify candidate
SELECT COUNT(DISTINCT [SourceColumn]), COUNT(DISTINCT
[SourceColumn])/COUNT(*) * 100 FROM [SourceTable]
keys, and hierarchy
levels
A source column with 100% unique values may be a good candidate for a business
key while progressively lower percentage uniqueness can suggest that a set of
columns represent a viable hierarchy. The SQL for ranking each value in a column
by its frequency is:
SELECT [SourceColumn], COUNT(*) FROM [SourceTable]
GROUP BY [SourceColumn] ORDER BY 2
Source column value frequency can be graphed to help you spot columns that have
no informational content in spite of not being Null. For example:
Graph source
column values
by frequency
Columns where values are (almost) all the same (equal to the default)
Empty or spaces only strings: the logical equivalent of Null
Favorite dates for lazy data entry staff such as “1/1/01”
to discover poor
quality content
Data profiling requires full table scans, making some of the queries involved very
resource intensive. You should avoid profiling a live operational system directly,
because transactional performance can be adversely effected. This is clearly not
the ideal first impression that any data warehouse team wants to make on opera-
tional support! Instead use snapshots (off-line copies) of the candidate data
sources held on your own server or wait until after-hours.
Data Ranges and Lengths
The third category of simple data profiling tests identify source data ranges by
querying the minimum, maximum and average values for numeric columns, the
earliest and latest dates for datetime columns, and the shortest and longest strings
for character columns. As well as helping you to define data types and set date
ranges for the data warehouse these queries help you to spot outliers that often
represent errors.
If source data is reliably time-stamped, try grouping your data profiling queries by
the Month, Quarter, or Year that the data was inserted/updated to see how data
quality changes over time. The worst quality issues may be older than the histori-
cal scope of the warehouse—if you’re lucky.
Query data ranges
to help define data
types and spot
erroneous outlier
values