My first Publication Agile-Data-Warehouse-Design-eBook | Page 153
132
Chapter 5
Master data
management
systems help
dimensional
conformance
If you are extremely fortunate, you may have a Master Data Management (MDM)
system that can help you identify the sources to profile for the most important
conformed dimensions. MDM captures, cleanses, and synchronizes reference data
across operational systems and can provide the cross-referenced business keys that
ETL needs to conform multiple sources.
Profile early, before the data warehouse model exists or is updated, then any data
quality issues you discover must be inherent in the established system-of-record
not a problem with the newcomer database or ETL process used to build it. Do it
the other way round and see who stakeholders (subconsciously) blame.
Data Profiling Techniques
Dedicated data
profiling tools are
incredibly powerful
but useful profiling
can be done using
SQL scripts and BI
tools
Data profiling has become common practice, and sophisticated profiling tools that
can graphically visualize data sources are readily available as standalone applica-
tions and as modules of data modeling and ETL tools. But even without specialized
tools, useful data profiling can be performed with simple SQL scripts, BI tools, and
spreadsheets. A full discussion of data profiling techniques is beyond the scope of
this book, but here are three basic checks for quickly assessing whether a data
source is fit for purpose in the data warehouse:
Missing Values
The first, best test
for any source is
to count missing
values and calculate
the percentage
missing
Nothing (literally) illustrates the value of data profiling more than the early discov-
ery of missing data that the stakeholders have deemed mandatory. Profile for
missing values by counting the occurrence of Nulls or blanks in each candidate
source column/field and calculating the percentage missing. Knowing how often
the source data is Null is essential for any column—but especially for columns that
have been identified as mandatory (MD) by the stakeholders. The SQL for count-
ing Null values in a column is:
SELECT COUNT(*) FROM [SourceTable] WHERE [SourceColumn]
IS NULL
/*For character columns you should add:*/
or [SourceColumn] = ‘’
When you are working with non-database sources, such as flat files, you can map
the source data as external tables, or perform basic ETL, with minimal transforma-
tions, to move it into DBMS tables, so that it can be profiled using SQL queries
and BI tools.