My first Publication Agile-Data-Warehouse-Design-eBook | Page 155
134
Chapter 5
Automating Your Own Data Profiling Checks
Use SQL scripts
to generate data
profiling queries
that write their
results to a table
If you don’t have a data profiling tool but you do have hundreds or thousands of
source columns to check, you can use SQL-generated SQL to create data profiling
tests that write their results to a table for analysis and presentation with BI tools.
For example, the following SQL generates a set of INSERT statements that count
Nulls for all columns in a schema, and write the results to a PROFILING
_RESULTS table:
SELECT
'INSERT INTO PROFILING_RESULTS(TABLE_NAME, COLUMN_NAME,
MISSING_COUNT) SELECT '''
|| Table_Name
|| ''', '''
|| Column_Name
|| ''', COUNT(*) FROM '
|| Table_Name
|| ' WHERE '
|| Column_Name
|| ' IS NULL;'
FROM SYS.All_Tab_Columns
WHERE …
Search online for
ready-made
profiling scripts
Search online for “SQL data profiling script” and you should be able to find ready-
made scripts that you can adapt for your database platform that will create all the
tests recommended above and more and store the results in table form.
For in-depth coverage of data profiling, data quality measurement, and ETL
techniques for continuously addressing data quality read:
Data Quality: The Accuracy Dimension, Jack E. Olsen
(Morgan Kaufmann, 2003)
The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta
(Wiley, 2004) Chapter 4, pages 113–147
No Source Yet: Proactive DW/BI Design
Proactive DW/BI
designers have to
cope without stable
data sources to
profile (yet)
Use no source as
an opportunity to
define the perfect
BI data source
What if there is no source to profile? This might not be a disaster, just a timing
issue that needs to be anticipated. When agile BI systems are developed in parallel
with new operational systems, a proactive data warehouse design can preempt
operational system development or the installation of a packaged solution. Initially,
there will be no indicative source data, possibly not even a source data model. Even
when there is a well documented data model, as in the case of a packaged source, it
can provide little useful information until the system has been configured and real
data migrated to it.
ETL development is especially challenging when source data definitions are still
fluid (non-existent), but this does present an opportunity for the agile data ware-
house team to negotiate a better “data deal.” The BEAM ✲ model can be used to
provide a detailed specification of business intelligence data requirements to the