My first Publication Agile-Data-Warehouse-Design-eBook | Page 153

132 Chapter 5 Master data management systems help dimensional conformance If you are extremely fortunate, you may have a Master Data Management (MDM) system that can help you identify the sources to profile for the most important conformed dimensions. MDM captures, cleanses, and synchronizes reference data across operational systems and can provide the cross-referenced business keys that ETL needs to conform multiple sources. Profile early, before the data warehouse model exists or is updated, then any data quality issues you discover must be inherent in the established system-of-record not a problem with the newcomer database or ETL process used to build it. Do it the other way round and see who stakeholders (subconsciously) blame. Data Profiling Techniques Dedicated data profiling tools are incredibly powerful but useful profiling can be done using SQL scripts and BI tools Data profiling has become common practice, and sophisticated profiling tools that can graphically visualize data sources are readily available as standalone applica- tions and as modules of data modeling and ETL tools. But even without specialized tools, useful data profiling can be performed with simple SQL scripts, BI tools, and spreadsheets. A full discussion of data profiling techniques is beyond the scope of this book, but here are three basic checks for quickly assessing whether a data source is fit for purpose in the data warehouse: Missing Values The first, best test for any source is to count missing values and calculate the percentage missing Nothing (literally) illustrates the value of data profiling more than the early discov- ery of missing data that the stakeholders have deemed mandatory. Profile for missing values by counting the occurrence of Nulls or blanks in each candidate source column/field and calculating the percentage missing. Knowing how often the source data is Null is essential for any column—but especially for columns that have been identified as mandatory (MD) by the stakeholders. The SQL for count- ing Null values in a column is: SELECT COUNT(*) FROM [SourceTable] WHERE [SourceColumn] IS NULL /*For character columns you should add:*/ or [SourceColumn] = ‘’ When you are working with non-database sources, such as flat files, you can map the source data as external tables, or perform basic ETL, with minimal transforma- tions, to move it into DBMS tables, so that it can be profiled using SQL queries and BI tools.