My first Publication Agile-Data-Warehouse-Design-eBook | Page 152

Modeling Star Schemas Agile data profiling is a form of test-driven (or test-first) design (TDD). Profiling the source data provides you with metrics that can be used to test the fit of a data warehouse model and the content of a data warehouse database before you develop your SQL DDL (data definition language) and ETL code. When profiling isn’t possible yet, a proactive DW/BI design can be viewed as an advanced test specifica- tion for the new operational system; a test that asks “can the system supply this data required for BI to this specification?” 131 Think of agile data profiling as a form of test-driven design Identifying Candidate Data Sources Data warehouses should be sourced from the most current and accurate data available, which, in practice, means identifying the system-of-record (SoR) for each fact and dimension. The system-of-record is the authoritative source for a particu- lar type of data, such as The Payroll System for employee salary data. Data should be extracted directly from the system-of-record rather than downstream copies— from the original image of the data, rather than a “photocopy of a photocopy”—to reduce latency, system dependencies and data quality issues. The only exceptions should be where downstream systems explicitly improve data quality or unlock proprietary data formats. Find the original For business events and the facts they provide, finding the system-of-record that creates and maintains the original transactions is relatively straightforward as there is often only one source system for a specific event type. For example, the claims processing system would be the obvious and possibly only choice for sourcing claim submission events. But where should claiming customers or insurance products be sourced from? Identifying the system-of-record for dimensions like these can be far more challenging. Events/facts often Conformed dimensions are common to multiple business processes, which may themselves be implemented using a mixture of purchased enterprise software packages and bespoke in-house applications. It is not uncommon for several operational systems to independently maintain common reference data (some- times called master data), such as Customers, Products, and Employees: the most valuable candidate conformed dimensions. You may need to profile systems that are outside of your present prioritized event scope to find the best source for a conformed dimension and spot any conflicts that would hamper conformance and reuse in the future. There may be no single best source for a dimension! Conformed If you are fortunate, one system will have been declared as the system-of-record for each conformed dimension. But even then, facts (events) from other systems may use alternate business keys and carry additional dimensional attributes. If so, conforming ETL processes will need to match the keys from the systems-of-record and these other sources to create the “perfect” set of conformed dimensional attributes for the next sprint and ultimately to be able to load facts from these sources in the future. Conforming ETL system-of-record (SoR). Avoid downstream copies that introduce latency and data quality issues have unique sources dimensions can have multiple sources that should be profiled to identify conformance conflicts processes may have to merge sources to obtain all the necessary keys and attributes