My first Publication Agile-Data-Warehouse-Design-eBook | Page 152
Modeling Star Schemas
Agile data profiling is a form of test-driven (or test-first) design (TDD). Profiling
the source data provides you with metrics that can be used to test the fit of a data
warehouse model and the content of a data warehouse database before you develop
your SQL DDL (data definition language) and ETL code. When profiling isn’t
possible yet, a proactive DW/BI design can be viewed as an advanced test specifica-
tion for the new operational system; a test that asks “can the system supply this
data required for BI to this specification?”
131
Think of agile data
profiling as a form of
test-driven design
Identifying Candidate Data Sources
Data warehouses should be sourced from the most current and accurate data
available, which, in practice, means identifying the system-of-record (SoR) for each
fact and dimension. The system-of-record is the authoritative source for a particu-
lar type of data, such as The Payroll System for employee salary data. Data should
be extracted directly from the system-of-record rather than downstream copies—
from the original image of the data, rather than a “photocopy of a photocopy”—to
reduce latency, system dependencies and data quality issues. The only exceptions
should be where downstream systems explicitly improve data quality or unlock
proprietary data formats. Find the original
For business events and the facts they provide, finding the system-of-record that
creates and maintains the original transactions is relatively straightforward as there
is often only one source system for a specific event type. For example, the claims
processing system would be the obvious and possibly only choice for sourcing
claim submission events. But where should claiming customers or insurance
products be sourced from? Identifying the system-of-record for dimensions like
these can be far more challenging. Events/facts often
Conformed dimensions are common to multiple business processes, which may
themselves be implemented using a mixture of purchased enterprise software
packages and bespoke in-house applications. It is not uncommon for several
operational systems to independently maintain common reference data (some-
times called master data), such as Customers, Products, and Employees: the most
valuable candidate conformed dimensions. You may need to profile systems that
are outside of your present prioritized event scope to find the best source for a
conformed dimension and spot any conflicts that would hamper conformance and
reuse in the future. There may be no single best source for a dimension! Conformed
If you are fortunate, one system will have been declared as the system-of-record for
each conformed dimension. But even then, facts (events) from other systems may
use alternate business keys and carry additional dimensional attributes. If so,
conforming ETL processes will need to match the keys from the systems-of-record
and these other sources to create the “perfect” set of conformed dimensional
attributes for the next sprint and ultimately to be able to load facts from these
sources in the future. Conforming ETL
system-of-record
(SoR). Avoid
downstream copies
that introduce
latency and data
quality issues
have unique
sources
dimensions can
have multiple
sources that should
be profiled to
identify
conformance
conflicts
processes may
have to merge
sources to obtain all
the necessary keys
and attributes