My first Publication Agile-Data-Warehouse-Design-eBook | Page 167
146
Chapter 5
RI prevents bad
keys getting into
good fact tables
DBMS constraints
can enforce RI but
ETL SK lookups can
often do this more
efficiently
Referential integrity means that every foreign key has a matching primary key.
Without RI checks, facts could be loaded into fact tables with corrupt dimension
foreign keys that do not match any existing dimensional values. If this happens,
any query that uses these bad keys will fail to include those facts because they will
not join to the appropriate dimensions. If these “bent needles” find there way into
the giant haystack of fact tables the (SQL “NOT IN”) queries needed to find them
are prohibitively expensive.
RI can be enforced by defining foreign key constraints in the database. However, in
practice, DBMSs can be too slow at loading data warehousing quantities of data
with RI switched on. In contrast, ETL processes are optimized for performing the
type of in-memory lookups required to check foreign keys against primary keys—
this is exactly what ETL does when translating business keys into surrogate keys
prior to loading fact tables. Effectively, the surrogate key processing provides “free”
procedural referential integrity, which allows DBMS RI checking to be safely
disabled.
DBMS query optimizers often benefit from having fact table RI constraints defined.
You can retain these optimization clues but still boost ETL performance by setting
the constraints to unenforced (you might call this “trust me I know what my ETL
process is doing” mode). This tells the optimizer what it needs to know about the
relationships between facts and dimensions to speed up queries, but avoids
unnecessary insert and update checks that would slow down ETL.
Enable DBMS-enforced RI for fact tables during ETL development and initial data
take-on to provide “belt and braces” data integrity assurance and test ETL surro-
gate key lookups. If no DBMS RI errors are raised, the ETL processes are assign-
ing valid surrogate keys to facts and the additional DBMS checks are
unnecessary. You can then disable the DBMS RI (drop the constraints or set them
to unenforced), if it is having an adverse effect on load times.
Slowly Changing Dimensions
Change stories
match the behavior
of slowly changing
dimensions
When you model dimensions with stakeholders you ask them how each dimen-
sional attribute should handle change. You record their answers as change stories
which illustrate the behavior of fixed value (FV) attributes that can only be cor-
rected and have no valid history, current value (CV) attributes with no required
history and historic value (HV) attributes that preserve their history. Figure 5-9
shows two Employee change stories for James Bond. The historical values of his
MARITAL STATUS and CITY (his HV attributes) have been carefully tracked on
each of his stories while his CV and FV attributes DEPARTMENT and DATE OF
BIRTH show only the current or corrected single values for all his stories. This
example data matches exactly how slowly changing dimensions (SCDs) are imple-
mented by ETL processes in a dimensional data warehouse using surrogate keys.