My first Publication Agile-Data-Warehouse-Design-eBook | Page 169
148
Chapter 5
Composite keys
involving effective
dates require
complex joins to
fact tables
With a composite key involving effective date this becomes a far more difficult to
optimize complex (or theta) join:
Employee.Employee_ID = Expenses_Fact.Employee_ID and
Expenses_Fact.Expense_Date
between Employee.Effective_date and Employee.End_date
Without the between join on the dates, all of Bond’s expenses would be joined to
each historical version of him, triple counting his total based on the three versions
of Bond in Figure 5-9. If the above join looks complex, imagine now that
EMPLOYEE isn’t the only HV dimension that must be joined to the facts, each
join would be just as complex. This would not be a viable query strategy against
typical data warehousing quantities of facts.
A Type 2 SCD
surrogate key
partitions history
by using a simple
equi-join
Instead, Type 2 SCDs use a surrogate key as an efficient minimal primary key that
uniquely identifies each historical version of a dimension member. Figure 5-9
shows the surrogate key EMPLOYEE KEY being added to the dimension. This
would become a foreign key in all employee related fact tables. For Bond, his
earliest expense claims and sales transactions would have an EMPLOYEE KEY of
1010 while his most recent will be have 2120. A Type 2 SCD surrogate key guaran-
tees that efficient equi-joins will automatically join historical facts to the correct
historical descriptions and the most recent facts to current descriptions. They also
have the effect of making reports “repeatable”; for example, Bond’s 1968 expenses
will always be reported as incurred by a single man never a widower.
Surrogate keys should be hidden from stakeholders wherever possible. BI tools
use them as the mechanical way of joining facts and dimensions only. Surrogate
keys are never to be used for sorting, grouping or filtering data. For example, you
cannot rely on the highest surrogate key for an employee being the most recent
version. A late-arriving employee change will be assigned a higher sequence
number surrogate key than the current version. BI tools also have to be careful to
count distinct employee IDs rather than employee version rows and show distinct
lists of Employees so that so that stakeholders don’t see “Bond, Bond, Bond, …”
Current Values or Historical Values? Why Not Both?
Define attributes as
HV if you think their
full history will be
needed in the
future. Refactoring
“late arriving” history
is expensive
Treat CV as a
reporting default
rather than an ETL
instruction
You should carefully consider current value only (CV) dimension definitions. If
you suspect that historical values might be needed at some point you should define
these attributes as HV and design your dimensions and ETL processes accordingly
to record historical values from the outset. Refactoring an attribute as HV and
adding its “late arriving” dimensional history to an established warehouse is
complex and expensive, and can involve updating the foreign keys of hundreds of
millions of existing facts.
As discussed in Chapter 3, you should often treat CV codes added to the model by
stakeholders as reporting directives rather than storage decisions. CV tells you that
the stakeholders would prefer their reports to initially default to current values
(because that is what they are used to). When their analysis requirements become