My first Publication Agile-Data-Warehouse-Design-eBook | Page 169

148 Chapter 5 Composite keys involving effective dates require complex joins to fact tables With a composite key involving effective date this becomes a far more difficult to optimize complex (or theta) join: Employee.Employee_ID = Expenses_Fact.Employee_ID and Expenses_Fact.Expense_Date between Employee.Effective_date and Employee.End_date Without the between join on the dates, all of Bond’s expenses would be joined to each historical version of him, triple counting his total based on the three versions of Bond in Figure 5-9. If the above join looks complex, imagine now that EMPLOYEE isn’t the only HV dimension that must be joined to the facts, each join would be just as complex. This would not be a viable query strategy against typical data warehousing quantities of facts. A Type 2 SCD surrogate key partitions history by using a simple equi-join Instead, Type 2 SCDs use a surrogate key as an efficient minimal primary key that uniquely identifies each historical version of a dimension member. Figure 5-9 shows the surrogate key EMPLOYEE KEY being added to the dimension. This would become a foreign key in all employee related fact tables. For Bond, his earliest expense claims and sales transactions would have an EMPLOYEE KEY of 1010 while his most recent will be have 2120. A Type 2 SCD surrogate key guaran- tees that efficient equi-joins will automatically join historical facts to the correct historical descriptions and the most recent facts to current descriptions. They also have the effect of making reports “repeatable”; for example, Bond’s 1968 expenses will always be reported as incurred by a single man never a widower. Surrogate keys should be hidden from stakeholders wherever possible. BI tools use them as the mechanical way of joining facts and dimensions only. Surrogate keys are never to be used for sorting, grouping or filtering data. For example, you cannot rely on the highest surrogate key for an employee being the most recent version. A late-arriving employee change will be assigned a higher sequence number surrogate key than the current version. BI tools also have to be careful to count distinct employee IDs rather than employee version rows and show distinct lists of Employees so that so that stakeholders don’t see “Bond, Bond, Bond, …” Current Values or Historical Values? Why Not Both? Define attributes as HV if you think their full history will be needed in the future. Refactoring “late arriving” history is expensive Treat CV as a reporting default rather than an ETL instruction You should carefully consider current value only (CV) dimension definitions. If you suspect that historical values might be needed at some point you should define these attributes as HV and design your dimensions and ETL processes accordingly to record historical values from the outset. Refactoring an attribute as HV and adding its “late arriving” dimensional history to an established warehouse is complex and expensive, and can involve updating the foreign keys of hundreds of millions of existing facts. As discussed in Chapter 3, you should often treat CV codes added to the model by stakeholders as reporting directives rather than storage decisions. CV tells you that the stakeholders would prefer their reports to initially default to current values (because that is what they are used to). When their analysis requirements become