My first Publication Agile-Data-Warehouse-Design-eBook | Page 301
Dimensional Design Patterns for Cause and Effect
The STEP IN PURCHASE dimension role lets BI users analyze page visit
sequences that lead to product purchases and ones that don’t: page views that don’t
lead to a purchase would have a STEP IN PURCHASE KEY that points to the "Not
Applicable" row 0 in STEP.
281
Events that are not
part of a sequence
use STEP row 0
Consequences
STEP dimensions are relatively simple to populate from spreadsheets, but they
grow surprisingly quickly as the maximum number of steps increases. The formula
for calculating the number of rows needed for n total steps is: n (n+1)/2 . There-
fore, 200 steps = 20,100 rows, and 1,000 steps would be more than half a million
rows! If 200 steps are more than adequate for 99% of all sequences, pre-populate
your STEP dimension accordingly, and create special step number records greater
than 200 if/when they are needed. These records would use special STEP KEY
values (e.g. the negative step number) and would contain the STEP NUMBER but
have missing values for the other attributes to denote that they are steps in "excep-
tionally long" sequences. Often exceptionally long sequences are the result of ETL
processing errors or poorly defined business rules that fail to spot the end of a
normal sequence.
Although designing and creating a STEP dimension is straightforward, attaching it
to the facts can require significant additional ETL processing. The events that
belong to the same sequence have to be identified by an appropriate business rule
(for example, all the page visits from the same IP address that are no more than 10
minutes apart) and counted in a first pass of the data; only then can the correct
STEP KEYs be assigned to each fact row in a second pass.
STEP dimensions
grow in size quickly.
You should set a
maximum number
of steps for the
majority of
sequences
STEP dimensions
require additional
ETL processing to
make two passes of
the data
Overloading facts with STEP information and other richly descriptive why and how
dimensions takes significant additional effort from the ETL team. You should
make sure you take them to lunch—on a regular basis.
Audit Dimension Pattern
Problem/Requirement
No treatment of how and why would be complete without covering the perplexed
stakeholder’s questions:
How did this data get into our data warehouse?
Why are the figures so high/low?
Too often the answers to these questions are locked away in an ETL tool metadata
repository—inaccessible to BI users who need this information the most.
Stakeholders want to
query data lineage