My first Publication Agile-Data-Warehouse-Design-eBook | Page 276
256
Chapter 8
Queries that attempt
to directly join fact
tables using single
SQL select clauses
can overstate the
facts!
How Many
Report 3: 2011 Employee Analysis, in Figure 8-18, shows the results of the previous
query—but first take a look at the two smaller reports that preceeded it. Report 1
shows that employee James Bond has received three salary payments totalling
£160,000. Report 2 shows that he has been absent 6 days. Now look at Report 3. It
shows that James earned £320,000 and was absent 18 days. Something is clearly not
right here: his salary has doubled and his absences have tripled!
Figure 8-18
Overstating the
facts
Joining across a
M:M relationship
causes over-
counting because
SQL joins first, then
aggregates the “too
many rows” created
by the join
This over-counting is known as the “many to one to many problem”, “fan trap” or
“chasm trap”. It occurs when the tables being joined have a M:M relationship. SQL
has to evaluate the WHERE clause, which performs the joins ahead of the GROUP
BY clause, so in the example the many Bond salaries (3) are joined to the many
bond absences (2) creating too many rows, which are then summed up. Even if the
fact tables have a 1:M relationship, any facts from the 1 side of the relationship will
be overcounted. This is an insidious problem because the aggregation that’s
inherent in most BI queries will hide the “too many rows”. The only totally safe
join between fact tables is when there is a 1:1 relationship. This is very rare and
hard to guarantee. Even then performance can be poor when millions of facts are
joined.
Solution
Multiple fact tables
should be accessed
using drill-across
queries that issue
multi-pass SQL
BI applications can avoid the M:M problem by performing drill-across queries.
Drilling across means lining up measures from different business processes using
conformed row headers. A drill-across query does this by issuing multi-pass SQL:
sending separate SELECT statements to each star schema. These separate queries
aggregate the facts to the same conformed row-header level before they are merged
to produce a single answer set. Drilling across would provide the correct answer to
Report 3 by running a query to summarize salaries by Employee ID and another to
summarize absences by Employee ID and then merging (full outer join) the two
correctly aggregated answer sets.