My first Publication Agile-Data-Warehouse-Design-eBook | Page 276

256 Chapter 8 Queries that attempt to directly join fact tables using single SQL select clauses can overstate the facts! How Many Report 3: 2011 Employee Analysis, in Figure 8-18, shows the results of the previous query—but first take a look at the two smaller reports that preceeded it. Report 1 shows that employee James Bond has received three salary payments totalling £160,000. Report 2 shows that he has been absent 6 days. Now look at Report 3. It shows that James earned £320,000 and was absent 18 days. Something is clearly not right here: his salary has doubled and his absences have tripled! Figure 8-18 Overstating the facts Joining across a M:M relationship causes over- counting because SQL joins first, then aggregates the “too many rows” created by the join This over-counting is known as the “many to one to many problem”, “fan trap” or “chasm trap”. It occurs when the tables being joined have a M:M relationship. SQL has to evaluate the WHERE clause, which performs the joins ahead of the GROUP BY clause, so in the example the many Bond salaries (3) are joined to the many bond absences (2) creating too many rows, which are then summed up. Even if the fact tables have a 1:M relationship, any facts from the 1 side of the relationship will be overcounted. This is an insidious problem because the aggregation that’s inherent in most BI queries will hide the “too many rows”. The only totally safe join between fact tables is when there is a 1:1 relationship. This is very rare and hard to guarantee. Even then performance can be poor when millions of facts are joined. Solution Multiple fact tables should be accessed using drill-across queries that issue multi-pass SQL BI applications can avoid the M:M problem by performing drill-across queries. Drilling across means lining up measures from different business processes using conformed row headers. A drill-across query does this by issuing multi-pass SQL: sending separate SELECT statements to each star schema. These separate queries aggregate the facts to the same conformed row-header level before they are merged to produce a single answer set. Drilling across would provide the correct answer to Report 3 by running a query to summarize salaries by Employee ID and another to summarize absences by Employee ID and then merging (full outer join) the two correctly aggregated answer sets.