My first Publication Agile-Data-Warehouse-Design-eBook | Page 270
250
Chapter 8
How Many
Move free text comments and lengthy sets of degenerate flags into their own
physical dimensions and replace them with short foreign keys (See Chapter 9).
Don’t store a large number of facts that can easily be calculated intra-record;
e.g., don’t store all the durations that can be calculated from a smaller number
of milestone timestamps.
Limit fact history to
only the data that is
useful for BI
The next thing to consider is the length of each fact table. You should try to limit
history to what the BI users really need. Don’t use fact tables as an expensive
archival strategy. If the auditors need more history than the BI users, they should
get that from the operational system of record, not the data warehouse. Regulatory
requirements are not analytical requirements, so don’t automatically load 20 years
of transactional history just because it exists. If the business has changed substan-
tially in that time how far can queries go back and make valid comparisons? Also,
the further back you go the harder it becomes to load the data because data quality
challenges tend to increase.
When modeling business events with stakeholders, ask for event stories describ-
ing the earliest when details that BI users will need to work with.
The most interesting data is the most recent. If you have years of history to load,
start with the current year and work backwards—partitioning can help to do this
efficiently. Don’t bother loading the oldest data until stakeholders ask for it.
Indexing
Create query
indexes on foreign
keys to support “star
join optimization”
More query indexes
can improve BI
performance but
slowdown ETL
Accumulating and
period-to-date
snapshots also
need an ETL update
index
After you have done all that you can to control the size of a fact table the next issue
to consider is how to index it for query performance. Here you should seek your
DBMS vendor’s advice on defining some form of “star join index.” This generally
involves creating a bitmap index on each dimensional foreign key—but techniques
vary by DBMS and by version, with new data warehousing index strategies being
added all the time (we hope).
Whatever indexing techniques you use, there is inevitably a trade-off between
query performance and ETL processing time. Your priorities should be heavily
biased towards query performance—but BI users can only query what you can
manage to load in the available time—so index thoughtfully!
In addition to query indexes, accumulating snapshots and period-to-date (PTD)
periodic snapshots need an ETL index to support efficient updates. This will be an
OLTP-style unique index using GD columns such as the ORDER ID degenerate
dimension in CUSTOMER ORDERS. Transaction fact tables and most periodic
snapshots are insert-only so they do not require a unique index, as long as ETL
processes can guarantee fact uniqueness.