My first Publication Agile-Data-Warehouse-Design-eBook | Page 166
Modeling Star Schemas
145
Adding surrogate keys to dimensions in addition to their business keys slightly
increases the size of the dimensions, but this is insignificant. In general do not
worry about the size of dimensions. Although dimension rows can be long with
dozens of descriptive attributes, dimensions typically only contain hundreds or
thousands of rows in total, whereas fact tables can record millions of rows per day.
If you want to see where you should concentrate on saving storage take a look at
the Figure 5-8 “scale” diagram of a star schema. In a dimensional data warehouse,
fact tables along with their indexes and their staging tables account for 80%–90% of
the storage requirements. Don’t worry about
Of course not every dimension is small. Customer dimensions that contain indi-
vidual consumers can contain tens or hundreds of millions of rows. These need to
be treated with the same respect as fact tables and will benefit too from a primary
key index based on a compact 4-byte integer rather than a longer “smarter’ alpha-
numeric Customer ID that contains a check digit. Chapter 6 covers specific tech-
niques for handling very large dimensions (VLDs), also known as “monster
dimensions”. …except customer
the size of
dimension tables…
dimensions – they
can be big!
Figure 5-8
Scale diagram of a
star schema by
space used
For fixed length surrogate keys, a 4-byte integer is suitable for most dimension
populations. If you are expecting a crowd (more than 2.1 billion whos or whats) or
have specialized calendar dimension requirements (discussed in Chapter 7), you
should use an 8-byte integer surrogate key.
Improve Query Performance
Because integer surrogate keys reduce the size of fact tables and indexes, they help
to fit more records into each read operation. This in turn leads to improved star
join processing and dramatic improvements in query performance.
Enforce Referential Integrity Efficiently
Surrogate keys do add an extra layer of complexity to ETL. It is true that fact
records can be loaded quicker if business keys are not replaced by surrogates. But
this overhead is more than outweighed by the benefits listed above and a byproduct
of surrogate keys is efficient referential integrity (RI).
Integer SKs join
faster than date and
character keys
SK lookups slow
down fact loads but
enforce referential
integrity (RI)