tured. While this depends in part on technical imple-
mentation of a metadata infrastructure as described
in the earlier “Design Physical Storage” section, data
governance also means that business processes
determine the key metadata to be required. Similarly,
data quality requirements related to concepts such as
completeness, accuracy, consistency and standard-
ization are in essence business policy decisions that
must first be made, before baking the results of those
decisions into the technical systems and processes
that actually carry out these requirements.
The technologies used to implement data governance
policies in a data lake implementation are typically
not individual products or services. The better
approach is to expect the need to embed the obser-
vance of data governance requirements into the
entire data lake infrastructure and tools.
Enable Metadata Cataloging and Search
Key Considerations
Any data lake design should incorporate a metadata
storage strategy to enable the business users to be
able to search, locate and learn about the datasets that
are available in the lake. While traditional data ware-
housing stores a fixed and static set of meaningful data
definitions and characteristics within the relational
storage layer, data lake storage is intended to flexibly
support the application of schema at read time. How-
ever, this means a separate storage layer is required to
house cataloging metadata that represents technical
16 | THE DOPPLER | SUMMER 2017
and business meaning. While organizations some-
times simply accumulate contents in a data lake with-
out a metadata layer, this is a recipe certain to create
an unmanageable data swamp instead of a useful data
lake. There are a wide range of approaches and solu-
tions to ensure that appropriate metadata is created
and maintained. Here are some important principles
and patterns to keep in mind.
Enforce a metadata requirement - The best way to
ensure that appropriate metadata is created is to
enforce its creation. Ensure that all methods through
which data arrives in the core data lake layer enforce
the metadata creation requirement, and that any new
data ingestion routines must specify how the meta-
data creation requirement will be enforced.
Automate metadata creation - Like nearly every-
thing on the cloud, automation is the key to consis-
tency and accuracy. Wherever possible, design for
automatic metadata creation extracted from source
material.
Prioritize cloud-native solutions - Wherever possible,
use cloud-native automation frameworks to capture,
store and access metadata within your data lake.
The core attributes that are typically cataloged for a
data source are listed in the table on the following page.
An AWS-Based Solution Idea
An example of a simple solution has been suggested
by AWS, which involves triggering an AWS Lambda