The Doppler Quarterly Summer 2017

tured. While this depends in part on technical imple- mentation of a metadata infrastructure as described in the earlier “Design Physical Storage” section, data governance also means that business processes determine the key metadata to be required. Similarly, data quality requirements related to concepts such as completeness, accuracy, consistency and standard- ization are in essence business policy decisions that must first be made, before baking the results of those decisions into the technical systems and processes that actually carry out these requirements. The technologies used to implement data governance policies in a data lake implementation are typically not individual products or services. The better approach is to expect the need to embed the obser- vance of data governance requirements into the entire data lake infrastructure and tools. Enable Metadata Cataloging and Search Key Considerations Any data lake design should incorporate a metadata storage strategy to enable the business users to be able to search, locate and learn about the datasets that are available in the lake. While traditional data ware- housing stores a fixed and static set of meaningful data definitions and characteristics within the relational storage layer, data lake storage is intended to flexibly support the application of schema at read time. How- ever, this means a separate storage layer is required to house cataloging metadata that represents technical 16 | THE DOPPLER | SUMMER 2017 and business meaning. While organizations some- times simply accumulate contents in a data lake with- out a metadata layer, this is a recipe certain to create an unmanageable data swamp instead of a useful data lake. There are a wide range of approaches and solu- tions to ensure that appropriate metadata is created and maintained. Here are some important principles and patterns to keep in mind. Enforce a metadata requirement - The best way to ensure that appropriate metadata is created is to enforce its creation. Ensure that all methods through which data arrives in the core data lake layer enforce the metadata creation requirement, and that any new data ingestion routines must specify how the meta- data creation requirement will be enforced. Automate metadata creation - Like nearly every- thing on the cloud, automation is the key to consis- tency and accuracy. Wherever possible, design for automatic metadata creation extracted from source material. Prioritize cloud-native solutions - Wherever possible, use cloud-native automation frameworks to capture, store and access metadata within your data lake. The core attributes that are typically cataloged for a data source are listed in the table on the following page. An AWS-Based Solution Idea An example of a simple solution has been suggested by AWS, which involves triggering an AWS Lambda

The Doppler Quarterly Summer 2017 | Page 18