Figure 3 : An AWS Suggested Architecture for Data Lake Metadata Storage
function when a data object is created on S3 , and which stores data attributes into a DynamoDB database . The resultant DynamoDB-based data catalog can be indexed by Elasticsearch , allowing a full-text search to be performed by business users .
AWS Glue ; a product soon to be released , provides a set of automated tools to support data source cataloging capability . AWS Glue can crawl data sources and construct a data catalog using pre-built classifiers for many popular source formats and data types , including JSON , CSV , Parquet , and more . As such , this offers potential promise for enterprise implementations .
We recommend that clients make data cataloging a central requirement for a data lake implementation .
Access and Mine the Lake
Schema on Read
‘ Schema on write ’ is the tried and tested pattern of cleansing , transforming and adding a logical schema to the data before it is stored in a ‘ structured ’ relational database . However , as noted previously , data lakes are built on a completely different pattern of ‘ schema on read ’ that prevents the primary data store from being locked into a predetermined schema . Data is stored in a raw or only mildly processed format , and each analysis tool can impose on the dataset a business meaning that is appropriate to the analysis context . There are many benefits to this approach , including enabling various tools to access the data for various purposes .
Data Processing
Once you have the raw layer of immutable data in the lake , you will need to create multiple layers of processed data to enable various use cases in the organization . These are examples of the structured storage described earlier . Typical operations required to create these structured data stores will involve :
• Combining different datasets
• Denormalization
• Cleansing , deduplication , householding
• Deriving computed data fields
Apache Spark has become the leading tool of choice for processing the raw data layer to create various value-added , structured data layers .
Data Warehousing
For some specialized use cases ( think high performance data warehouses ), you may need to run SQL queries on petabytes of data and return complex analytical results very quickly . In those cases , you may need to ingest a portion of your data from your
18 | THE DOPPLER | SUMMER 2017