Building a data lake is an integration of complex technologies that work together to provide access to diverse data sets . The following are key functional areas that should be included in all data lake deployments :
Data Processing – The ability for the data lake to seamlessly connect to other systems , provide clean mappings for data and move data around in an automated , highly reliable manner .
• Streaming – Capability for analyzing and making decisions on data that is in-flight .
• Rules / Matching – Ability to execute pattern matching against data for operations like de-identification or deduplication .
• ETL – An Extract-Transform-Load engine is key to integrating into existing RDBMS and EDW platforms .
• Governance – All governance should be consistently implemented at the edge of the data lake to ensure compliance and adherence to corporate policies .
Data Storage & Retrieval – These are functional areas to enable developers to query data in standard formats , using standard APIs from the data lake .
• Batch – High throughput , high latency processing for data that is being analyzed , not commonly used for interactive workloads .
• Analytical – Commonly used for interactive workloads where the queries change over time .
• In-memory – Used to support very low latency queries that support interactive usage or other low latency needs .
• Search / Index – These support the ability to locate information and relationships quickly .
• OLTP – Targeted to support transactional systems commonly found within business units and operations teams .
Storage – There are two primary types .
• Object – An object store is a key component of a data lake for storing non-relational data , as well as historical copies of information for later analysis .
• Long Term – Long term storage , commonly a component of the object store , is necessary for archiving data that may not be used regularly , but is still required to be accessible . Commonly used for compliance policies and legal hold rules .
40 | THE DOPPLER | SUMMER 2016