The Doppler Quarterly Summer 2016 | Page 48

Immutable data for all work – All work done in a data lake should be executed on immutable data ; this will ensure that rogue processes or analyses can be removed without affecting the data quality for future analysis .
• De-identifying Data – Many organizations deal with sensitive data , including healthcare , financial or personal information . A data lake creates a unique risk , allowing many individuals to access data previously stored in silos . All data put into a data lake and allowed to be accessed by a wide audience should be de-identified to ensure that personal privacy is protected . Many data lakes have separate areas with de-identified and identifiable data , with each section accessible to the proper staff .
• Source of Record – A data lake will be pulling data from multiple sources , as well as feeding analytical results back to operational systems . This requires that organizations carefully track their Source of Record for each data type and understand how that information is moved between systems , as well as referenced , to ensure data integrity .
• Relationship Mapping – As organizations have grown their silos of data over many years , relationships in data have become complex . A successful data lake must ensure that data elements are properly mapped , so that reporting can span systems , time frames and business units .
• Metadata Catalog – To ensure that all data lake users can effectively locate required data , a metadata catalog should be deployed to provide information about data sets , relationships , data quality and historical information , including past analysis and results .
Data Security
A key component of all data lake implementations is a strong set of security controls , backed by organizational governance policies . Because of the disparate data sets brought together in a data lake , and the variety of users , accessing the data in both structured and ad hoc methods , the governance and security controls must be clear , automated and actively respond to business needs and outside threats .
Figure 9 outlines three best practices for data integration when building a data lake :
• Security Context – All security context , including access controls , tagging and ownership , should be carried with data when moved between systems . This will ensure that as data is imported / exported , the policies carried are consistent between systems .
• Identities – Identities should be consistent across all systems ; inevitably data will be replicated to provide for performance needs , and consistent
46 | THE DOPPLER | SUMMER 2016