Train machine learning algorithms to predict groupings and label new data using those predictions .
Guess unknowns based on trained relationships
Identify key entity relationships
Figure 4 : Machine Learning Process
Machine learning is a series of iterative steps to leverage a known , analyzed data set to training specific models for future execution on unknown data sets . Figure 4 shows the typical steps used by data scientists to train the necessary models when leveraging machine learning . Once the models have been trained , they can be leveraged in conjunction with a variety of analytical tools including R , SAS and open source tools written in Python .
Cloud-based data lakes add the value of being able to leverage platform provided machine learning capabilities . Vendors , including AWS and Google , provide a rich set of trained models for immediate use against data sets , as well as the ability to train custom models for use against proprietary data sets . Both AWS and Google have deployed variants of the machine learning technologies they have used and refined internally over many years .
Architecture
The technical architecture for a data lake must be a match for the dominant use cases being run on the platform . When designing the data lake solution , the key design factors are :
• Use Cases – Early identification of the use cases and workloads for the data lake will allow proper prioritization of different analysis engines , scalability considerations and data integration points .
• Operational Aspects – The data lake architecture should factor in the necessary tools for monitoring and response , as well as which technologies to leverage to ensure the system is maintainable by your organization ’ s IT organization .
• Scalability & Performance – As your organization grows and evolves , the use of the data lake will expand . Early technology decisions should have an eye towards the ability for the technology choices to scale without replacement .
These top three considerations then become several key design elements for the data lake :
• Data Access & Retrieval – Cloud providers make available a multitude
38 | THE DOPPLER | SUMMER 2016