Figure 1: Data Lake Storage Layers
and data on the cloud, while business takes responsi-
bility for exploring and mining it. tion purpose, can only be accomplished if the underly-
ing core storage layer does not dictate a fixed schema.
Design Physical Storage Separation from compute resources - The most sig-
nificant philosophical and practical advantage of
cloud-based data lakes as compared to “legacy” big
data storage on Hadoop is the ability to decouple
storage from compute, enabling independent scaling
of each.
The foundation of any data lake design and imple-
mentation is physical storage. The core storage layer
is used for the primary data assets. Typically it will
contain raw and/or lightly processed data. The key
considerations when evaluating technologies for
cloud-based data lake storage are the following prin-
ciples and requirements:
Exceptional scalability - Because an enterprise data
lake is usually intended to be the centralized data
store for an entire division or the company at large, it
must be capable of significant scaling without run-
ning into fixed arbitrary capacity limits.
High durability - As a primary repository of critical
enterprise data, a very high durability of the core stor-
age layer allows for excellent data robustness without
resorting to extreme high-availability designs.
Support for unstructured, semi-structured and
structured data - One of the primary design consid-
erations of a data lake is the capability to store data of
all types in a single repository.
Independence from fixed schema - The ability to
apply schema upon read, as needed for each consump-
Given the requirements, object-based stores have
become the de facto choice for core data lake storage.
AWS, Google and Azure all offer object storage
technologies.
The point of the core storage is to centralize data of
all types, with little to no schema structure imposed
upon it. However, a data lake will typically have addi-
tional “layers” on top of the core storage. This allows
the retention of the raw data as essentially immutable,
while the additional layers will usually have some
structure added to them in order to assist in effective
data consumption such as reporting and analysis.
Figure 1 represents additional layers being added on
top of the raw storage layer.
A specific example of would be the addition of a layer
defined by a Hive metastore. In a layer such as this, the
files in the object store are partitioned into “directo-
ries” and files clustered by Hive are arranged within to
SUMMER 2017 | THE DOPPLER | 13