fectly suitable to create multiple copies of the same
data set with different underlying storage structures
(partitions, folders) and file formats (e.g. ORC vs
Parquet). robust “defense-in-depth” strategy, by walling off large
swaths of inappropriate access paths at the network
level. This implementation should also be consistent
with an enterprise’s overall security framework.
Design Security Access Control - This focuses on Authentication
(who are you?) and Authorization (what are you
allowed to do?). Virtually every enterprise will have
standard authentication and user directory technol-
ogies already in place; Active Directory, for example.
And every leading cloud provider supports methods
for mapping the corporate identity infrastructure
onto the permissions infrastructure of the cloud pro-
vider’s resources and services. While the plumbing
involved can be complex, the roles associated with
the access management infrastructure of the
cloud-provider (such as IAM on AWS) are assumable
by authenticated users, enabling fine-grained per-
missions control over authorized operations. The
same is usually true for third-party products that run
in the cloud such as reporting and BI tools. LDAP
and/or Active Directory are typically supported for
authentication, and the tools’ internal authorization
and roles can be correlated with and driven by the
authenticated users’ identities.
Like every cloud-based deployment, security for an
enterprise data lake is a critical priority, and one that
must be designed in from the beginning. Further, it
can only be successful if the security for the data lake
is deployed and managed within the framework of
the enterprise’s overall security infrastructure and
controls. Broadly, there are three primary domains of
security relevant to a data lake deployment:
• Encryption
• Network Level Security
• Access Control
Encryption - Virtually every enterprise-level organi-
zation requires encryption for stored data, if not uni-
versally, at least for most classifications of data other
than that which is publicly available. All leading cloud
providers support encryption on their primary objects
store technologies (such as AWS S3) either by default
or as an option. Likewise, the technologies used for
other storage layers such as derivative data stores for
consumption typically offer encryption as well.
Encryption key management is also an important
consideration, with requirements typically dictated
by the enterprise’s overall security controls. Options
include keys created and managed by the cloud pro-
vider, customer-generated keys managed by the
cloud-provider, and keys fully created and managed
by the customer on-premises.
Establish Governance
Typically, data governance refers to the overall man-
agement of the availability, usability, integrity, and
security of the data employed in an enterprise. It relies
on both business policies and technical practices. Sim-
ilar to other described aspects of any cloud deploy-
ment, data governance for an enterprise data lake
needs to be driven by, and consistent with, overarch-
ing practices and policies for the organization at large.
The final related consideration is encryption in-tran-
sit. This covers data moving over the network
between devices and services. In most situations, this
is easily configured with either built-in options for
each service, or by using standard TLS/SSL with
associated certificates. In traditional data warehouse infrastructures, con-
trol over database contents is typically aligned with
the business data, and separated into silos by busi-
ness unit or system function. However, in order to
derive the benefits of centralizing an organization’s
data, it correspondingly requires a centralized view
of data governance.
Network Level Security - Another important layer of
security resides at the network level. Cloud-native
constructs such as security groups, as well as tradi-
tional methods including network ACLs and CIDR
block restrictions, all play a part in implementing a Even if the enterprise is not fully mature in its data
governance practices, it is critically important that at
least a minimum set of controls is enforced such that
data cannot enter the lake without important meta-
data (“data about the data”) being defined and cap-
SUMMER 2017 | THE DOPPLER | 15