platform providers Alibaba, Huawei, Oracle and Tencent are all offering Kubernetes as a
Service applications. More than 50 vendors are shipping Kubernetes distributions certi-
fied by the Cloud Native Computing Foundation (CNCF).
But the story goes beyond all that. It ties to the ways data scientists are dealing with
data itself.
Using containers to
store data models, and
Kubernetes to manage
their delivery, data
scientists have more
flexibility to process
and analyze data.
In the past, the data science discussion focused on
which big data architecture you were running. The
way you managed data depended on where the
data was and how you worked with large storage
applications like Hadoop or Spark. Now, with the
help of containers, data science is becoming less
reliant on the state of the underlying data. As long
as users can get to the data efficiently, it does not
matter where it is. This gives data scientists more
freedom to build models, blend resources and ana-
lyze data.
Consequently, data scientists are looking for new
ways to leverage their data. Previously, most of it
was stored in data lakes. Now, users are looking at
ways to create hybrid data lakes, where some data is stored in on-premises Hadoop
clusters, and other data sets are stored in the cloud. Using containers to store data mod-
els and Kubernetes to manage their delivery, data scientists have more flexibility to pro-
cess and analyze data.
Containers offer other benefits as well. They achieve isolation with less overhead than
either virtual machines (VMs) or physical servers, enabling four to six times the number
of server application instances as traditional VMs when installed on the same size hard-
ware. Plus, once IT has an image of a container, data scientists across the organization
can use that image to create new environments as needed. The IT team managing the
work of hundreds of data scientists can use containers to ease the development of data
science environments and models covering a wide range of tools and languages.
Using Kubernetes for Data Projects
Big data relies on a number of projects and services to get where it needs to go. YARN
is a program for scheduling and ZooKeeper enables consistency and discovery. These
programs work well in on-premises environments, but they have not advanced at the
pace of the technologies in cloud environments. In contrast, scheduling, consistency,
service discovery and infrastructure management features in Kubernetes were all
designed as part of the core platform from day one. Kubernetes offers plug-ins for each
function and supports other ways of scheduling, giving data scientists a much wider
variety of options.
48 | THE DOPPLER |
WINTER 2019