The Doppler Quarterly Winter 2019

platform providers Alibaba, Huawei, Oracle and Tencent are all offering Kubernetes as a Service applications. More than 50 vendors are shipping Kubernetes distributions certi- fied by the Cloud Native Computing Foundation (CNCF). But the story goes beyond all that. It ties to the ways data scientists are dealing with data itself. Using containers to store data models, and Kubernetes to manage their delivery, data scientists have more flexibility to process and analyze data. In the past, the data science discussion focused on which big data architecture you were running. The way you managed data depended on where the data was and how you worked with large storage applications like Hadoop or Spark. Now, with the help of containers, data science is becoming less reliant on the state of the underlying data. As long as users can get to the data efficiently, it does not matter where it is. This gives data scientists more freedom to build models, blend resources and ana- lyze data. Consequently, data scientists are looking for new ways to leverage their data. Previously, most of it was stored in data lakes. Now, users are looking at ways to create hybrid data lakes, where some data is stored in on-premises Hadoop clusters, and other data sets are stored in the cloud. Using containers to store data mod- els and Kubernetes to manage their delivery, data scientists have more flexibility to pro- cess and analyze data. Containers offer other benefits as well. They achieve isolation with less overhead than either virtual machines (VMs) or physical servers, enabling four to six times the number of server application instances as traditional VMs when installed on the same size hard- ware. Plus, once IT has an image of a container, data scientists across the organization can use that image to create new environments as needed. The IT team managing the work of hundreds of data scientists can use containers to ease the development of data science environments and models covering a wide range of tools and languages. Using Kubernetes for Data Projects Big data relies on a number of projects and services to get where it needs to go. YARN is a program for scheduling and ZooKeeper enables consistency and discovery. These programs work well in on-premises environments, but they have not advanced at the pace of the technologies in cloud environments. In contrast, scheduling, consistency, service discovery and infrastructure management features in Kubernetes were all designed as part of the core platform from day one. Kubernetes offers plug-ins for each function and supports other ways of scheduling, giving data scientists a much wider variety of options. 48 | THE DOPPLER | WINTER 2019

The Doppler Quarterly Winter 2019 | Page 50