The Doppler Quarterly Fall 2017

This article focuses on the business value of a big data warehouse using Apache Hive, and provides pointers to architecture, design and imple- mentation best practices needed to implement such a system. Big Data Warehousing Data Warehousing is Dead? Or Long Live Data Warehousing? Every large organization has an enormous amount of historical data tied up in relational databases in the form of data warehouses and data marts. These data warehouses are the workhorses behind business intelligence reporting and analytics. Even when an organization is embarking on a big data journey in the cloud, recreating the data warehouse on the cloud may not be advisable. Most organizations are instead creating use case-specific slices of the legacy data warehouse when relational data warehouse-like denormalized structures are required, and then enabling them on the cloud through additional tech- nologies. If your organization has strong resources well-trained in SQL (and it probably does!), you will want to consider Apache Hive. Apache Hive to the Rescue Apache Hive, initially developed by Facebook, is a popular big data warehouse solution. It provides a SQL interface to query data stored in Hadoop distrib- uted file system (HDFS) or Amazon S3 (an AWS imple- mentation) through an HDFS-like abstraction layer called EMRFS (Elastic MapReduce File System). Apache Hive on EMR Clusters Amazon Elastic MapReduce (EMR) provides a clus- ter-based managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Apache Hive runs on Amazon EMR clusters and interacts with data stored in Amazon S3. A typi- cal EMR cluster will have a master node, one or more core nodes and optional task nodes with a set of soft- ware solutions capable of distributed parallel pro- cessing of data at scale. Amazon EMR Cluster Standalone Agent Amazon RDS Amazon S3 Master Node Slave Node Bucket 3 JSON Files Slave Node Bucket 2 XML Files Slave Node Slave Node Bucket 1 CSV Files Figure 1: Hive on AWS EMR Cluster FALL 2017 | THE DOPPLER | 25

The Doppler Quarterly Fall 2017 | Page 27