This article focuses on the business
value of a big data warehouse using
Apache Hive, and provides pointers
to architecture, design and imple-
mentation best practices needed to
implement such a system.
Big Data Warehousing
Data Warehousing is Dead? Or Long Live
Data Warehousing?
Every large organization has an enormous amount of
historical data tied up in relational databases in the
form of data warehouses and data marts. These data
warehouses are the workhorses behind business
intelligence reporting and analytics. Even when an
organization is embarking on a big data journey in the
cloud, recreating the data warehouse on the cloud
may not be advisable. Most organizations are instead
creating use case-specific slices of the legacy data
warehouse when relational data warehouse-like
denormalized structures are required, and then
enabling them on the cloud through additional tech-
nologies. If your organization has strong resources
well-trained in SQL (and it probably does!), you will
want to consider Apache Hive.
Apache Hive to the Rescue
Apache Hive, initially developed by Facebook, is a
popular big data warehouse solution. It provides a
SQL interface to query data stored in Hadoop distrib-
uted file system (HDFS) or Amazon S3 (an AWS imple-
mentation) through an HDFS-like abstraction layer
called EMRFS (Elastic MapReduce File System).
Apache Hive on EMR Clusters
Amazon Elastic MapReduce (EMR) provides a clus-
ter-based managed Hadoop framework that makes it
easy, fast, and cost-effective to process vast amounts
of data across dynamically scalable Amazon EC2
instances. Apache Hive runs on Amazon EMR clusters
and interacts with data stored in Amazon S3. A typi-
cal EMR cluster will have a master node, one or more
core nodes and optional task nodes with a set of soft-
ware solutions capable of distributed parallel pro-
cessing of data at scale.
Amazon EMR Cluster
Standalone
Agent
Amazon RDS
Amazon S3
Master Node
Slave
Node
Bucket 3
JSON Files
Slave
Node
Bucket 2
XML Files
Slave
Node
Slave
Node
Bucket 1
CSV Files
Figure 1: Hive on AWS EMR Cluster
FALL 2017 | THE DOPPLER | 25