The Doppler Quarterly Fall 2017 | Page 30

Apache Ranger Centralized Security and Audit Framework and used as the Hive execution engine • Partition the data to avoid table scans • Use ORC as the underlying storage file format Apache Ranger offers a centralized security frame- work to manage fine-grained access control over Hadoop, Hive and other related components such as HBase. Using the Apache Ranger administration con- sole, users can easily manage policies controlling access to a Hive database, table, or column for a par- ticular set of users and/or groups. For deeper control of the environment, Apache Ranger also allows for audit tracking and policy analytics. In this section, we want to discuss Tez in more detail, and mention three more performance levers that can significantly improve query performance in Hive: Vectorized Query Execution, Cost Based Optimizer and Long Live and Process (LLAP). Apache Hive on Tez In Urdu the word ‘Tez’ means fast, swift, intelligent. Apache Tez has became the new paradigm for Hive execution by enabling sub-second query perfor- mance that was not possible in the ‘MapReduce’ world. MapReduce is still supported for Hive execu- tion but Tez is now the default engine when running Hive jobs in Hadoop. As mentioned before, Tez avoids disk IO by avoiding expensive shuffle and shorts, while leveraging more efficient map side joins. For a typical execution pattern, data flows from node to node of an execution graph (like Apache Spark, Tez represents the computation as a direct acyclic graph); reducer’s intermediate data is passed to the next reducer without any disk writes. Consequently, Apache Tez benefits from more memory (heap size of HiveServer) and tuning of memory parameters. For Apache Ranger policy control consists of two major parts: • Specification of resources for which the policy is applicable (such as Hive database/tables/ columns) • Specification of conditions, such as users/ groups, access-types and custom-conditions, for which the access should be allowed Hive Performance Optimization We have already discussed three important elements of an Apache Hive implementation that need to be considered carefully to get optimal performance from Apache Hive. • Make sure Tez is installed on the EMR cluster Enterprise Legacy System Security Admins Define Policies Audit User Activity Enterprise Directory Services Sync Users Ranger User Sync Server Authenticate Access Data in Hadoop Systems Enterprise Users Plugins Authorize the Access & Audit the Activity 28 | THE DOPPLER | FALL 2017 HDFS Plugin Hive Plugin H Base Plugin Knox Plugin Storm Plugin Sync Users Policy Integration Policy Admin Server Policy DB Sync Policies Store Audit Activity Figure 4: Apache Ranger Architecture Audit Store