Simulating Hive Cluster for Deployment Planning ... - IEEE Xplore

1 downloads 0 Views 307KB Size Report
Abstract—in the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and ...
2014 IEEE 6th International Conference on Cloud Computing Technology and Science

Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization Kebing Wang, Zhaojuan Bian, Qian Chen, Ren Wang*, Gen Xu Software and Service Group Intel Corporation Shanghai, China

*Intel Labs Intel Corporation Portland, Oregon, US

{Kebing.Wang, Bianny.Bian, Charles.Chen, Ren.Wang, Gen.Xu} @ intel.com

partitions in a relational database, such as MySQL [3] and Derby [4]. When the driver generates and optimizes an execution plan, it refers to the metadata stored in MetaStore. Execution Engine executes the plan generated by the compiler, and produces the result.

Abstract—in the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system.

The query performance of data warehouse on a specific cluster hardware configuration is decided by two main factors: data flow and data content. Data flow focuses on how to process data, while data content is the specific data to be processed. In Hive, a query is translated to a MapReduce execution plan, which is a MapReduce Directed Acyclic Graph (DAG). A MapReduce DAG consists of a set of MapReduce jobs as nodes, with the directed edge between jobs indicating the data flow. Data content is described through metadata, including table schema, table size, partition information, and columns statistics.

In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.

The specific data content has significant impact on the MapReduce job plan (data flow), since metadata is referred when the Hive driver generates and optimizes the execution plan. Moreover, if data skew [5] exists in data content, potentially computing skew will be introduced to MapReduce jobs. This often results in unbalanced overhead distribution among map/reduce tasks that leads to severe performance degradation. Therefore, the complex interaction between data flow and data content in Hive needs to be taking into consideration for simulation and modeling.

Keywords—Hive query simulation; cluster simulation; performance modeling; big data; data center capacity planning

Hive [1] is an open-source data warehouse solution, built on top of Hadoop [2], to manage and analyze large structured/unstructured datasets. Hive recently becomes a widely popular SQL interface for batch processing and ETL (Extract, Transform, Load), which is being used in many cloud companies such as Yahoo, Facebook etc.

Currently, Hive cluster design and deployment decision are based on the performance prediction of the query [6, 7]. A good performance prediction is essential for decision making and should consider both hardware diversity and software stack complexity, especially the interactive nature between data flow and data content. The current experience based performance prediction methods fail to meet these criterions. Simulation based modeling for cluster analysis in general is more reliable for this purpose.

Hive supports a SQL-like declarative language – HiveQL, which is compiled to MapReduce jobs executed on Hadoop. The most important components of the Hive architecture are driver, MetaStore and Execution Engine. Hive driver receives the queries, parses them into different query blocks and expressions, and optimizes it to generate a MapReduce plan. MetaStore stores the metadata on different tables and

Among many proposed simulation methods [8, 9, 10, 11], CSMethod [12] is a fast and accurate cluster simulator for traditional MapReduce job, which employs a layered and configurable architecture to simulate big data clusters on standard client computers (desktop or laptop). With CSMethod, computing and communication behavior of the software stack are abstracted and simulated at functional level.

I.

Introduction

978-1-4799-4093-6/14 $31.00 © 2014 IEEE DOI 10.1109/CloudCom.2014.119

475

intended to solve data skew problem, but help understand performance impact of data skew via detailed simulation.

Software functions are then dynamically mapped onto hardware components. The timing of hardware components (storage, network, memory and CPU) is modeled according to payload and activities as perceived by software. A low overhead discrete-event simulation engine enables fast simulation speed and good scalability. In a word, CSMethod accepts MapReduce jobs and software stack & hardware components configuration of a big data cluster, then simulates MapReduce jobs execution on the cluster and delivers the results. A high level CSMethod overview is provided in Section 2.

x

The rest of the paper is organized as follows. Section 2 presents the proposed Hive simulation framework in details, along with an overview on CSMethod. Section 3 elaborates on the experimental setup for model evaluation and illustrates simulation accuracy for aggregation query and join query of Hibench [13]. In Section 4, two real-world use cases are used as examples to show framework capability to help customer making system upgrading decisions and estimating business query response time before actually enabling the query on real hardware clusters. Section 5 overviews the related work. Lastly, summary and future work is delivered in Section 6.

However, CSMethod cannot be directly used for Hive query performance simulation. One major reason is that CSMethod is designed for traditional unstructured MapReduce workloads and does not reflect the frequent interaction between data flow and data content, which is a unique feature for Hive. To fill the gap of Hive simulation, this paper proposes a framework based on CSMethod to simulate Hive query execution process for fast and accurate performance prediction. The proposed simulation framework accepts the inputs of HiveQL, metadata of tables/partitions, software stack and hardware components configurations of target cluster. The output of the framework is a simulation report, including response time of Hive query and detailed bottleneck analysis on the target cluster.

II.

A. Overview of the framework. Fig. 1 illustrates the simulation framework architecture. As shown in Fig.1, the inputs of the Hive simulation framework include: HiveQL, metadata of tables/partitions, software stack description, and hardware components configurations of the target cluster. The output of the framework is a simulation report, including response time of Hive query and the detailed bottleneck analysis. As the execution engine of the Hive simulation framework, CSMethod executes the MapReduce jobs simulation and obtains the response time of the query. However, due to the interactive nature of Hive, CSMethod cannot take user input directly. Hive simulation framework provides a frontend translation engine that tailors to address the frequent interaction between data flow and data content. In another word, in our proposed framework, CSMethod only directly receives software stack and hardware components configurations of target cluster from user inputs, while MapReduce job plans are not from user inputs but generated internally and automatically, using our translation engine, based on the HiveQL and metadata of user inputs.

Moreover, previous CSMethod does not support data skew, and assumes all map tasks and reduce tasks are with the same input size. However in a real world, key distributions of tables are commonly unbalanced, which leads to data skews. In our simulation framework, we improve CSMethod to also support data skew of map and reduce phases. We make the following key contributions in this paper: Present Hive simulation framework that can handle both structured and unstructured data with frequent interaction. By introducing a frontend translation engine on top of CSMethod, data flow (MapReduce job plan) can be generated internally using the framework. The process accurately reflects the interaction between data flow and data content.

x

Enhance CSMethod to support accurately simulating impact of data skew, which is predominant in real world workloads. Our simulation framework is not

Hive Simulation Framework

In this section, we introduce the proposed Hive simulation framework in details.

We employ the CSMethod as MapReduce execution engine, and extend it with a frontend translation engine that addresses the unique interactive nature between data flow and data content in Hive. In our framework, CSMethod directly receives software stack and hardware components configurations of target cluster from user input, however MapReduce jobs are not directly from user input any more. They are generated internally using our translation engine based on the HiveQL and metadata of user input. As shown with extensive experiments, our Hive simulation framework has the ability to efficiently and accurately simulate the interactions between data flow and data content.

x

Implement and validate the Hive simulation framework with extensive micro benchmarks, showing high accuracy (average error rate of ~6%) with efficient simulation speed. Moreover we use realworld use cases to showcase the capability of the framework to influence the decision making process on system upgrading.

The basic idea of our translation engine is to use explain command of Hive to translate HiveQL to MapReduce job plan. Specifically, explain command faithfully translates queries into MapReduce job plan without executing it. This approach guarantees that the simulated MapReduce job plan is identical with the real plan to be executed. In general, when Hive driver translates and optimizes the MapReduce job plan, it needs to refer to the metadata of tables/partitions stored in MetaStore, and automatically updates the metadata through

476

TABLE I. HADOOP/JVM/OS SIMULATION PARAMETERS Category Map Phase

Reduce Phase Fig.1 Illustration of Hive simulation framework

MetaStore API when operating on tables and records, saving Hive users from having to directly operate on MetaStore. In our framework, since only metadata is needed for obtaining the right execution plan, we use scripts to directly write metadata of tables and partitions into MetaStore, bypassing the table input data generation, reducing overall simulation overhead.

General MapReduce

After MapReduce job plan is produced using aforementioned translation engine, an automatic script is used to translate the plan to CSMethod complied format. Then CSMethod is activated to execute MapReduce job simulation and obtain the response time of the Hive query.

HDFS JVM

It is important to point out that the original CSMethod does not support data skew, and assumes all map tasks and reduce tasks have the same input size. However as we also mentioned before, in real world, key distributions of tables are commonly unbalanced, which leads to data skews where a small portion of nodes must handle large bulk of computation. This would reduce efficiency significantly. In our Hive simulation framework, we support data skew of map and reduced phases via extending CSMethod with various input sizes among map and reduce tasks support.

OS

Evaluated Software Parameters io.sort.mb io.sort.record.percent io.sort.spill.percent io.sort.factor mapred.min.split.size mapred.compress.map.output mapred.map.output.compres.codec mapred.reduce.parallel.copies mapred.job.reduce.total.mem.bytes mapred.job.shuffle.input.buffer.percent mapred.inmem.merge.threshold mapred.job.shuffle.merge.percent io.sort.factor io.bytes.per.checksum mapred.job.reduce.input.buffer.percent mapred.output.compression.codec mapred.output.compress mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum mapred.heartbeats.in.second mapred.jobtracker.taskScheduler Tasktracker.http.threads dfs.block.size -Xms -Xmx -XX:MinHeapFreeRatio -XX:MaxHeapFreeRatio -XX:NewRatio=ratio -XX:NewSize=size -XX:MaxNewSize=size flush ratio: /proc/sys/vm/dirty_background_ratio block ratio: /proc/sys/vm/dirty_ratio flush interval: /proc/sys/vm/dirty_expire_centisecs

engine. The performance impact from parameters of Table I could be reflected by CSMethod simulation. 4) The hardware view of a cluster is modeled by defining the cluster topology and hardware components, which includes:

Next we review the main components of the simulation framework.

x x x x x

B. User input The user needs to provide HiveQL statement, metadata for tables/partitions, parameters of cluster software and hardware stacks as inputs to the simulation framework. 1) Any HiveQL statement is a valid input to the framework.

Node counts and network topology/routing Network type: 1Gb / 10Gb, NIC/port count Storage type and count: SSD or HDD Processor type, frequency and thread count Memory type and capacity

C. Translation engine: generating MapReduce job plan As we mentioned in the overview, HiveQL is translated into MapReduce job plan and executed on Hadoop. Execution plan has significant performance impact on HiveQL. For example, Hive join query could be translated to MapJoin or ReduceJoin, while MapJoin has lower I/O requests and better performance compared to ReduceJoin. Thus, it’s very important for the simulation framework to get the appropriate MapReduce job plan, which is decided by HiveQL, metadata, compiler and Hive drive optimizer.

2) Metadata includes table schema, partition information, and certain columns statistics, including: count of null values, count of true/false values, maximum value, minimum value, estimate of number of distinct values, average column length, maximum column length, and height balanced histograms. 3) User needs to input performance sensitive S/W stack parameters, which covers Hive/Hadoop/JVM/OS layers. Hive parameters are mainly used in the phase of translating query to execution plan. Our framework supports all Hive parameters supported by the explain command. The parameters of Hadoop/JVM/OS are listed in Table I, which are used in the MapReduce job simulation phase using CSMethod execution

In our Hive simulation framework, MapReduce job plan generation can be described in three major steps: firstly, metadata of tables/partitions from user inputs is directly written into MetaStore; then, based on the metadata, Hive

477

To support data skew caused by uneven key distribution, our simulation framework extends CSMethod to take into consideration of different input sizes of map and reduce tasks. Certain map and reduce tasks could potentially require much longer execution time due to their bigger input size, it is ideal to balance resource accordingly for more efficient execution. In our framework, input size configuration of map and reduce tasks is calculated by column distribution statistics. III.

Fig.2 Workload abstaction of CSMethod

explain command is called to generate MapReduce job plan; lastly, the output of explain command is translated to CSMethod complied format for simulation execution. During the last step, data skew information will be also taken into account if user input contains key distribution information. All these processes are implemented using python scripts.

This Section describes our experiment setup and configurations, followed by presenting and discussing simulation results using Hibench [13] suite. A. Baseline configuration Table II lists the target cluster hardware and software stack used for our baseline experiments. The configuration is representative of mainstream datacenter configurations.

It is worth noting that without having to generate table records, it’s very convenient for users to check the performance impact from data content, i.e., different table/partition size and record distribution, which only requires metadata input to the framework.

B. Simulated queries Hibench is a representative and comprehensive Hadoop benchmark suite. Hibench contains Hive performance evaluation benchmarks: an aggregation query and a join query which are first proposed in [14]. Hive aggregation computes the sum of each group over a single read-only table – UserVisit, while Hive join computes the both the average and sum for each group by joining two different tables – UserVisit and PageRank.

D. MapReduce jobs simulation As mentioned before, in our Hive simulation framework, CSMethod is used to simulate MapReduce job execution. CSMethod employs a top-down approach to model the behavior of complete software stacks and to simulate the activities of cluster components including processors, memory, network and storage. The input of CSMethod includes S/W stack, H/W components configuration and MapReduce job abstraction.

Hive aggregation is relatively simple and always translates into one MapReduce job. We use three input configurations for simulation correlation, and the biggest input size – 2 billion records of UserVisit as a base configuration. The configuration is shown in Table III. Then, we perform map slot number scaling and core number scaling on the base configuration to validate the accuracy of our Hive simulation framework.

As shown in Fig.2, CSMethod defines a MapReduce job from 5 aspects: input, map output, reduce output, relative CPU cost of map and reduce functions. For Hive simulation framework, we use column statistics to calculate input, map output and reduce output. For the relative CPU cost of map and reduce functions, our framework refers to a pre-defined database that records CPU costs of various Hive operators such as Create_Table, Filter, Forward, Group_By, Join, Move, Reduce, etc. Return values are used to calculate relative CPU cost of map and reduce functions. If a User Define Function (UDF) is called, we could profile the CPU cost of the UDF offline first, then use the profiling result in the simulation framework. TABLE II. Cluster Processor Disk

Memory Network OS JVM Hadoop

Experiment Setup and Simulation Results

Hive join has two input tables. When we configure various input sizes for each table as shown in Table III, different execution plans are received from explain command. For Hive join query, when the record number of small table (called PageRank) is 500K, the result execution plan consists of 3 MapReduce jobs and executes map side join. On the other hand, when the record number of PageRank reaches 700K, the execution plan has 3 MapReduce jobs but executes reduce side join. Using MapJoin or ReduceJoin is decided by whether the small table could fit in memory. MapJoin will be selected if the small table could fit in memory, otherwise ReduceJoin will be selected.

CLUSTER H/W AND S/W SETTINGS

5 Nodes, connected by one on rack switch 1 Master + 4 Data Nodes / Task Trackers Intel® Xeon® E7-2675, 16 cores w/ SMT disabled per Node Direct Attached Storage, 5 x 750GB HDD per node, 7200 RPM SATA II 1 driver for OS, 4 drivers for Hadoop 125GB, 2 channel DDR3-1333 per node 1Gbit/s Ethernet SLES 11 SP2 GM JVM 1.6.0_31 Intel Hadoop Distribution 2.3

We use these four Hive join input configurations for simulation correlation, and 1.6 billion records of UserVisit and 500K records of PageRank as a base configuration. Then, we perform map slot number scaling and core number scaling on the base configuration to validate the framework accuracy when simulating Hive join query.

478

TABLE III. INPUT CONFIGURATION OF HIVE JOIN Query Index

Record of UserVisit

Record of PageRank

Execution Plan

Hive Aggregation 1

500000000

0

Aggregation

Hive Aggregation 2

1000000000

0

Aggregation

Hive Aggregation 3

2000000000

0

Aggregation

Hive Join 1

1000000000

500000

MapJoin

Hive Join 2

1000000000

700000

ReduceJoin

Hive Join 3

1600000000

500000

MapJoin

Hive Join 4

1600000000

700000

ReduceJoin

Fig.3 Hive aggregation correlation with different input size

In this paper, we present the results using input size, map slot number and core number as exemplary variables to illustrate the high accuracy of our simulation framework in detail. In fact, we have performed validation tests on all H/W and S/W parameters supported by the framework. Average error rate of hardware parameters validation is 10%, including node number, disk type/count, NIC bandwidth, processor type/core number/frequency. And average error rate of software parameters validation is 6%, including parameters of category Map Phase/Reduce Phase/General MapReduce of Table I. Moreover, we use the framework for node number scaling simulation to study Xeon server scaling efficiency for Hive Join.

Fig.4 Map slot scaling for Hive aggregation

Fig.5 Core number scaling for Hive aggregation

For core number scaling simulation, we record execution time with 4, 8, and 16 cores enabled, using both our simulation method and real system measurement. As shown in Fig.5, the simulation result aligns with real measurement results very well, with average discrepancy of less than 9%.

C. Simulation platform CSMethod, developed by Intel Cofluent [15], can run on both windows and linux platforms. Considering that we need to use Hive explain command to translate HiveQL to MapReduce jobs plan, Linux OS with Hive installed is a convenient and straightforward choice. However, users could also choose to obtain the execution plan on a Linux platform, and then execute CSMethod simulation on a windows platform. From hardware capability point of view, computing capability of a commodity laptop is good enough to execute the simulation framework, thanks to our low-overhead, efficient design.

2) Hive join The response time simulation of Hive join using Hive simulation is shown in Fig.6. As we can see, our framework exactly reflects the different execution plan of Hive join, and exhibits high efficiency with all four input configurations, with an average of error of ~3% comparing with real system measurement results. The map slot number scaling results are shown in Fig.7. In this case, base configuration of 19 is the optimal configuration according to the measurement results. Our simulation, on the other hand, shows that 20-slot configuration slightly better (~1%). Although the simulation result is inconsistent with measurement result in this particular case, we can see that the response time of these two configurations are very close, within 4% of each other. Considering the measured run-to-run various could be 3%, this simulation results is still very acceptable.

D. Simulation results and discussion 1) Hive aggregation For Hive aggregation, our simulation framework shows high accuracy with various input sizes, an average of ~6% discrepancy between simulation and measurement, as shown in Fig.3. The map slot scaling results are shown in Fig.4. In the simulation tests, base configuration is set to 19 map slot proposed by Active Tuner of Intel Hadoop Distribution. However, based on real measurement, the best map slot number should be 22, which effectively decreases response time by 6%. Our simulation result is consistent with real measurement, showing map slot number of 22 is the best configuration.

Fig.8 shows the results for core scaling for Hive join, with 4, 8 and 16 cores enabled, respectively. Again we can see that the simulation results are very accurate, with ~5% discrepancy rate. Node scaling efficiency study with simulation is as Fig.9. The scaling efficiency is almost 100% when node number is smaller than 32, and begins to drop when node number is bigger than 32 but still higher than 90%.

479

A. Monthly activity summary report The monthly activity summary report is the result of Hive aggregation query on a single table, and the cluster consists of 10 nodes with Intel Hadoop Distribution installed. The organization has plans in the future to add more business logics into the monthly report, ideally without suffering longer response time. For this purpose, the organization plans to upgrade datacenter servers (couldn’t add more server nodes due to space limitation) to decrease response time of current aggregation query by 20%, in order to accommodate the future more complex monthly report. A critical question to answer is, how should the organization update the hardware cluster most effectively, in terms of both cost and performance efficiency? Which component has most significant impact on the performance, CPU frequency, number of cores, or disk capacities?

Fig.6 Hive join correlation with different input configurations

Our simulation framework can be employed to answer the question without having to buy and update the real hardware systems. First, we validate our framework accuracy with the current cluster configuration, showing above 93% accuracy. Then, we perform frequency scaling, core number scaling and disk number scaling, respectively. The frequency of current processor is 2.4GHz, and we increase frequency to 2.8GHz and 3.2GHz as various simulation inputs. Simulation results in Fig.10 show that response time decreases by 13% and 21%, respectively, translating into frequency scaling efficiency of 99% and 95%.

Fig.7 Map slot scaling for Hive join

The core number of the current processor is 16, and we increase to 20 and 24 for simulation inputs. Simulation results in Fig.11 show that the response time decreases by 14% and 19%, with core number scaling efficiency of 93% and 82%, respectively. For disk configurations, current system has 5 HDD disks. We configure 7 HDD and 5 SSD for simulation inputs. Simulation result shows that response time only decreases by 4% and 6%, as shown in Fig.12, because the workload is more CPU intensive than I/O.

Fig.8 Core number scaling for Hive join

From the above simulation results, we can confidently conclude that it is higher priority to upgrade CPU with higher frequency and more cores, while only upgrading disk subsystem benefits the performance insignificantly. More specifically, 20% response time decrease can be achieved by improving CPU frequency from 2.4GHZ to 3.2GHz. Increasing core number from 16 to 24 can also decrease response time significantly by 19%, which is very close to the target of the organization.

Fig.9 Node number scaling for Hive join

IV.

Validation with Real-World Use Cases

We also find that core scaling is not linear, efficiency only reaching 81% when increasing the core number to 24. This is likely due to the disk access bottleneck. In order to improve core scaling efficiency, we perform core scaling again with 5 SSD (instead of slower HDD). As shown in Fig.13, the scaling efficiency in this case increases to 88% and response time decreases by 24%, from 16 to 24 cores. This set of experiments informs us that: when upgrading processors with more cores, it’s optimal to also upgrade disk subsystems at the same time, in order to take full advantage of the processor capability.

After validating the accuracy of our simulation framework with Hibench suite, in this section, we use two real world use cases to illustrate the capability of the simulation framework. The data for these two use cases are collected from two different public organizations. One is running monthly activity summary report, while the other is collecting user account statistics across several branches. In this paper, we only present high level simulation performance results without disclosing any detailed information regarding the use cases and workloads.

480

can be used to obtain useful information without having to implement and test the logic with physical working clusters. To simulate the target process, we input the relevant HiveQL to the Hive simulation framework, which translates the query into 3 MapReduce jobs. After proper translation, we carry out the query execution in the simulation framework and obtain the response time. There are two different configurations we investigated. The first one is with balanced key value distribution, which has 48 reduce tasks and each task has the same input size. With the second, more realistic configuration, key value distribution is uneven, with one key value dominating more than 30% records. So, the second configuration also has 48 reduce tasks, but input size of the task containing hot key is 20 times bigger than other task.

Fig.10 CPU frequency scaling of use case 1

The simulation results are shown in Fig.14. Our simulation framework can effectively estimate the response time of the planned Hive query. Moreover, we also show that data skew could cause significant extra delay (more than 20 seconds) in terms of response time. This information can be used to help the organization to make design and upgrade decisions based on their workload characteristics.

Fig.11 Core number scaling of use case 1

V.

Related Work

In general, there are two main challenges for accurate and efficient data warehouse query simulation: 1) how to accurately simulate the impact of specific data content on query data flow; 2) how to simulate the data flow behavior and performance on specific software stack and hardware components of a cluster. For Hive simulation, the two challenges reflect in SQL to MapReduce job translation and MapReduce job execution simulation. In this Section, we briefly review the related works in these two areas and compare to our work.

Fig.12 Disk scaling of use case 1

A. SQL-to-MapReduce translation Translating SQL to MapReduce job plan is essential for Hive simulation. YSmart [16] is a correlation aware SQL-toMapReduce job translator. YSmart applies a set of rules, aiming to use minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to other existing translators such as Optiq [17] and HadoopDB [18]. However, YSmart fails to take into consideration of table size impact when translating SQL queries, which potentially will severely degrade simulation accuracy.

Fig.13 Core number scaling with SSD of use case 1

Fig.14 Simulation result of use case 2

B. User account statistics analysis across different branches For the second use case, the organization is interested in adding another business process to the user account statistics analysis across several distributed branches. This process is implemented by Hive join on multiple tables. In order to optimize the decision making process in terms of both software and hardware upgrading, it is very desirable to have the information of the response time of the new query on the current production cluster. Implementing on the working cluster can be resource consuming and potentially disturbing the business operation. Therefore, our simulation framework

Other than optimizing the translator, SQL can be also rewritten in order to obtain a better MapReduce job plan. QMapper [19] aims to optimize MapReduce plan via rewriting SQL queries, applying a set of query rewriting rules and costbased MapReduce flow evaluation based on column statistics. Evaluation shows that Qmapper is able to improve performance significantly while assuring the correctness. However it estimates query cost based on simple formulas which could be unreliable.

481

MapReduce, such as Spark [20] and Tez [21]. Meanwhile, since power is another important factor, other than performance, for big data cluster planning, and we plan to extend the simulation framework to be able to perform power modeling and estimation.

B. MapReduce job simulator There are several existing simulators dedicated to simulate MapReduce computing paradigm, among them, MRSG [8], MRPerf [9], HaSim [10] and SimMR [11] are representative ones. However, all these simulators are strictly MapReduce centric without the flexibility to extend to include other software components on big data software stacks. Therefore the impact of other components is not modeled well in these simulation solutions. For example, modeling of HDFS and OS behavior is very limited. Specifically, SimMR does not simulate HDFS; MRPerf supports only one replica for each chunk of output data; and MRSG attaches HDFS to Map/Reduce behavior in a very tight fashion that it cannot model the disk or network IO resource contention among multiple jobs. HaSim does not model OS caching and buffering behavior, which has significant impact on the performance. On the other hand, our approach simulates the whole software stack.

References [1]

[2] [3] [4] [5]

[6]

Moreover, there are other limitations with the current existing solutions. MRPerf is limited to a single storage device per node and only simulates very simple MapReduce behaviors. SimMR and MRSG do not simulate disk I/O overhead. HaSim does not model memory capacity as one of important system components, and its resource sharing behavior is controlled by only a single parameter, which oversimplifies the realistic system behaviors. On the other hand, at the core of our Hive simulation framework, CSMethod employs a layered simulation framework which can define the cluster architecture in a very accurate and efficient fashion. Hardware resource usage histories are also monitored for accurate activity mapping and performance modeling.

[7] [8]

[9]

[10]

In summary, comparing to the existing Hive simulation solutions, our Hive simulation framework is able to accurately and efficiently simulate Hive queries, taking into consideration of both software stacks and cluster hardware configurations. VI.

[11]

[12]

Conclusion and Future Work

[13]

Currently, predicting Hive query performance accurately and efficiently is very challenging, due to the frequent interaction between data flow and data content, hardware diversity and software stack complexity. In this paper, we propose an innovative framework to simulate the complete Hive query execution process, addressing all the challenges mentioned above. By directly writing metadata into MetaStore, we utilize Hive explain command to obtain data flow – MapReduce job plan, without having to generate large amount of input data. Upon obtaining the MapReduce job plan, we use an enhanced version of CSMethod that also considers data skew to effectively simulate data flow execution. We have implemented the simulation framework and validated its accuracy and efficiency via comprehensive micro-benchmark experiments. Moreover, we have shown, using real world use cases, the framework is able to effectively guide the cluster system design and deployment decisions.

[14]

[15]

[16]

[17] [18]

[19]

In the future, we plan to extend the proposed simulation framework to support more execution engines than current

[20] [21]

482

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy, “Hive - a warehousing solution over a Map-Reduce framework,” VLDB‘09, August 24-28, 2009, Lyon, France. Hadoop: http://hadoop.apache.org/ MySQL: http://www.mysql.com/ Derby: http://db.apache.org/derby/ Qifa Ke, Vijayan Prabhakaran, Yinglian Xie, Yuan Yu, Jingyue Wu, Junfeng Yang, “Optimizing Data Partitioning for Data-Parallel Computing,” HotOS XIII, 2011, Napa, California, USA Enrico Barbieratoa, Marco Gribaudob, Mauro Iaconoc, “Modeling Apache Hive based applications in Big Data architectures,” ICST, 2013, Brussels, Belgium, Belgium Enrico Barbieratoa, Marco Gribaudob, Mauro Iaconoc, “A performance modeling language for big data architecture,” 27th ECMS, 2013 Wagner Kolberga, Pedro de B. Marcosa, Julio C.S. Anjosa, Alexandre K.S. Miyazakia, Claudio R. Geyera, Luciana B. Arantesb, “MRSG – a MapReduce simulator over SimGrid,” Parallel Computing Volume 39 Issue 4-5, Pages 233-244, April, 2013 Wang, G., Butt, A. R., Pandey, P., and Gupta, K., “A simulation approach to evaluating design decisions in MapReduce setups,” Proceedings of the 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '11), London, UK, 2011. Palson R Kennedy and T V Gopal, “A MR simulator in facilitating cloud computing,” International Journal of Computer Applications 72(5):4349, June 2013. Published by Foundation of Computer Science, New York, USA. A. Verma, L. Cherkasova, and R.H. Campbell, “Play It Again, SimMR!,” Proc. IEEE Int’l Conf. Cluster Computing (Cluster ’11), 2011. Zhaojuan Bian, Kebing Wang, Zhihong Wang, Gene Munce, Illia Cremer, Wei Zhou, Qian Chen, Gen Xu, “Simulating big data clusters for system planning, evaluation and optimization,” ICPP-2014, September 9-12, 2014, Minneapolis, MN, USA. Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang, “The HiBench benchmark suite: characterization of the MapReducebased data analysis,” ICDE Workshops, 2010. A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi, “A comparison of approaches to largescale data analysis,” SIGMOD, June, 2009. Intel, Simulation software http://www.intel.com/content/www/ru/ru/cofluent/intel-cofluentstudio.html Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang, “YSmart: yet another SQL-to-MapReduce translator,” ICDCS, 2011. Julian Hyde, “Optiq: a SQL front-end for everything,” Pentaho Community Meetup, 2012. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin, “HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads,” VLDB ‘09, August 24-28, 2009, Lyon, France Yingzhong Xu, Songlin Hu, “QMapper: a tool for SQL optimization on hive using query rewriting,” WWW 2013 Companion May 13–17, 2013, Rio de Janeiro, Brazil Spark: https://spark.apache.org/ Tez: http://zh.hortonworks.com/hadoop/tez/

Suggest Documents