Improving the Memory Efficiency of In-Memory MapReduce Based ...

3 downloads 182 Views 405KB Size Report
Dec 16, 2015 - Abstract. In-memory cluster computing systems based MapReduce, such as Spark, have made a great impact in addressing all kinds of big ...
Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems Cheng Pei, Xuanhua Shi(B) , and Hai Jin Services Computing Technology and System Laboratory, Cluster and Grid Computing Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China {peicheng,xhshi,hjin}@hust.edu.cn

Abstract. In-memory cluster computing systems based MapReduce, such as Spark, have made a great impact in addressing all kinds of big data problems. Given the overuse of memory speed, which stems from avoiding the latency caused by disk I/O operations, some process designs may cause resource inefficiency in traditional high performance computing (HPC) systems. Hash-based shuffle, particularly large-scale shuffle, can significantly affect job performance through excessive file operations and unreasonable use of memory. Some intermediate data unnecessarily overflow to the disk when memory usage is unevenly distributed or when memory runs out. Thus, in this study, Write Handle Reusing is proposed to fully utilize memory in shuffle file writing and reading. Load Balancing Optimizer is introduced to ensure the even distribution of data processing across all worker nodes, and Memory-Aware Task Scheduler that coordinates concurrency level and memory usage is also developed to prevent memory spilling. Experimental results on representative workloads demonstrate that the proposed approaches can decrease the overall job execution time and improve memory efficiency. Keywords: MapReduce · In-memory balancing · Task scheduler

1

·

Hash-based shuffle

·

Load

Introduction

For many data-intensive applications, the high-level cluster programming models MapReduce [1] and Dryad [2] are widely adopted for processing growing volumes of data and realize scalable performance by exploiting data parallelism. The design philosophy of these models is the automatic provision of localityaware scheduling, fault tolerance, and load balancing, all of which are embodied in some famous application systems. As a typical representative of MapReduce framework, Hadoop [3] enables a wide range of users to analyze big datasets on commodity clusters. Considering the large amount of disk I/O operations in Hadoop, some derived systems attempt to change the concrete implementation c Springer International Publishing Switzerland 2015  G. Wang et al. (Eds.): ICA3PP 2015, Part I, LNCS 9528, pp. 170–184, 2015. DOI: 10.1007/978-3-319-27119-4 12

Improving the Memory Efficiency of In-Memory MapReduce

171

of MapReduce’s execution flow to obtain better performance that suits different hardware environments or some special applications. TritonSort [4] equips Hadoop clusters with abundant disks and memory to improve I/O throughput. HaLoop [5] makes the task scheduler of Hadoop loop-aware and adds various caching mechanisms for iterative programs. Meanwhile, Spark [6], which is a new representative of the MapReduce framework, absorbs several of the principles of Dryad with regard to data sharing and functional programming interface. When coupled with in-memory computing, Spark can outperform Hadoop in many aspects of applications. Spark introduces an abstraction called resilient distributed datasets (RDDs) [7] to persist data in memory for reuse, and retains fault tolerance for memory-resident data. Moreover, Spark is a representative of in-memory cluster computing systems. Any successful improvement in Spark inspires the development of in-memory cluster computing systems. The present study implements in Spark, an idea that is common to in-memory cluster computing systems. The nature of compute-centric HPC systems requires large and powerful CPU equipment for computation. Memory and disks are often limited relative to CPU power. According to the published memory configurations of the top 10 supercomputers from June 2010 to November 2012, the majority of these HPC servers are equipped with less than 2 GB memory per CPU core, and some are equipped with less than 1 GB [8]. Thus, the compute-centric HPC system with a bottleneck caused by limited memory resource, rather than CPU resource, is considered in this study. In Spark, hash-based shuffle directly writes intermediate data into separate files for each partition. Since the pressure of memory allocation and garbage collection (GC) on the Java virtual machine (JVM) are not considered, the performance deteriorates severely in cases with large amounts of mappers and reducers. The tasks of Spark are scheduled according to data locality and number of CPU cores. However, in compute-centric HPC systems, the locality-oriented scheduling techniques for MapReduce jobs can cause performance degradation [9] when data locality is maximized. In Spark, delay scheduling [10] is adopted to delay the tasks for data locality. Moreover, tasks with shuffle fetch operations have no preferred locations and are randomly assigned to worker nodes. These design characteristics may lead to load imbalance, thereby causing each worker node to process different amounts of data and to encounter various memory pressures. Furthermore, tasks are scheduled by the free CPU cores of the worker nodes without consideration of memory resource. When a worker node is left with no spare memory, intense competition for memory resource and unnecessary spill operations occur. With the aim of achieving high performance for the in-memory cluster computing of compute-centric HPC systems, we propose three optimization techniques, which are implemented in Spark, to address the aforementioned issues by maximizing memory use. The contributions of this study are as follows: 1. We analyze the causes of performance degradation when the number of partitions increases under hash-based shuffle. We introduce Write Handle Reusing to surmount the deficiencies and to improve memory usage efficiency.

172

C. Pei et al.

2. We develop algorithms, called Load Balancing Optimizer, based on the adjustments or resets in the preferred locations of tasks, to implement load balancing. 3. We design and implement a task scheduler, which schedules tasks according to the memory usage of tasks and the resource situation of worker nodes. The scheduler, which we called Memory-Aware Task Scheduler, dynamically calculates the optimal task concurrency of each worker node and minimizes expensive disk spilling. 4. We conduct extensive experiments for comparison with the original Spark platform. The experimental results show that our proposed methods dramatically improve performance in terms of the job execution time in the memory-constrained clusters. The rest of this study is organized as follows. In Sect. 2, we provide a brief overview of in-memory MapReduce system and discuss the motivation of this research. The overall implementations of the three optimization techniques are presented in Sect. 3. Section 4 reports and analyzes the results of performance evaluations. Related work is discussed in Sect. 5. In the final section, we conclude this paper and discuss the potential future work.

2

Background and Motivation

This section provides a brief background on the approaches to design and implementation that have been adopted in existing in-memory MapReduce systems, and discusses the motivation offered by the challenges and opportunities in the job execution and memory usage mechanism of Spark. 2.1

Spark as the In-Memory MapReduce System

In-memory computing is a new concept proposed in the era of big data. Recent studies on in-memory computing focus on hardware architectures and system software. When a new hardware is introduced into the traditional architecture, an appropriate modification of the original upper system software ensues. As a distributed computing platform, Spark absorbs the essence of in-memory computing, that is, the maximization of memory speed to match CPU speed, and ideally tolerates data loss in the lineage of RDDs. However, Spark is generally a MapReduce framework like Hadoop. The differences in design and implementation between Spark and Hadoop are elaborated in the following section. MapReduce frameworks consist of a map, shuffle, and reduce phase. The shuffle phase is commonly overshadowed by the map and reduce phases. The map phase reads data from file systems, such as HDFS, and transforms such data into key-value pairs. The reduce phase handles the values of similar keys. The shuffle phase is divided into the shuffle write and shuffle fetch. Shuffle write is the preparatory stage of the shuffle fetch in the map phase, which overlooks the process for writing shuffle files. Shuffle fetch aims to transmit the data generated

Improving the Memory Efficiency of In-Memory MapReduce

173

by map tasks to reduce tasks in many-to-many fashion, which is often integrated as part of the reduce phase. In earlier versions of Hadoop and Spark, map tasks directly write output data to disks, thereby creating M × R shuffle files, where M is the number of map tasks and R is the number of reduce tasks. When M and R are enormous, millions of shuffle files are created. The reading of these files then generates many random I/Os and causes a sharp decline in performance. To reduce the number of files, Hadoop creates a large in-memory buffer to cache partitioned data and spills the cached and sorted data to disks when the buffer reaches its upper limit. However, in preparation for the reduce phase, each output file of the map task in Hadoop needs to be sorted and merged into a single file, thereby requiring large disk I/O operations. In Spark, shuffle file consolidation [11] is introduced to reduce the number of shuffle files, which creates only C × R shuffle files, where C denotes the number of CPU cores allocated to the Spark application. The few shuffle files of Spark facilitate the exploitation of strong locality benefits offered by the underlying file system. Many of the applications need not sort intermediate data. In the shuffle fetch phase, both Spark and Hadoop fetch M × R data segments or files. Spark directly stores the fetched data in memory for aggregation, whereas Hadoop stores the fetched data in disks or in memory for sorting and merging. Similar to Hadoop’s map phase, the reduce phase needs to sort and merge the fetched data through many disk I/O operations before dealing with them. However, Spark uses a hash table to aggregate values with the same partition in memory. When no spare memory is available, Spark is compelled to spill part of the data to disks. In addition, Spark leverages the distributed memory from all worker nodes to store much of the intermediate data in memory for reuse in the next stages. Spark improves overall performance by avoiding some disk I/O operations, which are otherwise unavoidable in Hadoop. 2.2

Motivation

After a shuffle file consolidation mechanism [11] is introduced into Spark, the number of shuffle files decreases and the performance improves. However, hashbased shuffle still causes performance degradation when the number of partitions reaches thousands. The job in Spark is divided into stages that run one at a time. If one stage runs slowly as a result of load imbalance, the total time of job is affected. Each stage of the job reads input data from disks or fetches data from the output of the previous stage. Whether or not the input data are distributed evenly on the cluster, Spark’s scheduler assigns tasks to worker nodes based on data locality by delay scheduling [10]. If a task processes a partition without preferred locations, Spark randomly sends the task to one worker node. When processing different amounts of data, worker nodes may face different memory pressures and execute some unnecessary and expensive spill operations. Thus, the execution time of each stage can be affected by load imbalance of the data involved in the process of the worker nodes.

C. Pei et al. memory

time

20 GC Times (s)

JVM Used Heap (GB)

25

15 10 5 map stage

reduce stage

0 200

400

600 800 1000 1200 1400 Time (s)

(a)

9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 1.5 0.5

Data Size (GB)

174

200

400

600 800 1000 1200 1400 Time (s)

(b)

30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0

node01 node02 node03 node04 node05 node06 node07 node08

Input Data Shuffle Fetching Data

(c)

Fig. 1. (a) and (b) Illustrations of original hash-based shuffle’s memory usage and GC time. (c) Data size through reading and shuffle fetching.

To analyze the reason of performance degradation caused by hash-based shuffle and to prove the existence of load imbalance, we conduct the following benchmark experiment. The experiment platform includes a cluster of nine nodes. One node functions as the master node and the remaining eight nodes as the worker nodes. Each node is equipped with two 8-core 2.6 GHz Intel Xeon E5-2670 CPUs, 32 GB memory and a 300 GB 10,000 RPM SAS disk, and runs RedHat Enterprise Linux 5 (Linux 2.6.18-308.4.1.el5). Spark version 1.1.0 is used in the experiment, running on a 160 GB dataset created by Hadoop’s RandomTextWriter. The application we use is WordCount without combiner (WC), with a memory of 25 GB for each executor’s JVM. We specify Netty for transferring shuffle blocks between executors and to open the option for the shuffle file consolidation mechanism. In Spark, WC can be divided into two stages (map and reduce stages) based on the characteristics of the tasks. By adding “-XX:+PrintGCDetails XX:+PrintGCTimeStamps” to the Java options, we obtain information on the status of memory usage and the amount of time spent on GC in one worker node. In addition, we record the size of data reading from the HDFS and fetching through the shuffle operations in all the worker nodes. As shown in Fig. 1 (a) and (b), in the reduce stage, memory usage remains stable and little time is spent on GC. The reason for performance degradation is observable in the map stage. In the map stage, memory usage exhibits an increasing trend as time passes, and the time spent on each GC increases. When no spare memory is left, full GC occurs, indicating that the speed of memory collection by GC is lower than the speed of memory allocation. By analyzing the implementation of the hash-based shuffle write phase in the map stage, we find that each map task needs to create R file writer handles and some auxiliary objects to write data to the shuffle files. Even if the shuffle file consolidation mechanism ensures that every map task executed in the same core can share the same file group, the number of operations on the files is still M × R. Each file operation needs some system calls, the write handle, 32 KB of write buffer, and some related objects. As a result, the system allocates huge memory to create objects when the number of partitions reaches thousands, which triggers a serious GC problem and increases the system overhead. From Fig. 1 (c), we

Improving the Memory Efficiency of In-Memory MapReduce

175

observe differences in the amount of data that each worker node reads and fetches in Spark. Load imbalance of processing data can cause the data assigned to one processing node to overwhelm that node’s memory capacity. Similar to those in multi-threaded execution engines, the tasks of Spark assume the form of threads running on the worker nodes, which are conducive to sharing memory and reducing resource waste. Tasks spill to disks when the memory runs out. Spark mainly restricts memory usage to two types and sets the maximal amounts under each type. One type is used for caching data in memory, and the other type is for aggregation operations. Moreover, Spark ensures that each task on one executor can use 1/2N to 1/N of the memory for aggregation operations, where N is the number of running threads in one executor. When memory is inadequate for aggregation operations, spill operations occur. Excessive spill operations cause performance degradation. Spill operations can be avoided in two ways. One method is the creation of additional partitions to ensure each task needs less memory, although this may cause more shuffle files. The other method is the lowering of the concurrency level to let each task obtain additional memory, although in this method some stages may run very slowly when the memory is sufficient for them to compute on the previous higher concurrency level. Motivated by the aforementioned challenges and opportunities, we propose three optimization approaches to improve memory usage efficiency. The specific details of the implementation are presented in the next section.

3

Implementation

This section presents how to implement the corresponding solutions for improving memory efficiency. The overall architecture of our approaches is shown in Fig. 2. Figure 2 is the native form of Spark, including our proposed optimization methods highlighted in red and table gridlines. Write Handle Reusing (WHR) is a new shuffle mechanism, which mainly optimizes the shuffle write phase, and is an option available to users. Before the tasks of a stage are submitted to the task scheduler, Load Balancing Optimizer (LBO) checks load imbalance and fixes it if it exists. Memory-Aware Task Scheduler (MATS) consists of a f eedback − sampling − decision mechanism, which helps the task scheduler to consider not only CPU resource but also memory resource in dispatching tasks.

Fig. 2. Overall architecture of our approaches

176

3.1

C. Pei et al.

Write Handle Reusing

The WHR mechanism based on the shuffle file consolidation mechanism, is designed to relieve memory pressure on the JVM and to reduce memory allocation by reusing writer handles and some related objects. In Spark, each CPU core has corresponding R partition files, called the ShuffleWriterGroup. When the ShuffleWriterGroup is used for the first time, the corresponding R partition files are opened. Then these partition files are not immediately closed after the tasks finish on the ShuffleWriterGroup. WHR closes the shuffle files of the worker node upon detecting that no tasks of the same stage is running on the worker node. Thus, the number of file operations is reduced to C × R. Each map task running on the same core writes data to the partition files and shares the same group of writer handles, file group, and write buffer. To efficiently support fault tolerance in WHR, we follow the shuffle file consolidation; that is, we maintain the functional semantics by providing similar guarantees. After one map task finishes, its output is registered in the master node, which includes the offsets and lengths in the partition files. In WHR, the ShuffleWriterGroup ID is also registered, which is used for the shuffle fetch phase. During the shuffle fetch phase in the reduce stage, the reduce task fetches the data that every map task outputs. Although the shuffle file consolidation mechanism reduces the number of shuffle files to C × R, the amount of fetching data segments is still M × R, and the partition files are divided into multiple parts. For fetching, however, requests for individual shuffle blocks arrive in a random order because they are requested concurrently by all executed reduce tasks. To further optimize the performance of the shuffle fetch phase, we make adjustments according to the changes in the shuffle write phase. The data in the same partition file belong to the same partition. Instead of fetching data segments, the entire partition files are pulled in WHR, which in turn reduces the number of random reads. Using the ShuffleWriterGroup ID, WHR transforms M ×R data segments into C ×R files. In the process of shuffle fetch phase, WHR checks whether or not the amount of data segments is equal to the length of the entire file for fault tolerance. 3.2

Load Balancing Optimizer

Load imbalance easily occurs when the pursuit for data locality is excessive. LBO can find and fix load imbalance before the tasks of a stage start to run. Delay scheduling then continues to coordinate the data locality of the input data and free CPU cores. When one task starts to run and read data from the HDFS or to process data by some shuffle operations, the amount of data on each worker node is known in advance. Thus, LBO is introduced to statically calculate which tasks should be executed on which worker node. In Spark, tasks are scheduled according to their preferred locations. We can achieve our desired results just by adjusting or resetting the preferred locations of the tasks. The preferred location of the input file is provided by the distributed file system. Tasks have no preferred locations when obtaining data through shuffle operations. For both cases, two

Improving the Memory Efficiency of In-Memory MapReduce

177

different algorithms are designed. We assume that p input partition files and n worker nodes exist. When tasks read data from the partition files on the HDFS, each task has the preferred location that LBO will adjust and reset when finding load imbalance. The corresponding algorithm of LBO is described in a following pseudo code. Algorithm 1. Reset the Preferred Location for Input Data from HDFS Input: The set of sizes for partition files, S; the set of preferred locations for tasks, P ; Output: The set of adjusted preferred locations for tasks, N ; 1: Initialization 2: sumsize = Sum(S[i]), averagesize = sumsize/n, R is an empty collection; 3: for i = 1 to p do 4: j = P [i]; 5: N (j) add (i, S[i]); 6: end for 7: for j = 1 to n do 8: if Sum(N [j]) > averagesize then 9: Sort N [j] by the size of element; 10: R add the redundance of N [j] through choosing form small to large; 11: end if 12: end for 13: for k = 0 to R.length do 14: Choose Node j which has the smallest size of partition files from N; 15: N [j] add R[k]; 16: end for 17: for j = 0 to n do 18: for k = 0 to N [j].length do 19: P [N [j][k].i] = j; 20: end for 21: end for 22: return N

Given that tasks run shuffle fetch operations without preferred location, LBO sets the preferred location for the tasks to pull data from every worker node. When the shuffle write phase finishes, the lengths of the partition files on each worker node are registered on the master node. The algorithm of LBO for shuffle operation takes full advantage of the data distribution to assign tasks to the worker nodes, which can make each worker node process nearly equal amounts of data and reduce network transmission. The specific steps of the LBO algorithm for tasks without preferred locations are described below. Step 1: For each task, calculate the amount of data and their distributions. Step 2: Divide the tasks into n groups according to the number of nodes and data size, ensuring that the data size for each group is nearly equal. Step 3: Determine the amount of fetching data if the tasks of every group are executed on every node. Thus, a n × n matrix is created. Step 4: Choose the largest value in the matrix to identify which group is allocated to which node. Mark the row and column at which the selected group is located to ensure that the group is not chosen next time. Goto Step 4 until no group is available. Step 5: Set the preferred location for each task.

178

3.3

C. Pei et al.

Memory-Aware Task Scheduler

The number of CPU cores available for tasks running on the same worker node is set by a static configuration file or parameter, which is fixed at runtime. When the tasks of a stage consume considerable amount of memory, the thread parallelism can cause fierce memory resource competitions, leading to many spill operations. Thus, we design MATS, which can dynamically adjust one worker node’s concurrency level according to its current memory usage. The implementation of MATS is described below. We let Smem and Sspill denote the size of memory usage and the amount of data spilled to disks respectively. Spark provides approaches to obtain the values of Smem and Sspill for each task. The upper bound of memory size Smax for aggregation operation and the maximal concurrency level CLmax are set beforehand. Thus, the equation for calculating the optimal concurrency level CLop is as follows:  CLmax if Sspill = 0 CLop = Smax /(Smem + Sspill ) if Sspill > 0 When one task in the worker node finishes, one value of CLop can be given; this value, however, does not stand for the whole worker node. Meanwhile, the master node receives much feedback information on finished tasks regarding CLop . Thus, MATS establishes a f eedback − sampling − decision mechanism. The mechanism assigns tasks to a worker node according to the results of a sampling strategy. The strategy sets a dynamic sampling number SN and calculates the optimal concurrency level after collecting the SN values of the CLop . We let CLcurrent denote the current concurrency level of one worker node. The initial value of SN and CLcurrent is equal to CLmax . Within the interval number of SN , we use CLsum to add the CLop returned by the finished tasks on the same worker node; the value of CLcurrent is changed to CLop when the CLop of the feedback information is less than the CLcurrent . When the counter reaches SN , the values of CLcurrent and SN are changed to CLsum /SN .

4

Evaluation

This section presents the performance evaluation to verify the effectiveness of the improved Spark which integrates three optimization methods proposed in this paper. The environment and settings of the experiments are the same as those described in Sect. 2.2. Some special settings are described in the following sections. The performance of the improved Spark is compared with that of Spark version 1.1.0. The job configurations of the improved Spark are default and the same as those of the original Spark. In other words, we do not compare the improved Spark against the original Spark with disadvantaged configurations in the experiments.

Improving the Memory Efficiency of In-Memory MapReduce

4.1

179

Validation Experiments on Three Approaches

To evaluate the influence of WHR on memory and to prove the effectiveness of LBO, we conduct an experiment with conditions identical to the previously described test. Under the same experimental conditions and experimental data, we record memory usage and GC time. Similar to that from the LBO, the data size of reading from the HDFS and the shuffle fetching on each node are recorded. memory

time

15 10 5 map stage

reduce stage

0 100 200 300 400 500 600 700 800 900 Time (s)

(a)

9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 1.5 0.5

Data Size (GB)

20 GC Times (s)

JVM Used Heap (GB)

25

100 200 300 400 500 600 700 800 900 Time (s)

(b)

30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0

node01 node02 node03 node04 node05 node06 node07 node08

Input Data Shuffle Fetching Data

(c)

Fig. 3. (a) and (b) Illustrations of WHR’s memory usage and GC time. (c) Data size through reading and shuffle fetching.

Figure 3 (a) and (b) show that the memory usage in the map stage has been reduced and is stable, and that the GC time has become shorter. Compared with the original hash-based shuffle, the maximal amount of memory usage is reduced significantly from nearly 24 GB to 13 GB. This result indicates that WHR can improve the utilization rate of memory and relieve the pressure on the JVM. WHR is capable of improving the performance because the write handles are shared in each core and the creation of large objects is reduced. According to the default value of the block size (128 MB), M and R are equal to 1440 and C is equal to 128. The improved Spark uses WHR to reuse the write handles and related objects, which can reduce the number of these objects in the map stage to 128/1440 ≈ 0.089 of the native Spark. By fetching C × R partition files instead of M × R data segments in the shuffle fetch phase, the amount of random reads is reduced. Figure 3 (c) shows that with the LBO the input and shuffle fetching data are relatively even on each node. Compared with Fig. 1 (c), the LBO changes the standard deviations of the processing input data size and fetch data size on each worker node from 0.94 to 0.19 and from 0.53 to 0.08, which makes a good load balancing. In addition, the total execution time of WC with the improved Spark is 33.3 % lower than that with the original Spark, changing from 24 min to 16 min. To verify the effectiveness of MATS, the test application should ensure that memory resource competition and spill operations occur. Given that the aggregation operations in the reduce stage consume little memory, no disk spilling occurs in WC. Thus, we choose the PageRank (PR) proposed by the founders of Google and implemented in Spark for the comparison. The hash-based shuffle is still adopted by PR in the original Spark. PR requires some calculations of iteration and caches data in memory, which is conducive to spill operations. We set

180

C. Pei et al. 20

Original Spark Improved Spark

16

Spilled Data (GB)

Currency Level

18 14 12 10 8 6 4 2 1

2

3 4 Stage

(a)

5

6

80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5

Original Spark Improved Spark

Stage(2) Stage(3) Stage(4)

(b)

Fig. 4. (a) Average currency level of each stage. (b) Spilled data size of the middle three stages.

the number of iterations to three. The input dataset of PR is produced by the PR of HiBench, which is provided by Intel. The size of the dataset for PR is 115 GB or nearly 14 GB per worker node. The heap size of JVM in each worker node is 25 GB. To avoid serious GC problem and reduce the size of the cached data, we serialize the cached data using org.apache.spark.serializer.KryoSerializer. In the experiments, the data size from spill operations of all the worker nodes and the average real-time concurrency level of each worker node are recorded. The application consists of five stages. The middle three stages consume considerable memory and cause spill operations. From Fig. 4 (a) and (b), we observe that the concurrency level is dynamically changed, and it decreases when memory resource competition occurs. The total spill size in the improved Spark with MATS is much smaller than that in the original Spark by as much as 27x. The total running time of the improved version is reduced by as much as 31 %, from 96 min to 66 min, compared with the original version. MATS sacrifices concurrency to avoid expensive spill operations and additional I/O operations, which in turn reduces the amount of data spills and maximizes the use of memory. Thus, according to the results of the experiment, replacing the task scheduler with MATS to improve the overall performance is worthwhile. The above experiments prove that the proposed three optimizations can improve the performance of in-memory MapReduce systems from different aspects. WHR plays the role to reduce overheads caused by a large number of partition. LBO tries to adjust the task scheduling to keep load balancing. MATS controls the frequency of dispatching tasks to avoid the memory resources competition. Thus, all the approaches do not affect each other in any perspectives and can work well together. For example, when the amount of input data is certain, the number of partitions is inversely proportional to the amount of single partitioned data. The memory usage is associated with the amount of data processing. So a small amount of partitions are easy to lead to memory resources competition, and a large amount of partitions cause more file operations. Users need to consider the size of the partition and the number of partitions when facing big data problems. But such thorny problems can be solved well by WHR and MATS together.

Improving the Memory Efficiency of In-Memory MapReduce 100

Sort-based Shuffle Hash-based Shuffle Improved Spark

90 70

Time (min)

Time (min)

80 60 50 40 30 20 10 0

720 1440 2160 2880 Number of Partitions

(a)

120 110 100 90 80 70 60 50 40 30 20 10

181

Original Spark Improved Spark

PageRank

WordCount

(b)

Fig. 5. (a) Performance comparison among increased number of partitions. (b) Performances relative to size of memory usage.

4.2

Increased Number of Partitions

To test the influence of the partition number in the improved Spark, we use four different partition sizes to compare execution times in the improved Spark with that in the original Spark, which separately adopts hash-based shuffle with the shuffle file consolidation mechanism and sort-based shuffle in WC. Figure 5 (a) shows that the partition sizes changes from 720 to 2880. The corresponding input data size is 80, 160, 240 and 320 GB. The hash-based shuffle performs better than the sort-based shuffle. The original Spark with the shuffle file consolidation mechanism achieves a speedup of 2.53x over the sort-based shuffle when the partition number is 720. When the partition size increases, however, the speedup drops to less than 2x. With an increase in the partition number, the hash-based shuffle runs slowly because memory is not reasonably used, thereby causing serious GC problems, and because too many file operations increase the system overhead. By contrast, the improved Spark can adapt well to different partition numbers and has good scalability. The speedup consistently remains in 2.11x or higher, reaching as high as 3.18x than the sort-based shuffle at the maximum. Compared with the hash-based shuffle, in the case with the 720 partitions, the improved Spark achieves a small speedup of 1.25x. However, when the partition number reaches 2880, the improved Spark can still decrease the run time by 36.8 % and achieve a maximal speedup of 1.58x over the hash-based shuffle. As a result of the absence of spill operations in WC, the WHR mechanism and the LBO method play important roles in the improved Spark. Therefore, the improved Spark reduces large number of memory to create objects and ensures load balancing across all the worker nodes. 4.3

Size of Memory Usage

For in-memory MapReduce systems, the size of memory usage for aggregation operations on intermediate data has a great influence on performance. In Spark, different aggregation operations consume different amounts of memory. Some aggregation operations require much memory to handle intermediate data, and

182

C. Pei et al.

therefore cause some disk spilling, especially when memory is inadequate. In WC, a reduceByKey operation is used to aggregate key-value pairs for summing up values with the same keys, which consumes little memory. Meanwhile, PR uses groupByKey operation to transform the (key, value) to (key, valuelist) and cache them in memory, which can easily cause spill operations. At every iteration of PR, the join and reduceBykey operations are used to handle the shuffle fetching data, which consumes considerable memory. Thus, WC and PR represent different applications which consume different size of memory usage. By analyzing the aforementioned experiments on WC and PR in Sect. 4.1, we study the effect of the size of memory usage on the improved Spark. Figure 5 (b) shows that, in contrast to that in the original Spark, the execution time of WC in the improved Spark reaches a speedup of 1.5x, which is slightly higher than the 1.45x speedup in PR. The performance improvement of the improved Spark in PR is not as significant as that of the improved Spark in WC. This result can be explained as follows. The performance improvement from WHR and LBO can be reflected well on an application that has no spill operations, such as WC. However, the performance degradation caused by spill operation can reduce the performance improvement from WHR and LBO. Given its operations that consume large amounts of memory, PR spills some of the intermediate data to disks. Although MATS can reduce considerable data spilling, the currency level decreases, which still affects the performance improvement from the WHR and LBO. MATS can not fully compensate for the performance degradation caused by data spilling.

5

Related Work

Hash-based shuffle, which is simple yet effective, is adopted in many MapReduce systems. By reducing the number of shuffle files, the shuffle file consolidation mechanism [11] effectively solves the performance issues caused by large amounts of random I/Os in the shuffle fetch phase. However, no research has studied the relationship between hash-based shuffle and memory. The present research analyzes only the influence of hash-based shuffle on memory. By sharing some objects in file operations, the proposed WHR method releases the pressure exerted by hash-based shuffle on memory and reduces a few random I/Os. Recently, research on load balancing has attracted significant attention. Enhanced Load Balancer [9] dynamically considers the size of intermediate data generated by tasks and schedules tasks for an even distribution of intermediate data. ShuffleWatcher [12] is appropriate for multi-tenant environments and aims to reduce shuffle traffic to improve cluster performance based on network loads. In contrast to those in previous works, the proposed LBO method statically analyzes the presence of load imbalance in the processing data and adjusts the preferred location of tasks, which can help locality-oriented scheduling techniques to improve their exploitation of CPU and memory resources. With regard to research on task schedulers, Resource-Aware Adaptive Scheduler [13] is designed to dynamically adjust the number of slots and workload placement on each machine to maximize the resource utilization of the Hadoop cluster. Mammoth [8] implements a

Improving the Memory Efficiency of In-Memory MapReduce

183

multi-threaded execution engine based on Hadoop and realizes global memory management through techniques, such as disk access serialization, multi-cache, and shuffling from memory. These previous works focus on implementing synchronization between map tasks and reduce tasks. By contrast, the proposed MATS method adapts to the multi-threaded execution engine, in which jobs are divided into stages that run one at a time. Subsequently, the tasks under each stage can share and compete for the united memory. By collecting information on the completed tasks, MATS helps the task scheduler adjust the frequency of dispatching tasks to each worker node, which can minimize expensive disk spilling caused by serious memory resource competition.

6

Conclusion and Future Work

In this study, we propose three approaches for in-memory MapReduce systems, which we implement in Spark. Through the analysis of the disadvantages of hashbased shuffle on memory usage, WHR is introduced to improve the efficiency of memory usage in the shuffle write phase and to reduce some random I/Os in the shuffle fetch phase. Moreover, LBO is designed to keep the data evenly processed on each node. When the nodes run out of memory, MATS is devised to coordinate the memory usage with the concurrency level. The results of the experiments show that the new system achieves a speedup of the job execution time from 1.25x to 3.18x over the original Spark. In addition, the improved system does not change the current Spark processing phases and can easily integrate the existing upper application techniques of Spark into the new system. Moreover, Spark applications do not need any modification to run on our improved system. Our future work will focus on two aspects of memory. First, we will build a unified memory management system to break the restriction of static configuration on memory, which can make memory usage more efficient. Second, we will design a new execution process to reduce memory resource competitions and to avoid long GC time without changing the upper application structure of the system. Acknowledgments. This paper is partly supported by the NSFC under grant No. 61433019 and No. 61370104, International Science & Technology Cooperation Program of China under grant No. 2015DFE12860, and Chinese Universities Scientific Fund under grant No. 2014TS008.

References 1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 2. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed dataparallel programs from sequential building blocks. In: Proceedings of the 2nd ACM European Conference on Computer Systems (EuroSys), pp. 59–72 (2007) 3. Apache hadoop. http://apache.hadoop.org

184

C. Pei et al.

4. Rasmussen, A., Porter, G., Conley, M., Madhyastha, H.V., Mysore, R.N., Pucher, A., Vahdat, A.: Tritonsort: a balanced large-scale sorting system. In: Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 29–42 (2011) 5. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endowment 3(1–2), 285–296 (2010) 6. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), pp. 10–10 (2010) 7. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–2 (2012) 8. Shi, X., Chen, M., He, L., Xie, X., Jin, H., Chen, Y., Wu, S.: Mammoth: gearing hadoop towards memory-intensive MapReduce applications. IEEE Trans. Parallel Distrib. Syst. 26(8), 2300–2315 (2015) 9. Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of memory-resident MapReduce on HPC systems. In: Proceedings of 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 799–808 (2014) 10. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys), pp. 265–278 (2010) 11. Davidson, A., Or, A.: Optimizing shuffle performance in spark. Technical report, University of California, Berkeley-Department of Electrical Engineering and Computer Sciences (2013) 12. Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: ShuffleWatcher: shuffle-aware scheduling in multi-tenant MapReduce clusters. In: Proceedings of the 2014 USENIX Annual Technical Conference (ATC), pp. 1–12 (2014) 13. Polo, J., Castillo, C., Carrera, D., Becerra, Y., Whalley, I., Steinder, M., Torres, J., Ayguad´e, E.: Resource-aware adaptive scheduling for MapReduce clusters. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 187–207. Springer, Heidelberg (2011)

Suggest Documents