ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

1 downloads 0 Views 2MB Size Report
ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems. Yizhi Ren1,2, Laiping Zhao3, Haiyang Hu4. 1School of Software Engineering, ...
ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

453

ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems Yizhi Ren1,2, Laiping Zhao3, Haiyang Hu4 School of Software Engineering, Hangzhou Dianzi University, China 2 State key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China 3 School of Computer Software, Tianjin University, China 4 School of Computer & Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou Dianzi University, China [email protected], [email protected], [email protected] 1

Abstract We re-examine the reliability issue in current Hadoop systems, and find that either rescheduling or speculative execution could possibly extend the job’s deadline. Compared with the not long execution time of a job, the delay would be severe. To reduce such negative impact on execution time, we explore to adopt aggressive replication in scheduler, and present the design of ARES: Aggressive Replication Enabled Scheduler for Hadoop systems. Firstly, relying on the predictions on future coming workload, ARES could prevent denial of service caused by large consumption of resources. Secondly, with taking into account fairness and data locality, replicas of map tasks could be distributed fairly and would not increase much network latency. Finally, experiments show that using ARES could even shorten the job execution time because it not only improves the data locality, but also starts reduce tasks as soon as the fastest map replica is finished. Keywords: Hadoop, Scheduling, Replication.

1 Introduction Parallel programming frameworks such as MapReduce [6], Hadoop [18], Dryad [8] have been widely adopted for largescale data-intensive applications, among which Apache’s Hadoop, an open source implementation of Google’s MapReduce Framework, is most widely used. For example, Yahoo uses Hadoop to support their research for ad systems and web search; Facebook uses Hadoop to store copies of internal log and dimension data sources and uses it for social network analysis, and Alibaba.com uses Hadoop for processing the data generated by the large amount of business transactions. The Apache Hadoop aims to develop open-source software for reliable, scalable, distributed computing [1], where “reliable” is interpreted from the HDFS (i.e., Hadoop Distributed Filesystem) and a application-level, faulttolerant computation process, and “scalable” is performed by the MapReduce system. It is generally accepted that the ability to handle failures and allow jobs to complete *Corresponding author: Laiping Zhao; E-mail: [email protected] DOI: 10.6138/JIT.2014.15.3.14

is one of the major bene-ts of using Hadoop. However, as the system scale continues to increase, problems caused by failures are becoming more severe than before [16]. For example, Google reports 5 average worker deaths per MapReduce job in March 2006 [5]; Los Alamos National Laboratory (LANL) logged more than 1,000 failures annually at their system No.7, which consists of 2014 nodes in total [16]. In terms of the application-level fault tolerance in computation, Hadoop handles failures in a conservative, reactive way: it relies on the technique of “rescheduling” for tolerating failures. That is, whenever a task, either a map or reduce, is failed by a tasktracker, it will be restarted on machines that are healthy. The jobtracker in Hadoop detects failures based on the measurement on response time of heartbeat. A tasktracker is marked as failed when it has stopped sending heartbeats for 10 minutes (by default). In fact, the average runtime of jobs is not that long, e.g., Google reported that the average runtime of jobs is around 15 minutes [5]. Therefore, if a failure occurs, it will significantly hurt the overall makespan of the job to reschedule the failed tasks after 10 minutes of failure detection process, and then restart the execution from the beginning on the new tasktracker. Given this, Hadoop provides another technique, named as “Speculative execution” to facilitate the progress of job execution. That is, whenever a task is running slower than expected (by default, 20% slower than the average level of the job), the jobtracker launches a backup for the slow task, and the task will be completed as long as one replica is successfully runtime. However, “Speculative execution” is also sensitive to the runtime of the job. It reactively detects slow tasks, and start new replicas for them after a detection. At the time of a positive detection, slowing progress has happened. Then launching a backup, which starts execution from the very beginning will result in a second extension on the job. Different with the conservative, reactive countermeasure in current Hadoop systems, we explore the possibility of replicating tasks in Hadoop environment in an aggressive way in this paper. With aggressive replication, tasks are assigned with copies at the beginning of execution, rather than after a failure or a slow progress

454

Journal of Internet Technology Volume 15 (2014) No.3

is detected. In this way, whenever a replica is failed, the job still can proceed because of the alive replicas. Moreover, it performs no impact on job execution time. The only problem is a large number of replicas could degrade the system performance, and waste of resources. Our goal is to restrict the negative impact on system performance imposed by aggressive replication, and enhance the effiency of job execution by leveraging the spare resources. In particular, the first arisen challenge is the increased demand on resources. Aggressive replication, i.e., starting several replicas simultaneously at the beginning, definitely requires a large amount of resources. For a dedicated system and jobs that arriving in an online way, it may be unable to provide adequate resources for the future coming jobs. While for a shared environment, e.g., rented virtual cluster environment from a cloud infrastructure service provider, it will result in more economic cost on buying virtual machines. Therefore, aggressive replication has to optimize the resource usage, and make sure to preserve adequate resources for future coming jobs. We refer to an application consisting of several maps and reduces as a job, and either a map or a reduce as a task. Our second challenge is on determining which job or task should be replicated, and how many replicas should be assigned for the selected task. Generally, either jobs or tasks in Hadoop systems are associated various priorities, process different data, and are submitted at different time. Meanwhile, physical machines in a heterogeneous system perform different reliability and show various processing speeds. And some tasks are approaching to end, while the others may be only just beginning. Obviously, it is reasonable to launch more replicas for a task that is associated with high priority, deployed on an unreliable machine, and is just beginning. A key question is how to count the number of replicas so that the improved reliability would be satisfied. Aggressive replication is supposed to take these situations into account, and design an fair policy for task replication. In accordance with these challenges, the primary contribution of this paper is that, we demonstrate that aggressive replication is a significant technique for building reliable data intensive processing environment, especially for workload predictable Hadoop systems. Specifically, our contributions are summarized as below, yyTo prevent denial of service caused by excessive use of resource by replicas, we forecast the workload using triple exponential smoothing, which performs well in systems that show trend and seasonality on workload. yyBased on the workload prediction and resource utilization rate, we devise an Aggressive Replication Enabled Scheduler (abbreviated “ARES”) for picking the tasks for replication.

yyWe evaluate the design of ARES using experiments, and find that job finish time is rarely postponed by tasktracker failures, and even shortened compared with the “rescheduling” and “speculative execution.” The rest of the paper is orginized as follows. Section 2 consists of our motivation on aggressive replication enabled scheduler. Section 3 introduces our design of ARES, including design goals and the job/task selection algorithm. Section 4 presents the experimental evaluation. Section 5 describes the related work. Finally, we conclude in section 6, and describe our future work.

2 Related Works By default Hadoop schedules jobs in a FIFO order. That is, jobs with higher priority and submitted earlier are firstly selected for scheduling. When several slots are free for scheduling, jobtracker assigns maps to the tasktracker closest to the target data. If the tasktracker is not idle for the time, Hadoop selects another on the same rack, otherwise, on a remote rack [19]. Capacity scheduler groups jobs into queues. Each queue is allocated with a fraction of the capacity of the systems. Whenever a TaskTracker (i.e., datanode) is free, the Capacity Scheduler picks a queue which has most free space for running task. Within each queue, capacity scheduler uses the simple FIFO method for job scheduling. Priority is optionally supported, that is jobs with higher priority will have access to the queue’s resources before jobs with lower priority [2]. Fair scheduler groups jobs into pools. It assigns each pool a guaranteed minimum share and divides excess capacity evenly between pools. Particularly, in the scheduling process, fair scheduler firstly fills each pool whose minimum share is larger than its demand. Secondly, it fills all remaining pools to their guaranteed minimum share. Thirdly, fair scheduler distributes the remaining requirements evenly into unfilled pools, starting at the emptiest pool. Technique, like delay scheduling [20] and Dynamically Iterative MapReduce [21] have been proposed to improve the scheduling efficiency for Hadoop. Capacity scheduler and Fair scheduler opened the floodgates to new research, and prompted the emergence of new scheduling algorithms. Dynamic Priority (DP) parallel task scheduler [15] employs economic principles is scheduling, whereby users bid for resources on a market and receive allocations based on various auction mechanisms. This mechanism incents users to customize their allocations to t the importance and requirements of their jobs. Chang et al. [4] focus on a theoretical model for the Hadoop scheduling problem, and formulate a linear

ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

program that minimizes the job completion times to solve the problem. They devised approximate algorithms (named as OFFA) that achieve feasible schedules within a factor of 3 of the optimal. Moseley et al. [11] consider the problem of minimizing the total flow time, give an efficient 12-approximation in the online setting and an online (1 + ε)-speed O(1 ε2)-competitive algorithm. Kc and Anyanwu [9] present the design of a Constraint-Based Hadoop Scheduler that takes user deadlines into account and determines the schedulability of a job based on the proposed job execution cost model. Polo et al. [12] presents task scheduling mechanism enable a MapReduce runtime to dynamically allocate resources in a cluster of machines based on the ob- served progress rate achieved by the jobs, and the completion time goal associated with each job. With the new deployed machines and old, slow machines are replaced with new, fast ones continuously, distributed systems are believed to become more heterogeneous in the future. As these trends continue, it is necessary to devise algorithms regarding system heterogeneity. Rasooli and Down [13] present a new Hadoop cluster scheduling algorithm, which uses system information such as estimated job arrival rates and mean job execution times to make scheduling decisions. Their approach can be efficiently applied in heterogeneous clusters. Regarding the heterogeneous nature of workloads, Tian et al. [17] optimize the scheduling through classifying the MapReduce workloads into three categories based on their CPU and I/O utilization. Accounting for heterogeneity of the environment and workloads, Lee et al. [10] present a system architecture to allocate resources in a cost-effective manner, and design a scheduling scheme that provides good performance and fairness simultaneously in heterogeneous cluster.

/

3 Why Use Aggressive Replication 3.1 Impact of Failures One of the many advantages of Hadoop is its high degree of fault tolerance, which is due to its special design on detecting and handling failures at the application layer. Restarting tasks is the primary way that Hadoop achieves fault tolerance. For example, regarding the TaskTracker failures, if the JobTracker does not receive any heartbeat from it for a period of time, the TaskTracker is assumed to have crashed, then every task running on it will be restarted on another TaskTracker [18]. However, restarting tasks is no good for time-critical jobs, because the completion time of jobs may be significantly lengthened by both the delayed detection of failures and re-execution of tasks. Suppose the failure detection time is set by default at 10 minutes. Then as reported by [5] that the average completion time of jobs

455

at Google is around 15 minutes, a TaskTracker failure may increase the completion time of a job for 66.7%, which is a rather severe problem. We illustrate the impact of failures with an example in Figure 1. At the beginning, Map1, Map2 and Map3 are scheduled on tasktracker PM1, PM2 and PM3, respectively. Task Map3 is failed by the crash of PM3 at the time of t1, and then is rescheduled on tasktracker PM4. Task Map3 restarts its execution at time t2, and nally completes at time t4. If there is no failure and rescheduling, task Map3 is supposed to finish at time t3 on TaskTracker PM3. However, due to the failure, the job completion time is postponed for a period of t4-t3.

Figure 1 Impact of Failures

3.2 Impact of Slow Slots A slow slot will postpone the execution of maps and reduces, and hence results in delay of job. Hadoop reduces the side effects from slow slots through speculation execution, which activates a replica for the slow task. In this way, the reduces can receive outputs from the replica if it finishes earlier. We illustrate the side effects of slow slots and the speculative execution in Figure 2. Like in Figure 1, three maps are launched on three slots. At time t1, the jobtracker detects that Map3’s progress, i.e., 60%, is 20% slower than Map1 and Map2, which have been finished around 80%. Then the speculation execution is activated, and a replica of

Figure 2 Speculative Execution

456

Journal of Internet Technology Volume 15 (2014) No.3

map Map3, named as Map3’ is started on PM4. At time t3, Map3 is completed before Map3’, and then the jobtracker kills the Map3’. In this process, although speculative execution is activated, it does not speed up the job execution. Actually it does nothing in this example. Note that if Map3’ is started at the beginning, it is supposed to be completed at time t2. Therefore, aggressive replication could possibly shorten the runtime of the job. 3.3 Resource Utilization Rescheduling relies on time redundancy for tolerating failures. Speculation replication is a activated only after a task’s running falling behind others. Both of them delay a job’s deadline. Given this, we propose to facilitate the job execution through aggressive replication. Aggressive replication activates replicas for job or task as early as possible even without knowing where are the failures and which are the slow slots. Since aggressive execution consumes a large amount of resources, we have to make sure that resource supply is adequate for deploying aggressive replication. Our motivation of applying aggressive replication into Hadoop scheduler directly comes from our working experiences on our testbed Hadoop system, which consists of merely nine physical machines, one of which is as the namenode, and the others are worknodes. According to statistics, the utilization rate of the system is very low. Usually, only a few experimental jobs are submitted to the system, and the system stays in idle state most of the time. Therefore, it is quite straightforward to use aggressive replication for improving the system utilization and performance. To explore the possibility of aggressive replication in commercial systems, we examine the resource utilization rate at Alibaba (www.alibaba.com). Alibaba is the global leader in e-commerce for small businesses. They provide a platform for millions of buyers and suppliers around the world to do business online. In particular, taobao.com, found by Alibaba group, is the largest online retail business in Asia. To analyze the large amount of transactions data, Alibaba constructs a large-scale Hadoop system, which o ers 22,419 map slots and 14,265 reduce slots in total. We randomly select one regular workday, which is Friday, November 25, 2011. We do not consider weekend because the slot utilization rate in weekend is much lower than in a workday. Figure 3 shows the slots utilization rate for maps and reduces from 0:00 to 24:00: averagely, 47% of the total map slots are occupied, while 42% of the total reduce slots are used. We discuss the feasibility of applying aggressive replication across thee periods as below: yyPhase 1: Utilization rate for maps is greater than 50%

Figure 3 Slot Utilization Rate at Alibaba

from 10:00 to 17:00, and peaks at around 12:00 to 13:00, where 90% of slots are used. This is because the staffs start work from 9:00 everyday, and massive jobs are submitted short after they start work. Thus it is not a good idea to launch the aggressive replication at this time. yyPhase 2: The utilization rate, from 0:00 to 2:00, from 6:00 to 9:00 and from 18:00 to 24:00, is below 50%, and even reaches the lowest level of 5% at 22:00. Therefore, it is very suitable to activate the aggressive replication during this time. yyPhase 3: The utilization rate in the early morning (i.e., from 2:00 to 5:00) increases over time because some “production” jobs periodically run every early morning. In fact, these jobs are generally launched at every early morning for not interfering with user’s experimental jobs. Aggressive replication could be partly employed in this period for the moderate resource utilization rate, but need to control carefully.

4 Design of ARES 4.1 Design Goals Our scheduler is designed to enhance reliability without sacrificing the other advantages provided by the existing schedulers, e.g., Capacity scheduler [2], FAIR scheduler [19]. Reviewing these schedulers, we learned two key requirements in scheduler design: fairness and data locality. In addition, we further take two other requirements into account in our design: reliability and future workload satisfaction. Aggressive replication is adopted for improving reliability. Since aggressive replication launches a great number of replicas which consumes a large amount of resources, we have to address future workload satisfaction in our design for preventing denial of service. 4.1.1 Fairness Allowing users to share a large cluster is attractive because it improves the resource utilization, leading to lower costs. Hadoop’s FIFO scheduler performs poorly when groups of users share the system. Because of lack

ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

of fairness, job response time becomes unacceptable, interactions become impossible, and utility of system is greatly reduced [19]. Allocating resources fairly among various users is able to improve the response time, especially for those short-lived, late-submitted jobs. In particular, FAIR scheduler distributes slots across “pools”, each of which is guaranteed a fraction of the capacity, while capacity scheduler [2] groups jobs into “queues.” FAIR [19] scheduler is designed for providing a fair schedule for all jobs. In particular, it treats all slots at two different levels. At the top level, FAIR allocates task slots across “pools”, and at the second level, each pool allocates its slots among multiple jobs in the pool. A minimum share is assigned for each “pool:” FAIR scheduler will give priority to meet the minimum share, then distribute the remaining slots starting at the emptiest bucket. Capacity scheduler [2] also aims to distribute the slots in a fair way. Jobs are submitted to queues, and each queue is allocated a minimum fraction of the capacity. Capacity scheduler can guarantee the minimum fraction of capacity, and further allocate free slots to queues beyond its capacity. Both FAIR and Capacity scheduler support to work with job priorities. That is, jobs with higher priority will be allocated with resources first. Distributing jobs into “pools” (e.g., FAIR) or into “queues” (e.g., Capacity scheduler) will improve the fairness between users, and will not let jobs submitted later be blocked for a long time. Moreover, supporting priority can ensure that the urgent or important jobs can be scheduled earlier than others. In our proposal, we will preserve the design of fairness. However, since both FAIR and Capacity scheduler do scheduling at slot-level granularity, the variety on capacity of slots in heterogeneous systems could results in unfairness between users. For example, suppose two slots with different speeds are available in the system, and two jobs with different priorities are going to be scheduled. According to FAIR or Capacity scheduler, the low-speed slot may be allocated to the high-priority job, while the high-speed slot is allocated to the low-priority job. In this case, the high-priority job has to be executed for a longer time, which is unfair for him. Therefore, in our design, we further consider the heterogeneity of slots, and provide a scheme on quantifying their difference. Given this, our scheduler works well even in a heterogeneous system. 4.1.2 Data Locality Because the network bisection bandwidth in a large cluster is much lower than the aggregate bandwidth of the disks in the machines, improving data locality could significantly improve the performance [6]. Data locality has been considered in existing schedulers, e.g., FAIR, Capacity scheduler. Thereafter, delay scheduling [20] further greatly improved the data locality for FAIR scheduler.

457

Our scheduler still regards data locality as an important requirement, and turns it into a factor of quantity. 4.1.3 Future Workload Satisfaction Since resources are limited, it is required that aggressive replication should not block the execution of future coming jobs. That is, adequate resources must be preserved for ensuring the execution of the coming jobs. To predict the workload, we consider to use the exponential smoothing method, which is a very popular scheme to produce smoothed data for presentation, or to make forecasts. Generally, exponential smoothing method is comprised of three different descriptions: single, double and triple: single exponential smoothing (SES) relies on previous values, the smoothed value, and lags the current value for smoothing; double exponential smoothing (DES) enhances SES by including a component to pick up trends performed in data; triple exponential smoothing (TES) further enhances DES by considering seasonality in data. In Figure 4, the black line shows the original data generated by the Alibaba Hadoop cluster within one week, i.e., from Jan. 10th, 2012 to Jan. 13th, 2012. The original data uctuates over a 1-day cycle, but performs some slight variances. For example, map slots utilization rate peaks at 12:00 in Jan. 10th and Jan. 13th, but peaks at 4:00 in Jan. 11th and Jan. 12th. These variances increases the difficulty of forecasting. We experimentally study the performance of SES, DES, and TES, and conclude that, TES with α = 0.4, γ = 0.1, and β = 0.4 performs the minimum MSE (mean of the squared errors), followed by SES with setting α = 0.99.

Figure 4 Exponential Smoothing

DES performs the worst among three methods. Note that although TES performs the best, it still results in a relatively serious MSE. In our experiments, we temporarily employ TES as our workload forecasting solution. However, we aim to explore a more accurate prediction on workload in the future. 4.1.4 Reliability The major reason for employing aggressive replication is to providing high reliability for job execution. That is, a

458

Journal of Internet Technology Volume 15 (2014) No.3

job is able to proceed without any degration on performance even some tasks are failed. However, we cannot replicate all tasks due to the limited resources in dedicated Hadoop systems. For example, when the slot utilization exceeds 50%, it is impossible to let all tasks have a backup. Hence, we intend to replicate the tasks that need resources most. Deciding which task should be replicated requires knowledge of the reliability of each task. Tasks executing on less reliable machines will be replicated firstly. Note that this may require the scheduler to know the reliability provided by each physical machine. The ability of predicting failures has been explored widely in recent years [14]. However, few of them have been extensively applied in practice. 4.2 Picking a Job for Aggressive Replication Now we present our design of ARES in response to the above requirements. A job that shows the following prosperities will be chosen for replication: (1) A job with higher user priority is preferred for replication than jobs with lower priority; (2) A job with lower fairness is preferred for replication than jobs with higher fairness; (3) A job with lower reliability is preferred for replication than jobs with higher reliability. Be the definition of the design principles, we quantify the corresponding parameters as below: (1) Priority is defined for every user. Suppose there are five different levels of priority, they can be quantified as: pu ∈ {0.2, 0.4, 0.6, 0.8, 1.0}, with greater value implying higher priority; (2) Fairness is also evaluated for every user. Denote by ald the resources allocated to a user, and denote by max the maximum limit of resources for the user. Then the fairness could be computed as the ratio of running resource to guaranteed capacity: fu = ald / max; (3) Reliability is defined regarding the number of replicas of every job. That is, a job with more replicas is supposed to be more reliable than others. We could use the ratio of nrep / n to indicate the reliability of a job, where nrep denotes the number of replicas and n denotes the number of tasks of the job. When the nrep / n approximates to 1, it implies every task of the job have been averagely replicated for one time. However, the number of tasks of a job may impact our intention. For example, suppose two jobs j1 and j2 are running in the system, and they are submitted by the same user. Hence both of them shows the same fairness level and the same priority. Suppose the job j1 consists of 9 maps and 1 reduce, and the job j2 consists of 999 maps and 1 reduce. Then with assigning a replica for j1, the value of nrep / n would change into 0.1, whereas with assigning a replica for

j2, the value of nrep / n of j2 changes into 0.001. If we continue to activate the replication process for both jobs, then j2 will be continuously chosen for replication until it has more than 100 replicas due to the value of nrep / n of j2 is smaller than 0.1. After 100 replicas are launched for j2, then j1 is chosen alternately because nrep / n of j2 is greater than 0.1. Then j2 is chosen again, and this process continues until no more available slots. In practise, we would prefer to launch replication for j1 at an earlier time because the j2 is dominating the replica slots and it is unfair for j1. In order to reduce the impact from the number of tasks of a job, we dene

the reliability of a job as: rj = n / n . By this formula, j1 will be chosen for replication only after 33 replicas is dispatched for the j2. (4) Progress rate (denoted by ε) is defined for describing the progress of a job. ε = 1 means the job is successfully finished, while ε = 0 means the job has not yet started. Then we compute the importance value for a job j: nrep



(1)

With Equation (1), we see that v(j) is proportional to the job priority, and inversely proportional to the fairness and reliability, which are consistent with our initial intention. 4.3 Picking a Task for Aggressive Replication Since a job is typically comprised of many maps and reduces, now we turn to select a task from the selected job for replication. Because replicating a reduce task may place much heavier communication burden than replicating a map task, we treat map and reduce tasks differently. 4.3.1 Picking a Map Task When a heartbeat arrives and declares an available slot, task replication is activated following the principles below: (1) If a node performs a high probability on delaying the task process, then tasks running on it are preferred for replication. (2) A task that comes towards the end would better not be replicated. Do not replicate a task if the estimated finish time of a task on the coming node is even slower than the original task. (3) A task with no replicas is preferred than the ones that have been dispatched replicas. (4) It would be better to dispatch a replica on a node that contains the target data locally, i.e., improving data locality. Be the definition of the design principles, we quantify the importance of each map of the selected job as below: (1) Denote by s the probability of task delaying. We compute s based on the historical data: suppose a

ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

slot has executed n tasks until now, and t of them are delayed because of failures or any other reasons, then we have s = t / n; (2) Progress rate (denoted by ε) is also defined for describing the progress of a task. In case of multiple replicas exist for a task, we take ε the maximum value among them. Since a job generally consists of a number of map and reduce tasks, the job progress (ε) could be roughly estimated through averaging the progress of the tasks; (3) Denote by nr the number of replicas of a task; (4) We use two values l ∈ {0.4, 0.6} to indicate data locality, where 0.4 implies the target data is not at the local site, and 0.6 implies the target data is at local site. We use these two values because a binary choice of 0 or 1 is not appropriate for the importance value calculation. The importance value of a map task m could be computed as:

(2)

With Equation (2), the map task that associated with higher locality, higher task delaying ratio, less replicas and slower progress will be selected for replication. Note that we use the 1 to amplify the impact from s due to its importance 1-s on improving the performance on finish time. 4.3.2 Picking a Reduce Task In the MapReduce framework, Map takes a set of data chunks and produces key/value pairs and Reduce merges them. If a Map or its replicas are not dispatched on the slot with target data, the target data has to be transferred in the network. So replication could possibly increase communication, and put an excessive burden on the network. Fortunately, the Hadoop distributed le system (HDFS) generally deploys several copies of data to improving reliability (by default, 3 copies). Since the copies are dispersed geographically, we have also been able to reduce the negative impact on network through taking data locality into account. Moreover, ideas, like Delay scheduling [20], can also be adopted for further reducing the network latency. Therefore, although the replication of map function might burden the system network, it is possible to limit, to some extent, such impact through improving data locality. However, for the reduce, we are unable to reduce the network burden through improving data locality. Because the Figure 5: ARES scheduler input of reduce comes from the output of map, if a reduce is replicated one time, its input data has to be transferred to both the reduce and its replica. Hence, one replica of reduce would double the

459

Figure 5 ARES Scheduler

network communication. Given this, we do not adopt replication for reduce. Yet replicating a reduce task when its progress is far less than the average level or a failure occurs is still present. 4.4 The ARES Scheduler We illustrate the design of ARES scheduler in Figure 5. ARES is implemented and deployed at the jobtracker. Datanode periodically sends heartbeats to the jobtracker. The jobtracker gives priority to jobs in the waiting queue. Replication is started after jobs in waiting queue are scheduled. ARES focuses on improving reliability for jobs in running state, and does not specify rules on how to schedule the new coming jobs. Instead, it relies on other schedulers, like FIFO, capacity scheduler or FAIR scheduler, to schedule new coming jobs. Therefore, ARES is not intend to replace the other schedulers, but co-works with them to improve the reliability using idle resources. Whenever free slots are available at the data node and there are no waiting jobs, ARES is activated to decide if to replicate running tasks. Firstly, ARES evaluates the importance of all running jobs using Equation (1), and selects the one with maximum v(j). Secondly, ARES evaluates the importance of all map tasks from the selected job using Equation (2), and selects the one with maximum v(m). For example, Map M3 in Job3 is selected in Figure 5. Then, ARES estimates the execution time of the selected task. Denote by TM3 the execution time of M3. Replication will be activated if the predicted coming workload in TM3 can be satisfied even if we deploy an replica of M3 on the free slot. Otherwise, ARES does not activate replication. Because system performance would be severely degraded if slots are fully occupied, we do not expect to use all free slots for replication. Instead, we set a threshold θ, if the slots occupied by all running tasks and replicas have exceeded θ, we will not activate replication. The above process is repeated until occupied slots exceed θ, or all maps have been deployed at least three replicas.

460

Journal of Internet Technology Volume 15 (2014) No.3

5 Performance Evaluation We use the Hadoop simulator: MRSim [7], to evaluate the performance of our proposal. MRSIM is a MapReduce simulator based on discrete event simulation, which accurately models the Hadoop environment. The simulator on the one hand allows us to measure scalability of the MapReduce based applications easily and quickly, while capturing the effects of different configurations of Hadoop setup on performance. 5.1 Systems We simulate 20 physical machines in our system, all of which are configured with the same capacity. Table 1 shows the configuration of each server. We configure every machine with 10 map slots and 1 reduce slot. It means every machine can be deployed with at most 10 maps and 1 reduce. So there are totally 200 map slots and 20 reduce slots in the system. CPU speed of each machine is set as 5.0 × 105 MIPS. The capacity of the HardDisk of each machine is set at 40,000, with read and write speed at 30,000 per second. 5.2 Jobs We simulate 8 types of jobs which are initialized with different number of maps, ranging from 20 to 160. The setting of these jobs are listed in Table 2. All jobs are set Table 1 The Configuration of Each Server

Server

Configuration

# of maps

10

# of reduces

1

# of CPU cores

4

CPU speed (MIPS)

5.0 × 105

HardDisk capacity

40,000

HardDisk read speed

30,000

HardDisk write speed

30,000

Table 2 The Job Settings

Job

# of maps

# of reduces

Input data Replicas

Job1

20

6

162,000

3

Job2

40

6

162,000

3

Job3

60

6

162,000

3

Job4

80

6

162,000

3

Job5

100

6

162,000

3

Job6

120

6

162,000

3

Job7

140

6

162,000

3

Job8

160

6

162,000

3

with 5 reduces. And size of target data is configured as 1620,000. HDFS deploys 3 replicas for each le, which spread over the cluster. The other settings are consistent with the sample job le in MRSim [7]. 5.3 Results MRSim [7] extends the discrete event engine used SimJava [3] to accurately simulate the Hadoop environment. Now we have implemented and deployed ARES in the simulated jobtracker in MRSim. Jobs are randomly selected from the 8 types of configurations (Table 2), and are continuously submitted. By default, jobtracker dispatches them using FIFO scheduler. ARES is activated only when there are free slots for replication, and the occupied slots do not exceed 80%. Figure 6 shows the execution time of jobs with different number of maps. In the experiment, all jobs are initialized with the same amount of target data, different number of maps, which ranges from 20 to 160, and 6 reduces. As shown in the Figure 6, job execution time decreases over the number of maps, especially when the number of map increases from 20 to 60. This is because the target data are divided into equal pieces based on the number of maps. When we have more maps for the same target data, the map execution time will decrease accordingly (Figure 7). However, the decrement on execution time becomes much slower when the number of maps exceeds 60, and the execution time even increases when the number of maps increase from 120 to 140. This is due to the fact that increasing the number of maps cannot improve the execution efficiency limitlessly, which is similar to Amdahl’s law in parallel computing. Moreover, it is costly to manage a large number of maps. Compared with no replications, the execution time yielded by ARES is much shorter. This is because the simulated HDFS le system has deployed three copies for all data, replications of maps could increase the data locality, hence reduce network communications. Moreover, reduce task will start processing the output of map task as long as one of map’s replicas is finished, which shortens the overall execution time as well. Consistent with the above observation, the performance improvement also becomes less significant when there exists more than 80 maps. Figure 7 shows the execution time of maps in the same experiment. Since we only replicate the maps in our scheduler, the performance improvement on job execution time mainly comes from the improvement on map execution time. Therefore, the changes on map execution time are quite consistent with the one shown in Figure 6. However, for the reduce tasks, Figure 8 shows that there exists no big differences between no replication and ARES. The current implementation of MRSim does not

ARES: Aggressive Replication Enabled Scheduler for Hadoop Systems

Figure 6 Job Execution Time

461

satisfaction. ARES is quite suitable to deploy in systems with plenty of free resources, which can be used for improving reliability. Since ARES consumes a large amount of resources, we propose to use exponential smoothing method to predict the future coming workload, and avoid the possible denial of service through reserving sufficient resources for jobs. This is very helpful for deploying ARES in practical or even commercial systems. The experiments manifest that applying aggressive replication into scheduler could not only tolerant failures, but also improve the performance on job execution time. In the future, we intend to implement ARES in the real Hadoop system, and promote our work using data from commercial systems. We will do more experiments on performance evaluation, especially under the situation of failures. Since system performance, including network latency, resource utilization rate, is very sensitive to replication, we also would like to learn more about the negative impact caused by replications in real systems, and further optimize ARES to reduce such impact.

Acknowledgements Figure 7 Map Execution Time

Figure 8 Reduce Execution Time

support injecting failures into machines. So we did not evaluate the performance of ARES under the situation of failures occurring at machines. We are sure that the job would proceed when a task is failed, if replicas have been deployed for the task. It is also expected that the job execution time could be comparable with the situation of no failures. These experiments will be included in our future work.

6 Conclusion and Future Work We present the aggressive replication enabled scheduler for Hadoop system. It is designed taking into account fair- ness, reliability, data locality and future workload

This work was supported by National Natural Science Foundation of China (Grant Nos. 61100194, 61272188), National 973 Programs (Grant No. 2013CB329102), Open Foundation of State key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) (Grant No. SKLNST-2013-1-14), Natural Science Foundation of Zhejiang Province (Grant No. LY12F02005), Scientific Research Fund of Zhejiang Provincial Education Department (Grant No. Y201120356), the Tianjin City Application Foundation and Cuttingedge Technology Research Program (Youth project, Grant No. 14JCQNJC00500), and Innovation Fund of Tianjin University (Grant No. 2013XQ-0061).

References [1] Apache Hadoop, 2012, http://hadoop.apache.org/ [2] Hadoop MapReduce Next Generation -- Capacity Scheduler, 2013, from http://hadoop.apache. org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/ CapacityScheduler.html [3] Simjava, 2013, http://www.dcs.ed.ac.uk/home/hase/ simjava/ [4] Hyunseok Chang, Murali S. Kodialam, Ramana Rao Kompella, T. V. Lakshman, Myungjin Lee and Sarit Mukherjee, Scheduling in Mapreduce-Like Systems for Fast Completion Time, Proc. IEEE INFOCOM, Shanghai, China, April, 2011, pp.3074-3082.

462

Journal of Internet Technology Volume 15 (2014) No.3

[5] Jeffrey Dean, Experiences with MapReduce, an Abstraction for Large-Scale Computation, Proc. ACM PACT, Seattle, WA, September, 2006, doi:10.1145/1152154.1152155. [6] Jeffrey Dean and Sanjay Ghemawa, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, Vol.51, No.1, 2008, pp.107-113. [7] Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham and Zelong Liu, MRSim: A Discrete Event Based MapReduce Simulator, Proc. IEEE FSKD, Yantai, China, August, 2010, pp.2993-2997. [8] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell and Dennis Fetterly, Dryad: Distributed DataParallel Programs from Sequential Building Blocks, Proc. ACM EuroSys, Lisbon, Portugal, March, 2007, pp.59-72. [9] Kamal Kc and Kemafor Anyanwu, Scheduling Hadoop Jobs to Meet Deadlines, Proc. IEEE CLOUDCOM, Indianapolis, IN, November/December, 2010, pp.388392. [10] Gunho Lee, Byung-Gon Chun and Randy H. Katz, Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud, Proc. HotCloud, Portland, OR, June, 2011, https://www.usenix.org/legacy/ events/hotcloud11/tech/final_files/Lee.pdf?CFID=46 1212387&CFTOKEN=60668578 [11] Benjamin Moseley, Anirban Dasgupta, Ravi Kumar and Tamás Sarlós, On Scheduling in Map-Reduce and Flow-Shops, Proc. ACM SPAA, San Jose, CA, June, 2011, pp.289-298. [12] Jorda Polo, David Carrera, Yolanda Becerra, Malgorzata Steinder and Ian Whalley, PerformanceD r i v e n Ta s k C o - s c h e d u l i n g f o r M a p R e d u c e Environments, Proc. IEEE/IFIP NOMS, Osaka, Japan, April, 2010, pp.373-380. [13] Aysan Rasooli and Douglas G. Down, An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems, Proc. CASCON, Toronto, Canada, November, 2011, pp.30-44. [14] Felix Salfner, Maren Lenk and Miroslaw Malek, A Survey of Online Failure Prediction Methods, ACM Computing Surveys, Vol.42, No.3, 2010, doi:10.1145/1670679.1670680. [15] Thomas Sandholm and Kevin Lai, Dynamic Proportional Share Scheduling in Hadoop, Proc. JSSPP, Atlanta, GA, April, 2010, pp.110-131. [16] Bianca Schroeder and Garth A. Gibson, A Large-Scale Study of Failures in High-Performance Computing Systems, IEEE Transactions on Dependable and Secure Computing, Vol.7, No.4, 2010, pp.337-350. [17] Chao Tian, Haojie Zhou, Yongqiang He and Li Zha, A Dynamic MapReduce Scheduler for Heterogeneous

[18] [19]

[20]

[21]

Workloads, Proc. GCC, Lanzhou, China, August, 2009, pp.218-224. Tom White, Hadoop: The Denitive Guide, O’Reilly Media, Sebastopol, CA, 2009, pp.74-87. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker and Ion Stoica, Job scheduling for multi-user MapReduce clusters, April, 2009. EECS Department University of California, Berkeley, Technical Report No. UCB/ EECS-2009-55. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker and Ion Stoica, Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Proc. ACM EuroSys, Paris, France, April, 2010, pp.265-278. Wei-Tsong Lee, Ming-Zhi Wu, Hsin-Wen Wei, Fang-Yi Yu and Yu-Sun Lin, Dynamically Iterative MapReduce, Journal of Internet Technology, Vol.14, No.6, 2013, pp.953-962.

Biographies Yizhi Ren received his PhD degree from School of Software, Dalian University of Technology, China in 2011. He is currently an assistant professor at the School of Software Engineering, Hangzhou Dianzi University, China. His research interests include security issues, evolutionary game theory, and social computing in social networks. Laiping Zhao received his PhD degree from Department of Informatics, Kyushu University, Japan in 2012. He is currently an assistant professor at the School of Computer Software, Tianjin University, China. His research interests include optimization and design of large-scale distributed systems and cloud services. Haiyang Hu received the PhD degree from Department of Computer Science and Technology, Nanjing University, China in 2006. He is currently a professor at the School of Computer, Hangzhou Dianzi University, China. His research interests include mobile computing and distributed computing.