Job Aware Scheduling in Hadoop for Heterogeneous Cluster Supriya Pati
Mayuri A. Mehta
Computer Engineering Department Sarvajanik College of Engineering and Technology Surat, Gujarat
[email protected]
Computer Engineering Department Sarvajanik College of Engineering and Technology Surat, Gujarat
[email protected]
Abstract— Hadoop cluster is specifically designed to store and analyze a large amount of data in distributed environment. With ever increasing use of Hadoop clusters, a scheduling algorithm is required for optimal utilisation of cluster resources. The existing scheduling algorithms are limited to one or more of the following crucial problems such as limited utilization of computing resources, limited applicability towards heterogeneous cluster, random scheduling of non-local map tasks, and negligence of small jobs in scheduling. In this paper, we propose a novel job aware scheduling algorithm that overcomes the above limitations. In addition, we analyze the performance of the proposed algorithm using MapReduce WordCount benchmark. The experimental results show that the proposed algorithm increases the resource utilization and reduces the average waiting time compared to existing Matchmaking scheduling algorithm. Index Terms— Hadoop, scheduling, MapReduce, job aware.
I. INTRODUCTION In today’s world, large amount of data are being generated and processed for various applications such as search engines, web indexing, web crawling, data mining, ad optimization, machine learning, etc. Hadoop cluster processes such large amount of data parallely on huge clusters of commodity hardware in a reliable and fault-tolerant technique. It consists of a number of project components such as MapReduce (MR), Hadoop Distributed File System (HDFS), HBase, Pig, Hive, ZooKeeper, Chukwa, and Avro [1]. Amongst these components, MapReduce is vital as it provides a parallel programming model to distribute and to execute data intensive jobs [2]. MapReduce programming model contains Map tasks and Reduce tasks. The Map tasks takes key-value pairs as input and generates a set of key-value pairs as intermediate output. Map tasks can further be classified as local map tasks and nonlocal map tasks. The intermediate key and its list of values are given as input to the Reduce tasks and produces a reduced set of values as output [2]. As the number of data intensive jobs submitted by the user increases, load of the cluster increases. In order to manage load of the cluster, a scheduling approach is needed to improve the overall cluster performance. A scheduling approach should consider data locality, synchronization, and fairness to improve performance of the cluster [3][4]. Data locality ensures that the
c 978-1-4799-8047-5/15/$31.00 2015 IEEE
map tasks will be executed on the node encompassing the input data. Synchronization allows transmission of intermediate output from the map tasks to the reduce tasks. Fairness criteria assures that fair share of resources are allocated to multiple jobs and multiple users. Though numerous job scheduling algorithms are available, the development of an effective job scheduling algorithm is crucial that overcomes the following limitations of traditional job scheduling algorithms [3-8]. • Limited resource utilization of the cluster [5-8] • Limited applicability towards heterogeneous cluster [8] • Random scheduling of non-local map tasks [5-8] • Negligence of small jobs in scheduling [5-8] To overcome the above limitations, we propose a novel job aware scheduling algorithm in Hadoop. The proposed algorithm schedules non-local map tasks based on three criteria: 1) job execution time, 2) earliest deadline first, 3) workload of the job. The scheduling of non-local map tasks with minimum job execution time and earliest deadline first results in decrease of average waiting time. The scheduling of non-local map tasks considering the workload of the job results in efficient utilization of resources. In addition, the proposed algorithm is applicable towards heterogeneous cluster. The remainder of the paper is organized as follows. Section II presents the related work. Section III describes the fundamentals of job scheduling in Hadoop and parametric evaluation of existing job scheduling algorithms. In section IV, we discuss the proposed job aware scheduling algorithm. Section V provides performance analysis of the proposed algorithm. Finally, Section VI specifies the conclusion and some future work. II. RELATED WORK Job scheduling algorithms have been studied abundantly in literature and hence, several job scheduling algorithms are available [3-13]. The default Hadoop FIFO scheduler schedules jobs on first come first serve basis [4]. However, it does not provide fairness. Fair scheduler provides a fair share of resources to the users [5]. It is extensible towards better resource utilization. Yahoo’s Capacity scheduler provides a guaranteed share of cluster capacity to the jobs using
778
hierarchical queues [6]. However, job pre-emption is not permissible. Synchronization is one of the major factors affecting scheduling decisions. MapReduce cluster decreases synchronization overhead by re-scheduling a speculative copy of late mappers on another node. In [9][10][11], the speculative execution of tasks is done in heterogeneous environment. However, data locality for launching speculative map tasks can further be improved. In [12][13], the synchronization overhead is overcome by using asynchronous processing. In [12], it initially starts a predefined number of reduce tasks and reduces results collected from map tasks incrementally. In [13], two levels of map-reduce phase are implemented: local and global. The constraint based schedulers such as priority, deadline, and resource aware schedulers schedule tasks based on the priority, deadline, and resource consumption of that job [4]. In [7], the ordering of jobs for task allocation is relaxed for increasing the data locality rate. Matchmaking algorithm aims at providing each slave node an equal opportunity to grasp local map tasks [8]. However, there is a scope of improvement by extending their applicability towards heterogeneous cluster, scheduling non-local map tasks based on specific criteria, and considering better utilization of resources. Thus, several job scheduling algorithms have been proposed for MapReduce in the literature [3-13]. However, reduction in average waiting time has not been focused. Although, existing algorithms attempt to enhance the system performance, some of them have been developed assuming homogenous cluster [4][8][12-13]. Several existing job scheduling algorithms are limited in resource utilization [5-8]. Further, to the best of our knowledge, no effort has been made to schedule non-local map tasks efficiently based on job awareness. Hence, we propose a novel job aware scheduling algorithm in Hadoop. In the proposed job scheduling algorithm, we preserve the data locality rate by scheduling the local map tasks first. Considering the importance of small jobs that have higher probability of having large amount of nonlocal map tasks, we schedule non-local map tasks based on job awareness [7]. We schedule non-local map tasks based on following three criteria: job execution time, earliest deadline first, and workload of the job.
Fig. 1. Execution of MapReduce Job.
process that handles and allocates jobs in the JobQueue. TaskTracker (TT) is a process that sends heartbeat to a JobTracker and in response receives a task to be executed on a particular node [14][15]. The file system component of Hadoop is HDFS [1]. The larger files in HDFS are split into block size of 64 MB or 128 MB. HDFS is comprised of two modules: the NameNode and the DataNode. NameNode is responsible for storing all the metadata and DataNode stores the actual input data for the tasks to be executed. Scheduling in Hadoop has been a great area of research in the past few year. In Table I, we present the parametric evaluation of the existing job scheduling algorithms [49][12][13]. Based on our study, we have identified the following imperative parameters to evaluate the algorithms: data locality rate, synchronization overhead, fairness, average response time, job execution time, and average waiting time. Data locality rate (lr) is defined as the division of number of local map tasks tl to total map tasks m. A map task that has been executed on a node containing its input data is known as local map task [3]. (1) In order to send output to reduce tasks, a map task has to synchronize and wait for all other map tasks which lead to synchronization overhead [3]. Fairness criteria deals with fair sharing of resources among the jobs [3]. TABLE I. PARAMETRIC EVALUATION OF JOB SCHEDULING ALGORITHMS
III. JOB SCHEDULING IN HADOOP Hadoop follows a master-slave architecture. It comprises of mainly two components: MapReduce and HDFS. MapReduce is a parallel programing model that handles data intensive application [1]. Any application that requires to be executed on the MapReduce paradigm is known as MapReduce job. A job is composed of a number of tasks classified as map tasks and reduce tasks. Map tasks execute map function while reduce tasks execute reduce function. In Fig. 1, we illustrate the execution of MapReduce job [2]. The MapReduce programming model comprises of a JobTracker (master node) and several TaskTrackers (slave nodes). JobTracker (JT) is a
2015 IEEE International Advance Computing Conference (IACC)
779
The average response time (ART) of total number of map tasks can be defined as: ART = lr * ARTl + (1 - lr) * ARTnl (2) where lr is the data locality rate, ARTl denotes the average response time of total number of local map tasks, and ARTnl represents the average response time of total number of nonlocal map tasks [8]. Job execution time is the time needed for a job to complete [7]. Average waiting time is the average of waiting time of all the jobs in a job queue [7]. Though numerous job scheduling algorithms are available, the development of an effective job scheduling algorithm is crucial that overcomes the following limitations of traditional job scheduling algorithms [3-8]: limited resource utilization of the cluster, limited applicability towards heterogeneous cluster, negligence of small jobs, and random scheduling of non-local map tasks. Thus, as per our observations, the development of an effective job scheduling algorithm is crucial that overcomes all of the above mentioned issues. IV. THE PROPOSED JOB AWARE SCHEDULING ALGORITHM The notations used in the proposed algorithm are described in Table II. Before discussing the proposed algorithm, we first describe the terminologies used in the algorithm. • MapReduce slots: It is defined as the total number of map tasks and reduce tasks that can be implemented parallely on a cluster node [1-4]. • Locality marker: It is used to mark the slave nodes in order to ensure that each slave node gets an equal opportunity to grasp its local map tasks [8]. • Local map task: It is a map task that is implemented on a slave node containing its input data [4][7-8]. TABLE II. LIST OF NOTATIONS Notations SNi
Description ith slave node, i = 1 to n
MN
Master node
N
Total number of slave nodes
NLi
Locality marker of ith slave node
sci
Free slot count of node SNi
JQ
Job queue
T l LJQ
Unassigned local map task Length of job queue
Jk
kth job of job queue, k = 1 to LJQ
tnl
Unassigned non local map task
PQ
Priority queue
Cidle
CPU idle time of node SNi
Didle
Disk idle space of node SNi
Ridle
RAM idle space of node SNi
CPUk
CPU requirement of Jk
Dk
Disk requirement of Jk
780
•
Non-Local map task: It is a map task that is implemented on a slave node that does not contains its input data [4][7-8]. • Heartbeat: It is a signal that slave nodes sends to the master node. It carries statistics about complete storage capacity, fraction of used storage capacity, amount of unused map slots, amount of unused reduce slots, and the amount of data transfers that are in progress [1]. • Small jobs: Jobs with a small number of input files are known as small jobs such as ad-hoc queries, sampling, and periodic reporting jobs [1-4]. The proposed algorithm is designed for a Hadoop cluster that consists of N = {n1,n2,...,nn}nodes. Every node in the cluster is a combination of numerous resources including memory, disk, network connectivity, processor and every node varies in its capacity of memory, disk, and processor. Each cluster consists of {SN1, SN2, …. , SNn-1} SlaveNodes (SN) and single Master Node (MN). Each SlaveNode has access to MasterNode. It is the responsibility of MasterNode to analyse available resources on SlaveNodes. In Fig. 2, we describe the proposed job aware scheduling algorithm. The Master Node MN maintains a job queue JQ. At the arrival of a new job, it is submitted to JQ. A job is divided into numerous map and reduce tasks. Every map tasks can further be classified as local map tasks tl and non-local map tasks tnl. Whenever, a new node is added to ith slave node SNi has a free slot count scn. It sends a heartbeat to MN. The scheduler in MN is responsible for scheduling jobs to SNi. Scheduler checks for unassigned local map tasks tl of the Input: Job Queue, N slave nodes Output: Result of completed jobs // scheduling non-local map task based on job awareness 1. FOR i = 1 to N 2. initialize NLi = NULL 3. FOR i = 1 to N 4. MN receives heartbeat from SNi: 5. WHILE sci > 0 6. initialize flag = NLi 7. FOR k = 1 to LJQ 8. IF Jk contains tl then 9. allocate tl of Jk to SNi 10. sci = sci - 1 11. IF NLi equals to NULL then 12. initialize NLi = 1 13. ELSE 14. NLi = NLi + 1 15. IF flag equals to NLi then 16. initialize NLi = 0 17. ELSEIF NLi equals to 0 then 18. IF (jobs are interactive) 19. min_task(JQ, SNi) 20. ELSEIF (jobs have a strict deadline) 21. deadline_aware(JQ, SNi) 22. ELSEIF (high load on cluster) 23. workload_aware(JQ, SNi) 24. sci = sci - 1 Fig. 2. Algorithm for the proposed job aware scheduling.
2015 IEEE International Advance Computing Conference (IACC)
jobs in JQ. If it finds any tl, it is allocated to SNi for execution and the locality marker for SNi is incremented by one. Moreover, scn of SNi is decremented by one. The process continues as it checks the subsequent jobs in JQ for tl. However, if it does not find any tl for the current heartbeat, it sets the slave node’s SNi locality marker NLi to zero. It waits for the current heartbeat and does not assign any task to SNi because during a heartbeat, all SNi with unused map slots scn are considered for assignment of tl. However, if it does not find any tl for the next heartbeat, it allocates unassigned non-local map task tnl from JQ to that SNi. In order to avoid the wastage of cluster resources, it allocates tnl from JQ to SNi. The selection of tnl from the JQ is based on the following three criteria: job execution time, earliest deadline first, and workload of the job. As per the criterion job execution time, a tnl of a job Jk with minimum execution time is selected. Hence, the average waiting time of jobs decreases considerably. This criterion is more suitable if the majority of the jobs in the job queue are interactive jobs. The functioning of the proposed algorithm as per the criterion job execution time is shown in Fig. 3. In Fig. 4, we show the functioning of the second criterion earliest deadline first. As per this criterion a tnl of Jk having earliest deadline is selected first. The earliest deadline first criterion is appropriate when jobs have a strict deadline. The third criterion considering workload of the job is shown in Fig. 5. The resource requirements (CPUk, Dk, Rk) of all jobs is calculated. When SNi sends heartbeat to MN, a job from JQ is selected whose resource requirements best fits the available resources (Cidle, Didle, Ridle) on SNi. A tnl of that selected job is scheduled on SNi. The best fit approach of resource allocation is more storage efficient and results in less wastage of resources as compared to first fit approach. This criterion is more appropriate for the cluster that is highly loaded. Input: Job Queue, SNi Output: Result of task // schedules non-local map tasks of a job with minimum execution time 1. FOR k = 1 to LJQ 2. insert Jk into PQ based on execution time 3. allocate tnl of Jk to SNi Fig. 3. Algorithm for scheduling non-local map tasks based on job execution time.
Input: Job Queue, SNi Output: Result of task // schedules non-local map tasks of a job with earliest deadline first 1. FOR k = 1 to LJQ 2. insert Jk into PQ based on deadline of Jk 3. allocate tnl of Jk to SNi Fig. 4. Algorithm for scheduling non-local map tasks based on earliest deadline first.
Input: Job Queue, SNi Output: Result of task 1. FOR k = 1 to LJQ 2. IF Cidle > CPUk 3. allocate tnl of Jk to SNi 4. ELSEIF Didle > Dk 5. allocate tnl of Jk to SNi 6. ELSEIF Ridle > Rk 7. allocate tnl of Jk to SNi Fig. 5. Algorithm for scheduling non-local map tasks based on workload of the job.
The scheduling of non-local map tasks of jobs based on job execution time and earliest deadline first reduces the average waiting time. The scheduling of non-local map tasks of jobs considering workload of the job increases the resources utilization of the cluster. V. PERFORMANCE ANALYSIS The proposed algorithm is evaluated using following two parameters: average waiting time and resource utilization of the cluster. We carry out experimental analysis using MapReduce WordCount job as benchmark in Hadoop 2.2.0. We assume that 5 MapReduce jobs are submitted to a job queue at the same time. The execution time of each job is shown in Table III. As per the proposed job scheduling algorithm in Fig. 2 and Fig. 3, the data locality is maintained by executing the local map tasks of each job first. Then the non-local map task of a job having shortest execution time is scheduled. The job with shorter execution time gets completed earlier with respect to the job with larger execution time. We evaluate the performance of the proposed algorithm for best case, worst case, and average case. In best case, the job consists of all non-local map tasks whereas in worst case, the job consists of all local map tasks. In average case, job consists of several non-local map tasks and several local map tasks that is if there are tl local map tasks, the number of non-local map tasks tnl will be m-tl where m is the total number of map tasks and 0 tl m. We assume that for average case there are three map tasks, out of that two are local map tasks and one is nonlocal map task. In Fig. 6, we present the turnaround time of jobs in existing algorithm [8] and the proposed algorithm for the above discussed three cases. The waiting time of a job is calculated as the difference between the turnaround time and the execution time. The average waiting time of MapReduce TABLE III. EXECUTION TIME OF JOBS Job Id J1
Execution Time (min:sec) 3.43
J2
1.14
J3
0.38
J4
0.35
J5
0.18
2015 IEEE International Advance Computing Conference (IACC)
781
(a)
Fig. 7. Average waiting time of MapReduce WordCount job for proposed and Matchmaking algorithms.
In average case, the performance is highly dependent on the number of local and non-local map tasks. It is observed that higher the number of non-local map tasks, the lesser is the average waiting time. VI. CONCLUSION AND FUTURE WORK
(b)
In this paper, the parametric evaluation of the existing job scheduling algorithms has been presented. In order to increase the resource utilization and thereby to reduce the average job waiting time, we have proposed a novel job aware scheduling algorithm in Hadoop that is applicable towards heterogeneous cluster also. The proposed algorithm schedules jobs based on one of the following three criteria: job execution time, earliest deadline first, and workload of the job. The experimental results show that the proposed algorithm reduces the average waiting time by 79% in best case and 23% in average case via scheduling the jobs based on job execution time. The worst case is likely to occur seldom. In future, we aim to analyze the performance of the proposed algorithm based on earliest deadline first and workload of the job. In addition, we aim to evaluate its performance with various cluster settings and different benchmarks. REFERENCES
(c) Fig. 6. Turnaround time of jobs in proposed and Matchmaking algorithms.
WordCount jobs for the above discussed three cases is shown in Fig.7. Using the proposed algorithm the average waiting time reduces considerably in the best case because the local and non-local map tasks are scheduled based on job execution time rather than on first come first serve basis. In worst case, the proposed algorithm performs similar to that of the existing algorithm. However, the worst case is likely to occur seldom.
782
[1] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1-10, May 2010. [2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proceeding of the 6th Symposium on Operating systems Design and Implementation (OSDI), pp. 137-150. USENIX Association, December 2004. [3] D. Yoo and K. M. Sim, “A Comparative Review of Job Scheduling For Mapreduce,” in Proceedings of IEEE Cloud Computing and Intelligence Systems (CCIS), pp. 353-358, September 2011. [4] B. Thirumala Rao and Dr. L. S. S. Reddy, “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” International Journal of Computer Applications (0975 – 8887), November 2011.
2015 IEEE International Advance Computing Conference (IACC)
[5] “Hadoop MapReduce Next Generation – Fair Scheduler [Online].” Available: http://hadoop.apache.org/docs/current/Hadoop-yarn/Hadoopyarn-site/FairScheduler.html [Last accessed: November, 2014]. [6] “Hadoop MapReduce Next Generation – Capacity Scheduler [Online].” Available: http://hadoop.apache.org/docs/current/Hadoop-yarn/Hadoopyarn-site/CapacityScheduler.html [Last accessed: November, 2014]. [7] M. Zaharia, D. Borthankur, J. Sarma, K. Elmellegy, S. Shenker, and I. Stoica, “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling,” in Proceedings of the 5th European conference on Computer systems, ACM, pp. 265-278, 2010. [8] C. He, Y. Lu, and D. Swanson, “Matchmaking: A new mapreduce scheduling technique,” IEEE Third International Conference on Cloud Computing Technology and Science(CloudCom), pp. 40-47, December 2011. [9] M. Zaharia, A. Kowinski, A. Joseph, R. Katz, and I. Stoica, “Improving MapReduce Performance in Heterogeneous Environments,” USENIX OSDI, 2008.
[10] Q. Chen, D. Zhang, M Guo, Q. Deng , and S. Guo, “SAMR: A Self-Adaptive MapReduce Scheduling Algorithm In Heterogeneous Environment,” IEEE 10th International Conference on Computer and Information Technology(CIT 2010), pp. 2736-2743, July 2010. [11] X. Sun, C. He and Y. Lu, “ESAMR: An Enhanced SelfAdaptive MapReduce Scheduling Algorithm,” IEEE 18th International Conference on Parallel and Distributed Systems, pp. 148-155, December 2012. [12] M. Elteit, H. Lin, and W. Feng, “Enhancing MapReduce via Asynchronous Data Processing,” in Proceedings of IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 397-405, December 2010. [13] K. Kambatla, N. Rapolu, S. Jagannathan, and A. Grama, “Asynchronous Algorithm in MapReduce,” in Proceedings IEEE International Conference on Cluster Computing (ICCC), pp. 245-254, September 2010. [14] Jason Venner, “Tuning Your MapReduce Jobs”, in Pro Hadoop, CA: Apress, 2009. [15] Tom White, “How MapReduce Works”, in Hadoop The Definitive Guide, Third ed. CA: O’REILLY,2012.
2015 IEEE International Advance Computing Conference (IACC)
783