Job Aware Scheduling Algorithm for MapReduce Framework by Radheshyam Nanduri, Nitesh Maheshwari, Reddy Raja, Vasudeva Varma
in 3rd IEEE International Conference on Cloud Computing Technology and Science
Athens, Greece.
Report No: IIIT/TR/2011/-1
Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA November 2011
2011 Third IEEE International Conference on Coud Computing Technology and Science
Job Aware Scheduling Algorithm for MapReduce Framework Radheshyam Nanduri, Nitesh Maheshwari, Reddyraja. A and Vasudeva Varma Search and Information Extraction Lab (SIEL) IIIT Hyderabad, India Email: {radheshyam.nanduri, nitesh.maheshwari, reddy.raja}@research.iiit.ac.in,
[email protected] Abstract—MapReduce framework has received a wide acclaim over the past few years for large scale computing. It has become a standard paradigm for batch oriented workloads. As the adoption of this paradigm has increased rapidly, scheduling of these MapReduce jobs has become a problem of great interest in research community. We propose an approach which tries to maintain harmony among the jobs running on the cluster, and in turn decrease their runtime. In our model, the scheduler is made aware of different types of jobs running on the cluster. The scheduler tries to allocate a task on a node if the incoming task does not affect the tasks already running on that node. From the list of available pending tasks, our algorithm selects the one that is most compatible with the tasks already running on that node. We bring up heuristic and machine learning based solutions to our approach and try to maintain a resource balance on the cluster by not overloading any of the nodes, thereby reducing the overall runtime of the jobs. The results show a saving of runtime of around 21% in the case of heuristic based approach and around 27% in the case of machine learning based approach when compared to Yahoo’s Capacity scheduler.
a tremendous growth in recent years especially for text indexing, log processing, web crawling, data mining, machine learning etc [4]. MapReduce is best suited for batchoriented jobs which tend to run for hours to days over a large dataset on a limited resources of the cluster. Hence, effective scheduling mechanism is vital to make sure that the resources are cogently used. In this paper, we try to present an approach that takes into account the interoperability of MapReduce tasks running on a node of the cluster. Our algorithms try to ensure that a task running on a node would not affect the performance of other tasks. This requires the scheduler to be aware of the resource usage information of each task running on the cluster. We try to present a heuristic approach as well as a machine learning based approach for task scheduling. Our algorithm tries to select a task from the list of pending tasks that is most compatible with the tasks already running on the node.
Keywords-Cloud Computing; Job Scheduling; Machine Learning; MapReduce;
II. P ROPOSED A LGORITHMS In our approach, we try to monitor the resource usage up to the level of each task and each node in the cluster as the performance of tasks and nodes is vital in any distributed environment. Our algorithm tries to maintain stability at node and cluster level through intelligent scheduling of the tasks. The uniqueness of this scheduler lies in its ability to take into account the resource usage pattern of the job before its tasks are scheduled on the cluster.
I. I NTRODUCTION Cloud computing has emerged as the advanced form of distributed computing, parallel processing and grid computing. It is a new and promising paradigm delivering IT services as computing utilities. As Clouds are designed to provide services to external users, providers need to be compensated for sharing their resources and capabilities [1]. Since these computing resources are finite, there is a need for efficient resource allocation algorithms for the cloud platforms. Efficient resource/data allocation would help reduce the number of virtual machines used and in turn reduce the carbon footprint leading to a lot of energy saving [2]. Scheduling in MapReduce can be seen analogous to this problem. If the scheduling algorithms are designed in a more intelligent way to avoid overloading any node and utilize most of the resources on a particular node, the runtime of the jobs could be lowered to a greater extent leading to a lot of energy saving. This paper deals with scheduling of jobs on MapReduce cluster without degrading their runtime while still maintaining the cost savings that the providers expect. Recently, MapReduce [3] has become a standard programming model for large scale data analysis. It has seen 978-0-7695-4622-3/11 $26.00 © 2011 IEEE DOI 10.1109/CloudCom.2011.112
A. Task Characteristics Based on the resources a task uses, it can be broadly classified into cpu-intensive, memory-intensive, disk-intensive, network-intensive. In a practical scenario, it might not be possible to categorize a task as belonging to one of the above categories. A task will have attributes of more than one categories mentioned above and to perfectly describe the true nature of the task it should be characterized as being a weighted-linear combination of parameters from each of these categories. We represent true and complete nature of − → a task through its TaskVector Tk as defined below. → − T k = Ecpu e1 + Emem e2 + Edisk e3 + Enw e4
(1)
where Ex (x is cpu, mem, disk, nw) is a resource usage pattern for cpu, memory, disk and network of a particular 724
job respectively, with Ex taking the values 0 ≤ Ex ≤ 100 and e1 , e2 , e3 , e4 are basis vectors.
C. Task Assignment Algorithm Task Assignment algorithm is the second part of our scheduling algorithm. This algorithm is primarily responsible to take the decision whether a task can be scheduled on a particular TaskTracker. The main objective of this algorithm is to avoid overloading any of the TaskTrackers by meticulous scheduling of only compatible tasks on a particular node. By compatible task, we mean task that does not affect already running tasks on that node. We present two approaches to our algorithm: machine learning based approach and a heuristic-based approach. 1) Machine Learning Approach: In this section, we present a machine learning based approach of our algorithm. We employ an automatically supervised Incremental Naive-Bayes classifier [6], [7] to decide whether the task is compatible on a particular TaskTracker. We have used Incremental Naive Bayes classifier since, the features used in our algorithm are independent to each other. Moreover this classifier is fast, consumes low memory and cpu, which avoids any overhead on the scheduler. Whenever the Task Selection algorithm queries for the compatibility of a task on a TaskTracker, our algorithm computes the compatibility through the outcome of the classifier. We consider the following features to train the classifier:
B. Task Selection Algorithm The JobTracker (master node) [5] receives an incoming job through the JobClient (client). The received job is queued up into Pending Job List (J). Our Task Selection Algorithm takes the pending jobs from the J and tries to split it into its sub units (map and reduce tasks). − → Task Vector (Tk ) Calculation: Before scheduling any task of a particular job, the algorithm calculates the TaskVector for the task. As task can be either a map or a reduce, each → − task has its own corresponding Map-TaskVector ( T kmap ) → − and Reduce-TaskVector ( T kreduce ). We assume the MapTaskVector of a task of a particular job to be logically equivalent to Map-TaskVector of the whole job since same code runs for each map task of the job. And same is the case with reduce tasks. Ideally, MapReduce jobs run over a large dataset with thousands of map tasks and hence are − → usually very long running. To calculate Tk , an initial sample of map/reduce tasks are executed. The algorithm employs an event capturing mechanism on the TaskTrackers [5] which listens to events related to cpu, memory, disk and network to monitor resource usage characteristics of that particular task and creates a TaskVector. The TaskVectors calculated through these few initial map/reduce tasks are averaged to − → − → get Tk . After Tk is calculated, the remaining tasks of that particular job are said to be ready for scheduling. The initial sample of tasks which are run to calculate the TaskVector process different data splits and may also be scheduled on different nodes. To overcome the minor differences in the TaskVectors generated due to the heterogeneity in the cluster and the data splits processed by the tasks, we use an − → average TaskVector Tk , which could work as an approximate TaskVector for the rest of the tasks in that particular job. Whenever a TaskTracker T T (slave node) has an empty slot for a task, our Task Selection algorithm checks if the → − → − TaskVector List T contains T kmap for job Jk from J. If → − → − T does not contain T kmap , it schedules the TaskVector → − Calculation Algorithm on T T to calculate T kmap . Other→ − → − wise, if T contains T kmap , then the algorithm schedules the TaskVector calculation algorithm on T T to calculate → − T kreduce in a similar fashion, duly taking into consideration that all the map tasks have finished for Jk . But, if the TaskVectors for a job’s map and reduce components have already been calculated, the algorithm tries to queue up the remaining map/reduce tasks in to the Task Queue. This queue is filled up with the tasks asynchronously as the new jobs arrive at the JobTracker. Then, our algorithm takes the Task Queue as the input and picks up the task taskk that has arrived first. This taskk is submitted to Task Assignment Algorithm along with details of T T to get the approval for its scheduling.
• • • •
Hardware Specifications of TaskTracker (Φ) Network distance between the nodes of task execution and the corresponding datasplit of the task (Σ) − → TaskVector of Incoming Task (Tk ) TaskVectors of Tasks Running on TaskTracker → − ( T compound (i))
→ − T compound (i) is the vector addition of the vectors of all the tasks currently running on the TaskTracker T Ti , given by, → − → − → − → − T compound (i) = T 1 + T 2 + .. + T n
(2)
where n is the number of tasks currently running on that TaskTracker. The above discussed features form the feature set F = → − → − {Φ, Σ, T k , T compound (i)}. Whenever Task Selection algorithm queries the Task Assignment algorithm for the compatibility of a task taskk on a TaskTracker T T , the algorithm tries to test the compatibility on a Incremental Naive-Bayes classifier model. taskk = compatible denotes the event that the taskk would be compatible with the other tasks running on T T . The probability P (taskk = compatible|F ) is conditional on the feature set F . The classifier uses the prior knowledge accumulated to make decisions for the compatibility of a task. To achieve this, we compute the posterior probability P (taskk = compatible|F ) using Bayes theorem:
725
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
TaskAssignment(task, TT) Φ = getHardwareSpecifications(TT) Σ = getNetworkDistance(taskk , TT) → − T = getTaskVector(taskk ) → −k T compound = getVectorsOfTasksOnTaskTracker(TT) compatibility = classifier(taskk , → − → − {Φ, Σ, T k , T compound }) if compatibility ≥ Cml then return TRUE else return FALSE end if
Figure 1. approach
Task Assignment algorithm following the machine learning Figure 2. Task Selection algorithm: The received task is tested for compatibility on a Incremental Naive-Bayes classifier and then its accepted if the posterior probability is greater than equal to Cml .
P (taskk = compatible | F ) = P (F |taskk = compatible) × P (taskk = compatible) P (F ) (3) The quantity P (F |taskk = compatible) thus becomes: P (F |taskk = compatible) = P (f1 , f2 , ..f4 |taskk = compatible)
2) Heuristic-based Algorithm: In this section we try to present a heuristic-based algorithm to decide whether a task is compatible on a particular TaskTracker. Whenever the Task Selection algorithm queries for the compatibility of a task on a TaskTracker, this algorithm runs the Task Compatibility Test. Task Compatibility Test: The compound TaskVector of the tasks currently running on the TaskTracker T Ti , is given by equation 2. → − → − → − → − T compound (i) = T 1 + T 2 + .. + T n − → Each of the TaskVector Tk in Task Queue, for all k ∈ [1, n], is represented as follows,
(4)
where f1 , f2 , ..f4 are the features of the classifier − → − → ({Φ, Σ, T k , T compound (i)}). We assume that all the features are independent of each other (Naive-Bayes assumption). Thus, P (F |taskk = compatible) =
4
P (fj |taskk = compatible)
(5)
→ − (6) T k = Ecpu e1 + Emem e2 + Edisk e3 + Enw e4 → − We obtain T availability by calculating the difference of − → the total resources (denoted by TR which is vector with maximum magnitude i.e. 100 for each component) and the → − T compound (i), which is given by,
j=1
The above equation forms the foundation of learning in our classifier. The classifier uses results of the decisions made in the past to make the current decision. This is achieved by keeping track of past decisions and their outcomes in the form of posterior probabilities. If this posterior probability is greater than or equal to the administrator configured Minimum Acceptance Probability Cml , then the taskk is considered for scheduling on T Ti (algorithm on Figure 1). Our algorithm tries to train the classifier incrementally after every task is completed. Every TaskTracker monitors itself for the race condition on its resources by checking if the particular TaskTracker is overloaded and sends the feedback corresponding to the previous decision made on it after the completion of every task. If a negative feedback is received from a TaskTracker pertinent to previous task allocation, the classifier is re-trained at the JobTracker (Figure 2), to avoid such mistakes in future. This will help in keeping the classifier model updated with the current scenario of the cluster.
− → → − → − (7) T availability = TR − T compound − → Assuming that Tk is scheduled on the TaskTracker, the amount of unused resources on it is calculated through the following equation → − → − → − T unused = T availability − T k
(8)
Though there are many similarity/distance calculation functions, we have chosen cosine similarity as its easier to calculate the measure of the similarities/dissimilarities in resource usage patterns of the tasks through the knowledge of angle between the vectors. Since, the resource usage − → → − patterns between Tk and T compound (i) should be dissimilar (since tasks should not conflict with each other), the pattern
726
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
TaskAssignment(task, TT) → − T = getTaskVector(taskk ) → −k T compound = getVectorsOfTasksOnTaskTracker(TT) − → → − → − T availability = TR - T compound → − → − → − − Tk T unused = T availability → − − → T k · T availability (i) angle = arccos → − → − | T k || T availability (i)| → − valuecombination = α × angle + (1 − α) × | T unused | if valuecombination ≤ Cheu then return TRUE else return FALSE end if
Figure 3. approach
Table I H ADOOP AND A LGORITHM PARAMETERS
Task Assignment algorithm following the heuristic-based
Parameter
Description
HDFS block size
64 MB
Speculative execution
enabled
Heartbeat interval
4 seconds
Number of map task slots per node
4
Number of reduce task slots per node
2
Replication factor
3
Number of queues in Capacity scheduler
3
Cluster resources allocated for each queue in Capacity scheduler
33.3%
alpha, α
0.5
Maximum Value for Acceptance, Cheu
60
Minimum Acceptance Probability Cml
0.45
− → out the Tk . The TaskVector is captured by monitoring the resource usage related to cpu, memory, disk and network through ‘atop’ [8] utility.
− → → − should be similar in case of Tk and T availability (i). The angle (in degrees) between the vectors is given by, → − − → T k · T availability (i) angle = arccos → (9) − → − | T k || T availability (i)|
A. Testing Environment Our testing environment consisted of 12 nodes (one master and 11 slaves). The master node was designated to run JobTracker and NameNode [9], while the slaves ran TaskTrackers and DataNodes (slave daemon of HDFS). The nodes were heterogeneous with Intel core 2 Duo, 2.4 GHz processors, with a single hard disk capacity ranging from 80 GB to 2 TB, and 2 or 4 GB RAM. The nodes were interconnected with a gigabit ethernet switch. All the nodes were installed with CentOS release 5.5 (Final) operating system with Java 1.6.0 26.
Lesser the angle value, lesser the dissimilarity between → − − → Tk and T availability (i) (greater the dissimilarity between − − → → Tk , T compound (i)) and hence better the compatibility. At the → − same time, lesser the | T unused | more the resource utilization → − on a particular TaskTracker. The negative T unused (when − → the resource requirement pattern of Tk is greater than the → − available resources T availability (i)) indicates that the task → − will overload the node, hence lower the | T unused | value, the better. Bringing them into a single equation, we get the following.
B. Experiments Description There are very few schedulers which are minimally resource aware of which Fair [10] and Capacity schedulers [11] are widely acclaimed. Capacity scheduler, developed by Yahoo is one of the most successful schedulers currently used in the industry for real-world applications with resource division at cluster level. Hence, we have chosen it as the baseline for testing our algorithms. Table I provides the Hadoop and algorithm parameters used for testing. Experiments were conducted on jobs like terasort, grep, web crawling, wordcount, video file format conversion which are close to real-world applications.
→ − valuecombination = α × angle + (1 − α) × | T unused | (10) The values α and Cheu are administrator configured values which can be set based on the requirements of a particular cluster. If the valuecombination is less than or − → equal to Maximum Value for Acceptance Cheu , then the Tk is considered for scheduling on T Ti . (Figure 3). III. E VALUATION AND R ESULTS We have employed our algorithms into scheduler plugins for MapReduce framework. We customized the method, ‘List assignTasks(TaskTrackerStatus)’ of org.apache.hadoop.mapred.TaskScheduler class. Upon the job submission our algorithm calculates the TaskVector of the job by running a sample of map/reduce tasks. The TaskVector is constructed separately for ‘map’ and ‘reduce’ phases and ‘sort’ and ‘shuffle’ phases are merged with the reduce phase. In our experiments, five map/reduce tasks are run and the average amongst them is calculated to find
C. Comparison on runtime of the jobs We compare the overall runtime of the jobs by varying the number of input jobs given to the scheduler. From our experiments, we conclude that the overall runtime saved could go up to 21% in heuristic based approach and 27% in machine learning based approach when compared to Capacity scheduler. The amount of savings also varies with the number of jobs given to the cluster. As the number of jobs increase, the scheduler finds it easier to find diverse jobs
727
to be scheduled together so that they would not conflict. As the number of jobs increase the savings in the overall runtime also increase, as can be seen from Figures 4 and 5.
Figure 6. Comparison of cpu requirement on a TaskTracker between Capacity and heuristic based algorithms: The cpu requirement mostly stays below 100% in case of heuristic based algorithm except for few surges. The time stamp is shown in minutes.
Figure 4. Comparison of runtime (in hours) of the jobs between Capacity and heuristic based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase.
Figure 7. Comparison of cpu requirement on a TaskTracker between Capacity and machine learning based algorithms: The cpu requirement mostly stays below 100% in case of machine learning based algorithm except for few surges. The time stamp is shown in minutes.
map-slow and reduce-slow nodes by using historical information. Whenever a task is found to be running slowly, the algorithm selects slow tasks and launches backup tasks accordingly on a faster node. In our approach, we try to understand the task compatibility on a node before a task is scheduled to avoid ‘slow task’ condition for that particular task on a node. And after a task is scheduled on a particular node, we try to monitor that node for overload condition and re-train the classifier accordingly to enhance decision making in future. In [13], the authors have shown how MapReduce framework can be leveraged to run heterogeneous sets of workloads, including accelerated and non-accelerated applications, on top of heterogeneous clusters, composed of regular nodes and accelerator-enabled systems. The authors propose ‘adaptive scheduler’ which provides dynamic resource allocation across jobs, hardware affinity when possible, and would even be able to spread jobs’ tasks across accelerated and non-accelerated nodes in order to meet performance goals in extreme conditions. In [4], the authors presented a utility based Job Admission algorithm for a MapReduce framework exposed as a Software as a Service. The proposed Admission Control algorithm accepts the jobs, based on the amount of utility they generate to the MapReduce service provider. A machine
Figure 5. Comparison of runtime (in hours) of the jobs between Capacity and machine learning based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase.
D. Comparison on resource usage We present the effect of our scheduling algorithms on cpu usage of the TaskTracker. We monitored a random TaskTracker for its resource requirement. Figures 6 and 7 show a snapshot of cpu requirement on a particular TaskTracker for a certain random period of time. From Figures 6 and 7, we can see that the cpu requirement for the tasks running on the TaskTracker with Capacity scheduler reaches up to 250% (resource requirement but not the resource usage). This situation has occurred due to un-awareness of the jobs in Capacity scheduler. On the contrary our algorithms try not to schedule similar tasks on a TaskTracker and hence not overload the TaskTracker, except for minor infrequent surges. We also observe that the overall utilization of the node is increased. Similar effect of our algorithms on other resources such as memory, disk and network have been observed. IV. R ELATED W ORK In [12], the authors discuss a self-adaptive MapReduce scheduling algorithm which tries to classify the nodes to
728
learning based algorithm is proposed which incorporates a Naive-Bayes classifier with MapReduce related features like ‘used map and reduce slots’, ‘map/reduce time average’ etc. In [14], the authors present a design of an agile data center with integrated server and storage virtualization along with the implementation of an end-to-end management layer. They also propose a novel VectorDot scheme (similar to our vector model) to address the complexity introduced by the data center topology and the multidimensional nature of the loads on resources. In [15], the authors present an approach known as ‘RS Maximizer’ to maximize the utilization of each resource set which can optimize the performance of the Hadoop job at hand, and also obtain potential energy savings by identifying the unused resources.
[2] Nitesh Maheshwari, Radheshyam Nanduri, and Vasudeva Varma. Dynamic energy efficient data placement and cluster reconfiguration algorithm for mapreduce framework. Future Generation Computer Systems, 28(1):119 – 127, 2012.
V. C ONCLUSION AND F UTURE W ORK
[6] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2 edition, November 2000.
[3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107– 113, 2008. [4] Jaideep Dhok, Nitesh Maheshwari, and Vasudeva Varma. Learning based opportunistic admission control algorithm for mapreduce as a service. In ISEC ’10: Proceedings of the 3rd India software engineering conference, pages 153–160. ACM, 2010. [5] JobTracker Architecture. http://hadoop.apache.org/common/ docs/current/mapred tutorial.html.
In this paper, we presented scheduling algorithms that try to avoid race condition for resources on a node of the cluster. We have discussed the importance of having the information regarding every task and node of the cluster. In this context, we presented Task Selection algorithm and Task Assignment algorithm which select the task that is best suitable on a particular node. We have successfully achieved our aim to avoid overloading any node on the cluster, utilize maximum resources on a particular node thereby decrease race condition for resources and overall runtime of the jobs. Since our algorithm mainly focuses on Task Assignment, it can also be plugged-in in conjunction to a Fair or Capacity scheduler. This would add Fair and Capacity schedulers’ job selection features into our algorithm. Though the proposed algorithms are designed specifically for MapReduce Framework, they can very well be implemented in any distributed environment. For example, we can incorporate our resourceaware algorithms to effectively schedule virtual machines on a given number of physical machines. Future work of our research include tuning of MapReduce framework with different configuration parameters for finding best runtime of the jobs through machine learning techniques. We plan to propose scale up/down algorithms with our resource-aware scheduling to switch on/off the virtual machines/nodes based on the resource usage of the cluster to save energy.
[7] Geoffrey Holmes Bernhard Pfahringer Peter Reutemann Ian H. Witten Mark Hall, Eibe Frank. The weka data mining software. SIGKDD Explorations, 11(1), 2009. [8] ’Atop’ utility. http://www.atoptool.nl/. [9] Hadoop Distributed File System. http://hadoop.apache.org/ common/docs/current/hdfs design.html. [10] Fair Scheduler. http://hadoop.apache.org/common/docs/r0.20. 2/fair scheduler.html. [11] Capacity Scheduler. http://hadoop.apache.org/common/docs/ r0.20.2/capacity scheduler.html. [12] Quan Chen, Daqiang Zhang, Minyi Guo, Qianni Deng, and Song Guo. Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, 29 2010. [13] J. Polo, D. Carrera, Y. Becerra, V. Beltran, J. Torres, and E. Ayguade and. Performance management of accelerated mapreduce workloads in heterogeneous clusters. In Parallel Processing (ICPP), 2010 39th International Conference on, pages 653 –662, 2010. [14] Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. Server-storage virtualization: integration and load balancing in data centers. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pages 53:1–53:12, Piscataway, NJ, USA, 2008. IEEE Press.
ACKNOWLEDGMENT The authors would like to thank students and faculty of Search and Information Extraction Lab (SIEL) for their support and constant feedback throughout our work.
[15] Karthik Kambatla, Abhinav Pathak, and Himabindu Pucha. Towards optimizing hadoop provisioning in the cloud. In Proceedings of the 2009 conference on Hot topics in cloud computing, HotCloud’09, pages 22–22, Berkeley, CA, USA, 2009. USENIX Association.
R EFERENCES [1] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic. Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6):599 – 616, 2009.
729