2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
Robust Fault Tolerant Job Scheduling Approach In Grid Environment Mangesh Balpande
Urmila Shrawankar
Research Scholar, Dept. of CSE G. H. Raisoni College of Engineering Nagpur, India
[email protected]
Dept. of Computer Science and Engineering G. H. Raisoni College of Engineering Nagpur, India
[email protected]
has always been a very popular topic among the researchers. Various methods and algorithms have been developed to look in to these issues. Since in the today’s era, the organizations such as telecom companies, financial institutions, hi-tech military operations and simulations, etc. are exclusively focusing on fault tolerance [4].
Abstract— Grid computing is becoming a new face of distributed computing, allowing the aggregation of geographically dispersed resources. The Dynamic and heterogeneous nature of Grid makes it more vulnerable to various faults which enforces the failure of job, delay in completion of job or even execution of job from the starting point. In this paper, the empirical analysis for different faults is carried out and also discussed nine fault tolerant job scheduling approaches to deal with these faults, for each approach comparative quantitative analysis is carried out. Though there is need of providing better resource sharing, improved resource utilization and computational speed for computationally intensive applications. Although the technique based on the combination of RFOH and application checkpointing approach may provide robust fault tolerant job scheduling. Keywords— Faults, Fault Tolerance, Job Computational Grid, Application Checkpointing.
I.
One of the challenges of Grid is that, nodes which are actually performing the computation may malfunction and leads to erroneous result due to faults [5]. The grid environment is designed to compute computationally intensive jobs e.g. scientific experiments, simulations, animations, military and aeronautics industry which may take hours, days, weeks or even months to execute. Thus because of more time consumption there are more chances of encountering faults during its execution. The Grid architecture identifies the key areas where Grid services are required. It defines standard integrated protocols and necessary functions for the design of interoperable Grid systems [6]. Remaining part of the paper is organized as follows: Fault tolerant issues are discussed in section II, fault tolerant job scheduling techniques are elaborated in section III, comparative analysis is conducted in section IV and section V give the conclusions.
Scheduling,
INTRODUCTION
For solving computational intensive applications dealing with scientific and industrial areas like bioinformatics, features of materials, decoding, military organization and simulations Computational grid provides resource sharing and mutual coordination [1]. These are now one of the common and acceptable technologies which demand computers with high capabilities or parallel processing techniques in order to perform with high efficiency and extensive synchronization.
II.
The motive of Grid computing is to develop an environment that aggregates large-scale and heterogeneous as well as active and passive resources for salving computationally intense applications. To achieve this it should consider grid architecture, computer heterogeneity, fault tolerance and computational delay [7].
Grid computing is a type of distributed computing that enhances the computing power of computers by using conventional network interface for the construction, and design of normal computers. Grid computing is best suited for multiple parallel computing applications where computations are independent and no need to share intermediate results [2]. Thus it can provide better performance for geographically separated heterogeneous platforms.
Following are some challenges in Grid: 1) it should provide reliability in terms of computer hardware, software and other resources for executing user applications. 2) Effective performance of fundamental functions such as resource allocation and scheduling of task. 3) Since the Grids are used for computationally intensive applications this may lead to under-loading and over-loading of resources with less processing power [16]. And 4) Different jobs may fail at different stages of execution and may require different type of recovery actions.
Running program on the high performance supercomputer may lead to address concurrency and dependency issues. Thus it is feasible and effective to write programs, giving different part of same problem to run on multiple machines [3]. This eliminates the complexities causes due to multiple instances of same problem at the supercomputer. Nevertheless, due to the heterogeneity, dynamicity and autonomy of the grid resources, fault tolerant task scheduling
978-1-4799-2494-3/14/$31.00 ©2014 IEEE
FAULT TOLERANT ISSUES IN GRID
259
2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
C. Fault Elimination To eliminate the faults discussed above following approaches of Grids are need to be discussed [9] [8]: 1) Job Replication Approach: In this multiple copies of the same job are created and executed on independent resources do deal with single point failure. 2) Check-pointing Approach: In this state of running job before occurrence of fault is stored in to stable storage in terms of check-point. When fault occurs, the state of job is rollback up to the point of failure. Thus check-point can be utilized to recover from the fault. And 3) Adaptive Approach: This is nothing but a combination of job replication and checkpointing approach in order to design fault tolerant job scheduling system.
A. Fault Detection Grid is highly heterogeneous and dynamic nature due to which more faults are likely to occur in the grid computing. Thus applications are designed in such a way that should recover from any type of fault without disturbing the normal schedule and affecting the performance. Due to fault tolerant nature of Grid it can recover from individual failure without terminating.
D. Some Issues forFault Resolution There are some problems while designing the fault tolerant Grid system to deal with these problems solutions are provided which may provide effective and efficient fault tolerant load balancing Grid environment : 1) Problem/Solution: Different jobs may fail at different stages of execution. Thus may require different recovery mechanisms to deal with different types of faults at different stages. 2) Problem/ Solution: The dynamic load balancing policy is not always best suited for the Grid environment. Since for small scale applications, the priori task information is not required thus it may cause computation overhead.
Fig.1. Types of faults in Grid
B. Fault Prevention Fault prevention techniques are used to prevent faults and resist the situation which may cause malfunctioning of resources. These techniques are nothing but the protective measures taken while executing and scheduling of jobs to avoid faults. It is very difficult to design an application in Grid environment and prevent it from fault. The efficiency of Grid may degrade while preventing it from fault due to overhead caused by preventive measures [9]. TABLE I.
Faults
III.
ANALYTIC COMPARISION OF FAULTS IN GRID
Causes of Faults
Comments
Fault Tolerance Techniques
Hardware Fault
Hardware
Replacement or troubleshooting is the only solution
No particular technique
Application & OS Fault
Result of DoS exceptions, Viruses
Not able to handle user defined exceptions
System & Application Checkpointing[3],
Network Fault
Packet loss or network congestion
May produce inconsistent results
Particle Swarm Optimization [17], Optimal Neighbor [18]
Interaction Fault
Transmission overhead and service interdependencies
Detection of Host Crash/ Network Failure is require
Multiple Ant colony Optimization [11]
Middleware Fault
Exception in the working of middleware
Middleware should perform job scheduling , fault tolerance, etc.
Alchemi, Globus, Legion, ICENI
Transient Fault
Malfunction of particular task specific component
Understanding the nature of idempotent nature of Grid is necessary
Fuzzy logic [10], NSGA-II and Fuzzy mutation [14]
A. Job Scheduling based on Fuzzy logic Approach To overcome the fault tolerance fluctuations in grid environment, a fuzzy logic based self-adaptive job replication (FSARS) algorithm is proposed in [10]. Depending on the success rate, access control and data integrity, a Security Demand (SD) is assigned to user job after submission. Accordingly the Trust Level is evaluated based on the success
FAULT TOLERANT JOB SCHEDULING TECHNIQUES FOR GRID ENVIRONMENT
In Grid environment, the Grid scheduler monitors and schedules the resources before dispatching the job. The grid scheduler receives the jobs from clients and assigns them to the corresponding resources. Some of the fault tolerant jobs scheduling approaches are need to be discussed.
260
2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
rate, grid utilization, etc. The job is successfully completed if (SD≤ TL) holds.
may lead to overloading of Grid if adaptive number of job replicas increases.
The Security Demand level (S D) and Trust Level (T L) of task set is evaluated by considering the SD and TL of corresponding host and number of host satisfying the condition S D≤T L and finally the difference ratio (S E) between T L and SD is calculated. The SD, S D and SE is divided into five membership functions [VL, L, M, H, VH] and lastly the number of replication copies (K) of each job is evaluated by forming the fuzzy If-then rules. This method reduces the performance of Ant Colony computation (ACC) but beneficial for multiple ant colony optimizations (MACO) but it increases the communication overhead between nodes.
E. Job Scheduling based on NSGA-II with Fuzzy Mutation Approach The approach proposed in [14] uses NSGA-II with fuzzy adaptive Mutation operator for job assignments of independent task. NSGA-II works on the principle of diversity preservation to assign some non-dominated rank to all the individuals of the population. For task scheduling with fuzzy mutation, inputs are individual cost variance and gene variance. Out of which for individual cost variance is divided in to five membership functions [VL, L, M, H, VH] and gene variance is divided in to [L, M, H] and accordingly the output in terms of a fuzzy mutation is evaluated by using fuzzy If-then rules which gives the probability of genes and population with in chromosome. This approach requires less number of iteration for scheduling, but cost of crossover, mutation and de-fuzzification is more.
B. Job Scheduling based on MACO Approach For reducing the imbalance between the tasks and execution time of tasks, an approach proposed in [11]. Increasing the number of ants may lead to decrease in performance of Ant Colony computation (ACC) but beneficial for multiple ant colony optimizations (MACO) but it increases the communication overhead between nodes.
F. Job Scheduling based on Application Check-pointing Approach The work proposed in [3] addresses the checkpointing mechanism in which threads of applications are generated in forms of checkpoints and in case of thread failure the corresponding threads from its checkpoint is resuming from the point of failure.
C. Job Scheduling based on RFOH Approach In [12] the genetic algorithm is proposed which utilizes the fault occurrence history of resources which is maintained in GIS (Grid Information Server). This approach stores the history in resource fault occurrence history table (RFHT) which is made of two columns: First column indicates the history of fault occurred in particular resource and another column maintains the number of job executions by that resource. Initially, value of each cell is zero and updated if: 1) With in the deadline the particular resource is unable to execute and 2) job is allocated to the resource. For choosing the fault tolerant resource the fitness function f is given by: f=
∑
,
∑
, ,
This scheme consists of a manager who is central part of the Grid and the executer nodes. The message based communication between the manager & executers is used to check and if a manager doesn’t receive a message from executer for a particular time period then it consider that the fault is occurred [16]. So instead of executing the present thread at faulty node, it is transferred and scheduled at another node from beginning. While performing the checkpointing following factors are needs to be considered: 1) checkponiting overhead, 2) Time to resume execution and 3) Probability of checkpoints. This leads to a considerable checkpointing overhead for small execution time jobs. Also it is efficient in faulty situation but leads to slight overhead in fault free situations.
(1)
This contains the resources with less response time and fault occurrence history. After completion of algorithm the resources with more fitness values are selected, still there is a probability of fault occurrence. To detect this, resource broker assigns a job to a particular resource and waits for the response within a certain time interval. If a resource failed to give response to resource broker, it implies that the fault has been occurred and appropriate fault is maintained at FOHT.
G. Job Scheduling based on Rough Set based Multicheckpointing Approach In predictive fault tolerant scheduling [1], the rough set theory is utilized for predicting the k no. of supporter nodes having less probability of failure. It makes use of global scheduler for selecting the best resources & sites considering load, capacity, deadline and capability of jobs. The total price after execution of this job is given by:
D. Job Scheduling based on AJR and BRS Approaches The design proposed in [13] uses an adaptive number of job replicas according to Grid failure history. This consist of two algorithms: 1) Adaptive Job Replication (AJR)- this determines the adaptive number of job replicas which is proportional to the selected resources having tendency to fail (FT) and it is job dependent given by [13]:
Total price=
60
k
1
(3)
Where, k – No. of supporter nodes n – No. of tasks
(2)
m – Execution time of each task
2) Backup resource Selection (BRS): This algorithm decides the resources for executing job replicas determined by AJR. This scheme reduces the excessive cost of resources and response time of jobs due to adaptive nature of replicas. But it
Ec – Cost required by various constraints Since, this model is based on the probability, thus it requires storing information of grid system and state of every
261
2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
mechanism in advance. Due to which the performance depends on the network parameters like bandwidth, topologies etc.
values of GU indicate good performance of that particular approach. These quality evaluation measures are evaluated for all the techniques discussed previously.
H. Job Scheduling based on ESS and QSS Techniques The work proposed in [15] is based on quadratic structure of distribution function of chunk of data and another is based on the slope of distribution function of data chunks to be represented in exponential form (ESS). The quadratic self scheduling (QSS) makes use of Taylor series expansion up to 2nd degree only. Thus, the distribution function C(n) to represent the nth chunk assigned to a particular processor is, C(n) =
The Degree of Imbalance (DI) is defined as: A measure to compute the imbalance among computing nodes which is computed based on the load of nodes is given by [11], (5)
(4)
Where and are the maximum and minimum is load among all the computing nodes, respectively, and the average load of nodes.
The self-adaptive approach in this method is designed such that whenever the existing execution environment changes or new processors are added in the environment, the new lists of chunks are generated. The ESS produces the large sized but smaller no. of chunks on the contrary the QSS produces the small sized but larger no. of chunks. Combination of both approaches provides better performance.
Execution Time (ET) is defined as: and The time of the task i on the resource j. Consider denote the size of the task i and processing speed of the resource j, respectively [14]. Then,
I. Job Scheduling based on Hierarchical Approach In hierarchical system of fault detection [4], a child node of a cluster is carried out by watch dog timer. If a node is faulty then the job is migrated to the cluster node and if cluster fails then it is migrated to the grid level. It is carried out by using three components: Fault detector, Fault manager, Load balancing function. This method the load balancing is carried out in three phases: 1) Intra – cluster load balancing: In this, components depends on the threshold level of cluster, cluster manager decides to start or not to start load balancing. 2) Inter – Cluster Load balancing: If the intra-cluster load balancing fails then the cluster manager takes help of other cluster manager to carry out parallel load balancing of n-nodes. And 3) Intra-grid-load balancing: if the inter-cluster load balancing fails then the jobs of overloaded clusters are transferred to the under-loaded clusters depending on the priorities of task migration. Fault detection is carried out by using Watchdog timer. It sends an interrupt to the fault detector at regular interval of time. If fault is occurred it is recovered by using hierarchical (i.e. Divide and Conquer) technique. IV.
(6) Grid Utilization (GU) is defined as: The Grid utilization can be obtained by dividing the sum of all the nodes' utilization by the total number of nodes. The expected utilization of each node is based on the given task assignment [19]. Thus GU can be formulated as, ∑
GU
(7)
There will be a completion time for tasks which assigned to it. In general, completion time in any processor is: ∑
(8)
Where, is the set of tasks indexes which are assigned to resource j.
ANALYSIS AND DISCUSSION The parameters like ET, DI, and AWT play vital role in the evaluation of quality of scheduling. As shown in figure 2, the RFOH, Application checkpointing and Fuzzy NSGA-II are having lesser values the execution time and average waiting time. The application checkpointing is having least value of degree of imbalance and also the highest value of grid utilization among all the techniques. Although by considering the performance comparison, it can be said that scheduling that the approach based on single approach is suitable for an environment where less number of nodes are present. But for the computationally intense application like evaluating the total number of prime factors of 40 digit number where participating nodes are very large in number and scheduling is more complex and critical, the combination of RFOH and application checkpointing approach could provide better results. RFOH could be used for scheduling and application checkpointing could be used for error recovery so that the
The analytic comparison of different faults in table I show that there is no specific technique available to recover from hardware faults. The different faults discussed make the Grid less reliable while executing the computationally intense jobs with high QoS requirement. From the survey conducted in [9] among Grid users about what kind of faults they are encountering, it could be observed that to achieve the transparency provided by middleware, operating system and network components the user needs to have thrust for all these elements of Grid. The performance of previously discussed approaches could be accessed by optimal performance measures specified in [11] [14] [19]. The analysis of these techniques is done with the help of execution time (ET), Degree of Imbalance (DI), Avg. Waiting Time (AWT) and Grid Utilization (GU). Out of these metrics lower values of ET, DI and AWT while higher
262
2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
Time in Seconds
degree of imbalance could be minimized. This combined approach may provide: 1) Better source sharing and improved resource utilization, computational speed to execute computationally intense jobs. 2) Prevent the resources from being either heavily loaded or lightly loaded. And 3) a distributed fault-tolerant scheduling and load-balancing approach that minimizes the communication cost and replication cost of independent jobs
0.0005 0.00045 0.0004 0.00035 0.0003 0.00025 0.0002 0.00015 0.0001 0.00005 0
Where, MACOÆ Multiple Ant Colony Optimization RFOHÆ Resource Fault Occurrence History AJR n BRSÆ Adaptive Job Replication and Backup Resource Selection Fuzzy NSGA-IIÆ Fuzzy based Non-dominated Sampled Genetic Algorithm Appl n CheckÆ Application Check-pointing RS based MCheckÆ Rough Set based Multicheck-pointing QSS n ESSÆ Quadratic Self Scheduling and Exponential Self Scheduling. V.
CONCLUSIONS
In Grid environment, different failures can occur for various reasons such as hardware failure, packet loss, timeout while interaction and temporary malfunction of particular system components. In this paper, we have accommodated the various faults, fault tolerant job scheduling techniques and various issues related to it. The comparative quantitative analysis for these scheduling approaches based on different optimal performance metrics is carried out also their nature, pros and cons are provided. The fault tolerant job scheduling approaches with optimal resource utilization, lowest execution time, lowest average waiting time and highest grid utilization are best suited for the design of performance driven fault tolerant job scheduling. The work can further be improved by using a robust method to provide better source sharing, improved resource utilization and computational speed is based on combined approach of RFOH and application checkpointing.
ET AWT
(a) Degree of Imbalance 0.3 0.25 0.2 0.15 0.1 0.05
REFERENCES
0
[1]
(b)
[2]
Grid Utilization in percentage
[3]
0.91 0.9 0.89
[4]
0.88 0.87 0.86 0.85
[5]
0.84
[6]
[7] (c) Fig.2. Performance comparison of discussed techniques: (a) based on ET and AWT, (b) based on DI and (c) based on GU
263
Bouyer, A.; Abdullah, A.H.; Ebrahimpour, H.; Nasrollahi, F., "FaultTolerance Scheduling by Using Rough Set Based Multi-checkpointing on Economic Grids," Computational Science and Engineering, 2009. CSE '09. International Conference on , vol.1, no., pp.103,109, 29-31 Aug. 2009. Daniel Nurmi, Rich Wolski, Chris Grzegorczyk,” The Eucalyptus Opensource Cloud-computing System”// Proceeding of Cluster computing and the Grid. University of California, 2009. Bawa, R.K.; Singh, R., "Application checkpointing in grid environment with improved checkpoint reliability through replication," Computing Communication & Networking Technologies (ICCCNT), 2012 Third International Conference on , vol., no., pp.1,6, 26-28 July 2012. Bhagyashree, A. H.; Pradeep, D.; Jayanthy, N.; Mounica, K. V.;Nivejaa, S.; Dharani, P.S., "A hierarchical fault detection and recovery in a computational grid using watchdog timers," Communication and Computational Intelligence (INCOCCI), 2010 International Conference on , vol., no., pp.467,471, 27-29 Dec. 2010. Christopher Dabrowski,"Reliability in grid computing systems",www.interscience.wiley.com. Jia Yu , Rajkumar Buyya, “A Taxonomy of Workflow Management Systems for Grid Computing “. Department of CS and SE university of Melbourne. Balasangameshwara, J.; Raju, N., "Performance-Driven Load Balancing with a Primary-Backup Approach for Computational Grids with Low Communication Cost and Replication Cost," Computers, IEEE Transactions on , vol.62, no.5, pp.990,1003, May 2013.
2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA)
[8]
[9]
[10]
[11]
[12]
[13]
[14] Salimi, R.; Motameni, H.; Omranpour, H., "Task scheduling with Load balancing for computational grid using NSGA II with fuzzy mutation," Parallel Distributed and Grid Computing (PDGC), 2012 2nd IEEE International Conference on , vol., no., pp.79,84, 6-8 Dec. 2012. [15] Diaz, J.; Muoz-Caro, C.; Nio, A., "A Fault Tolerant Adaptive Method for the Scheduling of Tasks in Dynamic Grids," Advanced Engineering Computing and Applications in Sciences, 2009. ADVCOMP '09. Third International Conference on , vol., no., pp.51,56, 11-16 Oct. 2009. [16] Akshay Luther, Rajkumar Buyya, Rajiv Ranjan, and SrikumarVenugopal," Alchemi: A .NET-based Grid Computing Frameworkand its Integration into G lobal G rid s". [17] Lei Zhang, Yuehui Chen, Runyuan Sun, Shan Jing, Bo Yang, "A Task Scheduling Algorithm Based on PSO for Grid Computing", International Journal of Computational Intelligence Research, ISSN 0973-1873 Vol.4, No.1, 2008. [18] Balasangameshwara, J.; Raju, N., "A Fault Tolerance Optimal Neighbor Load Balancing Algorithm for Grid Environment," Computational Intelligence and Communication Networks (CICN), 2010 International Conference on , vol., no., pp.428,433, 26-28 Nov. 2010. [19] Y. Li, Y. Yang, and R. Zhu, "A Hybrid Load balancing Strategy of Sequential Tasks for Computational Grids," International Conference on Networking and Digital Society, IEEE, 2009
R.k.bawa and Ramandeep Singh. Article: Comparative Analysis of Fault Tolerance Techniques in Grid Environment. International Journal of Computer Applications 41(1):21-25, March 2012. Medeiros, R.; Cirne, W.; Brasileiro, F.; Sauve, J., "Faults in grids: why are they so bad and what can be done about it?," Grid Computing, 2003, Proceedings. Fourth International Workshop on , vol., no., pp.18,24, 17 Nov. 2003. Wang, Cheng; Jiang, Congfeng; Liu, Xiaohu, "Fuzzy logic-based secure and fault tolerant job scheduling in grid," Tsinghua Science and Technology , vol.12, no.S1, pp.45,50, July 2007. Liang Bai; Yan-Li Hu; Song-Yang Lao; Wei-Ming Zhang, "Task scheduling with load balancing using multiple ant colonies optimization in grid computing," Natural Computation (ICNC), 2010 Sixth International Conference on , vol.5, no., pp.2715,2719, 10-12 Aug. 2010. Khanli, L.M.; Far, M.E.; Rahmani, A.M., "RFOH: A New Fault Tolerant Job Scheduler in Grid Computing," Computer Engineering and Applications (ICCEA), 2010 Second International Conference on , vol.1, no., pp.422,425, 19-21 March 2010. Amoon, M., "Design of a Fault-Tolerant Scheduling System for Grid Computing," Networking and Distributed Computing (ICNDC), 2011 Second International Conference on , vol., no., pp.104,108, 21-24 Sept. 2011. S
264