Adaptive Failure Tolerant Replication based ... - Semantic Scholar

Adaptive checkpointing strategy to tolerate faults in Economy Based Grid Babar Nazira , Kalim Qureshib, Paul Manuelc a

Department of Computer Science, COMSATS Institute of Information Technology, 22060, Tobe Camp Abbottabad, NWFP, Pakistan b

Department of Mathematics and Computer Science Kuwait University, Safat 13060, State of Kuwait Email:[email protected] c

Department of Information Science Kuwait University, Safat 13060, State of Kuwait

Abstract In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well known time optimization heuristic in economy based grid environment. From the measured results we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs. Keywords: Economy Based Grid, Grid Job Scheduling, Grid Resource Management, Fault Tolerance, Checkpointing, Distributed System.

1. INTRODUCTION The term Grid computing [1-4], was introduced in the early 1990s by Ian Foster and Carl Kesselman. They explain it as a way to make the computation power of idle work stations available to remote grid users for the execution of their computation hungry jobs. And this was intended to be done in the same pervasive fashion, as we access the electric power grid [5-7]. Grid computing uses idle computational power from different geographical places. It involves runtime aggregation of these resources in the form of Virtual Organization (VO) [8, 10], according to the needs of the job submitted by the grid user. The Economy based grid [11-13] is a user-centric resource management and job scheduling approach. It offers incentives and profits to resource owners for contribution of their resources. On other hand, it also gives users a dynamic environment to maximize their gains by relaxing QoS requirements such as budget and deadline. In this way, economy based grid computing provides a competitive environment that satisfies both the 1

parties involved in the grid i.e. resource producers and resource consumers. The rapid growth of the Internet in the last 10 years was the first major facilitator of the renewed interest in fault tolerance and related techniques. We believe that the emergence of grid computing will further increase the importance of fault tolerance. Grid computing will impose a number of unique new conceptual and technical challenges to fault-tolerance researchers. Following are some of the factors due to which the probability of faults in a grid environment is much higher than a traditional distributed system [4, 7, 31]: lack of centralized environment, predominant execution of long jobs, highly dynamic resource availability, diverse geographical distribution of resources, and heterogeneous nature of grid resources. Thus, the incorporation of fault tolerance related features in a grid job scheduling strategy should not be an optional feature, but a necessity. Fault tolerance becomes more critical, when an economy based grid environment [11-13] is considered. Failing to meet the deadline and the quality will adversely affect user faith in the grid resulting in loss of business. Nainwal et al [30] have studied adaptive and QoS oriented scheduling in general grid environments. To the best of our knowledge, fault tolerant job scheduling in economy based grid is not widely studied. In this paper we propose a new strategy for fault tolerant job scheduling for the economy based grid. This is a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. Real time grid applications like multimedia processing, scientific computation, etc are very good candidates where checkpointing can be used. Many real time applications in distributed system [4, 8, 29] have used checkpointing for performance optimization. The motivation of this paper is to develop a fault tolerant grid scheduling strategy for economy based grid systems. We investigate the performance of efficient scheduling in the presence of faults so that the penalty paid by the resource provider is minimized [14]. This also enables grid to uphold the faith of the user by not compromising its QoS requirements, due to faults at grid resources. Some of the main contributions of the paper: 1. We advocate the need for a fault tolerant job scheduling mechanism for economy base grid environment. A fault tolerant grid-scheduling model for economy based grid is proposed. This paper modifies the model for time optimization strategy presented in [8] and adds fault tolerant features. 2. The paper presents an adaptive task checkpointing based fault tolerant job scheduling strategy for economy based grid. This tackles the problem of fault tolerant job scheduling for economy based grid. The proposed strategy uses an adaptive heuristic of task checkpointing, enabling the grid to complete grid jobs within specified deadline and allotted budget. That is done by using the result of the last saved checkpoint in case of possible fault occurrence at grid resources. 3. To simulate the proposed strategy, we enhance the GridSim Toolkit-4.0 to exhibit fault tolerant related behavior. The paper defines the interaction protocol for communication between the core GridSim entities and new entities introduced by this paper for fault tolerance. 4. Through extensive simulation, the paper compares performance of the proposed strategy with the time optimization strategy in economy grid environments. The rest of this paper is organized as follows. Section 2 briefly explains different research efforts for providing fault tolerance in grid computing. Section 3 elaborates the problem formulation. In Section 4, explanation of proposed system model and job scheduling strategy is given. Section 5 discusses the simulation environment and describes the interaction protocol for communication between proposed fault tolerance 2

entities & GridSim entities. Section 6 describes the experimental setup. Section 7 explains the simulation results in perspective of different performance evaluation parameters. The final section includes conclusions and suggests future work.

2. RELATED WORK In the literature, research on fault tolerance in the grid environment can be divided into two main types: pro-active and post-active. Pro-active fault tolerance mechanism takes into account the failure of grid resource before scheduling jobs on grid resources. On the other hand, post-active mechanism considers and takes appropriate measures on job faults after the job failure. Most researchers apply the latter approach to deal with failures using different methods such as grid monitoring approach as mentioned in [15]. As far as fault detection in any resource of grid is concerned, there are two main strategies: pull model and push model as described in [16]. In the pull model, different grid components are responsible for sending periodic signals to a fault detector. In the absence of any such signal from any grid component, the fault detector recognizes that failure has occurred at that grid component. It then implements appropriate measures dictated by the predefined fault tolerance mechanism. In the push model, it is the responsibility of the fault detection component to send periodic signals to the different grid components. Further, the fault detection component is responsible for detecting and processing out the faults. In [17] an agent oriented pro-active fault tolerance framework is proposed. Here faults in the grid environment are divided into six classes: hardware faults, application and operating system faults, network faults, software faults, response faults and timeout faults. These basic classes of faults are further divided into sub-classes. A mechanism of different software agents for different classes of faults is used to deal with faults in a proactive manner. These agents maintain information about the different characteristics of the gird environment and enable the grid system to tolerate different types of faults gracefully. In [18] mobile agents are used for providing fault tolerance. The mechanism is named as MAG (Mobile Agents Technology for Grid Computing Environments). Here fault tolerant components are developed as mobile agents to provide fault tolerance. Mobile agents form a multi-agent society. In [19], when an originally selected grid resource is recognized as faulty, all the jobs assigned to this faulty resource are remapped to the some other grid resource. A review of literatures [12-16] reveals that grid environments are more failures prone than the general distributed systems. Fault tolerant measures in grid environment are different from those of general distributed systems. There is very little work done on fault tolerance in economy base grid environment, so there is a need of new fault tolerant scheduling mechanism for the economy base grid environment.

3. PROBLEM FORMULATION Grid jobs are executed by the economy based grid as follows: 1. Grid users submit their jobs to the grid resource broker (GRB) by specifying their QoS requirements i.e. deadline in which users want their jobs to be executed and the budget which users have for the completion of jobs. 3

2. Grid Resource Broker schedules user jobs on the best available resource by optimizing time. 3. Result of the job is submitted to user upon successful completion of the job. Such an economy base environment has following two major draw backs. 1. If a fault occurs at a grid resource, present heuristics merely reschedule the job on another resource which eventually results in failing to satisfy the user’s QoS requirements i.e. budget and deadline. The reason is simple. As the job is reexecuted, it consumes more budget and time. 2. In the economy based grid environments, there are resources that fulfill the criteria of deadline and budget constraints (QoS requirements), but they have tendency towards faults. In such a scenario, the grid resource broker goes ahead to select the same resource for the mere reason that the grid resource promises to meet QoS requirements of the grid jobs. This eventually results in compromising the user’s QoS parameters in order to complete the task. In this paper, in order to address the first problem, we use a task checkpointing heuristic to enable the economy based grid to tolerate faults gracefully, as we are able to restore the partially completed task from the last checkpoint. In order to address the second problem, we make our checkpointing strategy adaptive by maintaining a fault index. This fault index is maintained by taking into consideration the fault occurrence history information of the grid resource. In this way, we are able to introduce checkpoint mostly when it is necessary. Simulation experiments show that our proposed strategy is able tolerate faults gracefully by taking appropriate measures according to resource vulnerability towards faults.

4. ADAPTIVE FAULT TOLERANT JOB SCHEDULING STRATEGY This section explains the proposed model (see figure1) that enables the system to tolerate faults gracefully. The proposed environment of fault detection and heuristic for maintaining fault occurrence information is as follows: Grid faults affect the performance of resource management strategy. As our proposed strategy considers fault tolerance in an economy based grid environment, the aim is to optimize user-centric metrics in the presence of faults. These metrics include number of tasks executed within deadline and budget in the presence of faults. For simplicity, we assume that a fault occurs when a grid resource is unable to complete its job in the given deadline. When such a fault is detected by a grid resource broker, the fault occurrence information about the grid resource is updated. This fault occurrence information is used, while making a job allocation decision to the grid resource. We propose to maintain and update the fault index of all available resources of the grid. The fault index of the grid resource will suggest its vulnerability to faults (i.e. higher the fault index is higher the failure rate). The fault index of a grid resource is incremented every time the resource does not complete the assigned task within deadline. The fault index of a resource is decremented whenever the resource completes the assigned task within deadline.

4

Components of Proposed Scheduling Strategy: The interaction between different components of economy based grid in the proposed scheduling strategy (see figure1) is as follows: A grid resource is a member of a grid and it offers computing services to grid users. Grid users register themselves to Grid Information Server (GIS) of a grid by specifying the QoS requirements such as the cost of computation, deadline to complete the execution, the number of processors, speed of processing, internal scheduling policy, and time zone. A GIS contains information about all available grid resources with their computing capacity and cost at which they offer their services to grid users. All grid resources that join and leave the grid are monitored by GIS. Whenever a grid broker has jobs to execute, it consults GIS to identify an appropriate grid resource. The Fault Tolerant Schedule Manager (FTScheduleManager) maintains fault index history about grid resources and updates (increment/decrement) fault index of a grid resource by receiving requests from broker. The Checkpoint Manager (CPManager) receives the partially executed result of a task from a grid resource in the intervals specified by the grid resource broker based on the checkpoint. It maintains grid tasks and their checkpoint table which contains information of partially executed tasks by the grid resources. CPManager also receives and responds to the task completion and task failure message from grid resources. CPManager updates its table and passes this information to grid resource broker. For a particular task, the CPManager discards the result of the previous checkpoint when a new value of checkpoint result is received. For a particular task, if CPManager receives the task completion message from resource, it removes its entity from the task checkpoint result table. Grid Resource Broker (GRB) is an important entity of a grid. A grid resource broker is connected to an instance of a user. Each grid job (composed of gridlets) is first submitted to its broker, which then schedules the grid job according to the user’s scheduling policy. When GRB receives a grid job from a user, it gets the contract information of available grid resources from the GIS and then requests the resources to send their current work load condition. Based on current work load condition of the resources, it prepares a list of resources that can execute the task satisfying deadline and budget constraints based on time optimization strategy. The GRB then gets the fault index of the selected resources of the list from FTScheduleManager. Depending on the fault index of the resource, the GRB implements Algorithm A to take appropriate decisions.

Algorithm A:   

F: Fault index of the selected grid resource F(i), i=0, 1, 2 … N, are integers such that F(0) < F(1) < … < F(N) C(i), i=1, 2 … N, are the percentage of task completed such that 0 … > C(N)

1. IF F(0)

Adaptive Failure Tolerant Replication based ... - Semantic Scholar

Adaptive Failure Tolerant Replication based ... - Semantic Scholar

Suggest Documents

Robust Adaptive Fault-tolerant Compensation ... - Semantic Scholar

An Adaptive Data Replication Algorithm - Semantic Scholar

A Fault-Tolerant Model for Replication in ... - Semantic Scholar

Scaling Byzantine Fault-Tolerant Replication to ... - Semantic Scholar

Fault-tolerant replication management in large ... - Semantic Scholar

Fault-Tolerant Grid-Based Solvers - Semantic Scholar

OBSERVERâBASED FAULTâTOLERANT ... - Semantic Scholar

An Actuator Failure Tolerant Robust Control ... - Semantic Scholar

Triple Failure Tolerant Storage Systems Using ... - Semantic Scholar

Self-Adjusting Two-Failure Tolerant Disk Arrays - Semantic Scholar

Transaction-Based Grid Database Replication - Semantic Scholar

Adaptive Delay-Tolerant DSTBC in Opportunistic ... - Semantic Scholar

iPOS: A Fault-Tolerant and Adaptive Multi-Sensor ... - Semantic Scholar

Adaptive Distributed and Fault-Tolerant Systems 1 ... - Semantic Scholar

Fault Tolerant Adaptive Control for Probe and ... - Semantic Scholar

Adaptive and Fault Tolerant Medical Vest for Life ... - Semantic Scholar

Static and Adaptive Data Replication Algorithms ... - Semantic Scholar

Experience of Adaptive Replication in Distributed ... - Semantic Scholar

Efficient and Adaptive Web Replication using ... - Semantic Scholar

Adaptive Replication and Replacement in P2P ... - Semantic Scholar

DARE: Adaptive Data Replication for Efficient ... - Semantic Scholar

Database Replication - Semantic Scholar

Plan-based Replication for Fault-tolerant Multi- Agent Systems

Plan-Based Replication for Fault-Tolerant Multi-Agent ... - CiteSeerX

Adaptive Failure Tolerant Replication based ... - Semantic Scholar