A Dynamic and Reliable Failure Detection and Failure Recovery Services in the Grid Systems Bahman arasteh, Manouchehr ZadahmadJafarlou and Mohammad Javad Hosseini
Abstract Fault tolerance and resource monitoring are the important services in the grid computing systems, which are comprised of heterogeneous and geographically distributed resources. The reliability and performance must be considered as a major criterion to execute the safety–critical applications in the grid systems. Since the failure of resources can leads to job execution failure, fault tolerance service is essential to satisfy dependability in grid systems. This paper proposes a fault tolerance and resource monitoring service to improve dependability factor with respect economic efficiency. Dynamic architecture of this method leads to reduce resource consumption, performance overhead and network traffic. The proposed fault tolerance service consists of failure detection and failure recovery. A two layered detection service is proposed to improve failure coverage and reduce the probability of false alarm states. Application-level Checkpointing technique with an appropriate graining size is proposed as recovery service to attain a tradeoff between failure detection latency and performance overhead. Analytical approach is used to analyze the reliability and efficiency of proposed Fault tolerance services.
B. arasteh (&) Department of Computer, Tabriz Branch, Islamic Azad University, Tabriz, Iran e-mail:
[email protected] M. ZadahmadJafarlou Department of Computer, Ilkhchi Branch, Islamic Azad University, lkhchi, Iran e-mail:
[email protected] M. J. Hosseini Department of Computer, Sufian Branch, Islamic Azad University, Sufian, Iran e-mail:
[email protected]
James J. (Jong Hyuk) Park et al. (eds.), Computer Science and Convergence, Lecture Notes in Electrical Engineering 114, DOI: 10.1007/978-94-007-2792-2_47, Springer Science+Business Media B.V. 2012
497
498
B. arasteh et al.
Keywords Grid computing Fault tolerance service False alarm Performance overhead
Dependability
1 Introduction Grid computing as a large distributed environment integrates diverse and heterogeneous resources and services [1]. It enables the aggregation and sharing of geographically distributed computational, data and other resources as a single, unified resource for solving large-scale computation and data intensive applications in a parallel manner with reasonable costs [2–4]. Grid computing can be exploited as an efficient platform for some critical and computation intensive applications such as molecular sample examining and research concerning nuclear boiling which need many hours, days or even weeks of execution. Safety–critical and real-time distributed applications such as scientific, medical and industrial applications have rigorous objectives for the timing and correct results. Because the grid resources are highly heterogeneous and can leave/join dynamically, fault, error and failure occurrence in each component of grid environment must be considered as common events and consequently the infrastructure of the grid can reconfigure and change dynamically. Therefore, the dependability and its related criteria such as reliability, safety and availability must be considered in the grid resource management and job scheduling processes. In the Globus toolkit the Meta Directory Service (MDS) and HBM (heartbeat monitor) service are used to develop general fault tolerance service in the grid environment [5]. Low coverage, low reliability and low efficiency are the main drawbacks of this monitoring service. Thus, providing a reliable, efficient, scalable, dynamic and economic failure detection service as a basic service must be considered in the grid systems. In this paper, we propose a reliable, efficient and dynamic fault tolerance service by means of component and information replication. The proposed approach can detect and recover high percentage of timing and content failure like Byzantine faults. Using our technique reduces the probability of false alarm (false positive and false negative) and consequently the reliability of fault tolerance service. The cost of the requested services is the other criteria from the user’s and resource provider’s points of view which are considered in this work.
2 System Architecture and Assumption 2.1 System Model The grid infrastructure consists of layered software components deployed in different nodes [6]. A Grid can be defined as a layer of networked services that allow users access to a distributed collection of computing, data, communication and application resources from any location. The term service-oriented architecture refers to a
A Dynamic and Reliable Failure Detection Fig. 1 Overview of grid infrastructure
499
Grid Users (through web browser)
Application (Portal, Engineering, science, …)
High Level services and Tools (MPI-G, C++, …)
Grid Core Services (Grid Middleware) (GRAM, MDS, HBM, GARA, GMT, GTS …)
Grid Resources
architecture of developing reliable distributed systems that deliver functions as services. These services communicate with each other by message passing techniques and are implemented using Web Services, which are built on the same technologies (HTTP, XML, web servers) as the World-Wide Web. Web services technology and the Grid middleware together leads to develop a service-oriented architecture for a grid middleware. The Open Grid Services Architecture (OGSA) represents an evolution towards a Grid system architecture based on Web services concepts and technologies [2, 7]. The OGSA as a new architecture in the Grid middleware provides a more unified and simplified approach to the Grid applications. The Globus Toolkit is an open source software toolkit used for building grids projects. The Globus toolkit includes software services and libraries for resource monitoring, discovery, and management, plus security and file management [3, 8]. The toolkit includes software for resource management, communication, security and reliability. Grid resource allocation management (GRAM), Monitoring and Discovery System (MDS) and Grid Security Infrastructure (GSI) are the main components of Globus toolkit. The HBM (heartbeat monitor) technique is used to handle fault and failures [2, 5, 7]. (Figs. 1, 2). GRAM is responsible for managing local resources and comprises a set of Web services to locate, submit, monitor, and cancel jobs on Grid computing resources. The MDS is the information services component of the Globus toolkit and provides information about the available resources on the Grid and their status. GSI is a set of tools, libraries and protocols used in Globus to allow users and applications to securely access resources. MDS consists of two services: Grid Resource Information Service (GRIS) and the Grid Index Information Service (GIIS). The GRIS is a machine-specific service which contains information regarding the machine on which it is running and provides resources discovery services on a Globus toolkit.
500
B. arasteh et al.
GIIS
GIIS
Query for Resource Statuse
Site1
GIIS
Site 2
GIIS
Stores the Info of Resources
Stores the Info of Resources
LAN
LAN
GRIS
GRIS
GRIS
GRIS
GRIS
GRIS
Resource
Resource
Resource
Resource
Resource
Resource
Fig. 2 The hierarchical architecture of Globus failure detection service
The resource information providers use a push protocol to update GRIS periodically. GIIS provides a global view of the Grid resources and pulls information from multiple GRIS to combine into a single coherent view of the Grid [3, 7, 9, 10]. Globus is placed into the push resource dissemination category since the resource information is initially periodically pushed from the resource providers. Resource discovery is performed by querying MDS. The GRIS monitors the state of the registered resource and process and the GIIS as the data collector which receives HBM generated by local monitors [5, 7]. When a failure occurs in a local resource, the corresponding GRIS informs its domain GIIS by sending information.
2.2 Fault Model The resources may enter and leave the grid at any time. Hence, the grid is a hazardous environment and resource failure is a common event and not an exception. On the other hand, probability of fault, error and failure in each remote resource and the network framework is not negligible [9, 11]. Failures might happen during many stages of this process as a consequence of many software and hardware faults. The focus of this paper is on the Resource failure and Local environment failure during the job execution. Many transient and permanent hardware faults can lead to resource failure in the grid system. Hardware and omission faults can lead to resource failure [12]. The fault model in this paper is physical faults like Physical fault, Fault in the host machine’s CPU, Fault in the host machine’s memory, Fault in the host machine’s storage, Fault in the software layers of host machine (failure of OS in a host machine), and Fault in the transition channels. And as mentioned, omission faults will also arise when resources become unavailable, leave the grid environment, during a job execution. Early and late results of a request are the other type of faults in the grid. These faults can leads to fail-stop,
A Dynamic and Reliable Failure Detection
501
byzantine and timing failure. In the Fail-stop failure, the system does not output any data. It immediately stops sending any events or messages and does not respond to any messages. In the Byzantine failure, the system does not stop and behaves incorrectly and may send out wrong information, or respond late to a message. One of the assumptions in this paper is the correctness and fault freeness of the submitted jobs which are replicated by scheduler. The proposed method based on the replication techniques focuses on fail-stop, byzantine and timing failure handling.
3 Related Works Component Replication [13, 14], Job Replication [15] and Data Replication [16] are Deferent replication methods which can be used in different layers of the grid computing to achieve fault tolerance. In the application-level, the fault tolerance (FT) mechanisms are implemented into the application code by exploiting application-level knowledge [13, 17, 18]. Major Middleware tools that make use of application-level check-pointing are BOINC and XtremWeb [19]. The significant features of this technique are efficiency, low performance overhead, high flexibility and portability. System-level fault tolerance [20] is an automatic and transparent technique which is unaware of applications details. In this technique the application is seen as a black-box and it has no knowledge about any of its characteristics. For example the system-level Fault tolerance (FT) technique is used in Condor and Libckpt. Transparency and simplicity are the advantages of this technique but, it is impractical when the system has a large number of resources because of performance and communication overhead. Multi-level FT tries to combine the advantages of both techniques [21, 22]. The different Ft techniques are embedded in different layers of grid and responsible for handling the corresponding error and failures. But embedding the FT techniques in different layer and components (Mixed-level FT) sometimes are not possible because of diversity and distribution of grid component. In the grid systems the heart beat mechanism as a conventional method is used to implement the failure detection services [23]. In this technique, GRIS periodically sends a heartbeat message to the data collector which shows it is still alive. The heartbeat interval affects on the detection coverage, detection latency, performance and communication overhead. This is an unreliable technique which its reliability depends on the reliability of the GRIS and GIIS. Low detection coverage and low scalability are the other significant drawbacks of this technique. We propose a fault tolerance and resource monitoring service which address the following criteria: Reliability: shows the continuity of correct service to detection and recovery of resource failure. A reliable FT service has low probability of false alarms. Coverage: indicates the percentage of faults which can be handled by the FT services. Latency: the time interval between a resource or process failure and detection. Performance overhead: This criterion refers to the imposed time, resources and communication overhead by the FT service. Scalability, Portability and
502
B. arasteh et al.
Flexibility: the FT service should be independent of special platform and should be able to adapt to different type of platforms and applications. It must be able to scale to the large number of grid resources and processes. Resource Utilization: Using Dynamic methods in the FT Service improves the resource utilization and consequently reduces the cost of services in the economic grids.
4 Proposed Method After a job is submitted through a host machine and the machine could not schedule it because of needed resources and job deadline, the grid resource management service are invoked to select needed resources and schedule it. We focus on the safety–critical and soft real-time applications execution on grid. The hierarchical architecture arrange processes into some form of hierarchy and the following figure shows the hierarchical architecture of fault tolerance service in the Globus toolkit. Our work focuses on drawbacks and single-point-failures of this model and improves it by utilizing dynamic redundancy techniques. In the hierarchical model, failure detectors monitor processes and resources directly or indirectly through other level of hierarchy. It leads to reduce communication overhead by combining information about several processes in a single message, and by storing information at several levels in the system. In the hierarchical model, a local monitor of a resource monitors the status of running process on the host machine. If a failure, time or content, occurs the corresponding local monitor detects it and inform the site failure detector and resource manager. Local Monitor Coverage refers the probability of detecting an occurred failure in a running resource by a local monitor. Latency of Local Monitor refers to the time between a failure occurrence and detection by the corresponding monitor. Reliability of local Monitor refers to the probability of alarm accuracy which is generated by the monitor. The same points must be considered in the other levels of FT services. For example, the site failure detector may be erroneous and offer faulty information to the requesting client. FT service must detect the occurred process failure or resource failure before it propagates to other informational states and before the deadline of the running application. So we need a FT services to tolerate time, content and byzantine failures in a different layers. The proposed method is comprised of failure detection and recovery services.
4.1 Failure Detection Services After receiving a job from a host, the host scheduler analyzes job information including the needed resources and remaining deadline. If the scheduler could not serve the job in the remaining deadline by means of the ready resources, then the resource discovery service is invoked. GRAM provides an interface for accessing
A Dynamic and Reliable Failure Detection
503
system resources. When a job is submitted from a client, MDS which consists of GIIS and GRIS is invoked to show the available resources (in the local domain or the remote domain). The needed degree of dependability and performance and remaining deadline of the job is important to discover the candidate resources. In the resource discovery and resource selection algorithm the following parameters must be considered: Performance of the machine, Dependability of the machine and Locality of the machine (The resources with high degree of locality impose low communication and performance overhead) are the main criteria in the resource selection. Locality Resource Selection Criteria = Dependability Workload After discovering the needed resources, GRAM generates K replicas of the job. The parameter K is adaptable with respect the needed degree of dependability. In the next step, GRAM selects one ready machine as a candidate from K machines in the site and dispatches the job and starts it. Checkpoints are created concurrently with the running jobs and the local monitor controls the status of corresponding process. The following are considered as the main question in this step: Has the running machine failed? Is the machine running the job correctly? Are the mid results correct? Are the time thresholds considered? For answering these questions FT and monitoring services are needs to detect errors while running the jobs and also detect the failure of the host node. The monitor of a host machine, monitors the status of the corresponding machine and by means of HBM and GRIS informs the fault detector of corresponding domain. The local GRIS of a host machine uses the heartbeat as a notification message to inform the corresponding fault detector. Using a periodic signal as a notification message can detect the resource halt or disconnection and couldn’t detect the correctness of delivered services and information (content and Byzantine failure). In order to detect timing and content failure, this work uses the dynamic redundancy and Periodic diagnosis mechanisms periodically during job execution. Detection of an error during the job execution before it leads to failure is very complex process because many characteristics of real-time application on the remote machine are unknown. When the content failure occurs, the content of information delivered at the service interface deviates from specified results [11]. In the proposed model, detection mechanisms use the combination of function level detection and periodic diagnosis mechanisms. A diagnosis module, comparator and acceptance tests (AT), as a function of detection services, are invoked periodically to verify mid results. The proposed schema failure detection covers both timing failure and content failure. In the timing failure the determined deadline for the computation of the result or delivery of the requested service is not met and the detection service monitors time thresholds or checks time-out and deadline constraints of running. Checking deadline constraints is an important factor in failure detection mechanisms especially in the real-time applications. (Figs. 3, 4). AT [11] as a software module which is executed on the outcome of the relevant independent replica to confirm that the result is reasonable or not. Generating
504
B. arasteh et al.
Fig. 3 Hierarchical Monitoring and failure detection service
Monitor
Monitor
Process
Process LAN
Heart Beat Failure detector
Monitor Process
Heartbeat
Monitor
Failure detector
Process
Monitor
Failure detector
Process
LAN
Fig. 4 An overview of two layer failure detection service
LAN
Monitor
Monitor
Monitor
Monitor
Process
Process
Process
Process
H1
Monitor
Couldn’t pass the AT
Process
Failure detector
Redo the failed process from the Last chekpoint
All of k host machine Failed to pass AT
H2
Monitor
Couldn’t pass the AT
Process Redo the failed process from the Last chekpoint
Monitor
Majority voting is invoked on the Result of k host machine
Couldn’t pass the AT
Process
perfect acceptance test for real-time application sometimes is not feasible. In this model, the local monitor of the selected host machine starts to execute first replica. During the execution of first replica the AT is executed periodically to check results of execution. If the output of AT is true, then the mid result of computation is considered valid and the local monitors save this mid result as checkpoints into a reliable storage. Otherwise, a fault is detected and a copy of outcome is stored as a back up by local monitor. Also the local monitor informs the corresponding domain failure detector (FD) by sending a notification message which composed of last committed checkpoint. The corresponding FD and site resource manager (domain MDS and GRAM) starts the same job in a different selected machine from checkpoint. This process is repeated until either one of the alternate replicas passes the AT and gives the result or no more replicas from K replicas are left. At last, when the last alternate replica fails to pass the AT, there are two possible
A Dynamic and Reliable Failure Detection
505
reasons: All of K replicas on the corresponding host machines failed. May be the AT fails and there are some correct result. In this model the failure of a site FD can lead to failure of the fault tolerance service in the corresponding domain. If the MDS, GRAM and GIIS services in a domain are failed the corresponding monitoring service may be failed. For example the local monitor can be failed because of the corrupted services by the corresponding GRIS and if a GIIS offers faulty information the corresponding monitoring system may failed. Hence, the aprobability of false alarm is not negligible and should be considered to attain the reliable detection service. In order to harden the single-point-of failure, we use information redundancy in this level. In each site, the corresponding FD sends a type of notification message (Heartbeat message) with a specific period to a monitor in other level of Hierarchy. The recently saved information of a FD, which is received as notification message, is used in the recovery services. This mechanism is used in other level of grid environment.
4.2 Reliability and Performance Tradeoff It must be noted that this method focuses on dynamic redundancy, in order to tradeoff between reliability and performance criteria. The work based on dynamic redundancy assumes that, the submitted job is not hard real-time and can tolerate the temporary incorrect outputs. In this method the GRAM just need a host machine to start the job. Hence, the average waiting time to discover the needed resources is low. On the other hand this feature leads to improve the efficiency of the resources and decrease the average resources consumption and consequently the cost of services in the economic grid. Different type of notification message is used in different levels of this model. The time interval between the notification messages has significant impact on the reliability, performance slowdown and communication overhead. Increasing the interval leads to reduce the network traffic but increases the detection latency. Increasing the latency may lead to propagate the erroneous results to other components of system. On other hand, it leads to increase the frequency of checkpoints and consequently the imposed time overhead is increased. (Fig. 5). Hence, an optimal value must be calculated with respect the system status (like job deadline, network delay and needed reliability and performance). The optimal value for interval between notification-message improves the following parameters: • Detection and recovery latency • Time and space overhead which imposed by checkpointing • Number of message passing and network traffic
506
B. arasteh et al. Periodic Checkpoint
NM
NM
Periodic Checkpoint
Detection Point
T
Recovery
NM
T
Domain Monitor
Detection Latency Detection Point Site Monitor
Failure Point NM
NM
NM
NM
Recovery Latency Recovery
Detection Latency Resource Monitor Failure Point Recovery Latency
Fig. 5 Using Notification message (NM) between different monitors with different periods
5 Evaluation 5.1 Reliability Evaluation In order to evaluate the reliability and safety of proposed scheduling model, we use an analytical approach. Markov model is a conventional analytical method to evaluate dependability and performance of software and hardware systems [12, 24]. We have used the Markov approach to analyze the reliability of proposed service. Resource failure and local environment failure specially computing host failure are intended as the fault model. Both timing and content failure are considered in the fault model. The basic assumptions are listed in the following: Resource failure is independent and has a constant failure rate, k. The Majority voter as a software module is considered perfectly reliable. The AT as a software module is not considered perfectly reliable. The submitted jobs are soft real-time without software development fault. The following figure shows the reliability of proposed monitoring and fault tolerance system and the monitoring system without redundancy technique and the monitoring system by means of triple-modularredundancy (TMR). In order to evaluate the reliability of a system, the mean time to failure (MTTF) is one the significant factor. In the following section, we evaluate the MTTF for the proposed method. MTTF: Mean time to failure, R1 R1 MTTF = RðtÞdt; MTTF TMR-FD = ð3e2kDt 2e3kDt Þdt 0
MTTFProposed = 1k
0
n P 1 k , MTTF TMR = 5/6k\1/k, MTTF TMR-FD\MTTFProposed-FD,
k¼1
A Dynamic and Reliable Failure Detection
507
Fig. 6 Reliability of proposed monitoring and FT system
Fig. 7 MTTF of proposed FT system with respect of basic FT service
MTTF Mean time to repair: MTTR, Availability = MTTFþMTTR AvailabilityProposed -FD [ AvailabilityTMR-FD(Figs. 6, 7).
5.2 Performance and Resource Consumption Analyses The proposed fault tolerant (FT) Service has a dynamic architecture. This model just needs a single host machines to start the job. Hence, this model does not lead to starvation and reduces the waiting time. After a failure is detected during a job execution the FD and monitor discover (by MDS server) an other candidate machine and does not waste the resources usage before failure occurrence. Therefore, dynamic architecture of proposed model reduces total service time and improves resources efficiency. This model improves the percentage of accepted requested resources and services in the grid systems. On the other hand the NMR based FD (with a seven active host machine) can tolerate three host machine failures by means of seven host machines. Hence, the new model has lower needed resources in average case and this feature improves efficiency and reduces the service cost.
508
B. arasteh et al.
6 Conclusions In this paper we proposed a dynamic monitoring and fault tolerance service via dynamic redundancy techniques which covers the timing and content failure. The failure detection service was organized in two levels including AT and majority voting which reduce the probability of false alarm and consequently improves the reliability of FT service. The monitors in each level store the status of corresponding monitor able component as checkpoints periodically in a reliable storage. In order to mask the failure of AT, the majority voting module is used to harden the failure detection module. The proposed FT model is less dependent on quality of acceptance test. The analytical evaluation results by means of Markov model shows that the proposed FT service has higher reliability, higher resource utilization, lower resource consumption and lower performance overhead.
References Foster I, Kesselman C (1998) The grid: blueprint for a new computing infrastructure. Morgan Kaufmann Publishers, Los Altos 2. Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomput Appl 15(3):200–222 3. Foster I, Kesselman C (1998) The Globus project: a progress report. In Proceeding of the heterogeneous computing workshop 4. Jacob B, Ferreira L, Bieberstein N, Gilzean C, Girard J, Strachowski R, Yu S (2003) Enabling applications for grid computing with globus, IBM 5. Stelling P, Foster I, Kesselman C, Lee C, von Laszewski G (1998) A fault detection service for wide area distributed computations. High performance distributed computing, pp 268–278 6. Baker M, Buyya R, Laforenza D (2002) Grids and grid technologies for wide-area distributed computing. Software—practice and experience, (DOI:10.1002/spe.488) 7. OGSA, http://www.globus.org/ogsa/ 8. Czajkowski K, Foster I, Kesselman C, Karonis N, Martin S, Smith W, Tuecke S. A resource management architecture for metacomputing systems. In: Proceedings of the workshop on job scheduling strategies for parallel processing 9. Bouteiller A, Desprez F (2008) Fault tolerance management for hierarchical GridRPC middleware. Cluster computing and the grid 10. Huedo E, Montero S, Llorente M (2002) An experimental framework for executing applications in dynamic grid environments, ICASE technical report 11. Avizienis A, Laprie J, Randle B, Landwehr C (2004) Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans Dependable Secur Comput 1:11–33 12. Shooman ML (2002) Reliability of computer systems and networks: fault tolerance, analysis, and design. Wiley, New York, 0-471-29342-3 (Hardback); ISBNs: 0-471-22460-X 13. Nguyen-Tuong A (2000) Integrating fault-tolerance techniques in grid applications. Ph.D. Dissertation, University of Virginia 14. Arshad N (2006) A planning-based approach to failure recovery in distributed systems. A thesis submitted to the University of Colorado in partial fulfilment of the requirements for the degree of Ph.D. 15. Townend P, Xu J (2004) Replication—based fault tolerance in a grid environment. As part of the e-Demand project at University of Leeds, Leeds
A Dynamic and Reliable Failure Detection
509
16. Antoniu G, Deverge J, Monnet S (2004) Building fault-tolerant consistency protocols for an adaptive grid data-sharing service. IRISA/INRIA and University of Rennes 1, Rennes 17. Medeiros R, Cirne W, Brasileiro F, Sauve J (2003) Faults in grids: why are they so bad and what can be done about it?’’ Fourth international workshop on grid computing, 00:18 18. Fagg GE, Dongarra JJ (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Lecture Notes in Computer Science, vol 1908. pp 346–354 19. Domingues P, Andrzejak A, Silva LM (2006) Using checkpointing to enhance turnaround time on institutional desktop grids, Amsterdam, p 73 20. Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2006) Recent advances in check-point/recovery systems. In: Workshop on NSF next generation software held in conjunction with the 2006 IEEE international parallel and distributed processing symposium 21. Kola G, Kosar T, Livny M (2004) Phoenix: making data-intensive grid applications fault tolerant. In: Proceedings of 5th IEEE/ACM international workshop on grid computing 22. Thain D, Livny M (2002) Error scope on a computational grid: theory and practice. In: 11th IEEE international symposium on high performance distributed computing, 00:199 23. Aguilera MK, Chen W, Toueg S (1997) Heartbeat: a time—outfree failure detector for quiescent reliable communications. In: Proceedings of the 11th international workshop on distributed algorithms, WDAG.97, pp 126–140 24. Lyu M (1996) Handbook of software reliability engineering. McGraw Hill, New York