MPICH-GF: Transparent Checkpointing and Rollback ... - CiteSeerX

MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes Namyoon Woo, Heon Y. Yeom

Taesoon Park

School of Computer Science and Engineering Seoul National University Seoul, 151-742, KOREA {nywoo,jhs,yeom}@dcslab.snu.ac.kr

Department of Computer Engineering Sejong University Seoul, 143-747, KOREA [email protected]

Abstract Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management for MPI and message queue management. MPICH-GF requires no modification of application source codes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the virtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware.

1 Introduction A ‘computational grid’ (shortly a grid) is a specialized instance of a distributed system, which includes a heterogeneous collection of computers in different domains connected by networks [17, 18]. The grid has attracted considerable attention for its ability to utilize ubiquitous computational resources with a view of a single system image and it has been believed to benefit many computation-intensive parallel applications. Recently Argonne National Laboratory has proposed the ‘Globus Toolkit’ for the framework of the grid and it has been taken as the de-facto standard of grid services [16]. Main features of the Globus Toolkit are global resource allocation and management, directory service, remote task running, user authentication and so on. Although the Globus Toolkit can monitor and manage global resources, it lacks dynamic process management such as fault-tolerance or dynamic load balancing which is essential to the distributed system. Distributed systems are not reliable enough to guarantee the completion of parallel processes in a determinate time because of their inherent failure factors. The system consists of a number of nodes, disks and network lines, which are all exposed to failures. Even a single local failure can be fatal to the parallel processes since it nullifies all of the computation results which have been executed in cooperation with one another. Assuming that there is one percent chance a single machine might crash during the execution of a parallel application, there is only a 0.99100 = 36 percent chance the whole system with one hundred machines would be alive throughout the the 1

application run. In order to increase the reliability of distributed systems, it is important to provide the system with fault-tolerance. Checkpointing and rollback-recovery is a well-known technique for fault-tolerance. Checkpointing is an operation to store the states of processes into stable storage for the purpose of recovery or migration [12]. A process can resume its previous state at any time with the latest checkpoint file and periodic checkpointing can minimize the computation loss incurred by a failure. Although several stand-alone checkpoint toolkits [23, 31, 38] have been implemented, they are not sufficient for parallel computing due to the following reasons: First, they cannot restore communication states such as sockets or shared memory. Second, they do not consider causal relationship among the states of processes. Process states may be dependent on one another in the message-passing environment [9, 27]. Hence, the independent process recovery without consideration on dependency may tangle the global process states. The states of inter-process relations should be reconstructed for consistent recovery. Consistent recovery for message-passing processes has been extensively studied for over two decades and several distributed algorithms for consistent recovery have been proposed. However, implementation of the algorithms seems another issue because they assume the followings: • A process detects a failure and revives by itself. • Both revived and survived processes can communicate after recovery without any procedure of channel reconstruction. • Checkpointed binary files are always available. Also, there have been many efforts to make the theory practical [19, 28, 10, 6, 13, 24, 34, 22, 36, 4, 33, 30]. Their approaches take different strategies in the following context; system-level versus application-level checkpointing, kernel-level versus user-level checkpointing, user-transparency, support of non-blocking message transfer and direct versus indirect communication method. According to their strategies, implementation levels differ. However, the frameworks remain to be proofs or evaluation tools of the recovery algorithms. To the best of our knowledge there are only a few systems which are actually used in practice [19, 11], although they are only valid for the specific parallel programming model or the specific machine. Our goal in this paper is to construct the practical fault-tolerance system for message-passing applications on grids. We have integrated rollback-recovery algorithm with Message Passing Interface (MPI) [14], the de-facto standard of message-passing programming. As a research result, we present our MPICH-GF that is based on MPICH-G2 [15], the grid-enabled MPI implementation. MPICH-GF is completely transparent to the application developers and it requires no modification of application source codes. Any in-transit message during checkpointing is never lost whether it is for blocking operation or nonblocking operation. We have expanded the Globus job manager module to support rollback-recovery and to control distributed processes over domains. Above all, our main implementation issue is to provide dynamic process management that is not defined in the original MPI standard (version 1). Since the original MPI standard specifies only static process group management, a revived process, regarded as a new instance, cannot rejoin the process group. In order to enable a new MPI instance to communicate with the running processes, we have implemented a ‘MPI Rejoin()’ function. Currently coordinated checkpointing protocol has been implemented and tested with MPICH-GF, and other consistent recovery algorithms (e.g. message logging) are under development. MPICH-GF operates on Linux kernel ver 2.4 with Globus toolkit 2.2. The rest of this paper is organized as follows: In Section 2, we present the concept of consistent recovery and the related works of the fault-tolerance system. Section 3 describes the operation of the original MPICHG2. We propose the MPICH-GF architecture in Section 4 and address implementation issues in Section 5. The experimental results of MPICH-GF with Nas Parallel Benchmarks is shown in Section 6 and finally the conclusions are presented in Section 7. 2

2 Background 2.1 Consistent Recovery In the message-passing environment, states of processes may have dependency relation with one another, through message-receipt events. A consistent system state is the one in which for every message-receipt event reflected in the system state, the corresponding sending-event should be reflected [9]. If a process rolls back to a past state but any other process whose current state is dependent on the lost state does not roll back, inconsistency occurs.

P1 m2

m1 : Checkpoint

P2 Figure 1. Inconsistent global checkpoint

Figure 1 shows a simple example of two processes whose local checkpoints do not form a consistent global checkpoint. Suppose that two processes should roll back to their latest local checkpoints. Then, process P1 has not sent the message m1 yet, but P2 has marked m1 as being received. In this case, m1 becomes an orphan message and it causes inconsistency. Message m2 is considered as a lost message, in the sense that P2 waits for the arrival of m2 which P1 has marked as being already sent. If any in-transit message is not recorded during checkpointing, it becomes a lost message. Both of these message types cause the abnormal execution of processes during recovery. Extensive researches on consistent recovery have been conducted [12]. Approaches to the consistent recovery can be categorized into the coordinated checkpointing, the communication induced checkpointing and the message logging. In coordinated checkpointing, processes synchronize their local checkpointing so that a set of consistent global checkpoints can be guaranteed [9, 21]. On failure all the processes roll back to the latest global checkpoints for consistent recovery. However, it is a dominant idea that coordination would not scale up. Communication induced checkpointing (CIC) allows processes to checkpoint independently as well as prevents the domino effect using the information piggy-backed on the message [2, 12, 37]. In [2], Alvisi et al. disputed against the belief that CIC would be scalable. According to their experimental report, CIC generates enormous forced checkpoints and the process autonomy in placing local checkpoints does not seem to benefit in practice. Message logging records messages with checkpoint files in order to replay them for the recovery. It can be further classified into pessimistic, optimistic and causal message logging according to their policy on how to store message logs [3]. Log-based rollback recovery minimizes the amount of lost computation with high storage overhead.

2.2 Related Works Fault-tolerant systems for message-passing processes can be categorized as whether they support direct or indirect message transfer. With direct message transfer, application processes communicate with other processes directly, that is, client-to-client communication is possible. MPICH implementation is the representative case of direct transfer mode. With indirect message transfer, messages are sent through the medium like daemon: for example, PVM and LAM-MPI [7] are the cases. In these systems, processes do not have to know any physical 3

address of another process. Instead they maintain the connection with the medium and hence the recovery of the communication context is relatively easy. However, increase in message delay is inevitable. CoCheck [36] is a coordinated checkpointing system for PVM and tuMPI. While CoCheck for tuMPI supports direct transfer mode, the PVM version exploits PVM daemon to transfer messages. CoCheck exists as a thin library layered over PVM (or MPI) that wraps the original API. The process control messages are implemented at the same level as application messages. Unless an application process calls the recv() function explicitly, it cannot receive any control messages. MPICH-V [6] is a fault-tolerant MPICH version that supports pessimistic message logging. Every message is transferred to the remote Channel Memory server (CM) that logs and replays messages. CMs are assumed to be stable so that the revived process can recover simply by reconnecting to the CMs. According to their literature, the main reasons of using CMs are to cope with volatile nodes and to keep log data safely, even though the system should pay twice the cost for message delivery. FT-MPI [13] proposed by Fagg and Dongarra supports MPI-2’s dynamic task management. FT-MPI has been built on PVM or HARNESS core library that exploits daemon processes. Li and Tsay also proposed their LAM-MPI based implementation where messages are transferred via a multicast server on each node [22]. MPI-FT proposed by Batchu et al. [4] adopts task redundancy to provide fault-tolerance. This system has a central coordinator that relays messages to all the redundant processes. MPI-FT from Cyprus University [24] adopts message logging. An observer processor copies all the messages and reproduces them for the recovery. MPI-FT pre-spawns processes at the beginning so that one of the spare processes can inherit a failed process. This system has a high overhead for the storage of all the messages. There are also some research results supporting direct message transfer mode. Starfish [1] is a heterogeneous checkpointing toolkit based on Java virtual machine, which makes it possible for the processes to migrate among heterogeneous platforms. The limits of this system are that they have to be written in OCaml and that byte codes run more slowly than native codes. Egida [33] is an object-oriented toolkit that provides both communication induced checkpointing and message logging for MPICH with ch p4 device. An event handler hijacks events of MPI operations in order to do the corresponding pre-defined actions for the rollback-recovery. To support the atomicity of message transfer, it substitutes non-blocking operations with blocking ones. The master process with rank 0 is responsible for updating communication channel information on recovery. Therefore, the master process is not supposed to fail. The current version of Egida is able to detect only process failures and the failed process is recovered at the same node as it previously run. As a result, Egida cannot manage hardware failures. Hector [34] is similar to our MPICH-GF. Hector exists as a movable MPI library, MPI-TM, and several executables. There are hierarchical process managers which create and migrate application processes. Coordinated checkpointing has been implemented. Before checkpointing, every process closes its channel connection to ensure that there is no in-transit message left in the network. Processes have to reconnect the channel after checkpointing. One technique to prevent in-transit messages from being lost is to put off checkpointing until all the in-transit messages are delivered. In CoCheck, processes exchange ready-messages (RMs) with one another for the purpose of coordination and to guarantee the absence of in-transit messages [36]. RMs are sent through the same channel as the application messages. Since the channels are based on TCP sockets that satisfy FIFO property, RM’s arrival guarantees that all the messages earlier than RM have arrived at the receiver. In Legion MPI-FT [30], each process reports the number of messages sent and received to the coordinator at the request of checkpointing. The coordinator calculates the number of in-transit messages and then waits for the processes informing that all the in-transit messages have arrived. The other way to prevent the loss of in-transit messages is to build a user-level reliable communication protocol in which in-transit messages are recorded at the sender’s checkpoint file as being undelivered. Meth et al have named this technique as ‘stop and discard’ in [25]. Messages received during checkpointing are discarded. After checkpointing, discarded messages are re-sent by the user-level reliable communication protocol. This mechanism requires one more memory copy at the sender side. RENEW [28] is an recoverable runtime system proposed by Neves et al. It also has the user-level reliable communication layers upon the UDP transport protocol in order to 4

log messages and to prevent the loss of in-transit messages. Some researches use application-level checkpointing for migration among heterogeneous nodes or for minimization of checkpoint file size [5, 29, 30, 32, 35]. This type of checkpointing burdens application developers with decision of when to checkpoint, what to store in checkpoint file and how to recover with the stored information. The CLIP toolkit [10] for the Intel Paragon provides semi-transparent checkpointing environment for users. The user must perform minor code modifications to define the checkpointing locations. Although the system-level checkpoint file is not heterogeneous itself, we believe that user transparency is an important virtue since it does not seem to happen that a process has to recover on a heterogeneous node even if there exists plenty of homogeneous nodes. In addition, application developers are not willing to accept such a programming effort. To sum up, most of the systems have been implemented using indirect communication mode or they are valid for specialized platforms only.

3 MPICH-G2 In this section, we describe the execution of MPI processes on Globus middleware and the communication mechanism of MPICH-G2. ‘Message Passing Interface’ (MPI) is the de-facto standard specification for messagepassing programming that abstracts low-level message-passing primitives away from the developer [14]. Among several MPI implementations, MPICH [20] is the most popular for good performance and portability. Good portability of MPICH can be attributed to the abstraction of low-level operations, the Abstract Device Interface (ADI), as shown in Figure 2. An ADI’s implementation is called a virtual device. Currently MPICH version 1.2.3 includes about 15 virtual devices. Especially MPICH with a grid device globus2 is called MPICH-G2 [15]. Just for reference, our MPICH-GF has been implemented as the unmodified upper MPICH layer and our own virtual device ‘ft-globus’ originated from globus2.

Collective Operation Bcast(), Barrier(), Allreduce() ...

MPI Implementation ADI

Point−to−point operation

ch_shmem

Send()/Recv(), Isend()/Irecv(), Waitall()...

ch_p4

globus2

...

ft−globus

Figure 2. Layers of MPICH

3.1 Globus Run-time Module Globus toolkit [16] proposed by ANL is the de-facto standard of the grid middleware. It contains directory service, resource monitor/allocation, data sharing, authentication and authorization. However, it lacks dynamic run-time process control; for example, dynamic load balancing or fault-tolerance. Figure 3 describes how MPI processes are launched on Grid middleware. There are three main Globus modules which concern the process execution: DUROC (Dynamic Updated Request Online Co-allocator), GRAM (Globus Resource Allocation Management) job managers and a gatekeeper. DUROC distributes a user request to local GRAM modules and then a gatekeeper on each node checks whether the user is authenticated or not. If (s)he is an authenticated user, the gatekeeper launches a GRAM job manager. Finally the GRAM job manager forks 5

DUROC (Central Manager)

Gatekeeper

...

fork()

GRAM Job Manager (Local Manager) fork() MPI App.

Figure 3. The procedure of process launching in Globus

and executes requested processes. Neither DUROC nor GRAM job manager controls processes dynamically, but just monitors them. Anyway, Globus has the fundamental framework of hierarchical process management. For dynamic process management, we’ve expanded DUROC and GRAM job manager’s capability by modifying their source codes. We present their expanded abilities in Section 4 and from this context we use the terms, ‘central/local manager’ instead of DUROC and GRAM manager respectively for the convenience.

3.2 globus2 Virtual Device Collective communication in MPICH-G2 is implemented as a combination of point-to-point (shortly P2P) communications based on non-blocking TCP sockets and active polling. MPICH has P2P communication primitives with the following semantics: • Blocking operation : MPI Send(), MPI Recv() • Non-blocking operation : MPI Isend(), MPI Irecv() • Polling : MPI Wait(), MPI Waitall() The blocking send-operation submits a send-request to the kernel and waits until the kernel copies send-message to the kernel-level memory. The blocking receive-operation also waits until the kernel delivers the requested message to the user-level memory, while the non-blocking operation only registers its request and exits. Actual message delivery for the non-blocking operation is not performed until a polling function is called. Indeed, the blocking operation is the combination of the non-blocking operation and the polling function in MPICH-GF. Communication mechanism of globus2 device is similar to that of ch p4 device [8]. Each MPI process opens a ‘listener port’ in order to accept a request for channel opening. On the receipt of a request, the receiver opens another socket and constructs a channel for two processes. All the listener information is transferred to every process at MPI initialization. The master process with rank 0 collects the listener information of the others and broadcasts them. However, globus2 does not fork another listener process as ch p4 does. The channel openning request is not be accepted until the receiver explicitly receives the request by calling the polling function. Figure 4 shows the structure of process group table commworldchannel of the ‘globus2’ device. The i-th entry in commworldchannel contains the information of a channel to the process with rank i. In Figure 4, 6

commworldchannel channel 0 channel 1 ...

channel i

tcp_miproto_t

send queue

tcpsendreq

...

hostname listner port handlep send_q_tail

buffer

send_q_head

datatype

source rank destination rank tag ...

...

channel n

Figure 4. commworldchannel structure

we abstract this structure to show only the values of our concern. The pair of hostname and port is the address of a listener. handlep contains the real channel information. If handlep is null, the channel has not opened yet. Send-operation pushes a request into the send-queue and registers this request to globus io module. Then, the polling function is called to wait until the kernel handles the request. There are two receive-queues in MPICH: the unexpected queue and the posted queue (Figure 5). The former contains arrived messages whose receive-requests have not been issued yet. If a receive-operation is called, it examines the unexpected queue first whether the message has already arrived. If the corresponding message exists in the unexpected queue, it is delivered. Otherwise the receive-request is enqueued into the posted queue. On the message arrival, the handler checks whether there is a corresponding request in the posted queue. If it exists, the message is copied into the requested buffer. Otherwise, the message is pushed into the unexpected queue. unexpected queue

posted queue

Rhandle source rank

buffer

tag

datatype source rank ...

context id rhandle

Figure 5. Receive-queues of MPICH

4 MPICH-GF In the MPICH-GF system, MPI processes run under the control of hierarchical managers, a central manager and local managers. The central manager and the local managers are responsible for hardware or network failures and process failures, respectively. They are also responsible for checkpointing and automatic recovery. In this section, we present MPICH-GF’s structure and checkpointing/recovery protocol implementation.

7

Collective Communication MPI Implementation

ADI

P2P Communication ft−globus Original globus2 Checkpoint Toolkit

Message Queue Management

Dynamic Process Management

Figure 6. The structure of MPICH-GF

4.1 Structure Figure 6 describes the MPICH-GF library structure. Fault-tolerance module has been implemented at the virtual device level, ‘ft-globus’ and our MPICH-GF requires no modification of upper MPI implementation layer. The ftglobus device contains dynamic process group management, checkpoint toolkit and message queue management. Most of previous works have implemented fault-tolerance module at the upper layer making abstract of communication primitives, while our implementations has been performed at the lower layer. By doing so, previous approaches neglect to support the characteristic of some specific communication operations: for example, the nonblocking operation. In addition, the low level approach is inevitable to reconstruct the physical communication channel.

4.2 Coordinated Checkpointing For consistent recovery, the coordinated checkpointing protocol is employed. The central manager initiates global checkpointing periodically as shown in Figure 7. Then, local managers signal processes with SIGUSR1 so that the processes can be ready for checkpointing. The signal handler for checkpointing has been registered in the MPI process by the MPI initialization. On receipt of SIGUSR1, the signal handler executes a barrier-like function before checkpointing. By performing the barrier, two things can be guaranteed. One is that there is no orphan message between any two processes. The other is that there is no in-transit message because barrier messages push any previously issued message into the receiver. Channel implementation of globus2 is based on TCP sockets and hence FIFO property is kept. Pushed messages are stored at the receiver’s queue in the user-level memory so that checkpoint file can include them. When processes are recovered with this global checkpoint, every in-transit messages are also restored. This technique is similar to Ready Message of CoCheck. After quasibarrier, each process generates the checkpoint file and informs the local manager that it completes checkpointing successfully. The central manager checks if all the checkpoint files have been generated and confirms the collection of checkpoint files as a new version of global checkpoint.

4.3 Consistent Recovery In the MPICH-GF system, hierarchical managers are responsible for failure detection and automatic recovery. We have implemented hierarchical managers by modifying original globus run-time module as described in Section 3. Since application processes are forked from the local manager, the local manager receives SIGCHLD when 8

Local Manager

Central Manager Checkpointing

Process

Request for checkpointing

Initiation Signal : SIGUSR1 Barrier Checkpointing Checkpoint completed

Wait n replies Confirm

Figure 7. Coordinated checkpointing protocol

its forked application process terminates. Upon receiving the signal, the manager checks if the termination was through normal exit() by calling the system call, waitpid(). If that is the case, the execution is successfully done and the local manager doen not have do anything. Otherwise, the local manager regards it as a process failure and notifies the central manager. However, it is possible for the local manager to fail as well. The central manager monitors all the local managers by pinging them periodically. If a local manager does not answer, the central manager assumes that the local manager has failed or hardware/network has failed. Then, it re-submits the request to GRAM module in order to restore the failed processes from the checkpoint files. In coordinated checkpointing, a single failure results in the rollback of all the processes to the consistent global checkpoint. The central manager broadcasts both of a failure event and a rollback request. Our first approach to recovery was to kill all the processes on a single failure and to recreate them by submitting sub-requests to the gatekeeper on each node. This approach took too much time for gatekeeper’s authentication and reconstruction of all the channels. To improve the efficiency of recovery, we allow the survived processes not to terminate. Instead, they dump checkpoint files to their memory by calling exec(). This mechanism affects only the user level memory without affecting the channel status remaining at the kernel side. After dumping the memory, the survived processes can communicate without any channel reconstruction. However, the channel to the failed process requires to be reconstructed.

5 Implementation Issues 5.1 Communication Channel Reconstruction The original MPI specification supports static process group management. Once a process group and channels are built, this information cannot be altered during the run time. In other words, no new process can join the group. A recovered process instance is regarded as new instance at the view of the group. Although it can restore the process state, it cannot communicate with the survived processes. Our proposed solution is to invalidate communication channels of failed process and to reconstruct them. We have implemented a new ‘MPI Rejoin()’ function for this. Before a restored process resumes the computation, it calls MPI Rejoin() in order to update its listener information of the other’s commworldchannel and to re-initiate its channel information. To update the listener port, the restored process informs the following values: 9

Central Manager (3) broadcast

(3) broadcast

(2) forward new listner info.

Local Manager

Local Manager (4) new commworldchannel info.

(4) new commworldchannel info. (1) Inform new listner info.

Recovered Process

Survived Process

Figure 8. Communication channel reconstruction protocol MPI_function() { ... requested_for_coordination = FALSE ft_globus_mutex = 1; ... if (requested_for_coordination == TRUE) then do_checkpoint(); endif ft_globus_mutex = 0; }

| | | | | | | | | | | |

signal_handler(){ if ( ft_globus_mutex == 1 ) then requested_for_coordination = TRUE; else do_checkpointing(); return; }

Figure 9. Atomicity of message transfer

• global rank: the logical process ID of the previous run. • hostname: the address of the current node where the process is restored. • port number: the new listener port number. This information is broadcasted via hierarchical managers. The survived processes invalidate the channel to the failed process or renew the listener information of the restored process according to the event information from its local manager. Figure 8 describes the interaction among managers and processes for the MPI Rejoin() call. The restored process sets its handles free to re-initiate channels so that it can consider as if it has not created any channel to the others. The procedure of channel reconstruction is the same as that of the channel creation (as described in Section 3.2.)

5.2 Atomicity of Message Transfer Messages in MPICH-G2 are sent being divided into the header and the payload. If checkpointing is performed after receiving a header but before receiving the payload, partial message loss may happen. We provide the atomicity of message transfer in order to store and restore the communication context safely. In other words, checkpointing is not performed while the message transfer is in progress. We have made the MPI communication operation mutually exclusive by using the checkpoint signal handler as shown in Figure 9. Each mutually exclusive area for send and receive operations has been implemented in different levels. We set the whole send-operation area (MPI Send and MPI Isend) as a critical section. The process status in 10

checkpoint files should be either that any send-operation has not been called or that a message has been sent completely. We do not want a checkpoint file to contain send-requests in the send-queue because each sendrequest is related with the physical channel. If a process restores its state with send-queue entries, some of them may not be sent correctly because of channel updates. The kernel does not process any non-blocking send-request until a polling function MPI Wait() is called explicitly in the user code. MPICH-GF replaces the non-blocking send-operation into the blocking send-operation to ensure that no send-request exists in a checkpoint file. This replacement does not affect the performance or the correctness of the process because blocking operation just submits its request to the kernel, but does not wait for the delivery of the message at the receiver side. The implementation level of the receive-operation’s critical section is lower than that of the send-operation. If the whole receive-operation is implemented as a critical section, the deadlock situation may happen in coordinated checkpointing (Figure 10.) In the figure, a sender is waiting that all the processes enter in the coordination procedure and a receiver is waiting for the arrival of the requested message. At the receiver side, the signal for coordination is delayed until the receive-call finishes. The blocking receive is a combination of the non-blocking receive and the loop of the polling function, and the actual message delivery from the kernel to the user memory is done by polling. To prevent the deadlock, we set the polling function in loop as a critical section. The checkpoint file may contain receive-requests in the receive-queue, which does not matter because they do not related with physical channel information. In order to match the arrived message and the receive-request, MPI’s receive module checks only sender’s rank and message tag. So the recovered process can receive messages corresponding to the restored receive-requests. For that reason, the non-blocking receive does not need to be replaced with the blocking one.

P

Request for Checkpointing

Recv()

Waiting for Coordination

P

Coordination Dealyed

Send()

Figure 10. Deadlock of blocking operation in coordinated checkpointing

6 Experimental Results In this section, we present performance results for five Nas Parallel Benchmarks applications [26] – LU, BT, CG, IS and MG – executing on a cluster-based grid system of four Intel Pentium III 800 MHz PCs with 256MB memory connected by ordinary 100 Mbps Ethernet. Linux kernel version 2.4 and Globus toolkit version 2.2 have been installed. We have measured the overhead of checkpointing and the cost of recovery. Table 1 shows the characteristics of the applications. Application IS uses all-to-all collective operations only. MG and LU use anonymous receive-operations with MPI ANY SOURCE option. However, the corresponding sender for the receive call is determined statically by message tagging. LU is the most communication intensive application. However its message size is the smallest. Each process of CG has two or three neighbors and it communicates with only one of them for an iteration. The message size of CG is relatively large. BT is also a 11

communication intensive application. The process has six neighbors in all directions of a 3D cube. However, all of them perform non-blocking operations for every iteration. 1 Application (Class)

Description

Communication Pattern

BT (A) CG (B) IS (B) LU (A) MG (A)

Navier-Stokes Equation Conjungate gradient method Integer Sort LU Decomposition Multiple Grid

3D Mesh Chain All-to-all Mesh Cube

Average Message Size (KB) 112.3 144.9 2839.2 3.8 35.4

Number of Sent Messages per process

Executable file size (KB)

Average Checkpoint Size (MB)

2440 7990 106 31526 3046

1018.9 993.2 901.1 1054.7 952.3

86.8 125.9 154.9 18.1 124.2

Table 1. Characteristics of the NPB applications used in the experiments

We have measured the total execution time of NPB applications using MPICH-G2 and MPICH-GF respectively. To evalulate the checkpointing overhead, we varied the checkpoint periods from 10% to 50% of the total execution time. With 10% checkpoint period, applications perform checkpointing 9 times while only one checkpointing is done with 50% checkpoint period. Figure 11 (a) shows the total execution times of each case without any failure event. To evaluate the monitoring overhead, we measured the total execution time of “MPICH-G2” and “MPICHGF” without checkpointing respectively. In this experiment, the central manager queries local managers for their status every five seconds. If a local manager does not reply, it is assumed to have failed. The difference between first two cases in Figure 11 indicates the monitoring overhead. The other 5 cases show the execution time using MPICH-GF with 1,2,3,4,9 checkpointings performed respectively. Figure 11 (b) shows the average checkpointing overhead measured at the process. The checkpointing overhead consists of the communication overhead, barrier overhead and the disk overhead. We assume that the communication overhead as the network delay among the hierarchical managers in single coordinated checkpointing. We used the option O SYNC at file open in order to guarantee that the blocking disk writes have been flushed physically, which is considerably time-consuming. As shown in the figure, the barrier overhead is pretty small and the communication overhead is about the same for all the applications. The disk overhead is the dominant overhead for the checkpointing and is almost proportional to the checkpoint file size as expected. The barier overhead of IS (B Class) application is exceptionally large. It is due to the in-transit messages of IS whose sizes are the largest among all applications (Table 1). During the barrier operation, they are pushed to the user-level memory of receiver. Hence, it takes more time for IS to complete the barrier operation. Incremental checkpointing and forked checkpointing technique may be applied to minimize this disk overhead. However, we are not sure that the benefits of incremental checkpointing would be effective as expected since all the applications used have relatively small executables (about 1MB) while using a lot of data segment. For example, we have measured the checkpoint file size of IS application with different problem sizes: class A and class B. The checkpoint file sizes are 41.7 MBytes and 155.8 MBytes respectively while the executable remains the same. The difference of checkpoint file sizes is caused by the heap and the stack and they tend to be changed frequently. In order to measure the recovery cost, we have measured the time at the central manager from the failure detection to the completion of the channel update broadcast. The recovery cost per single failure is presented in Figure 12. The recovery is performed as follows: failure detection, failure broadcast, authentication for the request and process re-launching. Among them, the authentication is the most dominant factor in recovery and it is closely related to the workings of the Globus middleware. The second dominant overhead is the process re-launching and most of the time is spent in reading the checkpoint file from hard disk. We note that both significant overhead factors are not related to the scalability issues. 1

Collective operations are implemented by calling non-blocking P2P operations.

12

9

900

Communication overhead

MPICH-G2 800

8

MPICH-GF (no ckpt.)

Disk overhead

MPICH-GF (T=50%)

Barrier overhead

MPICH-GF (T=33%)

700

7

MPICH-GF (T=25%) MPICH-GF (T=20%)

6

MPICH-GF (T=10%) time (sec)

time (sec)

600

500

400

5

4

300

3

200

2

1

100

0

0 BT (A)

CG (B)

IS (B)

LU (A)

BT (A)

MG (A)

CG (B)

IS (B)

LU (A)

MG (A)

(b)

(a)

Figure 11. Failure-free overhead: (a) total execution time (b) composition of the checkpointing overhead

7 Conclusions In this paper, we have presented the feasibility, architecture and evaluation of fault-tolerant system for gridenabled MPICH. Our proposed system minimizes the loss of computation of processes by periodic checkpointing and guarantees the consistent recovery. It does not require any modification of application source codes or MPICH upper layer. While previous researches have modified the higher level yielding the performance, MPICH-GF respects the communication characteristic of MPICH by lower level approach. Consideration on communication context of the lower level makes fine grain of checkpoint timing possible. However, all of our job have been accomplished at the user level. MPICH-GF inter-operates with Globus toolkit and can restore a failed process in any node across domains, only if GASS service is valid. We also have shown the implementation issue and the evaluation of coordinated checkpointing. At the point of writing this paper, implementation of independent checkpointing with message logging is under going. The central manager is exposed to a single point of failure. We plan to use redundancy on the central manager for the high availability. Our ultimate target grid system is a pile of clusters. Local failures happening inside a cluster should rather be managed in that cluster. As shown in our experimental results, the recovery cost contains authentication overhead which is time-consuming. Since the original task request has been already authenticated, recovery request may skip this procedure. In order to manage this local failure efficiently, more hierarchical management architecture is required. Currently, we are developing the multiple hierarchical manager system by redundancy.

References [1] A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In Proceedings of IEEE Symposium on High Performance Distributed Computing, 1999. [2] L. Alvisi, E. N. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242–249, 1999.

13

12 Process re-launching Job re-submission Broadcasting failure event

10

time (sec)

8

6

4

2

0 BT (A)

CG (B)

IS (B)

LU (A)

MG (A)

Figure 12. Recovery cost

[3] L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. IEEE Transactions on Software Engineering, 24(2):149–159, FEB 1998. [4] R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. P. Neelamegam, Y. Dandass, and M. Apte. MPI/FT:architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In 1st International Symposium on Cluster Computing and the Grid, May 2001. [5] A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogenous networks of workstations. Journal of Parallel and Distributed Computing, 43(2):147–155, 1997. [6] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. F. Magniette, V. N´eri, and A. Selikhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In SuperComputing 2002, 2002. [7] G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for mpi. In Proceeding of Supercomputing Symposium, pages 379–386, Toronto, Canada, 1994. [8] R. Butler and E. L. Lusk. Monitors, messages, and clusters: The p4 parallel programming system. Parallel Computing, 20(4):547–564, 1994. [9] K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63–75, AUG 1985. [10] Y. Chen, K. Li, and J. S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of SC97: High Performance Networking & Computing, NOV 1997. [11] I. B. M. Corporation. IBM loadleveler: User’s guide, SEP 1993. [12] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375–408, 2002. [13] G. E. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, pages 346–353, 2000. [14] M. P. I. Forum. MPI:a message passing interface standard, MAY 1994. [15] I. Foster and N. T. Karonis. A grid-enabled MPI: Message passing in heterogeneous distributed computing systems. In Proceedings of SC 98. ACM Press, 1998. [16] I. Foster and C. Kesselman. The globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4–18, 1998. [17] I. Foster and C. Kesselman. The Grid: Blueprint for a Future Computing Infrastructure. Morgan Faufmann Publishers, 1999. [18] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 15(3), 2001.

14

[19] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A computation management agent for multi-institutional grids. In Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), AUG 2001. [20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. MPICH: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6):789–828, 1996. [21] R. Koo and S. Toueg. Checkpointing and rollbackrecovery for distributed systems. IEEE Transaction on Software Engineering, SE-13(1):23–31, 1987. [22] W.-J. Li and J.-J. Tsay. Checkpointing message-passing interface(MPI) parallel programs. In Pacific Rim International Symposium on Fault-Tolerant Systems (PRFTS), 1997. [23] M. J. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the unix kernel. In USENIX Conference Proceedings, pages 283–290, San Francisco, CA, JAN 1992. [24] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou. Portable fault tolerance scheme for MPI. Parallel Processing Letters, 10(4):371–382, 2000. [25] K. Z. Meth and W. G. Tuel. Parallel checkpoint/restart without message logging. In Proceedings of the 2000 International Workshops on Parallel Processing, 2000. [26] NASA Ames Research Center. Nas parallel benchmarks. Technical report, http://science.nas.nasa.gov/Software/NPB/, 1997. [27] R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Transacsions on Parallel and Distributed Systems, 6(2):165–169, 1995. [28] N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. In Symposium on Fault-Tolerant Computing, pages 58–67, 1998. [29] G. T. Nguyen, V. D. Tran, and M. Kotocov´a. Application recovery in parallel programming environment. In European PVM/MPI, pages 234–242, 2002. [30] A. Nguyen-Tuong. Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA, 2000. Partial Fullfillment of the Requirements for the Degree Doctor of Computer Science. [31] J. S. Plank, M. beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In USENIX Winter 1995 Technical Conference, JAN 1995. [32] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972–986, 1998. [33] S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 48–55, 1999. [34] S. H. Russ, J. Robinson, B. K. Flachs, and B. Heckel. The hector distributed run-time environment. IEEE Transactions on Parallel and Distributed Systems, 9(11):1102–1114, NOV 1998. [35] L. Silva and J. ao. Silva. System-level versus user-defined checkpointing. In Symposium on Reliable Distributed Systems 1998, pages 68–74, 1998. [36] G. Stellner. CoCheck: Checkpointing and process migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages 526–531, APR 1996. [37] J. Tsai, S.-Y. Kuo, and Y.-M. Wang. Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability. IEEE Transactions on Parallel and Distributed Systems, 9(10):963–971, 1998. [38] V. Zandy, B. Miller, and M. Livny. Process hijacking. In Eighth International Symposium on High Performance Distributed Computing, pages 177–184, AUG 1999.

15