User-Level Checkpoint and Recovery for LAMMPI - ACM Digital Library

24 downloads 1894 Views 5MB Size Report
rollback recovery (CRR) library to LAMIMPI, a high performance ... file checkpointing and own higher portability, which can run on more platforms including 1A32 ...
User-Level Checkpoint and Recovery for LAMMPI Youhui ZHANG, Dongsheng WONG, Weimin ZHENG Department of Computer Science, Tsinghua University. Beijing 100084, China

zyh02@;tsin~hua.edu.cn

Abstract: As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAMIMPI, a high performance implementation of the Message Passing lnterface (MPI), to improve its availability. Compared with the current CRR implementation of LAMIMPI, our work supports file checkpointing and own higher portability, which can run on more platforms including 1A32 and 1A64 Linux. In addition, the test shows that less than 15% performance overhead is introduced by the CRR mechanism of our implementation.

1. Introduction Cluster of computers (COCs) is a parallel system using off-the-shelf computers connected through high-speed network. It offers a cost-effective platform for high-performance computations. While the growth in CPU count has provided great increases in computing power, it also presents significant reliability challenges to applications. That is, as the node count increases, the reliability of the parallel system decreases. Failures in the environment are making it more difficult to complete long-running jobs and that reliability is becoming a limiting factor on scalability. Now, the Message Passing Interface (MPI) is a de facto standard for message passing parallel programming for large-scale distributed systems [1,2,3,4,5,6]. Implementations of MPI comprise the middleware layer for many large-scale high-performance applications [7,8,9,10]. Therefore, to improve the availability of MPI systems is a meaningful job while the MPI standard itself does not specify any particular kind of fault tolerant behavior. LAMiMPl is a high performance implementation of the Message Passing Interface (MPI) standard by Indiana University. In addition to high performance, LAM provides a transparent checkpointing and rollback recovery (CRR) framework [I I ] with high portability. The framework implements most management of coordinated checkpointing and rollback recovery for MPI parallel applications, including inference between the checkpointing and the execution of processes, checkpointing synchronization, etc. [12]. In another word, the framework itself can be used to support a wide valiety of single-process-CRR tools. Now, the current implementation in LAMIMP1 uses the BLCR [13] tool that is available for L i u x . Although LAMiMPl with BLCR is a transparent implementation of CRR with high performance, it lacks portability because BLCR is a kernel-level single-process-CRR tool for Linux. As we know, it works as a kemel module, so it is not a universal tool for different Linux versions and platforms. Moreover, BLCR dose not support file checkpointing.

To solve the problems, we inlcgrate a portable single-process-CRR tool, libcsm, to LAMiMPl under its framework and then it can pro\.ide transparent chcckpointing and rollback recovery mechanism for MPI applications on different platforms, including 1A32 and lA64 architecture and different Linux versions. Libcsm is a user-level CRR tool ibr single process developed by our team, which supports Linux (IA32 & IA64), AIX, Solaris and Windows NT systems. It supports the incremental: memory-excluded and copy-on-write CRR mode. In addition, we design and realize a low-overhead approach to checkpoint user files, which is called MOB (Modification Operation Buffering) 1141. The basic idea of MOB is to make all the modifications between hvo checkpoints atomic. That is; either all the modifications are executed (if the process NnS correctly to the next checkpoint), or all the modifications are aborted (if not). Checkpoint of user files is archived in MOB by buffering all modifications operated after a checkpoint. At the time of the next checkpoint, the buffering operations are flushed to the user files and then are cleared. Although it completes most management of CRR, to integrate libcsm with LAMiMPI is not so easy as it looks. Because LAMiMPl introduces several assumptions about the CRR tool to construct the framework. For example, it supposes that the single-process-CRR tool is thread-based while libcsm is signal-based. The remainder of this paper is organized as follows. Section 2 introduces relative works. Some background knowledge, including the CRR framework of LAMIMPI, is described in Section 3. Section 4 gives the details of our implementation and some testing results are listed in Section 5 . At last, the conclusions are presented.

2. Relative Works Checkpointkestart for sequential programs has been somewhat well studied. Libckpt is an open source library for transparent checkpointing of UNIX processes. It contains support for incremental checkpoints, in which only pages that have been modified smce the last checkpoint are saved. Condor [I51 is another system that provides checkpointing services for single process jobs on a number of UNIX platforms. The CRAK (CheckpointRestart As a Kemel module) project 1161 provides a kernel implementation of checkpointhestart for Linux. CRAK also supports migration of networked processes by adopting a novel approach to socket migration. BLCR (Berkeley Lab's CheckpointIRestart) is a kernel implementation of checkpoinWrestart for multi-threaded applications on Linux. In the context of parallel programs, some implementations are also available for checkpointing MPI applications running on commodify hardware, including CoCheck [I71 and the NCCU MPI implementation [IS]. CoCheck is built into a native MPI library called tuMPl and layered on top of a portable single-process checkpointing mechanism. CoCheck uses a special process to coordinate checkpoints, which sends a checkpoint request notification to all the processes belonging to the MPI job. On receiving this trigger, each process sends a "ready message" (RM) to all other processes, and stores all incoming messages from each process until all the RMs have been received, in specially reserved buffers. The underlying checkpointer then saves the execution context of each process to stable storage. At restart, a receive operation first checks the buffers for a matching message. If there is such a message, it is retrieved from the buffer. Otherwise, a real receive operation fetches the next matching message from the network.

The NCCU MPI implementation uses libckpt as the back-end checkpointer Checkpointing of processes running on the same node is coordinated by a local daemon process, while processes on different nodes are checkpointed in an uncoordinated manner using message logging. But they are both implemented using MPI libraries that primarily serve as research platforms and are not widely used. Compared with these implementations, LAMlMPl is a widely used and industrial strength open-source implementation ofMPI.

3. Background 3.1 Checkpoint-Based Rollback Recovery In the context of message-passingparallel applications, a global state is a collection of the individual states of all participating processes and of the states of the communication channels. A consistent global state is one that may occur during a failure-free, correct execution of a distributed computation and it can be used to restart process execution upon failure. Checkpoint/restart techniques for parallel jobs can be broadly classified into three categories: uncoordinated, coordinated, and communication-induced 1191. And coordinated CRR will be introduced in this paper.

3.2 User-level vs. Kernel-level Many existing checkpoint/restart projects use a user-level strategy for implementation. In this approach, the operating system is unmodified, and remains completely unaware of checkpoints and restarts. In order to save and restore the state of a program, user level checkpoinUrestart implementations intercept a wide range of system calls, in order to keep track of program state (such as memory mapped regions and open file handles) that needs to be saved during a checkpoint and restored during a restarf. User-level checkpointlrestarts have their disadvantages by their very nature, that is, they cannot fully support the restoration of any resources that are not fully specifiable by user APIs. The process id and session id of a job, for instance, cannot be restored to their original values in a user-level implementation. This mles out checkpointing a wide range of applications (such as some standard UNIX shells and scripting languages) that may rely on their parent's and/or children's pids remaining constant. In contrast, a kernel level checkpoint implementation can simply access the data it needs right in the kemel, reducing the potential for such inconsistency. But we believe the CRR capabilities of user-level tools, such as the file CRR and the recovery of process execution, are enough for high-performance scientific computations. Moreover, the kernel-level implementation works as a kernel module, it needs marked modification when ported to different platforms and different Linux versions. So, the portability of the user-level is much higher.

3.3 Coordinated Checkpointing With the coordinated approach, the determination of local checkpoints by individual processes is orchestrated in such a usay that the resulting global state is guaranteed to be consistent. It simplifies recovery and is not susceptible to the domino effect, since every process always restarts from its most recent checkpoint. Also, coordinated protocol requires each process to maintain only one checkpoint file, reducing storage overhead and eliminating the need of garbage collection as uncoordinated protocol. However, the main disadvantage is the large latency for coordinating [20].

3.4 The CRR Framework of LAM/MPI

M PI Layer L A M

Layer

The LAMIMP1 library consists of hvo layers. The upper layer is portable and independent of the

R P I

CR

I

communication subsystem (i.e., MPI function calls and accounting utility functions). The lower layer consists of a modular framework for components called SSI, System Services Interface. SSI is composed of a number of component types,

Flg l The layered design of LAMMPI (Th~sfigure

IS

selected from 11I])

each of which provides a single service to the LAM/MPI implementation. There are hvo types of components relative with CRR: Request Progression Interface

(RPI) and CheckpoinVRestart (CR). The RPI coniponent type is responsible for all MPI point-to-point communications and the CR component type is the sole interface to the back-end checkpointing system to actually perform checkpoint and restart functionality. The layered design of LAM/MPI is presented in Fig.1. For an MPI job to be checkpointable it must have a valid CR module and each of the other SSI modules lhat it has chosen at run-time must support some abstract checkpoint/restart functionality. The internal SSI checkpoint'restart interfaces were carefully designed to preserve strict abstraction barriers between the CR SSI and the other SSI modules. Hence, the shict separation of back-end checkpointing services and communication allows new back-end checkpointing systems to be "plugged-in" simply by providing a new CR SSI module. The existing RPI modules (and other SSI component types) will be able to utilize its services with no modifications. Although LAM has multiple RPI modules available, there is currently only one CR module available: blcr, which utilizes the BLCR single-process-CRR tool. 3-41 The CR SSI At the start of execution of an MPl job, the SSI framework chooses the set ofmodules from each SSI component type that will b e used. In the case of the CR SSI, it determines whether checkpointhestart support was requested, and if so, a CR module is selected to run (inthe cunent case, it is blcr since it is the only module available). All modules in the CR SSI provide a common set of APls to be used by the MPI layer. 3.4.2 The RPI SSI

To support checkpointing, an RPI module must have the ability to generically "prepare for checkpoint,"

"continue after checkpoint:" and "restore from checkpoint". A checkpointable RPI module must therefore provide API functions to perfonn this functionality. The following functions will be invoked from the thread-based callback in the CR SSl: Checkpoint: invokedwhen a checkpoinr request comes in, usually t o consume onp in-flight messages. Continue: invoked to perform any operations that might be required when a process continues execution ajler

o

checkpoint is taken. Restart: hvoked to re-establish connections and any other operations that might be required when a process restarts execution from a saved state.

These functions are independent of the backend single-process-CRR t o ~ used, l which means it is almost kept unmodified in our work. To replace BLCR with libcsm, we design new C R modules and modify our tool to provide the needed APls. Because L A M N P I supposes cbeckpointing is thread-based and implements the framework on this assumption, its workflow has to be modified. The details will be presented in the next section.

4. Design and I m p l e m e n t a t i o n The CR SSI type is composed of two SSI subtypes: crmpi and crlam. Hence, we implement the component comprised of two modules - one of each subtype. These two modules work together at run-time to effect the overall checkpointlrestart functionality. cnnpi modules are used to invoke checkpointlrestart functionality in MPI processes (i.e., processes that invoke MPI-INIT and MPIJINALIZE). Coordination between running MPI processes at checkpoint and restart time is accomplished, i n part, by mpirun. mpimn uses crlam modules to perfonn this coordination and save its states.

4.1 Libesm-based crmpi m o d u l e Libcsm is a user-level single-process-CRR tool designed for such a programmer. To use libcsm, all helshe must do i s to recompile hisiher source code with 1ibcsm.a. Upon execution, the user can send a predefined checkpoint signal to this process (The KiN command can be used), which will trigger its checkpointing to save execution states to the disk. The states include the mming context, data segments, stack, heap and the information of opened files. On reception o f a recovery signal, the process will restore the saved states from the disk. Libcsm has been ported t o a variety of operating systems and architectures, including Linux (IA32 & IA64), AIX, Solaris and Windows NT. To integrate libcsm into L A M N P I , we implement a new crmpi module named crmpi-charm, which supplies the following interfaces. lam-ssi--pi-charm-init: This h c t i o n registers the signal-based callbacks to perform the actual checkpointtrestart functionality provided by libcsm. checkpoint-handler & recover-handler: They are callback handlers registered by the previous function. lam-ssi-cmpi-cham-query: This is called b y the framework to register the module itself. And some other optional APls needed by the framework, including lam-ssi-crmpi-charm-finalize

and

are providcd but don't do any factual work. The crmpi module is only responsible for the high-level coordination required in MPI processes for receiving

lam-ssi-cmpi-chm-app-suspend

CRR requests, coordinating the checkpoint. continue, and restart action points, and interfacing to the back-end checkpointing library. wmpi modules are not responsible for closing/re-establishing network connections, drainmg

,,

"in flight" MPI messages, or any other prepare-to-checkpoinffrestart-from-checkpoint actions, w h r h are accomplished by some predefined RPI modules in the LAMINPI framework.

4.2 Libesm-based crlam module crtam module, named crlam-charm, provides the following functions: This is called by the

lam-ssi-crlam_charm_query:

framework to register the module itself lam-ssi-crlam-chm-init: This function registers the signal-based

callbacks

to

perfom

the

actual

checkpoinffrestart functionality provided by libcsm. It is is different with used by mpirun, which lam-ssi-empi-chm-init called by MPI processes. recover-handler & checkpoint-handler: They are callback handlers registered b y the previous function. lam-ssi-crlam-cham-checkpoint: This function is

Time

IT

T~+I ~. -

CR thread sleep

Am R e a d 1 Execute outside

I

MPI lib

I

i

x+z

acqure mutex

~i+3

pepareRPlfor function,block

I

clToint

Ti+4

checkpoint

Ti+5

RPl continue

Ti+6

release mutext

I I

Ti+7

sleep

t

Cal MPI on

I

utex

mtinue

i

invoked by mpirun received the checkpoint request. It Fig.2 The original checkpoint workflow propagates the checkpoint request out to all MPI processes started by mpimn and creating an application schema suitable for using to restart the MPI application. And some other optional APls needed by the framework, including Iarnssi-crlam-charmlam-s~i-crlam~chm~d'1~able-checkpoint, lam-ssi-crlam-cham-enable-checkpoint

continue, and

lam-ssi-crlam-charm-finalize are provided but don't do any factual work.

4.3 W o r M o w LAM provides a daemon-based run-time Gnvironment (WE). One user-level daemon (lamd) is used to provide many of the services needed for the MPI RTE,which is launched on every node at the beginning of an execution. lamdprovides process control for all MPI jobs executed under LAMIMPI. In the original implementation, mpimn invokes the initialization function of the blcr checkpointlrestart SSI module to register callback functions with BLCR and spawn a CRR thread at the start of execution. The thread-based callback running in the new thread is required to propagate the checkpoint requests to the MPI processes. The callback computes the names under which the images of each MPI process will be stored to disk and saves the process topology of the MPI job (called "application schema" in LAM) in mpim's address space, that will be used for restoring the applications at r e s m . It then signals all the MPI processes about the pending checkpoint request by instructing the relevant lamds to invoke CR checkpoint for every process that is a part of this MPl job. Once this i s done, the callback thread indicates that mpirun is ready to be checkpointed. When an MPI process from mpirun receives a checkpoint request, the threaded callback in the blcr module starts executing.

Therefore, in the original CRR implementation of LAMMPI, most of the work of the CR SSI is done in the separate CRR thread. And the checkpoint workflow of one MPI process is described in Fig.2. Similarly, mpirun has two threads and its workflow is omitted here. More detailed info is introduced in 1211. In contrast with this implementation, libcsm is signal-based, which means we have to modify this workflow and the new is presented in Fig.3. There is only one thread for any MPI Fig9 Thenew checkpoint workflow process and mpirun in our implementation, so all steps in Fig.2 should be completed in this thread. App Thread When m p m recelves the checkpoint request, it propagates requests to all MPI processes through TI Intempted by lam-ssi-crlam cham-checkpoint of crlam-charm.

I

I

On reception this signal, the MP1 process breaks the normal execution and calls RPI module to coordinate

,/

checkpointing $ p a l

Ti+Z

I

P T ~ I ~ R P I ~ ~ ~ cheefipoinl I

Ti+3 other processes to save any in-flight messages, which checkpo,nt 1 is completed by crmpicham. Then, the process saves Ti+4 RPI it states and RPI module is used restore the Ti+5 continue communication channels. After these steps, mpinm is continue n+6 ready to be checkpointed through crlam-charm. When t a checkpointed MPI job 1s restarted by mpirun Fig.4 The new recovexy workflow received a recovery signal, the signal-based callback function exec()s a new mpuun. Through crlam-cham, m p ~ m propagates recovery requests to all the MPI processes from the appllcatlon schema that was saved at checkpoint-time. Then MPI processes call RPI module to coordinate other processes and recover from the saved states. At last, the slgnal-based callback re-estabhshes new TCP sockets with each of its MPI peers to continue, which is also completed by RPImodule. The workflow is presented m Fig.4. As we have noted before, RPI modules of the LAMiMPI framework shoulder the CRR synchronization and coordination works, so our modifications focus on the interfaces and workflows relative to CR modules.

1

'1

I

Y

5. Performance Several programs from NAS NPB-2.2 benchmark is selected to evaluate the overload of the system on four Intel servers equipped with two 1.30 GHz ltanium-2 CPUs, featuring 2GB of RAM. At first we execute programs with different arguments without checkpointing at all and record the running time. For example, cg.A.8 means the cg program is compiled into 8 tasks and every node runs two of them. Then, programs are recomp~ledto own checkpointing functions and checkpointed 1 0 times during the running process. The running time 1s listed in Table 1 Program

Normal Time

Run tune I checkpoints I Overhead (%)

Run Time with

Run

10

(in

seconds) 27.19

Table I . Parallel testing results of the system It is obvious that the time overload introduced b y our checkpointing software is small in respect that the ratio of the extra time to the normal is always less than 15%. And then, we test the CRR performance of our single-process-CRR tool, which is running on on&IA-64 node and a sequential matrix-multiplication program is selected. The test focuses on the running time overhead with or without checkpointing and results are listed in Table 2. Matrix Size

Running Time without Checkpoinli ng (second) 6

Running

Time

with

checkpoints (second)

ten

800x800

5403856

105

106

1000x1000

8283856

210

211

1500x1500

18283888

726

727

2000x2000

32283952

1758

1768

3000x3000

72257080

15971

15959

Table 2 Sequentla1 testlng results of our single-process CRR tool The program is checkpointed 1 0 hmes dunng the ~ m m process. g We record the runnlng time and checkpointing files' size. It is obvious that the time overload introduced by our checkpointing s o h a r e is very little in respect that the ratio of the extra time to the normal is always less than 1%.

6. Conclusions We replace the CRR library employed in LAMMPI with libcsm, a user-level CRR tool developed b y our group. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and can run on more platforms, including 1.432, IA64 and other UNlX systems. LAM/MPI has introduced a CRR framework, which sets the strict separation of back-end checkpointing servjces and communication and then allows new back-end checkpointing systems to be "plugged-in" simply by providing a new CR SSI module. However, it supposes checkpointing is thread-based while libcsm is signal-based. So the workflow is adjusted subtly to integrate our library and any CRR-relative work is done in one thread for all MPI processes. Testing shows negligible perfomlance impact introduced by the CRR mechanism in our implementation and the time spent on the storage of CRR files is the key issue of the performance.

Reference [I] A. Geist, W Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, W. Saphir, T. Skjellum, and M. Snir. MPI-2: Extending the Message-Passing Interface. In L. Bouge, P Fraigniaud, A. Mignotte, and Y Robert, editors, Eum-Par'96 Parallel Processing, number 1123 in Lecture Notes in Computer Science, pages 12G135. Springer Verlag, 1996. [Z] W. Gropp, S. Hnss-Ledeman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir. MPI

- The Complete

Reference: Volume 2, the MPI-2 Extensions. MIT Press, 1998. [3] W.Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT F'ress, 1994. [4] W. Gropp, E.Lusk, and R. Thakur. Using MPI-2: Advanced Features ofthe Message Passing Interface.MIT Press, 1999.

[5] Message Passing Interface Forum. MPI: A Message Passing Interface. In Proc. of Supercomputing '93, pages 87&883. IEEE Computer Society Press, November 1993. [6] M. Snir, S. W. Otto, S. Huss-Ledeman, D. W. Walker, and J. Dongarra. MPI: The Complete Reference. MIT Press,

Cambridge, MA, 1996.

[q G

Burns, R. Daoud, and I. Vaigl. L M An Open Cluster Environment for MPI. in I. W. Ross, editor, Proceedings of

Supercomputing Symposium '94, pages 379-386. University of Tomto, 1994.

IS] W Gropp, L. Lusk. N. Doss. and A. Skjellum. A highperformonce, portable impleinentation of the MPI message passing interface standard. Porollel Computing, 22(6):789- 829, Sept. 1996. [9] W. D. Gropp and E. Lusk User's Guide for mpich, a Portable lmplcmentation of MPI. Mathematics and Computer Scie~lce Division, Argonne Nntmnnl Laboratory, 1996. ANL-9616. [lo] The LAM Team. Getting Started with LAMIMPI. University of Notre Dame, Department of Computer Science, hnp:/lwwwlum-mpi.od, 1998. [ I l l Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. The LAWMPI Checkpo~ntIRestartFramework: System-Initiated Checkpointing. In LACS1 Symposium, October 2003. 1121 E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance ofconsistent checkpointing. In Proceedings of the

I I th Symposium on Reliable Distributed Systems, pages 3 9 4 7 , Oct. 1992. 1131 J. Duell, P Hargrove, snd E. Roman. The Design and Implementation of Berkeley Lab's Linux CheckpointlRestart, 2002. 1141 Pei Dan, Wang Dongsheng WOB: A Novel Approach to Checkpoint Active Files. Acta Electronics Sinica. 2000: Vol28(5): pp9-12. [IS] M. Litzkow and M. ~ o l o m o nThe Evolution of Condor Checkpointing, 1998. (161 H. Zhong and 1. Nieh. CRAK: Linux checkpoint / restart as a kernel module. Technical Report CUCS-014-01, Department ofConnputer Science, Columbia University, 2001. (171 W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A highperfomance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789-828, Sepr. 1996. [I81 W-J. Li and J-J. Tsay. Checkpointing Message-Passing Interface (MPI) Parallel Programs. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems,1997. [I91 M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, Schwl of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 1996. (201 D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In Proceedings of the Fourth International Symposium on Reliability in Distributed Soflware and Databases, pages 207-21 5, 1984. [21] Srimm Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine. CheckpointlRestart System Services Interface (SSI) Modules for LAMIMPI. hltp://www.lam-mpi.org/. Open Systems Laboratory Pervasive Technologies Labs Indiana University. August 4, 2003.