New User-guided and ckpt-based Checkpointing Libraries for Parallel ...

New User-guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications??? Paweł Czarnul and Marcin Fraczak , Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology, Poland [email protected], [email protected] http://fox.eti.pg.gda.pl/∼pczarnul

Abstract. We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointing. The other version is a technically advanced parallel implementation of checkpointing based on the user-level ckpt library. It uses wrappers for MPI calls in the user program which enables to run a shadow MPI application just for communication purposes. Communication between original processes and the shadow MPI code is done via shared memory segments to which communication buffers are mapped. We present checkpoint/restart times for the two approaches and subversions proposed by us compared to an available LAMMPI/BLCR checkpointing solution for MPI applications. The performance of all the versions and I/O optimizations are discussed for a 4-node, 16-processor cluster with NFS and specifically for single SMP nodes with a local file system.

1 Introduction and Goals Checkpointing of applications can allow an application to be checkpointed periodically to be restarted after a system crash. Secondly, it enables a process migrate to another node to balance load or make some nodes available to the user. For checkpointing of parallel MPI programs, the solution must either handle all pending communication at the time a checkpoint signal is issued or assume a simplified model in which checkpoints are generated at designated points where there is no application data in the buffers and the network. The following checkpointing methods can be distinguished: 1. user guided ([1], [2], [3]) – programmer specifies what data needs to be included or excluded in/from the checkpoint, 2. user level libraries like ckpt ([4]), Condor ([5]) and Libckpt ([6]) – usually require linking a library to a program with slight or no modifications to the code. Do not require root privileges but are often limited in handling system calls, threads etc. [7] announces Hector (alpha version) – checkpointing for MPI with Dynamite 2.0, task WP.13 of 6 T11 2003 C/06098 “CLUSTERIX - The National Linux Cluster”. Calculations carried out at the Academic Computer Center in Gdansk, Poland. ?? partially covered by the Polish National Grant KBN No. 4 T11C 005 25 ?

3. hybrid approaches – [8] presents an interesting work in which the programmer inserts calls to PotentialCheckpoint in the code where checkpoint can be invoked. The whole memory space is saved automatically though. 4. modifications or extensions of existing implementations e.g.: LAM MPI/BLCR (LAM – [9] coupled with kernel-level BLCR – [10]) or MPICH-V ([11], MPICH coupled with Condor – [5]). This solution is especially attractive for parallel MPI programs as can offer a truly transparent solution but is tighly coupled with the internals of a particular implementation. May require root privileges as LAM/BLCR. The contribution of our work are two advanced implementations of checkpointing libraries, using coordinated ([8]) algorithms1: PARUG: flexible user-guided version with fast MPI-2 file system calls saving checkpoint collectively or through a dedicated master process, only necessary data needs to be packed which is selected by the knowledgeable programmer in the code. Thus it has the potential of saving only data necessary after restart. PARCKPT: extensible, fast and transparent checkpointing version for any MPI implementation using a sequential checkpointing library – a ckpt-based ([4]) version was developed with checkpoints saved locally or through a checkpoint server (ckpt feature). Contrary to other transparent solutions like LAMMPI/ BLCR or MPICH-V, PARCKPT can be used with any MPI implementation giving transparent checkpointing, can also be adapted for use with other sequential checkpointing libraries/tools examples of which are Condor ([5]) and Libckpt ([6]). These benefits come at the cost of a slightly limited application model – synchronous parallel application – it is assumed that all processes successively reach the same points in the program code where checkpoint can occur and pending messages are not considered at those points. This shortens time to checkpoint and the model is suitable for a wide range of applications e.g. SPMD ([12]) or synchronous Master-Slave.

2 Proposed User-guided Checkpointing In PARUG, collective routine CX CheckCheckpoint() (sample code shown in Figure 1) needs to be inserted by the programmer into potential points where checkpoints can occur denoting iterations, which can, if ordered by a signal earlier, save the states of the processes using MPI-2. Alternatively, the state of the application can be saved by one designated process if the programmer does not provide synchronized operations. The sequence of actions for the checkpointing procedure is as follows (Figure 1): 1. Send a SIGUSR1 signal to any MPI process which sets a global flag. 2. As soon as the process calls function CX CheckCheckpoint() (potentially in a loop of SPMD computations), the flag is read and asynchronous messages are sent to the other processes. An iteration number (Paragraph 1, hidden from the programmer) for checkpointing to occur at is also propagated (received by MPI Irecv calls). 1

available from authors, to be released at http://fox.eti.pg.gda.pl/∼pczarnul

Fig. 1: Proposed User-guided Approach: Inter Process Communication Schema

3. In the following operating modes of PARUG, corresponding actions occur: – CX SYNCHRONIZED: when the program reaches the required iteration number all processes save data pointed by the programmer to one checkpoint file. All processes can synchronize on CX CheckCheckpoint() for checkpointing; – CX LOOSE: only one selected master process saves data to the checkpoint file at the first call to CX CheckCheckpoint(). Independently from the above, the library can operate in two modes: 1. Parallel data write (default) – checkpoint data is written/read by MPI File write at all/ MPI File read at all MPI-2 functions which can speedup data write times/access by grouping, collective buffering etc. ([13]). 2. Data write through a master process – all checkpoint data from all processes is sent to the process with MPI rank 0 which then writes data to a file using MPI-2 calls.

3 Proposed ckpt-based Parallel MPI Checkpointing Library In PARCKPT, no code changes are required but all MPI functions are replaced by wrappers (preceded by RES , sample code shown in Figure 2). The wrappers for MPI communication routines denote the aforementioned potential checkpoint points and count iterations internally which is used to calculate a global iteration number for checkpointing. Thus, currently, the library can be used with synchronous applications with a uniform number of communication actions per per process per iteration. In this solution, a static library is linked with the original user application instead of an MPI library. The new library includes functions substituting MPI functions (preceded by RES ). Original MPI functions are called by another process, a wrapper. This makes it possible to checkpoint the original processes using ckpt since the processes do not

call true MPI functions. The wrapper also prepares the MPI world before the start and after the restart of the user application. For each process of the application a separate wrapper process is created (Figure 2).

Fig. 2: Proposed ckpt-based Checkpointing: Inter Process Communication Schema

Application–wrapper communication uses signals and shared memory i.e. the user application only passes data through shared memory to the wrapper which calls MPI functions. However, copying of user data into/from shared memory regions when passing/fetching it to/from the wrapper for sending/receiving decreases performance. We attach the shared memory ’window’ to the memory region that already contains user data – the buffers. This is done using the shmat function with the SHM REMAP flag set which removes all other memory mappings from that memory region though. Thus the buffer data is saved in a temporary buffer and restored after the attachment. This causes a serious slowdown once, but speeds things up if the same buffer is used repeatedly. A typical scenario for start/checkpoint/restart looks as follows: 1. Preprocessing the application source code, substituting any calls to MPI functions with their RES substitutes and page aligning data buffers. 2. Start of the wrapper which starts the user application process. 3. SIGUSR1 signal to an application process starts checkpointing. 4. During the next MPI action after receiving the signal, the process that received it sends a specific MPI message to every other process of the application. The message defines the checkpoint at some iteration in the future. 5. At the defined iteration processes order their wrappers to leave the MPI world gracefully (call MPI Finalize()) and exit. 6. The application processes checkpoint (using ckpt, [4]) and exit. 7. Upon restart the wrappers restart the application processes which continue without noticing any checkpoint/restart.

Similarly to PARUG, we distinguish two subversions of the implementation: 1. standard – all processes simply write data in the local file system. 2. a ckpt server (implemented in ckpt) is used to which checkpoints are sent over TCP and then saved locally).

4 Experimental Results 4.1 Testbed Environment and Parallel MPI Application All simulations used a 16-processor cluster (four 4-processor nodes, 512MB RAM each) with Pentium III Xeons and Ethernet switches. On one node checkpoints were saved locally (node g55) while the other nodes (g52-g53) saved to g55 via NFS. We used an SPMD MPI application (LAMMPI 7.0.6, BLCR 0.3.1 for LAM/BLCR) which runs 1000 time steps in which cells of a 2D domain are updated. The domain is divided equally among the processors. Between iterations, processes exchange boundary cell data. We varied the size of the domain from 32MB to 128MB. The implementation corresponds to parallel applications like electromagnetic modeling or medical simulations ([12]). For PARUG we pack the whole domain data. PARCKPT and LAM/BLCR pack communication buffers etc. additionally. In practice, they will save more data than PARUG. We aimed at the assessment of checkpoint/restart costs for all the methods. 4.2 Proposed User-guided Approach vs Checkpointing with ckpt Library Figure 3 presents PARUG’s execution times with one checkpoint/restart executed after 500 out of 1000 iterations. Within one node (2 and 4 processors on g55) the parallel data write method was faster i.e. MPI-2 calls were more efficient than routing data through one process on this node. However, for larger configurations, writes through a master residing on node g55 were faster than even MPI-2 collective calls. The internode NFS throughput appeared to be lower compared to native MPI send/recvs and fast disk access from node g55 or the rcp internode throughput (measured). Figure 4 shows execution times for the standard and ckptserv versions of PARCKPT, with one checkpoint/restart executed after 500 out of 1000 iterations. The ckpt server version is faster for configurations larger than one node. On one node standard writes to separate files are faster than routing through one local process.

4.3 Comparison of Parallel MPI Checkpointing Methods Finally, we compared both PARUG and PARCKPT (combinations of best subversions) against each other and LAM/BLCR (Table 1, Figures 5 and 6). Both PARCKPT and LAM/BLCR use sequential checkpointing libraries in a parallel MPI environment. PARUG is the fastest (see LAM/BLCR note * in Table 1) since it packs/unpacks least data and apparently because of fast collective MPI-2 calls within one SMP node. On larger configurations it uses one designated master on node g55. It is followed by:

!

! ! #

" "

%

& # ' #

"

# $ $

Fig. 3: PARUG: Execution Times of the Testbed Application with One Checkpoint/Restart

#$ " "

%

#

! #$

"

& # ' #

#

Fig. 4: PARCKPT: Execution Times of the Testbed Application with One Checkpoint/Restart

+!

, -

– on 2 and 4 processors (one node): LAM/BLCR generates smaller checkpoints than PARCKPT. This can account for faster LAM/BLCR for smaller sizes. For larger checkpoints LAM/BLCR was slower. [10] and [14] list performance limitations of BLCR, namely for larger checkpoints in the VMADump module used by BLCR. These are ([10]) writing memory pages using separate write() calls and making copies of pages while checkpointing which can cause memory overuse and swapping. LAM/BLCR empties the network from pending messages while keeping the application working unlike PARCKPT. Also see note * in Table 1. – on 8-16 processors: PARCKPT is faster than LAM/BLCR – it sends checkpoints from processes via TCP to ckptsrv on node g55 rather than saving locally via the slow NFS as LAM/BLCR does. LAM/BLCR failed to run any MPI application on more than two nodes (cr init() failed).

! "

*&

!

"

& '(

&)

"

#$ %

Fig. 5: Comparison of Checkpointing Approaches: Execution Times of the Testbed Application with One Checkpoint/Restart

Method Subversion Ckpt+Restart Time PARUG MPI-2 ver- slow on NFS, fast sion on SMP node designated fast on NFS, slow master on SMP node PAR std local slow on NFS, fast CKPT writes on SMP node ckptserv LAMMPI/BLCR

Features flexible, not transparent to programmer, fast, packs only necessary data, can restart on different no of processes than checkpointed, limited programming model (extendable with some programming effort) theoretically (almost) fully transparent although requires synchronous operations, limited set of MPI functions supported now, fast, checkpoints larger fast on NFS, slow than PARUG and LAM/BLCR, uses LINUXon SMP node specific memory mappings for high performance slower than fully transparent to programmer, easy-to-use, PARUG, faster checkpoints smaller than PARCKPT, only slightly than PARCKPT larger than PARUG, *for 1-node checkpoints of for smaller sizes, app processes appeared several seconds earlier than slower for larger the mpirun checkpoint. The application processes were working until that time. The former yields times very close (but longer) to PARUG (although the mpirun checkpoint is required to restart).

Table 1: Comparison of Tested Parallel Checkpointing Methods

We also assessed the overhead of the following components for the testbed application without checkpointing, compared to a standard MPI application without checkpointing: LAMMPI with BLCR – no measurable overhead, ckptserv – no measurable overhead compared to the standard PARCKPT version, PARCKPT – the overhead due to the additional wrappers and shared memory communication and signal synchronization – from 2% on 4 processors to 6% on 16 processors for the domain size of 128MB.

5 Summary and Future Work We have presented two new checkpointing libraries, their design and showed they offer better performance than LAM/BLCR for large checkpoints in a specific (NFS) environment, at the cost of a constrained application model. For the two solutions, we have investigated two subversions with fast MPI-2 calls/designated master process for PARUG and local writes/ckptserv for PARCKPT. We showed the latter options are faster on a shared NFS on two or more nodes while the former on single SMP nodes. NFS optimizations will be investigated as well. We plan on the incorporation of other checkpointing libraries into the PARCKPT scheme. Currently our PARCKPT supports a limited set of MPI functions which will be extended. We have also developed a parser for user applications which replaces MPI calls with PARCKPT-specific wrappers.

References 1. Silva, L., Silva, J.: System-level versus user-defined checkpointing. In: Proceedings. Seventeenth IEEE Symposium on Reliable Distributed Systems. (1998) 68–74

#

'"

( "

)

"* +

!!"!

$% &"

"

Fig. 6: Comparison of Checkpointing Approaches: Checkpoint/Restart Times 2. Czarnul, P.: Programming, Tuning and Automatic Parallelization of Irregular Divide-andConquer Applications in DAMPVM/DAC. International Journal of High Performance Computing Applications 17 (2003) 77–93 3. CUMULVS: (Collaborative User Migration, User Library for Visualization and Steering) Distributed Computing Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory, http://www.csm.ornl.gov/cs/cumulvs.html. 4. Zandy, V.C.: (ckpt library) http://www.cs.wisc. edu/∼zandy/ckpt/. 5. Condor Team, Attention: Professor Miron Livny, Dept of Computer Sciences, 1210 W. Dayton St., Madison, WI 53706-1685, (608) 262-0856 or [email protected] Condor Team, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI: (The Condor Project, Condors Checkpoint Mechanism) 6. J.S.Plank, M.Beck, G.Kingsley, K.Li: libckpt: Transparent Checkpointing Under UNIX. Conference Proceedings USENIX Winter 1995 Technical Conference (1995) 7. Romanov, S., Malashonok, D.Y., Iskra, K., Gubala, T.: The Dynamite checkpointer 2.0. Faculty of Science, Informatics Institute. (2003) http://www.science.uva.nl/research/scs/Software/ckpt/#hector. 8. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, San Diego, California, USA (2003) 84–94 9. Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: System-initiated checkpointing. Los Alamos Computer Science Institute (LACSI) Symposium (2003) 10. Duell, J., Hargrove, P., Roman, E.: The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. In: Future Technologies Group white paper. (2003) 11. Franck Cappello, Project Leader at al.: (Mpich-v: Mpi implementation for volatile resources) http://www.lri.fr/∼bouteill/MPICH-V. 12. Czarnul, P., Grzeda, K.: Parallel Simulations of Electrophysiological Phenomena in Myocardium on Large 32 and 64-bit Linux Clusters. In: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19 - 22, 2004. Proceedings. (Volume 3241/2004.) 13. Message Passing Interface Forum: MPI-2: Extensions to the Message-Passing Interface. (1997) University of Tennessee, Knoxville, Tennessee. 14. Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing, Los Alamos Computer Science Institute (LACSI) Symposium (2003)