accuracy of three data recovery techniques which may be used with the sparse grid combination technique on PDE solvers. The rest of the paper is organized as ...
2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver Md Mohsin Ali∗ , James Southern† , Peter Strazdins∗ and Brendan Harding‡
∗ Research
† Fujitsu
School of Computer Science, The Australian National University, Canberra, ACT 0200, Australia Laboratories of Europe, Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE, United Kingdom ‡ Mathematical Sciences Institute, The Australian National University, Canberra, ACT 0200, Australia In the near future, besides exploiting the full performance of such large clusters, dealing with failures caused by the faults will become a critical issue. Generally, the failure rate of a system is roughly proportional to the number of cores of the system [2]. For instance, a study at Oak Ridge National Laboratory showed that a 100,000-processor supercomputer with all its associated support systems could experience a failure every few minutes [3]. Since the typical size of HPC systems is becoming larger, the rate at which they experience failures is also increasing [4].
Abstract—A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum’s Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact checkpointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory.
The Message Passing Interface (MPI) [5] is a widely used standard for parallel and distributed programming of HPC systems. However, the standard does not include methods to deal with one or more process failures at run-time. Generally, MPI provides two options for handling failures. The first and default option sets the error handler MPI_ERRORS_ARE_FATAL to immediately abort the application. The second option uses the error handler MPI_ERRORS_RETURN to mainly give an application developer the option to perform some local operations before exiting, but without guaranteeing that any further communication can occur.
The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.
Another important challenge in HPC relating to fault tolerance is the lack of availability of practical examples to study the range of issues experienced during the development of fault-tolerant applications. Further, there is a discrepancy between the capabilities of current HPC systems and the most widely used parallel programming paradigm (MPI). Although the MPI standard proves itself for fully exploiting the capabilities of current architectures, it can not handle the failure of processes. Thus, many researchers have preferred FaultTolerant MPI (FT-MPI) [6] as an interface to implement their applications due to its capability of handling process failures at run-time. For details of how this can be applied to an application, see [7]. However, FT-MPI did not comply with the MPI standard and so its development was discontinued.
Keywords—fault tolerance; ULFM; process failure recovery; PDE solver; sparse grid combination; approximation error;
I.
I NTRODUCTION
Recently, the MPI Forum’s Fault Tolerance Working Group began work on designing and implementing a standard for User Level Failure Mitigation (ULFM) by introducing a new set of tools to create fault-tolerant applications and libraries. This draft standard (targeted for Open MPI 3.1) allows applications themselves to design their recovery methods and control them from the user level, rather than specifying an automatic form of fault tolerance managed by the operating system or communication library [8, 9]. However, the amount of practical work detailing the implementation and performance measurement with this standard is very limited. Some of the work that is available assumes a fail-stop process failure scenario,
Today’s largest High Performance Computing (HPC) systems consist of thousands of nodes that are capable of concurrently executing up to millions of threads to solve complex problems within a short period of time. The nodes within these systems are connected with high-speed network infrastructures to minimize the communication costs and maximize reliability [1]. A reasonable effort is required to exploit the full performance of these systems. Extracting this performance is essential in different research areas such as climate, environment, physics, energy, and so on, which can be characterized by complex scientific models. 978-1-4799-4116-2/14 $31.00 © 2014 IEEE DOI 10.1109/IPDPSW.2014.132
1169
A communicator with global size 7
3
0
10
1
2
3
4
5
6
Process 3 and 5 on parent fail 0
6
2
9
2
3
4
5
Child
6
Shrink the communicator and spawn failed processes as child with rank 0 and 1 0
12
5
1
8
13
11
4
0
1
2
4
6
0
1
Use intercommunicator merge to assign the two highest ranks to the newly created processes on child part 0
7
Fig. 1: The sparse grid combination solution Grids with IDs 0, 1, 2, and 3 (4, 5, and 6) form diagonal (lower diagonal) sub-grids, denoted by the left (right) sum of Equation 1. Grids with IDs 7, 8, 9, and 10 (with red filled circles) are the duplicate sub-grids of grids with IDs 0, 1, 2, and 3, respectively. Grids with IDs 11, 12, and 13 (with gray filled circles) are the sub-grids on extra two layers. Sub-grids 0–6 and 0–10 (0–6 and 11–13) are used for the Checkpoint/Restart, and Resampling and Copying (Alternate Combination) techniques, respectively.
1
2
3
4
5
6
Sending failed ranks from parent to the two highest ranks on child and split the communicator with the same color to assign rank 3 and 5 to the child processes to order the ranks as it was before the failure
us5,4 .
0
1
2
4
6
3
5
Changing child to parent 0
1
2
4
6
3
5
Fig. 2: Techniques of recovering the failed processes and assigning the same ranks as they were before the failure on the original communicator.
i.e., a failed process is permanently stopped without recovering and the application continues working with the remaining processes. One such example can be found in [10]. However, this scenario is not adequate for all applications. For example, some applications do not compensate by reducing the size of the MPI communicator and, thus, require recovery of the failed processes in order to finish the remaining computation successfully.
checkpointed data for exact data recovery after process failure. We then describe, both in text and pseudo code format, how ULFM can be used to detect and identify process failures and then recover the failed processes and the associated communicator. Finally, we describe how data recovery is carried out. Since ULFM is a new and prominent tool under evaluation for creating fault-tolerant applications, in the near future, it may become widely used by application developers. To help to reduce their efforts, a complete MPI code of the extended version of this paper will be made available from http://users. cecs.anu.edu.au/~mohsin/.
The contributions of this paper are firstly to demonstrate how ULFM as an MPI standard may be used to create a faulttolerant Partial Differential Equation (PDE) application, and evaluate its current effectiveness. Our approach features the preservation of communicator size and rank distribution after faults, the preservation of load balance, and either the exact or approximate recovery of data for failed processes using general PDE techniques. Secondly, to evaluate the effectiveness and accuracy of three data recovery techniques which may be used with the sparse grid combination technique on PDE solvers.
A. Scalable PDE solver and the sparse grid combination technique A realistic parallel application targeted for fault-tolerant implementation is a scalable PDE solver. The PDE solved by this solver is the scalar advection equation in two spatial dimensions. The problem is solved on regular grids using a Lax–Wendroff scheme [11]. Rather than solving once on a large isotropic grid, it is instead solved on several smaller anisotropic grids called sub-grids. These solutions are then combined according to the sparse grid combination technique [12, 13]. Let ui,j denote the approximate solution of the PDE on the sub-grid (2i + 1) × (2j + 1). The combination solution usn,l can be expressed as ui,j − ui,j (1) usn,l =
The rest of the paper is organized as follows. Section II describes how a 2D advection PDE solver may be made faulttolerant using ULFM. Experimental results detailing process failures recovery performance, beta Open MPI 3.1 performance, approximation errors caused by either the exact or user level approximation of data recovery, and the overall performance of the application are presented and analyzed in Section III. A review of the work related to this research is presented in Section IV. Finally, Section V concludes the paper. II.
1
Parent
i+j=2n−l+1 i,j≤n
FAULT-T OLERANT I MPLEMENTATION OF S CALABLE PDE S OLVER
i+j=2n−l i,j≤n−1
where n is the the full grid size and l ≥ 4 is the level. An example of this is shown in Fig. 1.
In this section, we describe our PDE solver which has either sufficient redundancy for approximate data recovery or
The computation of solutions on different sub-grids is embarrassingly parallel and each sub-grid is assigned to a 1170
Function MPI_Comm communicatorReconstruct(MPI_Comm myWorld) Input: Broken communicator (myWorld). Output: Reconstructed communicator (reconstructedComm).
3:
iterCounter ← 0; MPI_Comm_create_errhandler(mpiErrorHandler, &newErrHand); MPI_Comm_get_parent(&parent);
4:
do
1: 2:
6: 7:
failure ← 0; returnValue ← MPI_SUCCESS; if parent = MPI_COMM_NULL then // Parent if (iterCounter = 0) then reconstructedComm ← myWorld;
8: 9: 10:
12:
MPI_Comm_set_errhandler(reconstructedComm, newErrHand); OMPI_Comm_agree(reconstructedComm, &flag); // Synchronize
13:
returnValue ← MPI_Barrier(reconstructedComm); // To detect failure
11:
16:
if returnValue = MPI_SUCCESS then tempIntracomm ← repairComm(&reconstructedComm); failure ← 1;
17:
else
14: 15:
18:
failure ← 0;
else // Child MPI_Comm_set_errhandler(parent, newErrHand); OMPI_Comm_agree(parent, &flag); // Synchronize (child part)
19: 20: 21:
22:
// Merging intercommunicator (child part) MPI_Intercomm_merge(parent, true, &unorderIntracomm);
23:
// Receiving rank information from parent MPI_Recv(&oldRank, 1, MPI_INT, 0, MERGE_TAG, unorderIntracomm, &mpiStatus);
24:
// Ordering ranks in the new intracommunicator MPI_Comm_split(unorderIntracomm, 0, oldRank, &tempIntracomm); returnValue ← MPI_ERR_COMM ; failure ← 1;
25: 26:
if returnValue = MPI_SUCCESS then reconstructedComm ← tempIntracomm;
27: 28:
if returnValue = MPI_SUCCESS and parent = MPI_COMM_NULL then // Parent was failed parent ← reconstructedComm;
29: 30:
if returnValue = MPI_SUCCESS and parent = MPI_COMM_NULL then // Child was failed parent ← MPI_COMM_NULL;
31: 32: 33: 34: 35:
iterCounter++; while (failure); return reconstructedComm; Fig 3. Procedure for reconstructing the broken communicator due to process failures.
Function void mpiErrorHandler(MPI_Comm * comm, int *error_code, · · · ) Input: A communicator (comm). Output: Error handler of communicator comm. 1:
MPI_Group failedGroup;
2:
OMPI_Comm_failure_ack(*comm); OMPI_Comm_failure_get_acked(*comm, &failedGroup); /* Sometimes a delay of at least 10 milliseconds (with usleep(10000);) is needed here */
3:
Fig 4. Procedure for handling errors.
1171
Function MPI_Comm repairComm(MPI_Comm * brokenComm) Input: Broken communicator (brokenComm). Output: Part of repaired communicator (repairedComm). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
13:
14: 15: 16: 17: 18: 19: 20:
21: 22: 23:
24: 25: 26:
SLOTS ← 12; // Suppose slots in each host (node) is 12 OMPI_Comm_revoke(&brokenComm); // Revoke the communicator OMPI_Comm_shrink(*brokenComm, &shrinkedComm);// Shrink the communicator (failedRanks, totalFailed) ← failedProcsList(brokenComm); for i ← 0; i < procsNeeded; i++ do hostfileLineIndex ← failedRanksi /SLOTS; Read hostfileLineIndexth entry (0th is the starting) without slots information from hostfile to hostNameToLaunchi ; appLaunchi ← "./ApplicationName"; argvLaunchi ← argv; procsLaunchi ← 1; MPI_Info_create(&hostInfoLaunchi ); // host information where to spawn the processes MPI_Info_set(hostInfoLaunchi , "host", hostNameToLaunchi ); // Spawn new processes on the same host which experiences process failures MPI_Comm_spawn_multiple(totalFailed, appLaunch, argvLaunch, procsLaunch, hostInfoLaunch, 0, tempShrink, &tempIntercomm, MPI_ERRCODES_IGNORE); // Merging intercommunicator (parent part) MPI_Intercomm_merge(tempIntercomm, false, &unorderIntracomm); OMPI_Comm_agree(tempIntercomm, &flag); // Synchronize (parent part) MPI_Comm_size(shrinkedComm, &shrinkedGroupSize); for i ← 0; i < totalFailed; i++ do childi ← shrinkedGroupSize + i; MPI_Comm_rank(unorderIntracomm, &newRank); MPI_Comm_size(unorderIntracomm, &totalProcs); // Sending rank information to child if newRank = 0 then for i ← 0; i < totalFailed; i++ do MPI_Send(&failedRanksi , 1, MPI_INT, childi , MERGE_TAG, unorderIntracomm); // Ordering ranks in the new intracommunicator rankKey ← selectRankKey(newRank, shrinkedGroupSize, failedRanks, totalProcs); MPI_Comm_split(unorderIntracomm, 0, rankKey, repairedComm); return repairedComm; Fig 5. Procedure as a part of repairing the broken communicator.
different process group. Each process group then uses a domain decomposition for the parallel solution of the subgrid assigned to it. The number of unknowns (grid size) on the lower diagonal sub-grids is half that of the other. As we use a fixed simulation timestep (Δt) across all grids for stability purposes, our load balancing strategy is to use half of the number of processes on these grids than on the others. The solutions are combined in parallel using a gather-scatter approach. This is all handled by setting up process grids with corresponding process maps which govern the communication between different sub-grids and domains. B. Detection and identification of the failed processes The first and most important step of fault-tolerant implementation of any application is to detect if any component failure occurs, and to list them if they fail. By component failure, we want to stick with the process failure caused by any type of faults, as we mentioned before. Process failures in ULFM [9] are reported by using the return code of MPI communication routines. While a point-to-point communication routine returns with success or (eventually) reports the failure of the partner
process, the result of collective communication might be nonuniform, i.e., some processes may report process failure while others successfully terminate the current communication. As a result, propagating the failure information is explicitly needed to make a globally consistent list of failed processes with the expense of some overheads. Although it is sometimes sufficient to create a locally consistent list of failed processes for improved efficiency, this is not possible for all cases. One solution for creating a consistent list is to call a barrier function by MPI_Barrier (line 13 of Fig. 3) on the parent communicator and attach an error handler on the communicator to acknowledge and manage the local failures by OMPI_Comm_failure_ack and OMPI_Comm_failure_get_acked (lines 2 and 3 of Fig. 4). If the execution of this collective operation is failed, then revoke the communicator by MPI_Comm_revoke, shrink the communicator by MPI_Comm_shrink (lines 2 and 3 of Fig. 5), compare the old and new groups after the failure by MPI_Group_compare, creating a group difference by MPI_Group_difference, and translate the ranks in the group by MPI_Group_translate_ranks (lines 4, 6, and 1172
Function int * failedProcsList(MPI_Comm * brokenComm) Input: Broken communicator (brokenComm). Output: List of failed processes (failedRanks) and number of failed processes (totalFailed).
3:
MPI_Comm_group(*brokenComm, &oldGroup); MPI_Comm_group(shrinkedComm, &shrinkGroup); MPI_Comm_size(*brokenComm, &oldSize);
4:
MPI_Group_compare(oldGroup, shrinkGroup, &result);
1: 2:
5: 6: 7: 8: 9: 10: 11: 12:
if result = MPI_IDENT then MPI_Group_difference(oldGroup, shrinkGroup, &failedGroup); MPI_Comm_rank(*brokenComm, &oldRank); MPI_Group_size(failedGroup, &totalFailed); for i ← 0; i < oldSize; i++ do tempRanksi ← i; MPI_Group_translate_ranks(failedGroup, totalFailed, tempRanks, oldGroup, failedRanks); return failedRanks and totalFailed; Fig 6. Procedure for creating list and number of failed processes.
Function int selectRankKey(int mpiRank, int shrinkedGroupSize, int * failedRanks, int totalProcs) Input: MPI rank (mpiRank), shrinked communicator size (shrinkedGroupSize), list of failed processes (failedRanks), and MPI global communicator size (totalProcs). Output: Key value of a rank (key) to be used for splitting the communicator to order the ranks. 1: 2: 3: 4: 5:
j ← 0; for i ← 0; i < totalProcs; i++ do if i ∈ / failedRanks then shrinkMergeListj ← i; j++;
8:
for i ← 0; i < shrinkedGroupSize; i++ do if mpiRank = i then key ← shrinkMergeListi ;
9:
return key;
6: 7:
Fig 7. Procedure for selecting the keys of ranks to be used for splitting the communicator to order the ranks.
11 of Fig. 6) to create a globally consistent list of failed processes.
processes on the merged (reconstructed) communicator should be the same as they were on the original communicator (before failure) so that there is no disruption in the application’s original communication pattern. This ordering of ranks is achieved by MPI_Comm_split function (line 24 of Fig. 3 and line 25 of Fig. 5) of MPI with a proper selection of keys as input on it. How the keys are selected is shown in Fig. 7. Moreover, Fig. 2 shows this technique of reconstructing a faulty communicator including the ordering of child processes on that communicator. Finally, the identity of the child communicator is converted into parent communicator by assigning MPI_COMM_NULL to it (line 32 of Fig. 3). In this way the reconstructed communicator(s) becomes ready to use within the application.
C. Reconstruction of the faulty communicator and recovery of the failed processes The next step of fault-tolerant implementation after detecting the process failures and creating a list of the failed processes is to reconstruct the faulty communicator(s). This reconstruction is done by spawning the failed MPI processes by MPI_Comm_spawn_multiple (line 13 of Fig. 5). The host identity of the failed processes are determined by the ranks of the failed processes and slots information of the hosts. An MPI_Info object is created with this host information and applied this to the spawning to create the new processes on the same host of the failed processes (lines 5–12 of Fig. 5). Processes which are newly created are referred to as child processes and the rest are referred to as parents. The child and parent processes have their own inter-communicators through which they communicate with their own processes. Attaching the child to the parent is accomplished by merging their inter-communicators by MPI_Intercomm_merge (line 22 of Fig. 3 and line 14 of Fig. 5). The ranks of the child
D. Data recovery for the failed processes The final step of fault-tolerant implementation after the reconstruction of faulty communicator(s) is to recover the data for the reconstructed failed processes. This is usually done by recovering the data for the whole sub-grid which experiences one or more process failures. Data recovery only for the failed processes on a sub-grid is not sufficient as the data of the 1173
!
" ! #$
!
(a) creating a list of failed processes
(b) reconstructing a faulty communicator
Fig. 8: Time for generating failure information and repairing the faulty communicator of the application on OPL cluster with level l = 4 and full grid size n = 13. Times for other levels show the same characteristics. surviving processes on a communicator can be updated locally by the solver before the failure is detected. We have implemented and analyzed three techniques for recovering the lost data of the reconstructed failed grid. These techniques are called the Checkpoint/Restart, Resampling and Copying, and Alternate Combination. Checkpoint/Restart [14] is an exact data recovery technique which involves taking periodic checkpoints onto disks while the computation for each sub-grid is in progress. In the event of any process failure, it restarts with the recent checkpointed data and performs a recomputation for a number of timesteps by which the checkpoint is behind from performing the process failure checking. An optimal number of checkpoints, C, of the application is calculated by C = T /TI/O , (2) where T is the Mean Time Between Failures (MTBF) (which will be half the application run time in our setup), and TI/O is the single checkpoint write time onto disk by a process. Resampling and Copying is a combination of exact data recovery and approximate data recovery techniques. It creates a redundancy of the diagonal sub-grids so that the lost data of the diagonal sub-grids can be exactly recovered by copying from its redundancy (and vice-versa). As each lower diagonal sub-grid is a sub-set of the diagonal sub-grid (finer) above it, a resampling of the diagonal grid is used to recover the lost data of the lower diagonal sub-grid below it The notion of
TABLE I: Beta Open MPI 3.1 performance measurement on OPL cluster when two processes are failed. wall time (sec) # cores
19 38 76 152 304
MPI_Comm MPI_Intercomm OMPI_Comm OMPI_Comm _spawn_multiple _shrink _agree _merge 0.01 0.01 0.49 0.01 4.19 2.46 0.51 0.01 60.75 43.35 1.03 0.02 86.45 50.80 2.36 0.02 112.61 55.57 12.83 0.03
diagonal, lower diagonal and duplicate/redundant sub-grid is presented in Fig. 1. According to that grid arrangement, exact data recovery of sub-grid 0 is done from sub-grid 7, 7 from 0, 1 from 8, 8 from 1, 2 from 9, 9 from 2, 3 from 10, and 10 from 3. Alternatively, approximate data recovery of sub-grid 4 is done by resampling data from sub-grid 1, 5 from 2, and 6 from 3. Alternate Combination [15] is a technique which involves an approximate data recovery of the lost grid. It employs some extra layers of sub-grids below the lower diagonal sub-grids. See Fig. 1 for the notion of sub-grids on extra layers, where the number of extra layers used in the implementation is two. In the presence of single or multiple failures, all the surviving sub-grids, including those on the extra layers, are assigned new coefficients for the combination, and then a sample of the combined solution is used as recovered data of the sub-grid whose data was lost. Note that, unlike the Checkpoint/Restart, and Resampling and Copying techniques, data recovery by this technique is only possible when the combination of sub-grid solutions is complete. III.
E XPERIMENTAL R ESULTS
Experiments were conducted on two systems. The first was the 432-core OPL cluster, located at Fujitsu Laboratories of Europe (FLE) and consisting of 36 dual-socket nodes each consisting of two 6-core Intel(R) Xeon(R) CPU (X5670 @ 2.93GHz), 24.0 GB of RAM, and with InfiniBand QDR connection blades in each chassis. The second test system is called Raijin, located at the Australian National University and having a total of 57,472 cores (Intel Xeon Sandy Bridge technology, 2.6 GHz) distributed across 3,592 compute nodes with 160 TBytes (approx.) of main memory, Infiniband FDR interconnect, and 10 PBytes (approx.) of usable fast filesystem [16]. We used git revision icldistcomp-ulfm-3bc561b48416 of ULFM under the development branch 1.7ft of Open MPI for implementation. Faults are injected into the application using a failure generator which aborts single or multiple random MPI processes together by the system call kill(getpid(), SIGKILL) at some point before the combination of the subgrid solutions. Aborting the processes for the Resampling and 1174
%&
2
%& 2
/1
3 -.
/1* /1' /0* /0 /0
*
4 0 , -.
%& 2
(a) data recovery overhead
%& 2
%&
2
%& 2
''(' +'(' '(' '(' *'(' '('
*
(b) process-time data recovery overhead
Fig. 9: Failed grid data recovery overhead of the application with level l = 4 and full grid size n = 13. The number of processes on each diagonal (including duplicate), lower diagonal, upper extra layer, and lower extra layer sub-grid is 8, 4, 2, and 1, respectively. Here, RC, AC, and CR stands for the Resampling and Copying, Alternate Combination, and Checkpoint/Restart techniques, respectively. shown in Figs. 6 and 3, respectively. The values of the MCA parameter coll_ftbasic_method is 2 (default) for the single process failure, which is set to 3 for the double process failures in our experiments.
Copying technique can occur on any of the sub-grids under the constraint that they should not occur at the same time on any of the sets of two sub-grids which are involved in the communication of the recovery process. For example, process failures should not occur simultaneously on sub-grids 3 and 6, or 2 and 5, or 1 and 4, or 0 and 7, or 1 and 8, or 2 and 9, or 3 and 10, according to the grid arrangements shown in Fig. 1. However, there are no such constraints for the Checkpoint/Restart and Alternate Combination techniques. But for all of these techniques, there is a common constraint that process 0 could not be failed in the application as it is used for controlling purposes.
It is observed from the experimental results that the wall times for creating the list of failed processes and reconstructing the faulty communicator increase, if the number of cores continues to increase. This trend can be explained in the following way. Creating the failed process list and reconstructing the communicator involve collective operations. The more the cores involved in collective operations, the more the time needed to complete these two operations.
In our experiments, the 2D-advection solver is run for 213 timesteps at which point failure detection is tested and if needed the recovery process is initiated for the Resampling and Copying, and Alternate Combination techniques. However, for the Checkpoint/Restart technique, failure detection is tested prior to initiating the checkpoint write onto disk and if required it restarts from the recent checkpoint data rather than writing onto disk. When the solver completes its execution for all timesteps, the sub-grids are combined and the overall solution is tested for accuracy.
However, the wall times increase for creating the list of failed processes and reconstructing the communicator for two (or more) process failures, which are unsatisfactory. These take more time than anticipated compared to the case of single process failure. In principle, these two times should be roughly the same, irrespective of the number of process failures. So, it would be worth investigating the implementation of different functions of the beta version of the fault-tolerant Open MPI relating to failure detection, creating list of failed processes and reconstructing the communicator for more than one process failures. It is observed that OMPI_Comm_shrink and OMPI_Comm_agree are the two main barriers for this unsatisfactory behavior when two processes are failed. See Table I for a detailed performance measurement.
The wall times (MPI_Wtime) and efficiency measurements shown in the graphs of this section are an average of 5 experiments, while average error measurement is an average of 20 experiments. Experimental results of Figs. 8 and 11, and Table I consist of real process failures, but for Figs. 9 and 10, they are non-real (simulated), i.e., only assuming that there is single (or more) process failure in a grid.
B. Failed grid data recovery overhead Fig. 9a shows the data recovery overheads of the failed grids for the Checkpoint/Restart, Alternate Combination, and Resampling and Copying techniques, when one or multiple grids’ data are lost. Recovery overheads of the first technique include times for creating all the checkpoints onto disk, reading the recent checkpoint, and the recomputations that are needed. With the second technique, on the other hand, only the time that is needed for creating the combination coefficients, rather than recovering the grid data which is happened as a
A. Failure identification and communicator reconstruction times Fig. 8 demonstrates the performance of the fault-tolerant Open MPI which is split into two categories: creating the list of failed processes and reconstructing the faulty communicator. Wall times for failed list creation and communicator reconstruction are the times taken to execute the pseudo codes 1175
• Trec,r is the recovery time of the Resampling and Copying technique (time for copying and/or resampling grid data), • Trec,a is the recovery time of the Alternate Combination technique (time for calculating combination coefficients), • Tapp,r and Tapp,a are the total application times (excluding communicator reconstruction time) of the Resampling and Copying, and Alternate Combination techniques, respectively, and • Pc , Pr , and Pa are the total number of processes that are used for the Checkpoint/Restart, Resampling and Copying, and Alternate Combination techniques, respectively.
" # $
It is observed from the measured process-time data recovery overheads presented in Fig. 9b, based on the above calculations, that the Checkpoint/Restart technique shows more process-time overheads, Alternate Combination shows less process-time overheads, and Resampling and Copying is placed between these two on the OPL cluster. However, on the Raijin cluster, Checkpoint/Restart has the least overhead. This is due to its remarkably low disk write latency (resulting in TI/O = 0.03s), whereas the OPL cluster has a more typical latency (TI/O = 3.52s).
Fig. 10: Average approximation errors of the combined solution of the application on the OPL cluster with level l = 4 and full grid size n = 13. The number of processes on each diagonal (including duplicate), lower diagonal, upper extra layer, and lower extra layer sub-grid is 8, 4, 2, and 1, respectively. compulsory stage later, is used as recovery overhead. However, recovery overheads of the third technique include times for copying and/or resampling data from finer grids.
C. Approximation error The accuracy of the combined solution of the application with or without the failures is shown in Fig. 10. The error is the average of the l1 -norm of the difference between the combined grid solution and exact analytical solution (which can be calculated for advection from the initial conditions).
For these experiments, we have simulated process failures for up to 5 processes in order to determine the effect on the number of lost grids to the data recovery time. Thus the results do not include faulty communicator reconstruction time. We observe that in all cases, data recovery time is almost independent of the number of lost grids in all cases.
As expected, Fig. 10 shows an error independent of the number of grids lost, as it has exact data recovery; the error simply reflects that of an advection solver using the sparse grid combination technique at the given grid resolutions. However, the average approximation errors of the other techniques grow as the number of grids lost increases. Furthermore, they are always more for the Resampling and Copying technique. This surprising result indicates that resampling a lower resolution lost grid from a high resolution grid is actually less accurate than the Alternate Combination technique, which utilizes data from a lower resolution grid for each grid that is lost.
It is observed from the graph that the Checkpoint/Restart technique shows the highest overhead, Alternate Combination shows the lowest overhead, and Resampling and Copying is in the middle of these two. Since the comparison of the data recovery overheads of the Alternate Combination technique with other techniques presented in Fig. 9a may not be fair due to its exclusion of data recovery time, and different number of processes working on them, we can consider process-time data recovery overheads instead of these data recovery overheads. Such process-time data recovery overheads are calculated by Trec,c = C × TI/O + Trec,c , Trec,r = (Trec,r Pr + Tapp,r (Pr − Pc ))/Pc , Trec,a = (Trec,a Pa + Tapp,a (Pa − Pc ))/Pc ,
D. Scalability The overall parallel performance of the application is shown in Fig. 11. It is observed from Fig. 11a that in all cases (zero, one or two failures) the Checkpoint/Restart technique is most costly, followed by Resampling and Copying, with Alternate Combination being least costly. This is the case regardless of the number of cores used. Similarly, Fig. 11b also shows that the Alternate Combination, and Resampling and Copying techniques are more scalable (with a little bit less scalability in Resampling and Copying) than the Checkpoint/Restart technique. For example, with no process failure, both the Alternate Combination, and Resampling and Copying techniques show more than 80% parallel efficiency. However, these performances vary greatly for two (or more) process failures compared with the one or zero failure cases due to the unstable nature of the beta version of the fault-tolerant Open MPI.
where • Trec,c , Trec,r , and Trec,a are the normalized total processtime overheads of the Checkpoint/Restart, Resampling and Copying, and Alternate Combination techniques, respectively (these are normalized with respect to the number of processes used in Checkpoint/Restart), • C is the optimal number of checkpoints of the application (see 2), • TI/O is the single checkpoint write time onto disk by a process, • Trec,c is the recovery time of the Checkpoint/Restart technique (time for reading checkpoint file and performing recomputation),
1176
(*5 # )*5 # *(5 #
(*5 )*5 *(5
(*5 $ )*5 $ *(5 $
(*5 # )*5 # *(5 #
!4 #27
/ ! 61
(*5 $ )*5 $ *(5 $
##
'## ## ## $## ## #
(*5 )*5 *(5
'#
!# # $# #
## '# $## $'# ## '# ##
#
'#
## '# $## $'# ## '# ##
!
!
(a) overall execution time
(b) overall parallel efficiency
Fig. 11: Overall parallel performance of the application on the OPL cluster with level l = 4 and full grid size n = 13. Data for other levels shows the same characteristics. Here, RC, AC, and CR stands for the Resampling and Copying, Alternate Combination, and Checkpoint/Restart techniques. The number followed by RC, AC, and CR is the total number of processes that are failed. These performance characteristics may be explained by noting that the Alternate Combination technique involves only a relatively small amount of extra computation, whereas Resampling and Copying has a significant degree of replication of computation and Checkpoint/Restart performs many disk I/O operations (including some recomputations). IV.
R ELATED W ORK
The amount of work closely related to this research is limited. The research which is used as a basis of this research is available in [17]. The author contributes a technique of replacing a single failed process on the communicator and repairing the data in the matrix for a QR factorization problem. Furthermore, an analysis of the execution time and the overhead on a fixed number of processes in the presence of a single process failure is carried out. However, detailed analysis of the recovery performance is still missing for multiple process failures, as well as, for a varying number of processes on other realistic parallel applications. Moreover, the ready-to-use implementation details are not provided. Algorithm-Based Fault Tolerance (ABFT) techniques for constructing robust PDE solvers based on the modified sparse grid combination technique are proposed in [15, 18]. Although the proposed solver can accommodate the loss of one or two sub-grids with the technique of either deriving new combination coefficients to avoid a faulty solution or approximating a faulty solution by projecting the solution from a finer subgrid, it has a limitation of handling only simulated process failures. Simulated process failure is not the same as actual MPI process failure. It assumes that the failure of a process in the application is followed by a recovery action (but does not actually implement the recovery). A fault-tolerant implementation of multi-level Monte Carlo method neither relying on Checkpoint/Restart nor recomputation of samples is proposed in [19], which uses the ULFM standard to deal with the actual process failure. It incorporates
all samples unaffected by failures in the computation of final result, and simply ignore samples affected by the failures [20]. The samples are computed in parallel in each level and periodically sent to all processes of the level communicator to be added to their local sum so that the intermediate results are not lost in case of some of the process failures on that level. This periodic communication may be costly if the processes working on a level are distributed across multiple nodes. The experimental results relating to multiple nodes are not presented. Moreover, reconstruction of the faulty communicator is not considered, nor is data recovery used. V.
C ONCLUSIONS
A fault-tolerant implementation of a realistic application (2D PDE solver) capable of surviving multiple process failures is described in this paper. The implementation uses a beta version of Open MPI that includes implementation of a draft Process Fault Tolerance specification under consideration for inclusion in Open MPI 3.1. This version of Open MPI is capable of designing and controlling the recovery methods of the application from the user level. It is clear from the experimental results that it is possible to implement a fault-tolerant application that is capable of surviving multiple real process failures by using fault-tolerant Open MPI. However, the time required for gathering the failed processes information and reconstructing the faulty communicator for multiple process failures is unsatisfactory in this beta version. Our analysis of the various overheads may guide future fault-tolerant MPI implementations; our methods for process failure detection and recovery may be useful for the developers of fault-tolerant applications. In this paper, we have looked at three data recovery techniques which may be used in conjunction with the sparse grid combination technique. The data recovery time of checkpointing on a cluster with typical disk write latency is of a similar order; however, it is vastly smaller for the other two 1177
techniques. However, in order to compare the data recovery time for the three techniques, it is necessary to take into account the extra number of processes used in the later. When this is done, the Alternate Combination technique is nearly an order of magnitude superior to the others, except on a cluster with an ultra-low write latency (two orders of magnitude lower than on typical clusters), which gives checkpointing a clear ascendancy. In all cases, data recovery time is almost independent of the number of grids lost, a surprising result for the non-checkpointing techniques. The Alternate Combination technique proved to be more accurate than the ‘near-exact’ technique of replication and resampling; for a single failure, it introduces on average only a few percent error. Up to the loss of 5 out of 10 grids, both methods had errors within a factor of 10 of the baseline, indicating that they are both robust. In future work, we plan to consider the use of spare nodes in the case of node failure, in which case, all the processes on that node will fail and be restarted on the new node. This will have the same load balancing characteristics as our current approach of restarting the failed processes on the same node. We also plan to investigate how more advanced sparse grid combination techniques may be used for more efficient data recovery. VI.
ACKNOWLEDGMENTS
This research was supported under the Australian Research Council’s Linkage Projects funding scheme (project number LP110200410). We are grateful to Fujitsu Laboratories of Europe for providing funding as the collaborative partner in this project and the use of the OPL cluster. We thank the NCI National Facility for the use of the Raijin cluster. R EFERENCES [1] Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torus interconnect for exascale computers,” Computer, vol. 42, no. 11, November 2009, pp. 36–40. [2] B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” in Proc. International Conference on Dependable Systems and Networks, ser. DSN ’06. Washington, DC, USA: IEEE Computer Society, 2006, pp. 249–258. [3] A. Geist and C. Engelmann, “Development of naturally fault tolerant algorithms for computing on 100,000 processors,” 2002. [Online]. Available: http://www.csm. ornl.gov/~geist/Lyon2002-geist.pdf [4] G. Gibson, B. Schroeder, and J. Digney, “Failure tolerance in petascale computers,” Software Enabling Technologies for Petascale Science, vol. 3, no. 4, November 2007, pp. 4–10. [5] Message Passing Interface Forum, “MPI: A message passing interface,” in Proc. Supercomputing. IEEE Computer Society Press, November 1993, pp. 878–883. [6] G. E. Fagg and J. J. Dongarra, “FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world,” 2000. [7] M. M. Ali and P. Strazdins, “Algorithm-based masterworker model of fault tolerance in time-evolving applications,” in PESARO 2013, The Third International Conference on Performance, Safety and Robustness in Complex Systems and Applications, 2013, pp. 40–47.
[8] Fault Tolerance Working Group, “Run-through stabilization interfaces and semantics.” [Online]. Available: svn.mpi-forum.org/trac/mpi-forum-web/wiki/ ft/run_through_stabilization [9] W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. J. Dongarra, “An evaluation of user-level failure mitigation support in MPI,” in Recent Advances in the Message Passing Interface, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, vol. 7490, pp. 193–203. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-33518-1_24 [10] J. Hursey and R. Graham, “Building a fault tolerant MPI application: A ring communication example,” in IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), May 2011, pp. 1549–1556. [11] P. Lax and B. Wendroff, “Systems of conservation laws,” Communications on Pure and Applied Mathematics, vol. 13, no. 2, pp. 217–237, 1960. [Online]. Available: http://dx.doi.org/10.1002/cpa.3160130205 [12] M. Griebel, M. Schneider, and C. Zenger, “A combination technique for the solution of sparse grid problems,” in Iterative Methods in Linear Algebra, P. de Groen and R. Beauwens, Eds. IMACS, Elsevier, North Holland, 1992, pp. 263–281, also as SFB Bericht, 342/19/90 A, Institut für Informatik, TU München, 1990. [13] H.-J. Bungartz and M. Griebel, “Sparse grids.” Acta Numerica 13, 2004, pp. 147–269. [14] J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, “The design and implementation of checkpoint/restart process fault tolerance for Open MPI,” in Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, March 2007. [Online]. Available: http://tinyurl. com/yhxnowd [15] B. Harding and M. Hegland, “A robust combination technique,” ANZIAM Journal, vol. 54, no. 0, 2013. [Online]. Available: http://journal.austms.org.au/ ojs/index.php/ANZIAMJ/article/view/6321 [16] “NCI: National computational infrastructure.” [Online]. Available: http://nci.org.au/~raijin/ [17] W. B. Bland, “Toward message passing failure management,” Ph.D. dissertation, University of Tennessee, 2013. [18] J. Larson, M. Hegland, B. Harding, S. Roberts, L. Stals, A. Rendell, P. Strazdins, M. Ali, C. Kowitz, R. Nobes, J. Southern, N. Wilson, M. Li, and Y. Oishi, “Fault-tolerant grid-based solvers: Combining concepts from sparse grids and mapreduce,” Procedia Computer Science, vol. 18, no. 0, pp. 130 – 139, 2013, 2013 International Conference on Computational Science. [Online]. Available: http://www.sciencedirect. com/science/article/pii/S1877050913003190 [19] S. Pauli, M. Kohler, and P. Arbenz, “A fault tolerant implementation of multi-level monte carlo methods,” Department of Computer Science, ETH, Zürich, Tech. Rep., 2013. [20] S. Pauli, P. Arbenz, and C. Schwab, “Intrinsic fault tolerance of multi level monte carlo methods,” Seminar for Applied Mathematics, ETH Zürich, Tech. Rep. 201224, 2012.
1178