Using Message Semantics for Fast-Output Commit in Checkpointing ...

2 downloads 0 Views 105KB Size Report
handicaps of coordinated checkpointing is the high latency for committing output from the application to the external world. Enhancing the checkpointing scheme ...
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Using Message Semantics for Fast-Output Commit in Checkpointing-and-Rollback Recovery Luís Moura Silva

João Gabriel Silva

Departamento Engenharia Informática Universidade de Coimbra, Polo II 3030 - Coimbra PORTUGAL Email: [email protected]

Abstract Checkpointing is a very effective technique to ensure the continuity of long-running applications in the occurrence of failures. However, one of the handicaps of coordinated checkpointing is the high latency for committing output from the application to the external world. Enhancing the checkpointing scheme with a message logging protocol is a good solution to reduce the output latency. The idea is to track the sources of non-determinism in order to replay the application in a reproducible way during rollback-recovery. In this paper, we will present a new eventlogging scheme that only logs those messages that may be delivered non-deterministically to the application. While other schemes keep track of the arrival order of all the messages we just save the delivery order of some of them. Our scheme exploits the semantics of message passing and is able to reduce considerably the number of receiving events when compared with other existing schemes. We will present some performance results that compare the output latency of coordinated checkpointing, pessimistic message logging, optimistic message logging and our event-logging scheme.

1. Introduction Checkpointing allows long-running programs to save state at regular intervals so that they may be restarted after interruptions without unduly retarding their progress. It has been a widely studied issue and several checkpointing schemes have been presented in the literature [1]. One limitation of some checkpointing schemes is the difficulty of committing output from the application to the “outside world”. Since applications may communicate with entities outside the system, it may be necessary to hide the effects of rollback. Some algorithms assume that the applications only perform I/O operations at the beginning and at the end of the application. This is not the rule for all the applications, and intermediate I/O operations should also be taken into account. Herein, we have to distinguish between two classes of I/O operations:

(i) disk I/O operations. Usually, these represent the majority of the I/O operations; (ii) other external operations that are considered not reversible, like the output that is sent to a data-visualization system, a printer, or some other devices whose operations cannot be undone. Previous checkpointing schemes do not distinguish between these two classes of operations and treat them in the same way. We depart from this approach and we treat the disk operations in a separated way. In [2] we have presented some filecheckpointing mechanisms. In this paper, we will discuss the relevance of external output to other devices. Some external outputs have no problem in being repeated. These output operations are associated with idempotent actions. In these situations, the application can be rolled back to a previous point in the execution and present some “stuttering” effect, by sending again the message to the external device, without causing any inconsistency. However, in other cases the application may interact with some external devices that are unable to undo the changes, such as some human interface devices (e.g. on-line data visualization or writing to a printer). This sort of output operations that cannot be undone are called non-idempotent. A message that is sent to the external world is called an output message, while a message that is received from it is called an input message. Input messages are easy to solve: we log the data in stable storage and we replay it during rollback recovery, as was proposed in [3]. However, the sending of output messages to some external device should be delayed until the system can assure that the operation will not be undone. The time it takes to release the output is called by the output commit latency. For those applications that make extensive use of I/O operations it would be very important to reduce that latency. To guarantee a correct external behaviour of the application, all the output messages associated with non-idempotent operations can only be sent when the system is sure that they will not be rolled back.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

1

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

This requires that all the messages that causally precedes the output operation should be logged in stable storage or saved in a process’s checkpoint and cannot be lost in case of failure. Because of the causal dependencies, the process cannot decide locally whether or not a particular event is commitable or not. It must gather information from the other processes of the application. In any case, it is necessary to execute some output commit procedure to ensure that the sending of an output message will never need to be undone. The solution that was adopted by previous coordinated checkpointing schemes [4] was to force a global checkpoint before committing any message to the external world. Albeit feasible, this solution presents a high overhead, and in some cases, a large latency. The time to commit an output can be reduced by a message logging protocol. The first possible solution is to use a pessimistic messagelogging scheme together with a checkpointing mechanism. Pessimistic protocols can commit output messages immediately, without any coordination with the other nodes of the system. This scheme would ensure a very fast output commit [5] but at the expense of a high performance overhead. Using an optimistic message-logging scheme is another solution. Before committing an output message it is necessary to run a multihost protocol to ensure that all the messages that causally precedes that event are logged in stable storage. The communication that is required and the saving of the message logs to stable storage may imply a large output commit latency [3]. In [6] was presented an output commit algorithm for optimistic message logging that requires communication with the minimum number of other processes of the application. As a result, it will support a faster commit of output operations than most of previous algorithms based on optimistic message logging, but even so it still requires the synchronous logging of a considerable number of messages. In [7] was presented a mixed scheme that uses message logging for deterministic processes and checkpointing for non-deterministic processes. Output commit only involves the non-deterministic processes that are forced to take additional checkpoints. Coordinated checkpointing can also be used together with message logging [8]. Message logging protocols usually exhibit better output commit latency than protocols that use checkpointing alone. Schemes that force a global checkpoint before committing any output to the external system may represent a costly solution and introduce a high output latency. Hence, the time to commit output can be reduced by adding a message logging protocol. The best approach presented so far was the Manetho system [9]. It reduces the latency of output commit by allowing messages to be sent to the outside world without multihost coordination. That

scheme used an antecedence graph to capture the causality relationship between events of the distributed application. The antecedence graph records the message arrival order and the information necessary to reproduce some internal non-deterministic events that can be tracked efficiently by the operating system. It is piggybacked on each application message and is maintained by each process of the system. If a process wants to send some irreversible output message it just needs to save the state of its local graph to stable storage. The practical implementation had shown that the output latency of this system is considerably lower than the output latency achieved in optimistic message logging, which requires a multihost coordination, and another scheme that forces a consistent checkpointing every time a message is sent to the outside world. The cost of this scheme includes the information overhead that is piggybacked on each application and the storage overhead that is required to keep all the messages at the sender’s volatile memory. We look forward to a solution that assures a timely output commit latency and introduces a low information overhead. Section 2 presents some of the reasonings for a new scheme of output-commit. Section 3 describes our algorithm while section 4 presents some performance results. Several schemes have been implemented and we have measured their output latency. Finally, section 5 concludes the paper.

2. Reasonings for a New Output Commit Algorithm The recovery scheme must ensure that after a recovery operation the system still “agrees” with the state of the external world, either by guaranteeing deterministic re-execution or by using some form of output commit. Output data should be buffered until it is guaranteed that it can be safely committed. If the process execution and the message delivery are both deterministic then it is enough to tag the output interactions in order to discard “stuttering” outputs in the occurrence of a rollback. There is no need to log messages, perform global checkpoints or execute a multihost protocol. Thereby, the important issue is to force the determinism of the applications and the output messages can be committed with a shorter latency. This same conclusion was observed in [10]. Their idea was to ensure that the language and the compiler would force the determinism of the applications in order to provide very fast output commit. If the compiler was able to do that then output messages can be committed immediately, without requiring any multihost coordination. Numbering the output events is just enough to prevent duplicated output during re-execution. In that scheme, the checkpoints are taken only at points in the code where the network is known to be empty of messages. However, it is not certain that

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

2

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

this property can be achieved with off-the-shelf compiler technology. Our scheme is library-oriented and does not rely on any special compiler functionality. We will have to record all the information related to potential non-deterministic events and replay them in the same order during rollback recovery. Efficient tracking of non-determinism is crucial to supporting interactive applications. There are three main classes of non-deterministic events: (i) internal synchronous events, like some kernel calls, the random() primitive, etc; (ii) internal asynchronous events, like the synchronization operations between threads, an asynchronous software interrupts like signals, etc; (iii) the arrival order of the messages. Some techniques have been provided to handle non-determinism that results from reading a value returned by the operating system or a random number generator [11][12][13]. The basic idea is to save those values and replay them by the same order during rollback recovery. It was shown in [13] that the overhead added to kernel calls is about 3%. In [14] was presented a very interesting technique to track the non-determinism resulting from asynchronous events, like software interrupts (e.g. signals) and shared memory manipulation in multithreaded applications. That technique relied on the use of a software counter to compute the number of instructions between non-deterministic events in normal operation. Upon the occurrence of a failure, the instruction counts are used to force the replay of those events at the same execution points. The implementation of this technique was done in a Alpha processor and has shown that it produces a low overhead, typically less than 6% increase in the running time. The last source of non-determinism is the message arrival order. By [15] it represents the majority of the non-deterministic events. By this reason and since the other sources of non-determinism already have effective solutions, we decided to optimize the tracking of potential non-determinism produced by the communication. First of all, if the recovery system could determine that an application executes deterministically, independently of message arrival order, then there is no need to log the message arrival order at all. This can be traced by the run-time system. Some interesting observations can be asserted about message arrival order. First of all, there is no need to log the body of the message: only the message arrival order. The message contents can be recreated during re-execution and sent in the order recorder in the log. This was corroborated in [8] and [16]. It would represent a significant optimization against other schemes that log the contents of the messages [11][12].

The Manetho system [9] also logs the arrival order of each message. However, a further optimization was proposed in [17]. A tracing technique was presented that checks each message to determine if it races with another one and it only saves one of the racing messages. The success of that technique relies on the fact that only those messages that introduce non-determinism are logged by the system. The results achieved by that scheme were outstanding, improving by up to two orders of magnitude when compared with earlier techniques that log every message. That scheme requires the piggybacking of a Vector Time timestamp on each application message but is very easy to implement. The second important observation is that message arrival order, while potentially non-deterministic, is often irrelevant to the application. Usually the application receives the messages by some static order and will produce the same results regardless of the arrival order. In practice, the message arrival order is not the actual issue. What really counts is the order that messages are delivered to the application. Let us explain through an example. Suppose that messages m1 and m2, are sent by processes Pj and Pk (respectively) to process Pi. Since those messages are concurrent they may arrive at process Pi at a different order in different executions of the program. It really depends on the application code of process Pi. Two cases can happen and they are presented in Figure 1. 1st case … recv(---, Pj , ---) … recv(---, Pk , ---) … 2nd case … recv(---,ANY_PROC,---) … recv(---,ANY_PROC,---) …

Figure 1: Deterministic and non-deterministic message delivery.

In the first case, the code of Pi has two receiving statements that specify the identity of the process senders, while in the second case, process Pi may receive the two messages from any other process. In the first case, message m1 will be always delivered in the first place and message m2 will always be the second one to be delivered. Thus, the application behaves deterministically regardless of the message arrival order. The only requirement is that messages sent from the same process are delivered by the same order at the receiver. In the second case, the value ANY_PROC represents a sort of wildcard, meaning that a process may receive a message from any source. In this situation, the first message to be delivered to the application is the first one to arrive. This situation can introduce some non-determinism.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

3

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Thereby, the delivery order of those two messages should be logged in stable storage. During the recovery phase those messages will be delivered exactly by the some order they have been delivered in the previous execution. From our experience, it is usual that the application developer specifies the message sender in each receiving operation. Only in some particular situations, it would be advisable to use the wildcard to receive the message from any process. Therefore, it suffices to log the delivery order for the receiving operations that use that wildcard, instead of the message arrival order for all the messages. This optimization would significantly reduce the number of events that need to be logged. Even so, there are some global collective operations (e.g. calculate the maximum value among all the processes) that may be implemented using wildcard sender-based message receiving. In this case the delivery order is not relevant, since the final result of that operation would not be sensitive to that. To summarize, what is definitely relevant for the execution determinism is the delivery order and not the message arrival order. This is a very important point to take into account that was neglected in most of the previous work.

3. A New Output Commit Algorithm In order to guarantee a correct behaviour of the application with the external world the recovery system has to treat properly the input and output messages. Input messages are synchronously saved in stable storage and reproduced during the replay phase without any user intervention. Output messages are only committed when the system can be sure that all the information concerning the nondeterministic events has been saved in stable storage. During a recovery operation, the exact same computation will be reproduced and a simple scheme that assigns sequence numbers to output events will be used to discard duplicated output messages. Together with the periodic consistent checkpoints the system should track the nondeterministic events and save the necessary information so that the same behaviour can be reproduced on a recovery operation. Those events are kept on a volatile log (ND_Log). Typically it contains the necessary information about the non-deterministic events that occur during normal operation. For instance, it includes the values returned by the system calls, the instruction counts for the asynchronous interrupts, the message receive order and the interactions with the external world (input and output messages). Since the events can have a different nature the log entries may have a different format. The tracking of internal nondeterministic events can be done with the techniques presented in [12].

Concerning the non-determinism of the communication there are basically three choices: (1) log the arrival order of every message, as in Manetho [9]; (2) log the arrival order of only those messages that may race [17]; (3) log the delivery order of those messages that can be received from more than one process. This is our choice. Depending on the recovery style, there are three ways to maintain the log: (A) it may be flushed to disk synchronously [11]; (B) it may be piggybacked in the application messages, as in [9]; (C) or it may be buffered in volatile memory and flushed to disk asynchronously; Our main guidelines are twofold: first, to optimize the performance of the applications and, secondly, to avoid excessive information redundancy that is carried on each message. Method A involves the synchronous saving of log entries in stable storage. It can introduce some considerable overhead when the frequency of nondeterminism events is higher. If the nondeterministic events seldom occur then this method can be considered since the output messages can be committed immediately without any multihost coordination. When the non-deterministic events occur very often then it is advisable to flush the log to disk asynchronously with the application (method C), or distribute the log information for the other processes (method B). This latter method may introduce a substantial information overhead in the normal messages and thus method C seems to be the best alternative. In our opinion, the best way is to provide an hybrid scheme: the recovery system uses method A when the application has only a few nondeterministic events or when it is very important to provide output commit with a short latency. Method C is used in the other case and is suited for systems with frequent output. Method B is not adopted since it may introduce a considerable information overhead. In practice, the method to adopt should be decided by the run-time system that can use some heuristics about the frequency of occurrence of non-deterministic events. For the sake of simplicity, let us assume that there is only one process that interacts with the external world. This is in fact the usual situation. That process is called Pout, and the event corresponding to the output message is designated by eout. To commit the output message, process Pout has to make sure that all the (local and remote) nondeterministic events that causally precede eout must be stable, that is, saved in stable storage. Deterministic events are always stable, albeit not saved in stable storage.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

4

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

When the system makes use of method A, all the information about non-determinism is synchronously and atomically saved in stable storage. Thus, a process can commit the output immediately, without requiring any multihost coordination. If the system uses method C, then it requires a commit protocol in order to guarantee that the same state that produced the output can be recovered after a failure. In this method, the nondeterministic events are first maintained in a volatile log located in the main memory of the process. That log is flushed to stable storage when it gets full, the process takes a local checkpoint, or as part of an output commit operation. That process Pout needs to know if all the others have executed deterministically or not. To achieve that, the system will use a local flag in each application process (DET_FLAG), and will add one bit of control in every application message (DET_BIT). The local flag is initialized to 1, when the process is created, in each checkpoint operation or as part of the output commit protocol (when the log is flushed to stable storage). When the process receives a message it keeps the value of the DET_BIT that was carried in that message, in some boolean variable that we call by LAST_DET_BIT. The bit that is piggybacked in every message absorbs the status of the sender’s local flag. That is, the process sender performs a logical AND between the DET_FLAG and the LAST_DET_BIT. The result of this operation will be piggybacked in the DET_BIT of the next message that is sent by the process. In this way, the DET_BIT will carry transitive information about the occurrence of nondeterministic events in different processes. When some process executes a non-deterministic event it will log some information in the volatile ND_Log, and will set the DET_FLAG to zero. Every message that is sent by that process after that event will have certainly the DET_BIT set to 0. The bit remains to 1, if the process had received the last message with that value, and that process has been deterministic (i.e. its local flag is still equal to 1). An important optimization can be achieved by the following rule: process Pout does not need to tell the others that it is deterministic or not. This would avoid reflective information in the DET_BIT. That process will receive the non-deterministic status of the other processes but does not send its status to the other ones. When that process executes event eout it has to execute the procedure presented in Figure 2. When all the other processes are deterministic there is no need for multihost coordination. Output is committed immediately. However, if process Pout has some non-deterministic events then it should flush its log before sending the output to the external world.

When the other processes have non-deterministic events in their volatile log they have to flush their logs before sending the FLUSH_OK message. The DET_FLAG is set to 1 after this operation, since at this point the process can be classified as deterministic. Process Pout waits for all the FLUSH_OK messages and only after that it is allowed to send the output message to the external world. Process Pout: Output_Commit() if (DET_FLAG = 0) then flush local ND_Log DET_FLAG := 1; fi if (LAST_DET_BIT = 0) then bcast FLUSH_LOG message; wait_for_ack(); fi send_output_to_external_world(); Other Process: when recv FLUSH_LOG msg: if(DET_FLAG = 0) then flush local ND_Log; DET_FLAG := 1; fi send FLUSH_OK msg to Pout; Figure 2: Output commit procedure.

The output message is buffered and the computation can proceed while the output commit protocol is executing in the background. Thus, it would not affect considerably the performance of the application, only the output latency.

4. Implementation Results In this section, we present the results of an experimental study that was conducted on a commercial parallel machine. We have done an implementation of coordinated checkpointing, independent checkpointing, pessimistic and optimistic message logging. The goal was to compare the costs and the performance penalty incurred by each scheme. All these schemes were included in a checkpointing library (CHK-LIB) that we have developed for the Parix operating system1 [18]. That library provides a reliable and FIFO communication, and provides a MPI-like programming interface. The interested reader is referred to [19] for more details about the CHKLIB. The testbed machine was a Parsytec Xplorer, with 8 Transputers (T800). In that machine, each processor has 4 Mbytes of main-memory. All of the processors have direct access to the host file system. In our particular case, the host machine was a SunSparc2.

4.1 The Application Benchmarks To evaluate the output latency of our scheme we have used the following application benchmarks:

1 Parix is a product of Parsytec Computer GmbH.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

5

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

• ISING: This program simulates the behaviour of Spin-glasses. Each particle has a spin, and it can change its spin from time to time depending on the state of its direct 4 neighbours and the temperature of the system. Above a critical temperature the system is in complete disarray. Below this temperature the system has the tendency to establish clusters of particles with the same spin. Each element of the grid is represented by an integer, and we executed this application for several grid sizes. • SOR: successive overrelaxation is an iterative method to solve Laplace’s equation on a regular grid. The grid is partitioned into regions, each containing a band of rows of the global grid. Each region is assigned to a process. The update of the points in the grid is done by a Red/Black scheme. This requires two phases per iteration: one for black point and other for red point. In every iteration the slave processes have to exchange the boundaries of their data blocks with two other neighbours, and at the end the iteration all the processes perform a global synchronization and evaluate a global test of convergence. Each element of the grid is represented in double precision, and we executed this application for several grid sizes. • ASP: solves the All-pairs Shortest Paths problem i.e. it finds the length of the shortest path from any node i to any other node j in a given graph with N nodes by using Floyd’s algorithm. The distances between the nodes of the graph are represented in a matrix and each slave computes part of the matrix. It is an iterative algorithm. In each iteration there is one of the slaves which has the pivot row. It broadcast its value to all the other slaves. We will solve the problem with two graphs of 512 and 1024 nodes. • NBODY: this program simulates the evolution of a system of bodies under the influence of gravitational forces. Every body is modelled as a point mass that exerts forces on all other bodies in the system and the algorithm calculates the forces in a three-dimensional dimension. This computation is the kernel of particle simulation codes to simulate the gravitational forces between galaxies. We ran this application for 4000 particles. • GAUSS: solves a system of linear equations using the method of Gauss-elimination. The algorithm uses partial pivoting and distributes the columns of the input matrix among the processes in an interleaved way to avoid imbalance problems. In every iteration, one of the processes finds the pivot element and sends the pivot column to all the other processes. We will solve two systems of 512 and 1024 equations. • TSP: solves the travelling salesman problem for a dense map of 16 cities, using a branch and bound algorithm. The jobs were divided by the possible combinations of the 3 first cities. • NQUEENS: counts the number of solutions to the N-queens problem. The problem is distributed by several jobs assigning to each job a possible placement of the first two queens. We solved this algorithm with 13 queens.

4.2 Checkpointing and MessageLogging Schemes In our experimental study we have implemented several schemes and we measured their output latency. Our output-commit protocol can be used with a pessimistic logging strategy, optimistic logging or with a coordinated checkpointing algorithm. For this latter case, we have implemented one of the algorithms existing in the literature [20]. • Coordinated Checkpointing (Coord_Chkp): this scheme uses a non-blocking checkpointing protocol that achieves a consistent global state of the application. The algorithm is non-blocking, in the sense that after a process takes its local checkpoint it proceeds with its computation without waiting for the others. All the synchronization is done in background in order to reduce the performance degradation. More details about the algorithm can be found in [20]. To evaluate the effectiveness of our outputcommit protocol we have also implemented the two main schemes of message-logging: • Pessimistic Message Logging (PML): In this scheme every message exchanged by the application is synchronously saved in stable storage by the receiving process. Thus, every message that is received by an application process implies a disk writing operation. To assure the failure atomicity of the logging operation, a copy of the message is kept at the volatile memory of the sender process, until it receives some notification from the receiving process saying that the message has been logged. When the sender receives this acknowledgement message it can discard that message copy. This notification is done in the background and does not introduce any blocking at the sender process. • Optimistic Message Logging (OML): In this scheme, every message received by a process is logged in a volatile cache. Several messages are collected in that cache and are asynchronously written to disk in one single write operation. In this way, this asynchronous approach would introduce a much lower cost than the previous logging scheme. In our optimistic message-logging scheme we have used a cache of 100 Kbytes. The cache is flushed to disk in three different situations: (i) when it gets full; (ii) when the process is checkpointed; (iii) when the application sends an output message to the external world. These two previous schemes log the whole contents of the message. This is necessary when we use the message logging protocol with an independent checkpointing algorithm or we want to provide an autonomous and localized recovery. However, this is unnecessary when we use the message-logging scheme with coordinated checkpointing with the aim of reducing the output latency.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

6

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

output sequence number to a special disk file, and was measured with several runs and considering all the applications. There were some small deviations in the output latency for pessimistic message logging protocols. Usually the output latency is higher for PML than for PML_H, since the first scheme introduces a higher congestion in the stable storage. Figure 3 shows the maximum and the minimum values that were observed with the PML_H scheme on the ISING application. For small sizes of the grid, the application is more communicative and accesses the stable storage more frequently. This affects the writing operation when executing the output commit procedure. This is the reason why the output latency is somewhat higher for the applications that communication intensive. 3 .5 3

Latency (msec)

In this case, message logging is mainly used to force the determinism of processes. Saving the message header and keeping track of all the other sources of non-determinism in the process execution reduces the size of data that needs to be logged. For this reason we have implemented four other schemes: • PML_H: this scheme uses pessimistic message logging but instead of logging the whole message contents it just saves the message headers. This is the necessary information to reproduce the order the messages have been received. • OML_H: this scheme follows the previous optimistic message logging protocol, but only saves the message headers as well. In other words, the log only records the message arrival order. • PML_ND: this scheme uses synchronous saving of the non-deterministic events to stable storage. It does not log the headers of all the messages, but only from those messages that may be delivered to the application in a non-deterministic way. This scheme uses the ideas presented in section 3. • OML_ND: uses exactly the same technique of PML_ND but saves the non-deterministic events asynchronously to stable storage. Those events are locally saved in a ND_Log that is flushed during the checkpointing and the output commit protocol. This scheme also uses the ideas presented in section 3.

2 .5 2 1.5 1 0 .5

In the last two cases the log of each process only records the message delivery order of some receiving operations and the necessary information to reproduce those internal non-deterministic events that can be easily tracked by the checkpointing library. It was interesting to observe that from those 7 application benchmarks only 3 of them actually present some non-deterministic events: those include messages that can be delivered in a nondeterministic way and the use of a random() generator. There was no use of threads, shared memory, signals and other potential nondeterministic system features. In fact, it is unusual to use those features in scientific applications. Since these applications have only a few nondeterministic events we decided to give more emphasis on the results of pessimistic logging with in our scheme (PML_ND) instead of using the multihost protocol presented in Figure 2 (OML_ND).

4.3 Experimental Results: Output Commit Latency In the case of pessimistic message logging (PML, PML_H and PML_ND) the output commit only involves a single write to disk to log the number of the output operation. Numbering the output events is just enough to prevent duplicated output during re-execution. Every output message is tagged with a unique increasing sequence number. The average output latency for all the pessimistic message-logging protocols was 2.5 milliseconds. This was the average time spent in writing the

0 Mi n

256

512

768

1024

1280

1536

1792

2048

Size of the Grid

Ma x

Figure 3: Output latency for the PML-H scheme (ISING Application).

The optimistic message logging schemes (OML and OML_H) require an additional multihost coordination protocol that involves the flushing of the message logs of all the processes that have direct or transitive dependencies with the process that wants to send some output to the external world. The output latency of these schemes will depend from the execution time of that distributed commit protocol. If we do not use any message logging protocol the output commit requires a global coordinated checkpointing of the application. In this case, the output latency corresponds to the average checkpoint duration. In Table 1 we present the output latency for coordinated checkpointing, pessimistic message logging and the two versions of optimistic message logging. The numbers presented for these two last schemes were the maximum values that were observed in several runs of the application. Applications ISING 2048 SOR 1280 GAUSS 1024 ASP 1024 NBODY 4000 TSP 16 NQUEENS 13

Coord Chkp 39752 31650 22604 14824 11918 45694 22652

PML

OML

OML_H

2.5 2.5 2.5 2.5 2.5 2.5 2.5

634 1427 1631 1362 749 80 132

188 115 87 99 72 45 57

Table 1: Output Latency (msec).

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

7

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

In Figure 4 we compare the output latency of OML with OML_H for the ISING application. The OML logging scheme logs more data than OML_H. Thus, it is natural that at the time of the output commit protocol the message logs of each processor have more data to be flushed to disk. This will delay the completion of the output commit protocol and increase the output latency. The results for the ISING application show that the output latency of OML_H was 10% to 30% lower than the latency of OML.

Output Latency (sec)

1.6

OM L_H

Output Latency (sec)

14 12 10 8 6 4 2 0 100 K

2 56

512

76 8

10 2 4

12 8 0

500 K

Size of the Grid

1000 K

Figure 5: Output latency with different log caches (SOR application with OML).

We have also measured the total overhead of the OML scheme when using those three sizes for the log caches. The results are presented in Figure 6, where we have also included the overhead of OML_H, for the sake of curiosity. The SOR application was executed for 100 iterations in all those cases. 70 OML ( 100k)

60

OML ( 500k) OML ( 1000k)

50

OML_H ( 100 k)

40 30 20

1.4

10

1.2 0

1

256

512

0 .8

768

1024

1280

Size of the Grid

0 .6

Figure 6: Overhead of the OML scheme with different sizes of the log.

0 .4 0 .2 0

OM L

schemes are mainly affected by the amount of data that needs to be logged in stable storage.

Overhead (sec)

We can see that the multihost coordination protocol that is used by the optimistic message logging schemes present an output latency that can be at least 2 or 3 orders of magnitude higher than the pessimistic version. The performance of that protocol depends mainly on the size of the logs and the message traffic in the network. After some experiments we concluded that the major source of overhead was not due to the multihost coordination messages, but rather to the size of the log caches at the time the protocol was executed. Between the OML and the OML_H there are also some considerable differences. For instance, for the GAUSS application the output latency with OML_H was 5% of the OML’s latency. The lowest difference happened with the TSP application: the output latency of OML_H is only 56% of the latency of the OML scheme. The output latency of using a global checkpoint is 5 orders of magnitude higher than the latency of the PML schemes.

256

512

768

1024

1280

1536

1792 2048

Size of the Grid

Figure 4: Output Latency of optimistic message logging schemes (ISING Application).

In most of our experiments we have used a local cache of 100Kbytes. That cache was used to keep the messages until they are saved to stable storage. In the next experiment we used different size for the caches (100K, 500K and 1000K) and we measured the effect of that factor in the output latency of OML. Those results were taken with the SOR application, and are presented in Figure 5. We can see in Figure 5 that the lowest output latency was always achieved when used a log of 100Kbytes. Increasing the size of the log would represent more data to flush at the time of the output commit protocol, which results in an inevitable increase in the output latency. This corroborates our previous claim that the output latency of the OML

Two things can be observed: first, the overhead of the OML scheme when using a cache of 500K or 1000K is about the same. However, there was a reduction in the overhead when we used a log size of 500K instead of using the usual size of 100K. Although the difference is not very significant some trade-off decision has to be made according to the size of the log caches: if the cache has a size of 100Kbytes the output latency is smaller at the expense of higher overhead during normal execution. If we want to reduce this overhead we can choose a log cache of 500Kbytes, but this will increase the output latency. For those applications that communicate intensively with the external world it is very important to assure a fast output commit. Suppose that the application has to send periodically some output to an X-server. From the point of view of the end user what is perceptible is the time duration between two adjacent outputs.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

8

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

This means that it is not enough to judge a checkpointing or message-logging scheme by the output latency. Is also important to know what is the overhead imposed by the scheme in the interval between output interactions. The time duration between output samples corresponds to the sum of the output latency with the performance overhead during that interval.

36.752 seconds. The most interesting result was that the overhead in the time to output with the PML and OML schemes increases almost linearly with the interval between the output, in terms of iterations. When the application sends an external output in every 200 iterations then the overhead for those two message-logging schemes is higher than the overhead introduced by the checkpointing scheme.

In Figure 7 we present the overhead in time between two adjacent outputs for the ISING(2048) application. We have considered that the application performed an external output in every 10, 50, 100 and 200 iterations. It was executed for a long period of time and the interval between checkpoints was one hour. The time to execute 200 iterations is around 40 minutes, thus there was no normal checkpoint between those output interactions.

The same experiment was performed with the ISING(1024) application, that communicated more frequently than the previous case. The application was executed for some hours, and the interval between checkpoints was also 1 hour. Since 200 iterations take about 10 minutes there is no normal checkpoint in the times presented in Figure 8. 70 C o o r d _ C h kp

60

P ML

60

OM L

Overhead (sec)

C o or d _ C h kp P ML

50

OM L

Overhead (sec)

P M L_ H OM L_ H

40

P M L_ ND

30

50

P M L_ H OM L_ H

40

P M L_ ND

30 20 10

20

0 10

10

50

10 0

200

Number of iterations between every output 0

10 50 100 200 Number of iterations between every output Figure 7: Overhead in the time to output (ISING 2048).

The overhead in the time to output corresponds to the additional time it takes to perform the output due to the use of the checkpointing or the message logging schemes. Two sources of overhead contribute to that metric: the overhead of the message logging protocols during normal operation and the duration of the output commit protocol. In the Coord_Chkp scheme there is no overhead during normal execution. The additional overhead in the interval between outputs is only due to the output latency, that is, the checkpoint duration. In the PML_ND scheme, there is also no overhead during normal execution in this particular case of the ISING application. The only overhead is due to one write operation to disk at the time of the output. With this scheme, the overhead in the time to output is only 2.5 milliseconds. In the other logging protocols the overhead in the time to output is mainly affected by the overhead of the message logging in stable storage, that is the overhead during normal execution. For the ISING(2048) application, the overhead in the time to output is quite small for the PML_H, OML_H and PML_ND schemes: the maximum overhead was 0.2, 0.4 and 0.0025 seconds, respectively. In the case of the Coord_Chkp scheme the overhead is constant and equal to

Figure 8: Overhead in the time to output (ISING 1024).

In this case, the results were rather different. The overhead in the time to the output with the PML_ND was still the lowest one: 2.5 milliseconds. The same overhead when using checkpointing with the output commit operation was 13.786 seconds. If the interval between outputs was 200 iterations (i.e. 10 minutes) then the overhead in the time to output of the message logging protocols is higher than the checkpointing scheme, except for the OML_H scheme. Even for 50 iterations the overhead with the PML scheme was higher than taking a checkpoint, even if the output latency of that scheme is only 2.5 milliseconds. The major contribution to that overhead comes from the synchronous message logging during those 50 iterations. In these two previous Tables we show that it is not enough to judge the schemes by their absolute output latency. In some cases, the overhead during normal execution is the most important factor in the interval between outputs. From all those experimental results we can conclude that our output-commit protocol (used by PML_ND and OML_ND) is the scheme that achieves the lowest output latency. This scheme can be used together with a coordinated checkpointing algorithm. However, in each output operation there is no need to perform a global checkpoint: it is only necessary to save the non-deterministic events.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

9

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

5. Conclusions Message logging protocols generally exhibit better output commit performance than protocols that use checkpointing alone, but it was shown in this paper that output commit can be performed more efficiently. We propose a technique that only requires the tracing of non-deterministic events. There is no need to log the message contents and it is absolutely useless to log the arrival order of every message. Only the delivery order of some of the messages needs to be recorded in the log of nondeterministic events. This would represent a substantial optimization, since the amount of data that needs to be logged is significantly less than the data that is saved in comparable existing schemes.

Acknowledgments This work was partially supported by the Portuguese Ministério da Ciência e Tecnologia, the European Union through the R&D Unit 326/94 (CISUC) and the project PRAXIS XXI 2/2.1/TIT/1625/95 (PARQUANTUM).

References [1] E.N.Elnozahy, D.B.Johnson, Y.M.Wang. “A Survey of Rollback-Recovery Protocols in Message Passing Systems”, Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, October 1996 [2] L.M.Silva, V.N.Tavora, J.G.Silva. "Mechanisms of File-Checkpointing for UNIX Applications", Proceedings of the 14th IASTED Conf. on Applied Informatics, Innsbruck, Austria, pp. 358-361, February 1996 [3] R.E.Strom, S.A.Yemini. “Optimistic Recovery in Distributed Systems”, ACM Transactions on Computer Systems, Vol. 3, No. 3, pp. 204-226, August 1985 [4] Y.Tamir, C.H.Sequin. “Error Recovery in Multicomputers Using Global Checkpoints”, Proc. 13th Int. Conf. on Parallel Processing, pp. 32-41, 1984 [5] Y.Huang, Y.M.Wang. “Why Optimistic Message Logging has not been used in Telecommunication Systems”, Proc. 25th Int. Fault-Tolerant Computing Symposium, FTCS-25, pp. 459-463, 1995 [6] D.B.Johnson. “Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs”, Proc. 12th Symposium on Reliable Distributed Systems, SRDS-12, pp. 86-95, 1993 [7] E.L.Ellenberger. “Transparent Process Rollback Recovery: Some New Techniques and a Portable Implementation”, Master Thesis, Department of Computer Science, Texas A&M University, USA, August 1995 [8] E.N.Elnozahy, W.Zwaenepoel. “On the Use and Implementation of Message Logging”, Proc. 24th Fault-Tolerant Computing Symposium, FTCS-24, pp. 298-307, June 1994 [9] E.N.Elnozahy, W.Zwaenepoel. “Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit”, IEEE Transactions on Computers, Vol. 41 (5), pp. 526-531, May 1992

[10] A.C.Klaiber, H.M.Levy. “Crash Recovery for Scientific Applications”, Proc. Int. Conf. on Parallel and Distributed Systems, ICPADS’93, Taiwan, 1993 [11] A.Borg, W.Blau, W.Graetsch, F.Herrmann, W.Oberle. “Fault-Tolerance Under UNIX”, ACM Transactions on Computer Systems, Vol. 7, No. 1, pp. 1-24, February 1989 [12] A.Goldberg, A.Gopal, K.Li, A.Lowry, R.Strom. “Transparent Recovery of Mach Applications”, Proc. 1st USENIX Mach Workshop, October 1990 [13] E.N.Elnozahy. “Manetho: Fault-Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication”, PhD Thesis, Rice University, Tech. Report TR-93-212, October 1993 [14] J.H.Slye, E.N.Elnozahy. “Supporting Nondeterministic Execution in Fault-Tolerant Systems”, Proc. 26th Int. Fault-Tolerant Computing Symposium, FTCS-26, pp. 250-259, 1996 [15] A.Goldberg, A.Gopal, A.Lowry, R.Strom. “Restoring Consistent Global States of Distributed Computations”, Proc. of the ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 144-154, May 1991 [16] M.Russinovich. “Application-Transparent Fault Management”, PhD Thesis, Carnegie Mellon University, August 1994 [17] R.Netzer, B.Miller. “Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs”, Proc. Supercomputing’92, Minneapolis, November 1992 [18] “Parix 1.2: Software Documentation”, Parsytec Computer GmbH, March 1993 [19] L.M.Silva, “Checkpointing Mechanisms for Scientific Parallel Applications”, PhD Thesis presented at the Univ. of Coimbra, Portugal, January 1997, ISBN 972-97189-0-3 [20] L.M.Silva, J.G.Silva. “Global Checkpointing for Distributed Programs”, Proc. 11th Symposium on Reliable Distributed Programs, Houston USA, pp. 155-162, October 1992

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

10

Suggest Documents