Survey of Backward Error Recovery Techniques for ... - CiteSeerX

4 downloads 5915 Views 76KB Size Report
computer systems, backward error recovery, based on. checkpointing and rollback, is often used. During failure-. free operation, the process states are regularly ...
Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback* G. Deconinck, J. Vounckx, R. Lauwereins**, J.A. Peperstraete Katholieke Universiteit Leuven, ESAT-ACCA Laboratory Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium Tel: +32-16-22 09 31, Fax: +32-16-22 18 55 e-mail: [email protected] Abstract—For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpointing and rollback, is often used. During failurefree operation, the process states are regularly saved, and after a fault is detected, the system is rolled back to a previously saved state. We can distinguish four classes of techniques: semi-automatic techniques, message logging, coordinated checkpointing and hybrid techniques. In this paper a survey is given, and the possibly involved overhead is discussed. This will allow the user to choose an optimal checkpointing and rollback technique for given facilities and applications. Keywords—fault-tolerance, checkpointing, rollback

backward

error

recovery,

The rest of the paper is organised as follows. The overhead (in time, storage, communication) and other requirements and aspects to compare the techniques are presented in section 2. The semiautomatic technique is discussed in section 3, while the classes of user-transparent checkpointing techniques are described in section 4. A comparison is made in section 5, after which some general conclusions are presented in section 6.

2. Aspects for Comparison In this section, topics to compare the different checkpointing and rollback techniques are considered. The overhead, capabilities and constraints of the algorithms must be examined. 2.1.

1. Introduction Fault-tolerance is essential to work reliably, because the probability that a failure occurs is not negligible in large systems. Implementing fault-tolerance in multicomputers (multiprocessors as well as distributed systems) by backward error recovery can be seen as a warm backup [3, 14], however without hardware duplication. It works as follows: periodically the processes' states are saved as checkpoints [22] (e.g. on stable storage). After a failure is detected (and the system is possibly reconfigured), processes are rolled back to previously saved checkpoints; i.e. their state is restored and normal operation is resumed from there. Hence, two different tasks can be distinguished: checkpointing and rollback. If a process rolls back after a failure, this can cause other processes to roll back to keep consistency [16], which can in turn induce the first one to roll back even further, etc. This is called the dominoeffect [16, 22]: processes can even roll back to their initial state. Several classes of checkpointing and rollback techniques avoid this domino-effect [11]. They differ by the way the checkpoints are taken, such that the system is rolled back to a consistent state, i.e. a valid recovery-line [5] is restored. Four classes are important: • Semi-automatic techniques: the application programmer explicitly invokes the checkpointing-routine and specifies consistent recoverylines [10]. • Message logging: processes save their state independently and inter-process messages are logged [7, 12, 15, 21, 26, 27]. • Coordinated checkpointing: all [4, 30] or an interacting set [5, 16, 17, 28] of processes save their states in a coordinated way. • Hybrid techniques: a combination of the two previous techniques that merges their advantages [18, 24, 31, 32]. The latter three systems are user-transparent. Overhead is associated with each class of techniques, both during failure-free operation and after a fault is detected. This overhead, together with other characteristic features of a particular technique, determine what scheme is best suited for an application. * Partly supported by ESPRIT-project 6731 (FTMPS) and by IUAP-50 ** Senior Research Assistant of the Belgian National Fund for Scientific Research

Overhead

The overhead can be expressed in terms of time, storage or communication. Overhead during normal operation should be traded off against overhead during rollback-recovery. 2.1.1.

Time Overhead

• Computation time is lost during checkpointing, where it would

normally be used for computation. • Computation time is lost during rollback (recovery-time): the reco-

very-line is to be computed and restored, and all computations, performed since these checkpoints were taken, will be repeated. • Kernel overhead: a part of the CPU time is used for the management of checkpoint- and rollback-related topics. • Synchronisation overhead: other processes should sometimes suspend their operation until all processes finished their checkpointing or rollback [4, 5, 16, 17, 28, 30]. 2.1.2.

Storage Overhead

• Local memory and disc space are used to store the checkpoints: at

least one complete (permanent) recovery-line should be stored [16], together with the tentative checkpoint(s). If local memory is used for this, its usable part is significantly reduced [4]. • Local memory and discs are used for the management of checkpointing and rollback schemes. This includes data about the stored checkpoints, the running processes and their interactions, message logs, databases and other bookkeeping information. • Stable storage: it must be assured that the checkpoints are safely stored and can be retrieved after the failure [21]. 2.1.3.

Communication Overhead

• Load on the I/O-network caused by exchanging checkpoints: the

disc bandwidth, available for the application, is reduced. In the worst case an entire system state is checkpointed or restored at once [4, 30]. • Load on the data-network: control-messages are sent and datamessages may be enlarged to include information for the schemes (send/receive sequence numbers [15, 26, 27], incarnation numbers

Proc. of IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10-12, 1993, pp. 262-265

[12, 27, 32] or crash counters [24], etc.) or to obtain special communication protocols (e.g. two-phase commit-protocols). 2.2.

Requirements and Other Aspects

A particular algorithm may require specific hardware features or communication facilities. At the other hand, the application program itself can have requirements that exclude or favour particular algorithms. This section concludes with some other important aspects. 2.2.1.

Requirements to Hardware and Communication

• Special CPU facilities may be needed; e.g. a memory management

unit to determine the checkpoint's contents or abilities to redownload applications and restore their state. • Locked or loose [31] synchronisation of the processors (possibly in hardware [1]) may be required. • A special network topology [18, 21] (e.g. hypercube, mesh or common bus architecture) can be required. High connectivity is necessary if partitioning of the system is not allowed. Disjoint paths may be necessary to retransmit messages [9]. • Stable storage [20] can be needed to preserve information across the failure [21]. • Redundant hardware (spare processing capacity) is necessary to reconfigure the system after a permanent failure. • Fail-stop processors [23] can be assumed or non-zero errordetection latency may be tolerated [24]. • Fault-tolerance may be restricted to permanent or transient faults; the latter seem to occur one or two orders of magnitude more frequently [25]. Mostly single faults at a time are assumed. Other (more robust) schemes tolerate more faults (n-fault-tolerance), even during recovery [12, 17, 26]. After rollback, the system is assumed to be fault-free again. • Reliable channels may be required [16, 26] or messages may be delivered out-of-order, get lost, duplicated, etc. • (Atomic) broadcasting features can be necessary [5, 10, 15, 18, 21, 24, 27, 30, 31]. • Special communication protocols may be required [5, 7, 31]. • Time-out mechanisms may be forbidden, because congestion problems (multiple retransmissions of messages, triggering themselves) can appear [29]. 2.2.2.

Requirements to the Schemes

• Program paradigm: (piecewise) non-determinism must be supported [4, 5, 12, 16, 17, 28, 30]. • Checkpointing overhead and a priori the rollback-time should be within the fault-tolerance time of a process [8] • Inter-process communication influences the overhead if this is related to the number of messages [15, 26, 27]. • If I/O fault-tolerance is needed, the input must be "logged" on stable storage for replay [27], and output must be delayed and only released when it will never be undone. Special care should be taken to preserve consistency if servers are shared between several applications: checkpointing of these applications together should be avoided to maintain locality [2, 7]. If no I/O fault-tolerance is provided, closed systems are required [28, 30], i.e. input is stored on disc before operation begins and output is released after all computations when the application ends. • The fault-tolerance as a whole may be optional or the user may wish to fine-tune fault-tolerance characteristics [5, 7, 10, 24].

2.2.3.

Other Aspects

Some other important aspects distinguish the schemes and are important to make an appropriate choice. • Complexity of the checkpointing and rollback protocols (possibly introducing a higher fault-probability) should be considered. • Scalability to more faults (n-fault-tolerance), to larger architectures, to other architectures (e.g. common-bus architectures, shared memory systems), etc. is important.

• Locality is needed to avoid that one process' checkpointing or rollback forces others to do the same [2]. • A distributed kernel avoids the need for master-processes (which have a consistent global overview of the system). Interference of parts of this kernel must be avoided [16], otherwise a unique (possibly predetermined) coordinator may be necessary [30]. • The degree of parallelism will be decreased, e.g. if inter-process communication causes recovery-related overhead. • If user-involvement is necessary (semi-automatic checkpointing), the application-programmer has to be aware of it. • Implementation aspects may not be neglected.

3. Semi-automatic Techniques In this section, the first class of backward error recovery techniques is considered: the techniques where the application-programmer is directly involved in the backward error recovery. In this approach the programmer calls the checkpointing routine, by specifying to what recovery-line it belongs and possibly the contents of the checkpoint [10]. Hence, it is the programmer's responsibility to create consistent recovery-lines. In parallel processes that are mutually very similar or with substantial geometrical symmetry (as in much number crunching problems), it is rather easy to find consistent recovery-lines. When the threads are completely different, finding a recovery-line can be quite difficult or even impossible. The user only has to determine the position where the checkpoint routine is to be invoked; the recovery-software takes care of the rest, i.e. storage of the checkpoint, determining what recovery-lines are completely saved, necessary bookkeeping and rollback-recovery are handled user-transparently. The advantages of this kind of semi-automatic checkpointing are: • Minimal failure-free overhead: only while the checkpoint is taken

(on demand of the programmer) the process suspends its operation. • Fault-tolerance characteristics can be fine-tuned to the user's

specific needs: the user specifies how often the checkpoints are taken. • Simple recovery software as the programmer specified consistent recovery-lines: the last completely saved valid recovery-line is restored and the operation is resumed. • Smaller checkpoints if the programmer specifies the memoryranges and system parameters to be included in the checkpoint. • Sub recovery-lines can be specified, i.e. a subset of the processes forwards its recovery-line and remains consistent with the other processes. This reduces rollback-time when a process of this subset fails. • If the user-transparent recovery fails (e.g. after a software error), the user may try a forward error recovery based on some exceptionhandling as he knows about the contents of the checkpoint and the structure of the application. • Loose synchronisation: it is possible to let checkpointing between different processes drift apart to about one checkpoint interval [31]. Drawback is the complexity for the programmer, especially in non-symmetrical applications, to specify consistent recovery-lines.

4. User-transparent Techniques 4.1.

Message Logging

In Message logging techniques, the different processes are checkpointed independently of each other (one process at a time). All inter-process messages are recorded in a message-log. After a failure is detected, the previous checkpoint is restored and the logged messages are replayed (in the same order) to bring the failed process back to a consistent system state. Two approaches are possible. In "pessimistic" schemes [7, 21] the processes are suspended after every communication until the message is logged (synchronisation). "Optimistic" schemes [12, 15, 26, 27] continue operation during the (asynchronous) logging of mes-

sages, but need extra bookkeeping to know which computation depends on which message and which messages have been logged. Because processes can fail before messages are logged, it should be possible to undo the computations that depend on these messages, i.e. dependencies must be tracked [27]. The asynchronous logging of the "optimistic" approach [12, 15, 26, 27] is surely advantageous to the synchronous logging of the earlier "pessimistic" schemes [7, 21] as computation, communication, checkpointing and committing checkpoints do not block each other. A constraint of all Message logging techniques is that applications should be (at least piecewise) deterministic. However, the independent checkpointing requires only one process to take a checkpoint at a time, hence a small communication bandwidth overhead. On the other hand, there is a continuous burden on the normal operation: time overhead is associated with recording of dependencies, storage overhead with logging of messages and communication overhead with broadcasting this information, especially if there is much inter-process interaction. In the most recent schemes [12] the overhead during failure-free operation is minimised at the cost of a more complex recovery scheme. All Message logging techniques (except [26]) require for each process only the last checkpoint to be stored with a message- and an interaction-log. Increased communication-protocols and complex recovery schemes are the price for these advantages. 4.2.

Coordinated Checkpointing

The second class of user-transparent checkpointing techniques, called Coordinated checkpointing, avoids the domino-effect by checkpointing all processes together, on system-level or on interacting process-level. Hence, we can distinguish two approaches. In a first approach, Global checkpointing, the whole system is frozen to be able to take a snapshot of the entire system state [4, 10, 30]. All application-processes are suspended until the complete checkpoint is taken to preserve consistency. The rollback is simple: these checkpoints, which form a consistent recovery-line, are restored and operation is resumed from there. Inter-process communication-in-transit during checkpointing should be handled appropriately to avoid loosing consistency [30]. The second approach, Process-level checkpointing [5, 16, 17, 28], is an extension of the Global checkpointing technique. The key idea is that only interacting processes (i.e. processes that have been communicating since the last checkpoint in a direct or transitive way [16]) are checkpointed together. A new checkpoint of a process in one interacting set is consistent with the checkpoint of another process in some other interacting set, because there was no communication between these processes of different sets since their last checkpoint. Hence, a communication tree should be maintained. The major benefits of the Coordinated checkpointing techniques are the straightforward approach to snapshot a complete recoveryline (only messages-in-transit have to be handled) and the lack of the determinism-restriction. Furthermore, the small failure-free overhead (no logging of messages necessary) is important, hence, a small performance decrease during normal operation. The cost (time and communication overhead to save complete recovery-lines at once) is the blocking of all concerned processes. In the Global checkpointing techniques this latter effect is worst: quite some time is lost, because the whole system is blocked while a checkpoint is taken. The benefits are that there is no failure-free overhead between the checkpointing sessions and that the algorithms are relatively simple: a complete consistent state is snapshotted and restored at once. In Process-level checkpointing, the straightforward snapshooting approach is kept and the load on the communication network is more balanced, because only the interacting processes take their checkpoints at the same time. If there is little locality of the processes, this system comes close to a global checkpointing; otherwise, there is a significant reduction of the blocking time. But the overhead to do the bookkeeping for inter-process communication and the

construction of the communication tree at checkpointing and rollback time are the price for this. 4.3.

Hybrid Techniques

The last class of user-transparent techniques are the hybrid techniques [18, 24, 31, 32]. These schemes are based on Coordinated checkpointing, but avoid the freezing of the application (hence, a non-blocking checkpoint) by logging some messages (e.g. those crossing the recovery-line). Hence, these hybrid techniques merge the advantage of the Message logging techniques to take checkpoints for single processes at the time (without blocking the system) with the benefit of the Coordinated checkpointing techniques to save a complete recovery-lines as a whole. To maintain consistency, some messages have to be logged. More overhead (stricter communication protocols, more complex algorithms, etc.) is the cost for a more flexible failure-free operation.

5. Comparison of the Classes In the previous sections the four classes of checkpointing and rollback techniques that avoid the domino-effect were discussed with their benefits and drawbacks. Semi-automatic techniques let the programmer specify the recovery-line. The Message logging techniques take checkpoints from single processes and log messages to provide consistency. The Coordinated checkpointing techniques take snapshots from consistent states as entities (at system or at interacting-process level). The hybrid techniques use aspects from the latter two techniques. Beside the described overhead and requirements of the algorithms are the kind of application that will run and the available facilities important to determine which class is best suited. • Applications: communication-intensity, program paradigm, size, real-time constraints, general purpose or specific, etc. • Facilities: load, provided communication protocols, distributed system or multiprocessor, etc. Furthermore the desired fault-tolerance properties are important: both the overhead during normal failure-free operation (checkpointing) and after a fault is detected (rollback) should be taken into account. Hence fault-probability and fault-characteristics influence the performance. If a low fault-probability is expected, emphasis should be on minimising the fault-free overhead to maximise the throughput of the system; if the fault-probability is higher, the global overhead is more important. Also the characteristics of the faults influence the performance (e.g. transient faults require only rollback where permanent faults may also require reconfiguration of the system). Some general optimisations (as forcing a minimum number of nodes to roll back [24, 31] or minimising the number of control messages using the interconnection topology [18]) or other properties (as a damage-assessment phase [24]) can be implemented in several classes of techniques. The semi-automatic checkpointing techniques have minimal overhead, but require programmer-involvement. For the two original classes of user-transparent techniques [13] we can make the following remark. Message logging techniques are less interesting for communication-intensive applications due to the higher time, communication and storage overhead associated with the logging of the messages, but are better suited in the opposite case, because the freezing is avoided as only one process takes a checkpoint at a time, requiring a smaller communication bandwidth for checkpointing. Coordinated checkpointing techniques are suited for both communication-intensive or -sparse applications as they have nearly no overhead at failure-free operation, but they freeze or block the applications periodically until the whole system or the interacting processes are checkpointed. The rollback-recovery after Coordinated checkpointing is more straightforward because no messages have to be replayed (no deterministic algorithms required), but more processes are normally involved in the rollback. In Message logging

OVERVIEW Semi-automatic checkpointing Message logging Pessimistic approach Optimistic approach Coord. checkp. Global checkpointing Process-Level checkpointing Hybrid techniques

Fault-free overhead

Overhead after a fault is detected

•programmer invokes CP of single process

•RB to valid recovery-line

•CP of single process •log synchronously •CP of single process •log asynchronously

•RB single process •replay messages •RB some processes •replay messages

•CP of all processes

•RB all processes

•CP of interacting set

•RB interacting set

•CP of single process •log some messages

•RB all processes •replay messages

Table 1: Comparison between checkpointing techniques techniques, the rollback is more complex because the dependencies should be tracked (but a more complex rollback is tolerable, as faults are exception rather than rule). A comparison [6] between a specific algorithm of both techniques proofs a similar behaviour. The hybrid techniques merge advantages of these two groups and optimise failure-free overhead. In table 1 these major characteristics are summarised.

6. Conclusions In this paper, we described the main aspects necessary to compare backward error recovery techniques based on checkpointing and rollback. These aspects are the overhead associated with time, storage and communication, and the other important aspects such as requirements of schemes or applications. Four classes of techniques, avoiding the domino-effect, were discussed with their major benefits and drawbacks: semi-automatic, Message logging, Coordinated checkpointing and mixed hybrid techniques. Semi-automatic checkpointing seems to be the simplest approach, but the major drawback is the required user-involvement. Message logging has the advantage that it has a non-blocking (because independent) checkpointing but a constant overhead during normal computation, due to the logging of the messages. Coordinated checkpointing techniques have a smaller overhead during normal computation, but have to tolerate the computation time lost during the checkpointing. The hybrid techniques merge advantages of both other user-transparent classes. Kind of applications, available facilities and desired fault-tolerance properties determine further what technique is best suited.

References [1] D. Agrawal, J.R. Agre, "Recovering from Multiple Process Failures in the Time Warp Mechanism", IEEE Trans. on Computers, 41(12), Dec. 1992, pp.1504-1514 [2] M. Ahamad, L. Lin, "Using Checkpoints to Localise the Effects of Faults in Distributed Systems", Proc. 8th Symp. on Reliable Distributed Systems, 1989, pp.2-11 [3] T. Anderson, A. Lee, "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981 [4] A. Bauch, B. Bieker, E. Maehle, "Backward Error Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP", Workshop on Fault-Tolerant Parallel and Distributed Systems, Amherst, MA, Jul. 1992, pp.36-43 [5] G. Barigazzi, L. Strigini, "Application-Transparent Setting of Recovery Points", Digest 13th Fault-Tolerant Computing Symp., Milano, Italy, Jun. 1983, pp.48-55 [6] B. Bhargava, S.R. Lian, P.J. Leu, "Experimental Evaluation of Concurrent Checkpointing and Rollback Recovery Algorithms", Proc. of 6th Int. Conf. on Data Engineering, 1990, pp.182-189 [7] A. Borg, J. Baumbach, S. Glazer, "A Message System Supporting Fault-tolerance", Proc. 9th Symp. on Operating Systems Principles, Oct. 1983, pp.90-99

[8] R. Cuyvers, R. Lauwereins, J.A. Peperstraete, "Fault-tolerance in Process Control: Possibilities, Limitations and Trends", Journal A, 31(4), Dec. 1990, pp.33-40 [9] R. Cuyvers, R. Lauwereins, "Hardware Fault-Tolerance: Possibilities and Limitations offered by Transputers", chapter 8 from "Transputers in Real-time Control", G. W. Irwin, P. J. Fleming, Eds., Research Studies Press Ltd., Taunton, England, 1992, pp.202-238 [10] M. Dal Cin, A. Grygier et al., "Fault-tolerance in Distributed Shared Memory Multiprocessors", to appear in: Springer Lecture Notes on Computer Science, 1993 [11] G. Deconinck, "Survey of Checkpointing and Rollback Techniques", Technical Report O3.1.8 of ESPRIT Project 6731 (FTMPS), May 1993 [12] E.N. Elnozahy, W. Zwaenepoel, "Manetho: Transparent RollbackRecovery with Low Overhead, Limited Rollback and Fast Output Commit", IEEE Trans. on Computers, 41(5), May 1992, pp.526-531 [13] T.M. Frazier, Y. Tamir, "Application-Transparent Error-Recovery Techniques for Multicomputers", Proc. of the 4th Conf. on Hypercubes, Concurrent Computers and Applications. Monterey, CA, Mar. 1989, pp.103-108 [14] B.W. Johnson, "Design and Analysis of Fault-Tolerant Digital Systems", Addison-Wesley Publishing Company Inc., 1989 [15] D.B. Johnson, W. Zwaenepoel, "Sender-Based Message Logging", Digest FTCS-17, Pittsburgh, PA, Jul. 1987, pp.14-19 [16] R. Koo, S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE Trans. on Software Engineering, 12(1), Jan. 1987. [17] P.Y. Leu, B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems", Proc. 4th IEEE Int. Conf. on Data Engineering, 1988, pp.154-163 [18] K. Li, J.F. Naughton, J.S. Plank, "Checkpointing Multicomputer Applications", Proc. 10th Symp. on Reliable Distributed Systems, 1991, pp.2-11 [19] E. Nett, R. Kröger, J. Kaiser, "Implementing a General Error Recovery Mechanism in a Distributed Operating System", Proc. of FTCS16, Vienna, Austria, Jul. 1986, pp.124-129 [20] G. Peattie, "Quality Control for ICs", IEEE Spectrum, 18(10), Oct. 1981, pp.93-97 [21] M.L. Powell, D.L. Presotto, "PUBLISHING: A Reliable Broadcast Communication Mechanism", Proc. of 9th ACM Symp. on Operating Systems Principles, Oct. 1983, pp.100-109 [22] B. Randell, "System Structure for Software Fault-tolerance", IEEE Trans. on Software Engineering, 1(2), Jun. 1975, pp.220-232 [23] F.B. Schneider, "Fail-Stop Processors", Digest of Papers from Spring Compcon of the IEEE Computer Society, Mar. 1983, pp.66-70 [24] L.M. Silva, J.G. Silva, "Global Checkpointing for Distributed Programs", Proc. of 11th Symp. on Reliable Distributed Systems, Houston, Texas, Oct. 1992, pp.155-162 [25] D.P. Siewiorek, R.S. Swarz, "The Theory and Practice of Reliable System Design", Digital Press, Bedford, MA, 1982 [26] R.E. Strom, D.F. Bacon, S.A. Yemini, "Volatile Logging in n-FaultTolerant Distributed Systems", Digest of FTCS-18, Tokyo, Japan, Jun. 1988, pp.44-49 [27] R.E. Strom, S.A. Yemini, "Optimistic Recovery: an Asynchronous Approach to Fault-tolerance in Distributed Systems", Proc. of 14th Int. Fault-tolerant Computing Symposium, Florida, 1984, pp.374-379 [28] Y. Tamir, T.M. Frazier, "Application-Transparent Process-level Error Recovery for Multicomputers", Hawaii Int. Conf. on System Sciences, Kailua-Kona, Hawaii, Jan. 1989 [29] A.S. Tanenbaum, "Computer Networks", Eaglewood Cliffs, N.J., 1981 [30] Y. Tamir, C.H. Séquin, "Error Recovery in Multicomputers Using Global Checkpoints", 13th Int. Congress on Parallel Processing, Bellaire, Michigan, Aug. 1984, pp.32-41 [31] Z. Tong, R.Y. Kain, W.T. Tsai, "Rollback Recovery in Distributed Systems Using Loosely Synchronised Clocks", IEEE Trans. on Parallel and Distributed Systems, 3(2), Mar. 1992, pp.246-251 [32] Z. Wójcik, B.E. Wójcik, "Fault-tolerant Distributed Computing Using Atomic Send and Receive Checkpoints", Proc. 2nd IEEE Symp. on Parallel and Distributed Processing, 1990, pp.215-222

Suggest Documents