ery, and the failure-free overhead of a pessimistic ap- proach can .... scienti c application, it is generally assumed that the .... calls ftread() to receive a message.
Why Optimistic Message Logging Has Not Been Used In Telecommunications Systems and Yi-Min Wang AT&T Bell Laboratories Murray Hill, NJ 07974
Yennun Huang
Abstract Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach [1] that places more emphasis on failure-free overhead than recovery eciency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-speci c information.
1 A Brief Literature Survey Much of the existing work on message logging and checkpointing assumes a piecewise deterministic (PWD) execution model [2]. Under the PWD assumption, each process execution is viewed as a number of state intervals bounded by nondeterministic message receiving events1. Execution within each state interval is completely deterministic, and hence replayable. This allows the use of message logging as a form of checkpointing because the state of a process can be deterministically reconstructed by replaying the logged messages in their original order. There are primarily three factors that need to be considered for trade-os when designing a message logging protocol: failure-free overhead, the number of rolled-back surviving processes, and recovery time. A protocol is called optimistic if it optimistically assumes that failures are rare events, so optimizing failure-free performance is more important than 1
Or more general nondeterministic events [3].
achieving good recovery performance. In contrast, a pessimistic protocol always pessimistically prepares for failures, so it is willing to pay higher failure-free overhead in order to recover faster should a failure occur. A pessimistic logging protocol always logs a message before processing (or delivering) it so that every message is retrievable, and every process state is recreatable through message replaying under the PWD assumption. A signi cant advantage is that any failed process can then locally recover by itself without requiring the cooperative rollbacks of its correspondents. Since a failure between message receipt and message logging may result in a lost message unknown to both the sender and the receiver, the issue of atomic message receipt-logging needs to be addressed. Traditional pessimistic logging protocols guarantee the atomicity either by implementing atomic three-way message delivery (to the receiver, receiver's backup and sender's backup) [4, 5] or atomic two-way transmission (to the receiver and a centralized recorder) [6]. The optimistic approach [1] was proposed based on the assumption that synchronously logging every message upon its receipt can result in an unacceptably high failure-free overhead. Since failures rarely occur, it is argued that minimizing failure-free overhead by sacri cing recovery eciency can achieve the best overall system performance. By grouping a number of received messages and logging them onto stable storage in a single write operation, optimistic logging techniques [1, 7{9] can greatly reduce message logging overhead. However, since messages can be lost if a failure occurs before they are logged, some states of a failed process may not be recreatable. Therefore, consistent recovery in general cannot be achieved by a local recovery of the failed process. It may require the rollbacks of those surviving processes whose states depend on the non-creatable states, and the rollbacks of the senders of lost messages in order to regenerate
those messages [8]. The recovery process usually takes longer compared to the pessimistic approach. In contrast with the above receiver-based logging protocols, several sender-based logging protocols [2, 10, 11] have also been proposed. Since sender-based techniques require only volatile logging of message contents (as opposed to stable logging), the message logging overhead can be further reduced. However, since a failed process has to either request the logged messages from the sender logs or wait for the failed senders to regenerate and retransmit those messages, the recovery time is in general even higher. A new class of causal logging protocols [3, 11{13] has recently emerged in the literature. For each message, the message content is volatile-logged at the sender and the processing order is volatile-logged at the receiver's receiver. Causal logging in general requires piggybacking more information on each message, but can approximately achieve localized recovery of a failed process except that the surviving processes may still need to be involved in the recovery to supply the message contents in the sender logs. Fast recovery is not guaranteed because a failed receiver may still have to wait for its failed senders to regenerate and retransmit pre-failure messages2. In summary, most of the existing message logging protocols have focused on reducing failure-free overhead. The new causal logging protocols have been designed to also optimize the recovery performance by not requiring the rollbacks of any surviving processes. Unfortunately, not much attention has been given to the third factor - the recovery time - which turns out to be a very important factor for telecommunications systems.
2 Message Logging and Checkpointing in Telecommunications Systems Typical telecommunications systems are continuously-running server applications. They are quite dierent from the longrunning scienti c applications that are usually considered in the literature [11, 15, 16]. In a long-running scienti c application, it is generally assumed that the entire state of a process at any point is dependent on any previous state of the same process. Checkpoints 2 One possible way to achieve fast and localized recovery without the overhead of synchronous logging is to combine asynchronous receiver-based logging with causal logging as a mechanism to avoid orphans [14].
are typically taken in a transparent fashion: a process is interrupted when a timer expires and a snapshot of the entire state is saved3. The checkpoint interval is on the order of tens of minutes to hours. Recovery is performed by restoring the checkpointed state, and the execution returns from the point at which the checkpoint was taken [17]. In contrast, a typical continuously-running server application consists of an initialization step for both data and communication, followed by an in nite loop which receives a service request from a client, performs the requested processing, sends the results back to the client (if required), and get ready for the next request. At the loop boundary, much of the signi cant process state has been saved on stable storage and only a small amount of critical data [18] remains in the volatile memory. Checkpoints are typically taken in a non-transparent fashion: an application explicitly invokes the checkpoint function at the loop boundary, and only critical data is saved. The checkpoint interval is in the order of seconds to a few minutes. Recovery is performed by reexecuting the initialization step in the new environment and then restoring the checkpointed critical data before entering the loop.
2.1 Why use message logging Telecommunications systems use message logging to reduce service down time or, equivalently, to achieve high availability4 . Speci cally, by replaying the messages directly from a local log le instead of waiting for the senders to regenerate and resend the messages, a failed process can often recover much faster. To illustrate that point, we next describe our experience with three applications which have used libft [18] to do checkpointing and pessimistic message logging.
System G System G is a tool used by maintenance people to test telephone circuits for large customers. In each test session, an operator selects several phone lines through a user interface module UI. Module UI communicates with a server module CM which processes the requests by communicating with another server module DM. Module DM then connects to several switches through 3 Some limited application-independent optimizations may be performed [11, 15, 16] to reduce the checkpoint size. 4 In this paper, we focus on message logging for interprocess communications inside a system. Messages that are generated from outside the system must be synchronously logged for recoverability, not just for fast recovery.
the phone lines and conducts the tests. The test results are sent back to module UI to be presented to the operator. Before using message logging and checkpointing, a failure in CM would cause the entire connections and activities to be lost. Module UI would hang because it could not receive any response. In that case, the operator had to exit the interface program, re-login and redo the entire session. The time lost due to a failure ranged from 2 minutes to 20 minutes, which often upsets the operator. To achieve fast and localized recovery in case of a server failure, the CM server synchronously logs the requests from the interface module in a le and takes a checkpoint at the end of each session. Checkpointed data includes the mapping table between the interface module id and message id, communication port numbers, and the status table of the phone lines. When a failure occurs in CM during a session, the server is quickly restarted, checkpointed data are restored, and all the logged requests are replayed. Message replaying allows the server to reconstruct its pre-failure state, continue the interrupted sessions, and forward the returned results to the right operators. The recovery time is reduced to 40 seconds or less, and the operators no longer have to exit and re-login.
System U System U is a telecommunications operations system which receives calls information from a switch, analyzes the data, stores the results and forwards the results to a billing system. Since each switch can only store the calls information for 10 minutes, System U should not be unavailable for more than 10 minutes; otherwise, some data will be lost. System U uses a duplex architecture - a primary machine and a backup machine. The checkpoints and message logs for the primary processes are saved onto the backup machine. When the primary processes fail, the backup processes take over by restoring the checkpointed data and replaying the logged messages. Checkpointed data includes the mapping table between the le names and le descriptors, environment variables, message sequence numbers, etc. A checkpoint is taken when the size of a log le exceeds 6 megabytes. Without message logging and checkpointing, each primary machine failure would cause a loss of data and a long service disruption. By incorporating the logging and checkpointing capabilities of libft and REPL [19], the system down time due to a primary machine failure is reduced to less than 90 seconds, which includes
15 seconds for activating the backup processes and 75 seconds or less for replaying logged messages.
System D System D is a cross-connection system for ber cables. The availability requirement for such a system is no more than 4 minutes of down time per year and no more than 30 seconds of recovery time per failure. In order to achieve the level of fast recovery that can meet the high-availability requirement, System D uses an extremely pessimistic technique for message logging and checkpointing: each message is logged before being processed and, at the end of each processing, the part of process state that has been changed is checkpointed. After a checkpoint is successfully taken, the checkpoint le then overwrites the message log le. This guarantees that no more than one message needs to be replayed for any recovery. The recovery time is measured to be less than 30 seconds. For System D, the pessimistic approach has another advantage. There are frequent output commit [11, 20] points in the system for controlling switches and committing database operations. The low failurefree overhead of the optimistic approach would have been oset by the frequent execution of output commit protocols. In contrast, the pessimistic approach allows each output to be performed without latency.
2.2 How expensive is pessimistic logging The three examples demonstrate that fast recovery is an extremely important issue when a telecommunications system incorporates the capability of message logging and checkpointing. Pessimistic receiver logging appears to have been the best choice so far because optimistic logging often requires the rollbacks of more processes which translate into slower recovery, and sender logging simply delays the availability of the logged messages for replaying. One remaining question is: can the failure-free overhead of applicationlevel pessimistic receiver logging be reduced to a level that is considered a fair price to pay for the reduced down time? The answer is yes for many telecommunications systems, if we can take advantage of certain application-dependent information. Instead of implementing atomic message receipt-logging, libft relies on one of the following approaches to handle the rare cases of a lost message due to a failure occurring between message receipt and logging: For most client-server applications, the eect of such a lost message can often be detected by
other high-level mechanisms. For example, one server process will eventually time out if it has been waiting for a message from another server process for an unusually long time; a client often has a built-in retry mechanism so that, if the expected server response does not arrive within the time-out interval, the service request will be resubmitted again. The special loop structure of the server applications makes the task of checkpoint coordination very simple. The checkpoints that each server process takes independently at the loop boundary are either automatically consistent because no messages are passing around at that point, or can be made consistent through simple blocking coordination. When a lost message is detected, the server processes can always roll back to the latest coordinated checkpoints as a last resort, and redo the processing when receiving the resubmitted service request.
In libft, pessimistic receiver logging is performed user-transparently when an application calls ftread() to receive a message. Similarly, the function ftwrite() can be used to perform pessimistic sender logging when sending a message. If an application uses both ftread() and ftwrite() to redundantly log each message at both the sender and the receiver sides, a lost message can always be retrieved from the sender log. In other words, the receiver logs are used in most situations to provide fast recovery, while the sender logs are used for the rare cases of lost messages to avoid the need of much more expensive global rollback. The failure-free overhead of combining both pessimistic receiver and sender logging has been measured to be between 3% and 10% [21], which should be acceptable to those applications that would require such an extra protection against lost messages.
For some applications, the problem of lost messages is simply ignored because they can tolerate the resulting errors. The main strength of libft is to provide low-cost fault tolerance for those applications that would bene t from the reduced service down time, but are not life-critical. Incurring additional failure-free overhead in such applications just for the purpose of tolerating certain rare failure scenarios is usually not considered a worthwhile trade-o. For example, occasionally losing the billing record of a single phone call in case of a failure may not justify the additional
overhead and complexity to avoid that rare situation. The failure-free overhead which translates into longer response time has been observed to be 2%, 9% and 5% for System G , System U and System D, respectively. All three systems have been able to absorb the overhead without causing unacceptable negative impacts on system performance. In return, the reduced system down time either greatly improves the service quality or allows an in-time and correct failure recovery. In general, telecommunications applications typically require tens of system calls for processing each service request. Since the libft message logging mechanism adds only one more system call5 , i.e. write(), to the processing, the failure-free overhead is typically below 10%.
3 Conclusions The arguments for optimistic techniques are valid only when a system designer is free to make an arbitrary trade-o between failure-free overhead and recovery eciency. Since the objective of most faulttolerant telecommunications systems is to meet their stringent down time requirements, the usual practice is to go for a pessimistic technique for fast recovery, and then see if their performance requirements can absorb the resulting overhead. This approach has proved to be quite successful, as illustrated by the three applications. We would like to point out, though, that the techniques of dependency tracking and recovery line computation that have been developed in the literature are still applicable to fault-tolerant telecommunications systems with pessimistic message logging and checkpointing. While deterministic state reconstruction through message replaying is eective for tolerating fail-stop transient failures, non-fail-stop software or protocol failures may require some message logs to be discarded in order to force a dierent execution path to bypass the software bugs [?]. Dependency tracking and recovery line computation remain essential for bringing a system back to a consistent state.
Acknowledgement
We thank Bill Sanders for his discussion that motivated this report, and Chandra Kintala, Pi-Yu (Emerald) Chung, Gaurav Suri and Fred Douglis for their useful comments and valuable discussions. 5 The message size is usually less than 16 kilobytes.
References [1] R. E. Strom and S. Yemini, \Optimistic recovery in distributed systems," ACM Trans. Comput. Syst., Vol. 3, No. 3, pp. 204{226, Aug. 1985. [2] R. E. Strom, D. F. Bacon, and S. A. Yemini, \Volatile logging in n-fault-tolerant distributed systems," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 44{ 49, 1988. [3] E. N. Elnozahy and W. Zwaenepoel, \Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output commit," IEEE Trans. Comput., Vol. 41, No. 5, pp. 526{531, May 1992. [4] A. Borg, J. Baumbach, and S. Glazer, \A message system supporting fault-tolerance," in Proc. 9th ACM Symp. on Operating Systems Principles, pp. 90{99, Oct. 1983. [5] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, \Fault tolerance under UNIX," ACM Trans. Comput. Syst., Vol. 7, No. 1, pp. 1{24, Feb. 1989. [6] M. L. Powell and D. L. Presotto, \Publishing: A reliable broadcast communication mechanism," in Proc. 9th ACM Symp. Oper. Syst. Principles, pp. 100{109, Oct. 1983. [7] D. B. Johnson and W. Zwaenepoel, \Recovery in distributed systems using optimistic message logging and checkpointing," J. Algorithms, Vol. 11, pp. 462{491, 1990. [8] A. P. Sistla and J. L. Welch, \Ecient distributed recovery using message logging," in Proc. 8th ACM Symposium on Principles of Distributed Computing, pp. 223{238, Aug. 1989. [9] T. T.-Y. Juang and S. Venkatesan, \Crash recovery with little overhead," in Proc. IEEE Int. Conf. Distributed Comput. Syst., pp. 454{461, 1991. [10] D. B. Johnson and W. Zwaenepoel, \Sender-based message logging," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 14{19, 1987. [11] E. N. Elnozahy and W. Zwaenepoel, \On the use and implementation of message logging," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 298{307, 1994. [12] L. Alvisi and K. Marzullo, \Message logging: Pessimistic, optimistic, and causal," in Proc. IEEE Int. Conf. Distributed Comput. Syst., pp. 229{236, May 1995. [13] L. Alvisi, B. Hoppe, and K. Marzullo, \Nonblocking and orphan-free message logging protocols," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 145{154, 1993. [14] L. Alvisi, \Fast and localized recovery with asynchronous logging." Personal communication, 1995.
[15] E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel, \The performance of consistent checkpointing," in Proc. IEEE Symp. Reliable Distributed Syst., pp. 39{ 47, Oct. 1992. [16] J. S. Plank, M. Beck, G. Kingsley, and K. Li, \Libckpt: Transparent checkpointing under Unix," in Proc. Usenix Technical Conference, pp. 213{224, Jan. 1995. [17] Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, \Checkpointing and its applications," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 22{ 31, June 1995. [18] Y. Huang and C. Kintala, \Software implemented fault tolerance: Technologies and experience," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 2{ 9, June 1993. [19] G. Fowler, Y. Huang, D. Korn, and H. Rao, \A userlevel replicated le system," in Proc. Summer '93 USENIX, pp. 279{290, June 1993. [20] D. B. Johnson, \Ecient transparent optimistic rollback recovery for distributed application programs," in Proc. IEEE Symp. Reliable Distributed Syst., pp. 86{95, Oct. 1993. [21] Y. M. Wang, Y. Huang, W. K. Fuchs, C. Kintala, and G. Suri, \Progressive retry for software failure recovery in message-passing applications," IEEE Trans. Comput., under revision.