Error detection and recovery are based ... and manipulate the entire data space of the target pro- gram ... the error detection or recovery software is mostly con-.
Fault Tolerance Via N-Modular Software Redundancy
Abstract
Timothy Tsai Lucent Technologies, Bell Labs Murray Hill, NJ 07974
This paper presents a novel method of \indirect" software instrumentation to achieve fault tolerance at the application level. Error detection and recovery are based on the well-known approach of replicating application processes on multiple computers in a network. The advantages of this fault tolerance scheme based on indirect instrumentation include (1) a general error detection method that ensures data integrity for critical data without the need for any modi cation of code, (2) a high degree of automation and transparency for fault-tolerant con guration and operation (i.e., the set-up time for a new application is on the order of a few minutes), and (3) the ability to perform error detection for applications for which no source code or only minimal knowledge of the code is available, including legacy applications. The types of faults that are tolerated include transient and permanent hardware faults on a single machine and certain types of application and OS software faults.
1 Introduction
This paper presents a novel method of \indirect" software instrumentation to achieve fault tolerance at the application level. Error detection and recovery are based on the well-known approach of replicating application processes on multiple computers in a network. The indirect software instrumentation approach focuses on the minimization of the eort required of the human user. The advantages of this approach coupled with process replication include (1) a general error detection method that ensures data integrity for critical data without the need for any modi cation of code, (2) a high degree of automation and transparency for fault-tolerant con guration and operation (i.e., the setup time for a new application is on the order of a few
c 1998 IEEE. Published in the Proceedings of FTCS'98, June 23-25, 1998 in Munich, Germany. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
minutes), and (3) the ability to perform error detection for applications for which no source code or only minimal knowledge of the code is available, including legacy applications. The main innovation oered by this approach is the execution of the original software program under the management of a controller program. The original software program will be referred to in this paper as the target program. The controller program has the ability to direct the target program to commence, suspend, and terminate execution at any point. It also is able to view and manipulate the entire data space of the target program, including static and dynamic data and processor registers. In fact, the implementation of the approach presented later in this paper utilizes a well-known debugger. However, any controller program with these characteristics can be used. For instance, on UNIX systems, a simple program based on the ptrace() facility is sucient. The indirect instrumentation concept has been employed in the past to a limited extent. For instance, the FERRARI [4] fault injection tool used the UNIX ptrace() facility to provide precise triggering of fault injection. The use of the controller program is important because it avoids all direct modi cation of the target program. Most software fault tolerance techniques necessitate the modi cation of either source code or binary code to add error detection and recovery functionality. These modi cations are performed prior to execution of the target program and often require the user to edit les or to run instrumentation software. The use of the controller program does not actually completely eliminate the need for instrumentation of the target program. Rather, the instrumentation is performed indirectly. Indirect instrumentation is useful for adding functionality to a program in a manner that provides a large degree of transparency and automation for the human user. This is particularly true for adding fault tolerance to a program that has not been developed with any explicit fault tolerance capabilities. With indirect instrumentation, such a non-fault tolerant program can be executed in a fault-tolerant mode after a few minutes of
con guration via a graphical user interface. No source code modi cation or recompilation is required because the error detection or recovery software is mostly contained within the controller program. Section 2 describes in detail the indirect instrumentation approach and the general-purpose Prism tool, which implements the indirect instrumentation idea. A fault tolerance implementation based on the indirect instrumentation concept is described in Section 3. The implementation is called Prism and uses process replication and voting to perform error detection; checkpointing is used to perform recovery. This particular approach to error detection and recovery is not a new idea. Similar ideas have already been proposed [5][6][8][7]. Nversion programming [1] uses redundant code execution and software-implemented voting to achieve fault tolerance, as does the approach described in this paper. Although N-version programming focuses on the detection of software design faults via design diversity, many of the same issues and concerns need to be addressed with the Prism approach. The dierence between previous approaches and the approach in this paper is the use of the indirect instrumentation concept to manage the process redundancy and voting. A prototype has been implemented to demonstrate the viability of indirect instrumentation for fault tolerance, as well as to study some of its practical concerns. One task that is especially performed well by the Prism fault tolerance implementation is error detection. Some software-implemented schemes rely on the operating system to detect errors. These schemes have the disadvantage of being unable to preserve data integrity when no operating system exception is triggered. Other schemes use algorithm-based detection methods that are not applicable to most programs [2][3]. The indirect instrumentation based fault tolerance scheme provides generally applicable error detection that preserves the data integrity for an explicitly selected set of critical data. The fault types that are tolerated include all single occurrences of transient or permanent hardware faults, as well as software faults that cause error conditions in the operational environment. Faults that are not detected include communication faults and other software faults that only cause internal value errors. Section 4 discusses some of the performance issues related to the Prism fault tolerance implementation. Section 5 summarizes with concluding thoughts.
2 Prism instrumentation
The idea of indirect instrumentation has been embodied in the Prism tool [10], which has been used to implement a software-based fault tolerance scheme using process replication. Prism was originally designed
to be a general-purpose instrumentation tool based on the principle of distributed, automated, indirect instrumentation (DAII). Instrumentation refers to modi cation of a target program to alter or add functionality. Traditional instrumentation requires direct modi cation of the target program source or executable code. In contrast, indirect instrumentation requires no such code modi cation, neither in the disk nor the memory image. Instead, indirect instrumentation means modifying the target program execution via a controller that directs the program's control- ow and provides observability into the program's internal state. The controller may take the form of a debugger (such as gdb or dbx) or a low-level facility provided by the operating system (such as ptrace or the /proc le system). The gdb debugger [9] was selected as the Prism controller program because it possesses a well-developed programming interface and has been ported to multiple platforms. Although Prism can be constructed with a debugger as one of its components, it is an instrumentation tool and should therefore not be confused with a debugger. It greatly expands on the features oered by a debugger. A debugger is capable of performing many low-level tasks such as managing breakpoints, executing debugger commands when breakpoints are encountered, and printing and modifying the values of variables. In contrast to this, the Prism software con gures the debugger to perform all the instrumentation tasks needed in an automated, transparent, and distributed manner. In addition, the software adds the capability of executing user-speci ed code in conjunction with the target program to perform tasks that a debugger alone is unable to do. Prism is a software instrumentation tool. It grants the user access to the address space of a program without modifying the source code or even the executable. In fact, source code is not required, which means that even legacy code can be instrumented1 . Because the instrumentation is indirect and performed dynamically, the instrumentation itself can be altered, added, or deleted as the target program executes. As a consequence of this, Prism can even be used to instrument an already executing target program! If the program's executable contains a symbol table, Prism organizes accesses to the address space according to the symbolic information. 1 If the target program source code is not available, then the instrumentation must rely on the symbol table attached to the target program. If no symbol table is available, Prism can still be used for instrumentation, but instead of using symbolic information, all references to the target program will be in terms of virtual addresses, which currently entails some investigative work on the user's part.
The basic Prism architecture includes a frontend and one or more backends. The frontend handles con guration and management of the backends, each of which contain a controller program. Some features have been added to the basic Prism tool to make it suitable for use in fault-tolerant applications based on replicated target programs. In particular, support for error detection and recovery has been added. The frontend acts as a voter for data that is collected from the replicated backends. The voter code includes management code for synchronization of the backends. For recovery, the libckp checkpointing package [11] is used to perform restarting and migration of target program processes that have been determined to be erroneous. The frontend graphical interface was also modi ed to allow the speci cation of voting parameters and recovery parameters.
3 Fault tolerance implementation
The Prism tool described in Section 2 provides a platform that is well suited to the development of fault tolerance implementations based on process replication. To illustrate the eectiveness of the Prism instrumentation idea and the utility of the Prism tool, a speci c fault tolerance scheme was implemented based on an N-Modular Redundancy (NMR) architecture. In this scheme, the target program is replicated on dierent machines. Each copy of the target program is controlled by a separate Prism backend. A single frontend serves to coordinate the operation of the backends. Any number of backends (and therefore replicas of the target program) can be supported, although at least three backends are required to perform recovery. The frontend and backends should be placed on dierent machines to maximize performance and error containment for each backend. Because multiple copies of the entire data space for the target program exist, error detection is performed by voting on the same data for all copies of the target program. The copies of the backend send the data to be voted upon to the frontend, which serves as the voter. If the vote is unanimous, the frontend instructs the backends to continue execution of the target program. If the vote is not unanimous, then an error is detected, and recovery is initiated, assuming that a majority consensus can still be reached. Although the actual voting is a straightforward comparison, some considerations must be taken. The copies of the target program execute independently. Thus, a synchronization mechanism is needed to ensure that all copies of the target program are at the identical logical execution point. One convenient method to guarantee this synchronization is to insert
breakpoints at the same address in all copies. The breakpoints serve as a form of barrier synchronization, where all copies must wait until the last copy reaches the breakpoint. Voting must only be performed upon data that is deterministic and is therefore guaranteed to be identical for all copies, which excludes data such as machine names, non-global clock times, and dynamic virtual addresses. The data must be delivered to the frontend in a standardized format. The use of a debugger as the backend controller program provides a convenient method for standardizing data. If symbolic information for the target program is available, then the data to be voted upon can be speci ed as variable names. Otherwise, data can still be speci ed in terms of virtual addresses. If the outcome of the vote is not unanimous, then a divergence is detected for a particular copy of the target program, which is then assumed to have experienced an error and which must then undergo recovery. The recovery process is based on using checkpoints to restart and possibly migrate the erroneous copy of the target program. One of the non-erroneous backends is directed to produce a checkpoint of itself. The checkpoint is then copied to the machine on which the erroneous target program resides, and that target program is restarted using the checkpoint. The idea of using checkpointing to restart a failed process is neither unique nor complicated; however, the following points should be considered. The code to save the checkpointed process state to a le and to restart a new process from the checkpointed data le can either be part of the target program or exist externally to the target program. If the checkpoint code is intended to be part of the target program, then it may be integrated with the original target program by (1) source code modi cation, (2) compile-time linking, or (3) run-time linking. If the checkpoint code is external to the target program, then two options exist: (1) Use a checkpointing package that is totally separate from Prism, or (2) use the Prism backend controller program to perform the checkpoint operations. The latter option has the advantage of being more platform independent, especially if symbolic information is used to store the checkpointed data in a standard intermediate format. Special care must be taken to ensure that the newly restarted target program is synchronized with the original target programs. This might be a concern if the checkpointing code is integrated into the target program. In that case, both the restarted target program and the checkpointed target program will execute code asymmetrically. The restarted program does not neces-
sarily have to be restarted at the same location as the other programs. It just has to be restarted at a location such that the next vote point it reaches will be identical to that for the other programs. This implies that the set of voting breakpoints must be preserved when restarting the new target program. A policy must be established to specify the machine to use to restart the target program. The new target program is restarted on the same machine up to a maximum restart threshold of X times, where X 0. If a target program is restarted on a dierent machine, then a checkpoint le must be copied to the new machine, which must be suciently similar to the failed machine to be able to restart the target program using the copied checkpoint le. overhead time pause and send data to frontend (Tp1_1)
(T0) run
(Tb1) target program starts
pause and send data to frontend (Tp1_2)
(Tckp1_1) take checkpoint
(Tc1_1) cont
(Tc1_2) cont
(Te1) target program finishes
backend1
time Restart with checkpoint
time to start up target program
Fault (Tr2_1) occurs (T0)
(Tb2) (Tp2_1) (Tc2_1)
(Tp2_2)
(Tc2_2)
backend2
(Te2)
time
frontend
time unanimous vote
divergent vote
(Tf0) (T0) tell backends to continue
tell backends to continue
tell non-faulty tell faulty backend to backend to take checkpoint restart
Figure 1: NMR Time-line
Figure 1 illustrates an example of a sequence of events for the fault tolerance implementation using the Prism tool. Three time-lines are shown to represent the events for two backends and the frontend. More than two backends are present in the actual con guration, but the only two displayed are the backend that experiences an error (backend2) and the backend that creates the checkpoint to be used for recovery (backend1). The subscripts associated with recurring events are of the form Txi j , where x is the type of event, i identi es
backend, and j distinguishes the events for each backend. Broken, vertical lines are drawn for events that occur simultaneously. Also note that the gure is not drawn to scale. Assume that a fault occurs on backend2 and that this fault aects a value for a data variable that has been selected for voting. This fault occurrence is shown in Figure 1 as a . After the fault occurs, the backends all pause at time Tpi j to send their data to the frontend. This time, the voting is not unanimous due to the fault. If at least three backends participated in the vote, then the two non-faulty backends should present the same data value to the frontend. In this manner, the frontend is able to determine which backend is erroneous. The recovery process is then initiated. The frontend instructs one of the non-faulty backends to produce a checkpoint of its state in the form of a checkpoint le. In Figure 1, backend1 takes a checkpoint of its state at time Tckpi j . The frontend waits for the checkpointing to be completed. If backend1 and backend2 reside on dierent machines, then the checkpoint le is copied from the backend1 machine to the backend2 machine. The frontend then instructs backend2 to terminate the faulty target program and restart a new target program using the checkpoint le. After the new target program has reached the same state as the non-faulty target programs, the frontend gives the continue command to all backends at time Tci j . Depending on the restart policy, if the maximum restart threshold has already been reached, then the entire backend2 would be terminated and restarted on another machine. The new target program would also be restarted on the new machine. If the fault that occurs on backend2 aects the target program such that it is unable to reach the voting breakpoint, the frontend will wait until a maximum wait threshold is reached and then declare backend2 to be erroneous. The same recovery process described above is then initiated. These types of faults can cause the target program (1) to crash, (2) to hang inde nitely, or (3) to continue execution without encountering the correct voting breakpoint. In the last case, if another voting breakpoint is encountered, then the frontend will still determine that an error has occurred. This error detection and recovery scheme will tolerate all single occurrences of hardware faults, be they transient or permanent. Permanent faults will cause restarts to be migrated to a non-faulty machine. For \single fault occurrence," a second fault should not occur until the recovery process initiated by the rst fault has been completed. Some types of software faults can also be tolerated. These software faults cause errors in the environment external to the target program, which
in turn cause system calls in the target program either to return divergent values or to cause program termination. For instance, a memory leak in the target program may eventually cause a request for additional memory to produce a program termination when a subsequent access to the unallocated memory causes an illegal memory access error that is detected by the operating system. Obviously the selection of the voting parameters is very important. The data to be subject to voting should include the set of \critical" variables in the target program. The following types of variables should receive special consideration: Control ow Variables involved in control ow, such as variables in the conditional part of a branch or a loop. Program output Variables passed as parameters to output functions, including visual, le, and interprocess output. Algorithm input and output Variables used as input or produced as output by algorithms and functions. An example is the input and output matrices of a matrix operation. For ultra-reliable operation, several approaches can be taken to enhance the reliability of the frontend voter, since it is a potential single point of failure. Selfchecking and recovery code can be added to the frontend along with an external watchdog mechanism to ensure that the frontend remains alive. The single frontend process may be replicated with redundant voting. However, that would require a distributed consensus among the multiple voters. A distributed consensusbased vote among the backends would eliminate the need for the frontend voter altogether.
4 Performance
The NMR fault tolerance scheme described in Section 3 has been implemented and tested using fault injection. The fault injection testing demonstrated that the scheme is eective in tolerating the intended faults. However, one drawback was obvious: the impact on performance. From Figure 1, several sources of performance degradation can be seen: The time to start up the backend software before the target program actually begins execution. This is shown as time Tbi ? T0 in Figure 1. The synchronization time at each vote. This is the time that each target program is stalled while waiting for the frontend to complete the vote tally and issue the continue command. In Figure 1, this time is shown as time Tci j ? Tpi j for unanimous
Table 1: Measured Overheads For One Example
Source of overhead Figure 1 labels Time (secs) Start-up Tbi ? T0 1.796 Voting Tci j ? Tpi j 0.010 Recovery Tci j ? Tckpi j 3.437 votes and time Tckpi j ? Tpi j for non-unanimous votes. The recovery time incurred after an error is detected. This time is shown as time Tci j ? Tckpi j in Figure 1. The performance overheads are caused by several factors. First, the backend controller program incurs an overhead because it must compete with the target program for processor time. Second, the controller program incurs an additional overhead for managing the execution of the target program, including the management of breakpoints. Third, the voting process imposes a form of barrier synchronization on the target programs. The waiting time is especially conspicuous if the processing speeds of the backend machines are not equal because the faster machines will always wait for the slow machine to reach the voting point. Fourth, because the frontend and backend are distributed, a communication overhead is incurred. Fifth, the recovery process requires some nite amount of time. Table 1 shows some measured overheads for one speci c example. The second column of the table relates the dierent overheads to those depicted in Figure 1 (shown as gray boxes). Since measurements of the overhead will vary depending on the actual target program used and the Prism con guration, the table is given as an indication of the typical overheads that can be expected. The times given are absolute measurements of a single instance of the type of overhead and should be viewed as indications of the actual total overhead as a percentage of the total execution time. The voting time in Table 1 includes the overhead of sending messages between the frontend and backend as well as the time spent waiting for all target programs to synchronize at the vote point. The voting overhead is on the order of a few milliseconds for a single vote, with an approximate range of 3 to 15 milliseconds. Some actions can be taken to decrease the performance overhead of the Prism NMR implementation: (1) Decrease the frequency of votes. (2) Decrease the size of the data to vote on. (3) Eliminate the requirement for target programs to stall execution until the outcome of the current vote has been determined by the frontend.
Action 1 would decrease the number of the Tci j ? j overheads in Figure 1. Action 2 would decrease the average Tci j ?Tpi j time, but probably only slightly. As long as the data for voting is not very large, most of the Tci j ? Tpi j time is composed of synchronization and communication overheads. Action 3 would completely eliminate the Tci j ? Tpi j overhead, but would increase the complexity of the voting and recovery process because the target programs would no longer be synchronized. Thus, vote data must be saved until used, and recovery would necessitate the partial cancellation of some voting data. Tpi
5 Conclusion
This paper described the concept of indirect instrumentation and its use in executing non-fault tolerant programs in a fault-tolerant mode. Using the Prism indirect instrumentation tool, a fault tolerance scheme based on process replication and voting was implemented. This Prism fault tolerance scheme oers several advantages: (1) The integrity of critical variables is guaranteed because variable values within the target program's data space are subject to voting. (2) The scheme is applicable to most programs, with a few exceptions. (3) No source code editing or recompilation is needed. (4) The scheme can handle legacy applications or code for which the user has only limited expertise. The Prism fault tolerance implementation was subject to fault injection testing, and this testing demonstrated that the scheme was eective in detecting and recovering from the intended faults. One area of concern was the performance overhead imposed by the indirect instrumentation scheme. Several proposals were presented to reduce the performance penalties, including decreasing the frequency of votes and allowing the replicated processes to execute asynchronously. Further research should be conducted to address some of the concerns associated with the use of process replication. For instance, how should replicated input and output be handled? One possible solution might be to copy input to all replicated target programs and sink the output from all but one backend that is designated as the primary. This type of I/O management could be performed by trapping all calls to the read() and write() system calls. However, this read()/write() scheme does not handle memorymapped I/O and ioctl() calls.
6 Acknowledgments
The author gratefully acknowledges the design and development eort of Reinhard Klemm and Navjot Singh, as well as suggestions and feedback from Chandra Kintala.
References
[1] Algirdas A. Avizienis. Software Fault Tolerance, chapter The Methodology of N-Version Programming, pages 23{46. John Wiley & Sons Ltd., 1995. [2] V. Balasubramanian and P. Banerjee. Compiler assisted synthesis of algorithm-based checking in multiprocessors. IEEE Transactions on Computers, 39(4):436{447, April 1990. [3] K. H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33(6):518{528, June 1984. [4] Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. FERRARI: A exible software-based fault and error injection system. IEEE Transactions on Computers, 44(2):248{260, February 1995. [5] J. Long, W. K. Fuchs, and J. A. Abraham. Forward recovery using checkpointing in parallel systems. In Proc. IEEE International Conference on Parallel Processing, pages 272{275, 1990. [6] J. Long, W. K. Fuchs, and J. A. Abraham. Dependable Computing for Critical Applications 2, chapter Implementing Forward Recovery Using Checkpointing in Distributed Systems, pages 27{46. Springer-Verlag, 1991. [7] D. K. Pradhan and N. H. Vaidya. Roll-forward and rollback recovery: Performance-reliability tradeo. In Proc. 24th Fault-Tolerant Computing Symposium, pages 186{195, 1994. [8] D. K. Pradhan and N. H. Vaidya. Roll-forward checkpointing scheme: A novel fault-tolerant architecture. IEEE Transactions on Computers, 34(10):1163{1174, October 1994. [9] Richard M. Stallman and Cygnus Support. Debugging with GDB: the GNU Source-Level Debugger, 4.12 edition, January 1994. [10] Timothy Tsai, Reinhard Klemm, and Navjot Singh. Distributed, automated, indirect program instrumentation with prism. Technical memorandum, Lucent Technologies, Bell Laboratories, Murray Hill, NJ, 1997. Disbributed Software Research Department. [11] Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pi-Yu Chung, and Chandra Kintala. Checkpointing and its applications. In Proc. 25th Fault-Tolerant Computing Symposium, pages 22{31, 1995.