TFT: A Software System for Application–Transparent Fault Tolerance Thomas C. Bressoud Stratus Computer, Inc. 55 Fairbanks Blvd. Marlborough, MA 01752, USA
[email protected] Abstract An important objective of software fault tolerant systems should be to provide a fault-tolerance infrastructure in a manner that minimizes the effort required by the application developer. In the limit, the objective is to provide fault tolerance transparently to the application. TFT, the work presented in this paper, provides transparent fault-tolerance at a higher interface than prior solutions. TFT coordinates replicas at the system call interface, interposing a supervisor agent between the application and the operating system. Moving the replica coordination to this interface allows uncorrelated faults within the operating system and below to be tolerated and also admits the possibility of online operating system and hardware upgrades. To accomplish its task, TFT must enforce a deterministic computation above the system call interface. The potential sources of non-determinism addressed include non-deterministic system calls, delivery of asynchronous events, and the representation of operating system abstractions that differ between replicas.
1
Introduction
Most software solutions for building fault-tolerant applications require the application programmer to design with fault tolerance in mind and to conform to the particular programming paradigm provided by the faulttolerance infrastructure. The fault-tolerance infrastructure may then provide support and mechanisms for coordinating replicas1 of a computation on processors that fail independently. Unfortunately, correct design of such applications is difficult; errors can be disastrous; and the provided programming paradigm may not be the best paradigm for designing the application.
1
A replica is an instance of a computation; a collection of replicas comprises a fault-tolerant realization of that computation.
An important objective of fault-tolerant systems should be to provide this fault-tolerance infrastructure in a manner that minimizes the effort required by the applications developer. In the limit, the objective is to provide fault tolerance transparently to the application. This paper describes Transparent Fault Tolerance (TFT), wherein replica coordination is automatically provided for applications by interposing a software layer between the application software and the operating system. The result is a fault-tolerant application that does not require changes to the application source, nor to the underlying hardware or operating system. Transparency must be defined relative to an existing interface. Previous work has achieved transparency relative to other existing interfaces: • Transparent hardware fault tolerance may be attained by providing replica coordination at the system or I/O bus interface and operating the processors in instruction lockstep [13], [15]. • Hypervisor-based fault-tolerance [4] provides transparent replica coordination at the instruction set interface, but requires a virtual machine monitor to implement the coordination. For both of these approaches, all software above the selected interface executes identically at all replicas. The implication of this is that operating system design faults always result in (identically) correlated replica failures. In addition, the hardware replication has a recurring design cost for new processor and bus architectures. The hypervisor-based approach has a significant failure-free performance cost as the result of the virtual machine monitor overhead. TFT provides transparent replica coordination above the operating system. Replica coordination is provided at the interface between the application software and the operating system. As a result, independent faults in the operating system and below can be tolerated. Further, for operating system and hardware upgrades that maintain the semantics of the application/operating system interface, the solution admits the possibility of online operating system and hardware upgrades.
To establish the feasibility of key aspects of the TFT solution, two prototypes were built. The first targeted the Unixware system call interface. The second targeted the Win32 interface of Windows NT. The rest of this paper is organized as follows. In the remainder of this section, we describe the system model and fundamental assumptions of TFT. Section 2 discusses the general approach for providing transparent fault tolerance relative to a given interface. Section 3 maps the general approach to the application/operating system interface of TFT. In Section 4, we describe the two prototypes and the overheads measured from them. Section 5 describes related work; a summary and future research directions are presented in Section 6.
1.1
System Model
We assume the existence of multiple processors that fail independently. Each processor executes its own operating system kernel and there exists a well defined and complete interface between the application, running in user mode, and the operating system, running in kernel mode. Each processor may host a replica of a given application. We further assume that there exists a communication channel between processors. Communication is assumed to be FIFO and eventually reliable. Message sends are non-blocking but the communication subsystem provides a means for checking and/or blocking on the stable receipt of a particular message send; we call this a stability query. Message receives are blocking. We assume that processors are not partitioned by communications failures. Different replicas must have the ability to interact with the I/O devices of the environment. We thus assume that any I/O operations possible from the application executing on one processor are possible from the application executing on any other processor. The TFT approach is derivative of the primary/backup approach to fault-tolerance [1]. One replica is designated the primary, and the others are designated as backups. The primary is responsible for interaction with the environment until a failure of the primary. At that point some backup assumes its responsibilities. Unlike many other primary/backup schemes, the backups of TFT are actively executing the computation, although temporally just behind the primary. Primary/backup approaches (including TFT) are generally designed to only tolerate those failures that can be detected before the application at the faulty processor performs an erroneous externally visible action. This is commonly known as the failstop failure model [16]. Byzantine behavior of the primary can cause corrupt data
and failure of the system and is generally disallowed by this approach. Failure detection is not specifically addressed in this description of TFT. Our prototypes use timeouts and maximum retry counts on the communications from the primary to the backup and thus implicitly assume a bounded delay both in communication and in processor execution. A failure of the backup is assumed to release the blocking stability query described above.
2 2.1
General Approach Interfaces and Computations
The design of replica coordinating software must determine what defines the boundary of a replica and then manage the inputs and outputs crossing that boundary. Within the context of a hierarchically structured system, this can be restated as determining the interface at which to implement replica coordination. Figure 1 shows such a representation and the candidates for a replica coordination interface. Application Middleware Interface Middleware System Call Interface Operating System
Instruction Set Interface
Processor Hardware
System Bus Interface Memory and I/O
Figure 1. Achieving transparency can be accomplished by implementing replica coordination as a layer of software replacing an existing interface. Call that interface the replication management interface. The replica coordination layer must then provide the semantic equivalent of the usurped interface as well as perform the replica coordination responsibilities. In this way, the software above the selected interface cannot distinguish between the fault-tolerant and non-fault-tolerant realizations of the interface. Given a selected replication management interface, the levels above the selected interface become the managed replica and it is the goal of replica coordination to ensure that replica instances behave identically in a manner rigorously described below. The levels below the selected interface may execute independently. An interface is defined by the set of actions it supports. An action that is initiated by software above the interface is called an operation. An action that is initiated from below the interface (and is thus asynchronous to the sequence of operations at the interface) is called an
exception. A controlled action (either an operation or an exception) is one that can be implemented by the replica coordination software. The sequence of actions (operations and exceptions) at the replication management interface forms the computation of the replica. The action sequence of a computation defines a form of time that increases monotonically with respect to that action sequence. We call this logical time to distinguish it from physical time. Although logical time can be measured by counting the actions at the selected interface, a more sophisticated approach is necessary and will be described below.
2.2
State Machines, Replica Coordination, and Non-Determinism
The state machine is an essential model for architecting fault-tolerant systems [11],[17]. A state machine consists of state variables, which encode its state, and actions, which transform its state. Each action is deterministic, and execution of each action is atomic with respect to other actions of the state machine. Execution of each action modifies the state variables and/or produces some output. Given this definition of a state machine, the outputs are completely determined by the sequence of actions executed. Building fault tolerant systems out of state machine replicas is typically achieved by having replica coordination software deliver the same actions to all state machine replicas in the same order and voting or comparing the outputs. Note the intentional overload of the term action. We equate actions performed at the replica management interface with transformations of the state of a replica. Mapping the state machine approach to our goal of transparent fault tolerance requires additional effort. The state of a replica (corresponding to the state variables of a state machine) should include all data and code above the replication management interface as well as the program counter and general-purpose registers that allow a computation to execute on real hardware. The state of software and/or hardware below the replication management interface only becomes part of the state of the replica when it may be visible (either explicitly or implicitly) across the replication management interface. We want the actions of the selected interface to be the set of actions of a state machine replica. However, the set of actions supported by an existing interface are likely to not be deterministic, and thus may not support the state machine model. Some of these non-determinisms are described below. The additional requirement of replica coordination in this setting is to enforce determinism of all actions.
More formally, the obligations of the replica coordination software are the following: RC1: Ensure that all replicas begin in the same initial state. RC2: Ensure that every non-faulty replica executes the same actions in the same relative order. RC3: Ensure that the transformation in the state of the replica is identical for each action executed. Together, these imply that all replicas execute the same computation and thus produce the same outputs in the same order. The environment must only see a single correct output, so replica coordination must also: RC4: Ensure that a single correct output is communicated to I/O devices of the environment. Finally, we must be able to continue operation, interacting consistently with both the environment and with the underlying operating system, even in the event of failures. RC5: The effect of a failure must be transparent to the ongoing computation. To satisfy RC2 and RC3, we must understand the potential classes of non-determinism across the replication management interface. Non-determinism affecting the state of a computation replica can be partitioned into three classes: Timing non-determinism. Timing nondeterminism captures the notion that execution of the same action from the same state takes a different amount of physical time at different replicas. This is typically due to differences in state below the replication management interface as well as differences in execution rates of the underlying levels of software and hardware. The use of the state machine approach, with replica coordination as defined above, abstracts out timing nondeterminism. As long as the action sequence is identical (RC2) and the transformation in state is identical (RC3), then the amount of physical time taken by the execution of an action on a processor replica does not affect correct operation, regardless of the time skew introduced. Control-flow non-determinism. Given identical state at two different replicas, control-flow nondeterminism is exhibited if the next action to be executed differs between the two computation replicas. This type of non-determinism is due to exceptions initiated from below the replication management interface. In the absence of exceptions, the next operation of the computation is determined by the previous operation (its successor operation, or the destination of a branch). To satisfy RC2, exceptions must be controlled actions and must be delivered by the replica coordination software at identical corresponding points in the action sequence at all replicas (i.e. at the same logical time).
Data non-determinism. Given identical state at two different replicas and the same next action to be executed, data non-determinism is exhibited if the transformation in state induced by execution of the action differs between the two replicas. Data non-determinism can be partitioned into two subclasses. First, the semantics of an action may admit a non-deterministic transformation in state. An action that reads a clock or performance counter exemplifies this subclass. Likewise, an action that reads data from the environment (an input action) also falls into this subclass. This class of non-determinism is synchronous with the action sequence. To satisfy R3, replica coordination must ensure that the transformation in state is identical. To do that, the action must be controlled, and a deterministic equivalent must be provided. The second subclass of data non-determinism can occur with loads of ordinary memory locations if that memory is shared with another computation. This is caused by asynchronous activity outside the replica boundary. They are analogous to exception delivery and these replica state transitions must also occur consistently relative to the action sequence.
3 3.1
application/operating system interface and argue that the obligations of RC2 and RC3 can be satisfied. Recall that the actions of an interface are partitioned into operations and exceptions. The operations of the application/operating system interface consist of (i) the non-privileged instructions of the processor architecture and (ii) the set of system calls provided by the underlying operating system. The exceptions of the application/operating system interface are also dependent on the underlying operating system. Typical examples include asynchronous signals and non-blocking I/O completion notifications. The transformation in the state of the replica for exceptions must include delivery of any data associated with that exception. The TFT architecture presupposes an interposition mechanism whereby the supervisor is in the path between any system call invoked by the application and the operating system itself. Example interposition mechanisms for our two prototypes are discussed in Section 4. Thus all system calls are controlled actions. We further assume (by appropriate implementation) that the supervisor can intercept all exceptions initiated by the operating system, making exceptions controlled actions as well.
The TFT Solution Primary
Backup
Application
Application
TFT Supervisor
TFT Supervisor
Operating System
Operating System
Processor
Processor
Architecture
The TFT solution provides replica coordination in a software level interposed between the application and the operating system, replacing that interface. We call the TFT replica coordination software the Replica Supervisor (supervisor). The interface presented by the supervisor to the application must have the same semantics as the operating system interface. The primary/backup method is employed to provide consistency among the replicas. The supervisor must ensure that: 1. If the primary has not failed, then backup application replicas generate no interactions with the environment. 2. In the event of a failure of the primary, exactly one backup application replica takes over and generates subsequent interactions with the environment. This is done in such a way that the environment is unaware of the primary’s failure. These then satisfy obligation RC4 and their enforcement will be covered in greater detail in Section 3.3. The solution described here employs a single backup and thus can tolerate a single failure. Generalization to the use of multiple backups is straightforward. The TFT architecture with a single backup is depicted in Figure 2. Using the terminology presented in Section 2, we must now analyze the actions supported by the
Environment
Figure 2.
3.2
Identical Replica Computations
The supervisor is responsible for ensuring RC2 and RC3. For the replication management interface of TFT, RC2 means that each replica must execute the same sequence of instructions, system calls, and exceptions, since these are the actions of the interface. We first consider how to satisfy RC2 and RC3 for instructions and system calls in the absence of exception delivery. We then present our solution for identical delivery of exceptions.
Operations. We assume that ordinary instructions (the non-privileged instructions of the processor architecture) are deterministic: given the same replica state and the same ordinary instruction and arguments, execution will result in the same transformation in state. Arithmetic instructions, load/store instructions, and control instructions all satisfy this assumption, even if they result in a synchronous trap. Thus ordinary instructions satisfy RC3 without additional work by the supervisor. Also, since part of the transformation in state of any ordinary instruction is to update the program counter, then RC2 is also satisfied. Operating system calls are partitioned into deterministic system calls and non-deterministic system calls. Examples of deterministic system calls are varied and include heap creation and heap allocation/free operations as well as pure processing such as string manipulation. The deterministic system calls satisfy RC2 and RC3 by the same argument as ordinary instructions. For all non-deterministic system calls, the supervisor at the primary invokes the system call on behalf of the computation. The transformation in state resulting from the non-deterministic system call is sent in a message, along with a logical time timestamp, from the primary to the backup. The message is sent along the FIFO communication channel and is non-blocking for the primary. The primary may then continue its execution. When the backup reaches the same system call, it retrieves the transformation in state conveyed by the primary on the communication channel. This would cause the backup to block if it had “run ahead” of the primary. The backup may then apply same transformation in state, thereby satisfying RC3. Note that, in the absence of exception delivery, RC2 is satisfied in the same way as ordinary instructions and deterministic system calls. Examples of non-deterministic system calls include those that deal with physical time, such as gettimeofday(), calls that synchronously sense the environment, such as a read(), and calls that perform output to the environment and return a status, such as a write(). In all these cases, the primary performs the operation and the backup simulates the operation by returning the same transformation in state as the primary. Exceptions. Epochs. Execution of an application replica is partitioned into epochs. Each epoch contains one or more increments of logical time. By TFT design, corresponding epochs at the primary and backup application comprise the same sequence of operations. The supervisor is responsible for incrementing logical time and dividing the operation sequence into epochs. A simple solution to try to accomplish this would be to increment logical time at each system call invocation, since the supervisor has control at those points anyway.
However, this granularity of logical time is not sufficient for arbitrary applications. Consider, for example, an application executing a code sequence that loops and invokes no system calls. The application could design the loop to terminate when an exception (such as a signal) indicates completion of a prior I/O operation. No system calls are invoked thus the supervisor never gains control, so buffered exceptions are never delivered and the loop would never terminate. Even in the absence of intercepted system calls, the supervisors at the primary and at the backup must receive control from the application at the same point in logical time so that buffered exceptions may be delivered. TFT uses a known technique of object code editing to transparently instrument application binaries with code to advance logical time and to transfer control to the supervisor at epoch boundaries. Epochs begin with an epoch counter set to the length of the epoch. The epoch counter represents logical time. The epoch counter is set identically by the supervisors of both the primary and the backup. To guarantee the liveness of logical time, each back branch of the application binary is replaced with instructions to decrement the epoch counter and to conditionally branch to supervisor code. The pseudo assembly code of Figure 3 illustrates the transformation. Exception Delivery. The supervisor must deliver the same exceptions, including the transformation in state they induce, at the same point in the computation (i.e. at the same logical time). In TFT, the replica supervisor at the primary is responsible for intercepting exceptions, buffering them, and forwarding them to the supervisor at the backup. Any exceptions intercepted directly by the supervisor at the backup are ignored. Thus, by communicating with the primary’s supervisor, the backup’s supervisor learns what exceptions it must deliver to the backup application replica. Exceptions are not delivered during an epoch. Instead they are buffered and are only delivered at epoch boundaries. We have only to ensure that the same exceptions (if any) are delivered at both the primary and the backup in the same order when each epoch ends. This is achieved by having the primary communicate the exception and the epoch of delivery to the backup. A flag can be added to the message to indicate the last exception to be delivered at an epoch boundary. A special message can be sent to indicate epoch boundaries with no exceptions to be delivered. The backup delays entering a subsequent epoch until communication from the primary indicates that all exceptions due to be delivered at the current epoch boundary have been received. The solution is depicted in Figure 4. This is similar to the policies for handling delivery of interrupts defined in Hypervisor-based Fault-tolerance [4], as well as the instruction counters of [19].
Before: load r1,#read_sys_call trap #os_trap loop_top: ; potentially lengthy computation load bz
r1,signal_flag loop_top
load r1,#write_sys_call trap #os_trap After: load r1,#read_sys_call trap #os_trap loop_top: ; potentially lengthy computation load r1,signal_flag push condition_codes push r1 load r1,epoch_counter bnz continue call supervisor continue: pop r1 pop condition_codes bz loop_top load trap
r1,#write_sys_call #os_trap
Figure 3. Primary
Backup
Exceptions A Epoch i Epoch i B C
Replica Coordination Deliver Exceptions A, B, C Epoch i +1
Deliver Exceptions A, B, C Epoch i +1
Figure 4.
3.3
Interaction with the Environment
The system calls and exceptions of the application/operating system interface are the mechanism by which applications sense and affect the environment.
One of the obligations of the supervisor is to satisfy RC4 wherein the sequence of I/O operations seen by the environment are consistent with what could be observed were a single application in use, instead of a replicated application. In particular, the following scenario should not be possible. First, the application replica at the primary performs an I/O operation that causes the state of the environment to change. Then the primary fails. When the backup is promoted to the new primary, it has a state and performs acts that are inconsistent with the state of the environment. This could be the result of uncertainty at the backup (now primary) about the last I/O operation performed by the primary. As in other systems of this type, we would like to achieve “exactly once” semantics of all I/O operations. Unfortunately, without an atomic operation that includes both the I/O and the communication to the backup, achieving exactly once semantics is impossible. The protocol described thus far allows such inconsistency. Due to the failure of the primary, which may include the failure of the processor the primary is running on, some communication to the backup may be lost. By ensuring RC1 through RC3 as described above, the computation of the backup will be identical to that of the primary up through the last communication from the primary received reliably at the backup. (I.e. the backup can perform the same set of non-deterministic system calls and exception delivery at epoch boundaries as the primary up to the point of failed communication.) The backup must take over following a failure at the first non-deterministic system call or epoch boundary for which the backup does not have the corresponding message from the primary. Call this the promotion point. Once the backup is promoted, it must begin interacting with the environment and resolving non-deterministic system calls in a manner consistent with the assumptions of the interface (i.e. become the new primary). We must guarantee that the backup is aware of all I/O operations that the primary may have attempted prior to its failure. The delivery of an exception at an epoch boundary or the transformation induced by a nondeterministic system call could dictate an execution path that includes some I/O operation initiated by the primary. In the event of a failure, if the failure results in a loss of a message that dictates that execution path, then the backup could resolve the non-determinism differently and take a different code path. An I/O operation initiated by the primary and affecting the environment might therefore never be seen by the backup. To solve this part of the problem, the primary supervisor executes a stability query immediately prior to each I/O operation that affects the environment. If prior messages from the primary to the backup are not yet
stable, the primary blocks until they are. The I/O operation is then allowed to propagate out to the environment. This is similar to the output commit requirement before environment interaction in messagelogging systems [8], [9]. The current algorithm can still result in outstanding I/O operations whose status is not known to the backup. At the promotion point, there may be one or more I/O operations that are outstanding. An I/O operation is outstanding if, (i) for blocking I/O operations, the system call message was not received by the backup, and (ii) for non-blocking I/O operations, the exception associated with asynchronous completion notification was not received by the backup. The backup can automatically resolve this uncertainty for two classes of I/O operations: 1. Idempotent operations: the backup may simply reissue the operation. 2. Testable operations: the backup can query the I/O device as to the last operation received by the device and its status and either perform or not perform the operation depending on whether the primary had been successful. For any other I/O operations, the backup must return a failure status completion and rely on the application to take corrective steps. Clearly, we would like as many I/O operations as possible to fall into the two categories where automatic corrective action is possible. In some cases, system software can be modified to exhibit this behavior. For example, TFT maintains its own current file offset so that file reads and writes using an implicit offset can be made idempotent. We also implement a virtual serial console for default standard input and output that adds sequence numbers to operation requests so that the virtual console is testable.
3.4
Interaction with the Operating System
Portions of the state of the underlying operating system can become visible to the application as a result of calls. If this information is allowed to “leak” into the application state across the application/operating system boundary, then, by RC2 above, the same state leaks into the backup as does the primary. When a failure occurs and the application replica at the backup begins interacting with the operating system, this leaked information may be inconsistent with the state of the underlying operating system at the backup. This is another source of non-determinism, but must be resolved in such a way as to allow consistent operation across a failure. To satisfy RC5, the replica supervisor must provide a virtualization of the set of abstractions supported by the
operating system. We call this the Virtual Process Environment (VPE). The VPE presents identical virtual operating system abstractions to the application so that the state of the primary and backup is identical and will remain consistent across a failure. The VPE can map between the virtual abstractions and the actual operating system abstractions they represent. Simple mapping is sufficient for large classes of operating system abstractions. For example, the VPE can maintain its own process identifiers and mapping for object handles that map 1:1 to those in the operating system. Another important example is the virtualization of the set of system calls returning physical time (or some derivative). The primary can retrieve the time, return it locally, and pass it along to the backup per the protocol described in Section 3.2.1, but after a failure, the backup must be able to return time values consistent with expected semantics (e.g. time must monotonically increase). More complex mapping may need to be employed for file system directory hierarchies and for network addresses and ports. For example, for the failure of a TFT server to be transparent to its clients, the IP address must migrate transparently to a new MAC address, and the state of the TCP/IP protocol stack must not reveal the failure. The size and extent of the VPE is very dependent on the location sensitivity of the underlying system. For a true distributed operating system where location should not be visible (as in so-called single system image operating systems like LOCUS [21]), the VPE need not incorporate much functionality. Accommodating network operating systems involves incorporating significant functionality of a distributed operating system in the VPE. A complete discussion of the design of the VPE is ongoing research and is beyond the scope of the current paper.
4 4.1
TFT Prototypes Unixware System Call Interface
Our first prototype targeted the application/operating system interface defined by the Unixware operating system executing on the Intel Pentium processor architecture. Interposition of the supervisor was achieved by replacing the operating system trap instances within the C run-time library with a procedure call to the supervisor. Application executables had to be relinked with this “supervisor-enhanced” C run-time library. The communication medium in the prototype was 10 Mbps Ethernet. Communication protocol between the primary and its single backup employed TCP/IP with an
explicit acknowledgment to implement the stability query. We did not attempt to optimize the communications protocols – we thus had the default message buffering and transmission decisions by the protocol layer. The Unixware prototype was did not implement object code instrumentation of logical time due to a lack of general instrumentation tools available at that time. Nonetheless, we believe such instrumentation is very feasible. Logical time was incremented at system calls. We used the TFT prototype for two workloads: one focused on robustness of the system call interposition and the supervisor incursion overhead and the second focused on performance in a more common CPU I/O burst cycle.
The first workload was the AIM IX benchmark whose goal is to measure the number of system calls per unit time for a wide range of Unix system calls. Table 1 summarizes the results of the AIM IX benchmark. The first column describes the benchmark subtest. The second column shows the elapsed time for each subtest without the presence of the TFT supervisor. The subsequent pairs of columns show the elapsed time and normalized performance (relative to column 2 elapsed time) for both a primary-only system and a primary/backup system for each subtest. Not all AIM IX subtests are shown. Those that deal in deterministic operations show normalized performance of 1, as we would expect.
Table 1: AIM IX Benchmark Results Test Description
Elapsed Time (non-TFT)
Primary-Only Elapsed Time
Primary-Only Normalized Performance
Primary/Backup Elapsed Time
Primary/Backup Normalized Performance
Program Loads (exec)
2.52
2.86
1.134
8.05
3.196
Task Creations (fork)
2.44
3.41
1.399
4.85
1.986
Random Disk Reads
161.36
160.94
0.997
348.72
2.161
Random Disk Writes
177.54
175.23
0.986
238.45
1.343
Sequential Disk Reads
51.70
57.31
1.108
665.10
12.864
Sequential Disk Write
75.01
79.96
1.066
282.17
3.762
Disk Copy
128.61
139.90
1.088
964.40
7.499
Sync Random Disk Write
742.34
744.84
1.003
755.14
1.017
Sync Sequential Disk Write
698.27
700.91
1.004
711.89
1.020
Sync Disk Copy
707.66
708.91
1.002
745.43
1.054
57.11
58.86
1.031
61.12
1.070
Directory Searches
By the nature of the benchmark design, no real work is being done with any disk blocks that are read or written. The normalized performance shown is representative of interposition overhead relative to the duration of the system call being interposed upon. Note the poorer performance of each class of disk read to the corresponding class of disk write. For instance the random disk read has a normalized performance of 2.161 while the random disk write has a normalized performance of 1.343. This is because the read must communicate the data block from the primary to the backup. The write need only communicate the return result. Because the duration of synchronous disk read/write operations is large compared to the interposition overhead, normalized performance is close to 1 for those calls. The comparison of primary-only vs.
primary/backup results shows the large effect of communication overhead and synchronization delays on performance. Normalized performance for the primaryonly case is uniformly close to 1, while normalized performance for the primary/backup case ranges from 1 to 12.8. The second workload was the gzip compression utility. By changing the compression level for gzip, we could see the effects of different CPU to I/O ratios. The gzip benchmark can be summarized as a cycle of ‘read block; compress’ with a ‘write block’ performed when the compressed data fills the output block. Figure 5 shows a histogram of the normalized elapsed time performance of gzip compression at 5 different compression levels. Each bar is decomposed into the time spent in the application outside the operating system (i.e. performing the
compression), the time spent in the read system call, and other elapsed time. The time spent in the write system call was also calculated, but did not have large enough values to show on the graph. The `other’ category is the difference between the above categories and the elapsed time. It includes any other processes scheduled over the elapsed time interval as well as time spent in the supervisor not on behalf of the read or write system calls. The normalized performance ranges from 1.58 for the lowest level of compression to 1.23 for the highest level of compression. 1.8 1.6
Normalized Performance
1.4 1.2 1 0.8 0.6 0.4 0.2 0 1
3
5
Compression Level Application Read System Call
7
9
Other
Figure 5.
4.2
Win32 Interface on Windows NT
Our second prototype performed replica coordination at the Win32 interface on Windows NT, again executing on the Intel Pentium processor architecture. The Win32 interface was selected because of its public interface definition. Windows NT provided an interesting contrast to the Unix work and allowed us to experiment with our multi-threading solution. Interposition of the supervisor used known techniques for replacing the linkage between an executable or user dynamic link library (DLL) and the set of Win32 DLLs with linkage to the supervisor versions of the same Win32 calls. The supervisor then could call the Win32 DLLs directly. This is similar to techniques used for system call spy utilities. As in the first prototype, communication was over 10Mbps Ethernet and used TCP/IP with explicit acknowledgements. Given an application executable for Windows NT that contained function entry point symbols, we were able to construct a specialized object code editing tool for instrumenting the application binary with logical time ticks as described above. Our transformation tool could
also add the supervisor’s DLL to the set of DLLs statically referenced by the executable. One of the important lessons to be learned from our Win32 prototype is that Win32 is, at best, an approximation of the application/operating system interface of Windows NT. The actual full interface is proprietary to Microsoft and is embedded in the NT DLL that sits below the set of Win32 DLLs. With a proprietary interface, interposition is made difficult, and accurately providing identical semantics becomes impossible. Thus the Win32 interface was chosen instead. Unfortunately, the Win32 interface does not serve as a conduit for all interaction between an application and the operating system (i.e. the interface is not complete). An application or DLL may invoke the interface directly if it knows the correct entry points and semantics. This is the case for many Microsoft provided DLLs. The impact of this is that the Win32 interface is not adequate for transparent replica coordination for applications that invoke the NT DLL directly (such as some Microsoft Server applications, including the Internet Information Server) or indirectly through Microsoft provided DLLs. Nonetheless, the Win32 prototype has been useful in corroborating the performance effects of the gzip workload seen in the Unix prototype and in serving as a testbed for our experimentation for kernel-based thread sharing of a common address space.
5
Related Work
Transparent fault tolerance has been well explored for replication management interfaces below the operating system. Most of these solutions fall into the space of hardware fault-tolerance and include systems such as the Stratus line, the VAXft, and designs from Tandem. The survey paper by Sieworek and Swarz [18] detail many of these designs. The hypervisor work of Bressoud and Schneider [4] mentioned previously is also below the operating system. These systems are all vulnerable to a large class of design faults in the operating system [5]. Other solutions have modified an operating system or restructured it while maintaining an existing application interface. The seminal work on the Auragen system [3] and even the restructuring of Novell’s NetWare operating system [12] are examples of this approach. Recent work by Tandem [7] also pursues the goal of transparency, but provides no framework for sources of non-determinism other than input and output operations. It also restricts the delivery of exceptions such that the general case (as described above) is not addressed. Finally, the virtualization provided by our VPE is not addressed.
6
Summary and Ongoing Research
The software system described in this paper provides application-transparent fault tolerance by interposing replica-coordination functionality above the operating system and below the application. The work presents a general framework for the problem of transparent interposition for replica coordination as well as the specific TFT solution, which addresses resolving exception delivery and providing for consistent interface semantics across a failure through the VPE. We have shown that the approach has sufficient potential to justify further research, implementation and experimentation. Nonetheless, much work remains to be done. The two major areas of current investigation include the complete functionality required of the VPE and its efficient implementation, and the exploration of supporting both (i) multiple kernel threads sharing the same process address space, and (ii) multiple processes communicating through shared memory. Finally, the described work has not addressed other components required for building fault-tolerant systems based on this approach. To be addressed are the design of fault-tolerant I/O subsystems and the reintegration of a failed replica.
Acknowledgements I would like to thank the referees for their comments and suggestions. I must also thank other TFT collaborators including Rick Harper, Brad Glade, Robert Cooper, Ken Birman, Ian Service and Fred Schneider.
References [1] Alsberg, P.A. and Day, J.D. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering, pages 627-644, 1976. th
[2] Bartlett, J.F. A nonstop kernel. In Proceedings of the 8 Symposium on Operating System Principles, pages 22-29, Dec. 1981. [3] Borg, A., Blau, W., Graetsch, W., Herrmann, F., and Oberle, W. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 3(1): 63-75, Feb. 1985.
[7] Del Vigna, Jr., P. System and method for providing a fault tolerant computer program runtime support environment. U.S. Patent 5,621,885, Apr. 1997. [8] Elnozahy, E.N., Johnson, D.B., and Wang, Y.M., W. A Survey of Rollback–Recovery Protocols in Message Passing Systems. Carnegie Mellon University Computer Science Technical Report. CMU-CS-96-181, Oct. 1996. [9] Elnozahy, E.N., and Zwaenepoel, W. Manetho: Transparent rollback recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. on Computers 41 (5): 526-531, May 1992. [10] Hopkins, Jr., A. L., Smith, III, T. B., Lala, J. H. FTMP— A highly reliable fault-tolerant multiprocessor for aircraft. Proceedings of the IEEE 66 (10): 1221-1239, Oct. 1978. [11] Lamport, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7): 558-565, July 1978. [12] Major, D., Powell, K., and Nelbaur, D. Fault tolerant computer system. U.S. Patent 5,157,663, Oct. 1992 [13] Reid, R. Central processing apparatus for fault-tolerant computing. U.S. Patent 4,453,215, Jun. 1984. [14] Russinovich, M., and Cogswell, B. Replay for concurrent non-deterministic shared-memory applications. In Proceedings of the ACM SIGPLAN ’95 Conference on Prog. Lang. Design and Implementation (PLDI), June, 1995. [15] Samson, J.E., Wolff, K.T., Reid, R., Hendrie, G.C., Falkoff, D.M., Dynneson, R.E., Clemson, D.M., Baty, K.F. 1987. Digital data processor with high reliability. U.S. Patent 4,654,857, Mar. 1987. [16] Schlichting, R., and Schneider, F.B. Failstop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1(3): 222-238, Aug. 1983. [17] Schneider, F.B. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22(4): 299-319, Dec. 1990. [18] Siewiorek, D.P., and Swarz, R.S. Reliable Computer System Design and Evaluation. Digital Press, Bedford, MA, 1992. [19] Slye, J.H., and Elnozahy, E.N. Supporting Nondeterministic Execution in Fault-Tolerant Systems. In Proceedings of the 26th International Symposium on Fault Tolerant Computing Systems, pages 250-259, Jun. 1996.
[4] Bressoud, T.C. and Schneider, F.B. Hypervisor-based fault-tolerance. ACM Trans. Comput. Syst. 14(1): 90-107, Feb. 1996.
[20] Smith, III, T.B. Fault tolerant processor concepts and operation. In Proceedings of the 14th International Symposium on Fault Tolerant Computing Systems, pages 158-163. Jun. 1984.
[5] Castelli, L., Coan, B., Harbison, J.P., and Miller, E.L. Tradeoffs when integrating multiple software components into a highly available application. In Proceedings of the 16th Symposium on Reliable Distributed Systems, pages 121-128, Oct. 1997.
[21] Walker, B.J., Popek, G., English, B., Kline, C., and Thiel, G. The LOCUS Distributed Operating System. In Proceedings of the 9th Symposium on Operating System Principles, pages 49-75, 1983.
[6] Cutts, R. W., Nikhil, A. M., and Jewett, D.E. Multiple processor system having shared memory with privatewrite capability. U.S. Patent 4,965,717, Oct. 1990.