Programming Language Support for Writing Fault ... - Semantic Scholar

5 downloads 25617 Views 95KB Size Report
tributed software—that is, software that must continue to provide service in a .... happen if the lock manager shown in Figure 2 or any of its clients fail. If the lock ...
Programming Language Support for Writing Fault-Tolerant Distributed Software Richard D. Schlichting and Vicraj T. Thomas

Abstract Good programming language support can simplify the task of writing fault-tolerant distributed software. Here, an approach to providing such support is described in which a general high-level distributed programming language is augmented with mechanisms for fault tolerance. Unlike approaches based on sequential languages or specialized languages oriented towards a given fault-tolerance technique, this approach gives the programmer a high level of abstraction, while still maintaining flexibility and execution efficiency. The paper first describes a programming model that captures the important characteristics that should be supported by a programming language of this type. It then presents a realization of this approach in the form of FT-SR, a programming language that augments the SR distributed programming language with features for replication, recovery, and failure notification. In addition to outlining these extensions, an example program consisting of a data manager and its associated stable storage is given. Finally, an implementation of the language that uses the x-kernel and runs standalone on a network of Sun workstations is discussed. The overall structure and several of the algorithms used in the runtime are interesting in their own right.

1 Introduction Programmers faced with choosing a programming language for writing fault-tolerant distributed software—that is, software that must continue to provide service in a multicomputer system despite failures in the underlying computing platform—often have few alternatives. At one end of the spectrum are relatively low-level choices such as assembly language or C, often coupled with a fault-tolerance library such as ISIS [1]. Such an approach can result in good execution efficiency, yet forces the programmer to deal with the complexities of distributed execution and fault-tolerance in a language that is fundamentally sequential. At  This work supported in part by the National Science

of Naval Research under grant N00014-91-J-1015.

1

Foundation under grant CCR-9003161 and the Office

the other end of the spectrum are high-level languages specifically intended for constructing fault-tolerant applications using a given technique. Examples here include Argus [2] and Plits [3], which support a programming model based on atomic actions. Such languages simplify the problems considerably, yet can be overly constraining if the programmer desires to use fault-tolerance techniques other than the one supported by the language [4]. The net result is that neither option provides the ideal combination of features. In this paper, we advocate an intermediate approach based on taking a general high-level concurrent or distributed programming language such as Ada [5], CSP [6], or SR [7] and augmenting it with additional mechanisms to facilitate fault-tolerance. Starting with a language of this type offers a number of advantages. For example, unlike a low-level approach, such languages allow the programmer to deal with multiple processes and interprocess communication at a high level of abstraction, thereby simplifying the programming process. Moreover, given a well-designed set of fault-tolerance extensions, such a language can give the programmer a greater degree of flexibility than is found in current higher-level alternatives. Such flexibility allows, for instance, the use of multiple fault-tolerance techniques, something that can be important in certain types of software. In short, if done right, this approach can offer a language that preserves many of the positive attributes of both sets of alternatives. The specific purpose of this paper is to elaborate on this approach, in two ways. First, we present a programming model based on the notion of fail-stop modules that captures the characteristics needed for a language oriented towards writing fault-tolerant distributed software. Second, we describe a realization of this approach in the form of FT-SR, a programming language based on augmenting the SR distributed programming language with additional mechanisms for fault-tolerance. FT-SR has been implemented using the x-kernel, an operating system designed for experimenting with communication protocols [8], and runs standalone on a network of Sun workstations. The implementation structure and several of the algorithms used in the runtime system are also interesting in their own right. We restrict our attention in this paper to failures suffered by processors with fail-silent semantics—that is, where the only failures are assumed to be a complete cessation of execution activity—

2

although the approach generalizes to other failure models as well.

2 Fail-Stop Modules and Program Design A fail-stop (or FS) module is an abstract unit of encapsulation. Such an module contains one or more threads of execution, which implement a collection of operations that are exported and made available for invocation by other FS modules. When such an invocation occurs, the operation normally executes to completion as an atomic unit, despite failures and concurrent execution. The failure resilience of an FS module is increased either by composing modules to form complex FS modules, or by using recovery techniques within the simple module itself. Replicating a module N times on separate processors to create a high-level abstract module that can survive N-1 failures is an example of the former [9], while including a recovery protocol that reads a checkpointed state from stable storage [10] is an example of the latter. The other key aspect of FS modules is failure notification. Notification is generated whenever a failure exhausts the redundancy of a (simple or complex) FS module, resulting in complete failure of the abstraction being implemented. The notification can then be fielded by other modules that use the failed module so that they can react to the loss of functionality. For example, if N-fold replication is used to construct a complex FS module, notification would be generated should a failure destroy the Nth copy, assuming no recovery. We refer to a failure that exhausts redundancy in this way as a catastrophic failure. Notification is also generated if a module is explicitly destroyed by programmer action. Note that the analogy to fail-stop processors [11] implied by the term “fail-stop modules” is strong: in both cases, either the abstraction is maintained (processor or module) or notification is provided. FS modules are also similar in some respects to the “Ideal Fault-Tolerant Components” described in [12]. FS modules form the building blocks out of which a fault-tolerant distributed program can be constructed. As an example, consider the simple distributed banking system shown in Figure 1. Each box represents an FS module, with the dependencies between modules rep-

3

withdraw deposit transfer

Transaction Manager

read write

Stable Storage startTransaction prepareToCommit read/write commit abort

startTransaction prepareToCommit read/write commit abort

Data Manager

Data Manager

read write

Stable Storage

lock unlock

Lock Manager

Host 1

read write

Stable Storage

lock unlock

Lock Manager

Host 2

Figure 1: Fault-tolerant system structured using FS modules resented by arrows [13]. User accounts are assumed to be partitioned across two processors, with each data manager module managing the collection of accounts on its machine. The user interacts with the transaction manager, which in turn uses the data managers and a stable storage module to implement transactions using, for example, the two-phase commit protocol [14] and logging. The data managers export operations to read and write user accounts, and to implement the two-phase commit protocol. The stable storage modules are used to store the user data and to maintain key values for recovery purposes. The lock managers are used to control concurrent access. To increase the overall system dependability, the constituent FS modules would be constructed using fault-tolerance techniques. For example, the transaction and data managers might use recovery protocols to ensure data consistency following failure. Similarly, stable storage might be replicated. The failure notification aspect of FS modules can be used to allow modules to react to the failures of modules upon which they depend. If such a failure cannot be tolerated, it may, in turn, be propagated up the dependency graph. At the top

4

level, this would be seen as the catastrophic failure of the transaction manager and hence, the system. This might occur, for example, should the redundant copies of the stable storage module all fail concurrently. The failure notification and composability aspects of FS modules are what makes this programming model so useful. Ideally, a fault-tolerant program behaves as an FS module: commands to the program are executed completely or a failure notification is generated. This assures users that, absent notification, their commands have been correctly processed. Such a program is much easier to develop if each of its components are in turn implemented by FS modules. Since component failures are detectable, other components do not have to implement complicated failure detection schemes or deal with erroneous results. These components may in turn be implemented by other FS modules, with this process continuing until the simplest components are implemented by simple FS modules. At each level, the guarantees made by FS modules simplify the composition process.

3 The FT-SR Language The programming model presented in the previous section provides a framework and rationale to guide the design of fault-tolerance extensions for a high-level distributed programming language. Here, we present FT-SR, the result of following this design process for SR. To support the model, the language has provisions for encapsulation based on SR resources, resource replication, recovery protocols, and both synchronous and asynchronous failure notification. Familiarity with SR is assumed, although many of its constructs should be intuitive; details can be found in [7, 15].

3.1 Simple FS Modules Most distributed programming languages, including SR, have module constructs that provide many of the properties needed to realize a simple FS module. In SR, these modules are called resources. Each resource is populated by a varying number of processes that implement operations that are exported for invocation from other resources. As an example, consider 5

resource lock manager op get lock(cap client) returns int op rel lock(cap client; int) body lock manager var : : : variable declarations: : : process lock server do true -> in get lock(client cap) and lock available() -> : : : mark lock id as being held by client cap: : : return lock id [] rel lock(client cap, lock id) -> : : : release lock: : : return ni od end lock server end lock manager

Figure 2: Lock Manager resource the simple lock manager resource shown in Figure 2. This resource contains a single process that exports two operations, get lock and rel lock. If a client invokes the get lock operation and the lock is available, a lock id is returned and the client can proceed. If the lock is unavailable, the client is blocked at the first guard of the input statement, a multiway receive with semantics similar to Ada’s Select statement. get lock takes as its argument the capability of the invoking client, which is used as an identifier. Given that resources export operations and contain multiple processes, the only aspect of simple FS modules that SR does not support directly is failure notification. Accordingly, FT-SR includes provisions for both generating and fielding such notifications. The language runtime is responsible for generating notifications when processor failures are detected, so further discussion of that part is deferred to Section 4. For fielding notifications, FT-SR supports two different models. The first is synchronous with respect to a call; in this case, the notification is fielded by an optional backup operation specified in the calling statement. The second is asynchronous; in this case, the programmer specifies a resource to be monitored and an operation to be invoked should the monitored resource fail. To understand the need for these two kinds of failure notification, consider what might happen if the lock manager shown in Figure 2 or any of its clients fail. If the lock manager fails, all clients that are blocked on its input statement will remain blocked forever. To 6

resource client op : : : op : : : body client() var lock id: int op mgr failed(cap client) returns int . . . lock id := call flock mgr cap.get lock, mgr failedg (myresource()) . . . proc mgr failed(client cap) returns lock err return LOCK ERR end mgr failed end client

Figure 3: Outline of Lock Manager client handle this situation, clients can use the synchronous failure notification facility to unblock themselves and take some recovery action. Figure 3 shows the outline of a client structured in this way. Bracketed with the normal invocation is the capability for a backup operation, mgr failed. The backup is invoked should the original call fail, i.e., if the lock manager fails to reply within a certain amount of time (see Section 4 for details). In this example, the backup operation is implemented locally, although it could just as easily have been implemented in another resource. Note that the backup is called with the same arguments as the original operation, implying that the two operations must be type compatible. Consider now the inverse situation where a client fails while holding a lock. The server can use the FT-SR asynchronous failure notification facility to detect such a failure and release the lock, as shown in Figure 4. Here, monitor is used to enable monitoring of the client instance specified by the capability client cap. If the client is down when the statement is executed or should it subsequently fail, rel lock will be implicitly invoked by the language runtime system with client cap and lock id as arguments. Monitoring is terminated by monitorend or by another monitor statement that specifies the same resource.

7

resource lock manager op get lock(cap client) returns int op rel lock(int) body lock manager var : : : variable declarations: : : process lock server do true -> in get lock(client cap) and lock available() -> : : : mark lock id as being held by client cap: : : monitor client cap send rel lock(client cap, lock id) return lock id [] rel lock(client cap, lock id) -> : : : release lock if held by client cap: : : monitorend client cap return ni od end lock server end lock manager

Figure 4: Lock Manager with client monitoring

3.2 Composition and other Fault-Tolerance Mechanisms FT-SR provides mechanisms for using recovery techniques within simple FS modules and for composing simple FS modules using replication. The replication facility allows multiple copies of a resource to be created, with the language and runtime providing the illusion that the collection is a single resource instance exporting the same set of operations. The SR create statement has been generalized to allow for the creation of such replicated resources, which we call a resource group. For example, the statement lock mgr cap := create (i := 1 to N) lock manager() on vm caps[i] creates a resource group with N identical instances of the resource lock manager on the SR virtual machines specified by the array vm caps. The value returned is a resource capability that provides access to the operations implemented by the new resource group. In particular, this capability is a resource group capability that allows multicast invocation of any of the group’s exported operations. In other words, using this capability in a call or a send causes the invocation to be multicast to each of the individual resource instances that make up the group.

8

A multicast invocation provides certain guarantees. One is that all such invocations are delivered to the runtime of each resource instance in a consistent total order, although the program may vary this if desired. This means, for example, that if two operations implemented by alternatives of an input statement are enabled simultaneously, the order in which they will be executed is consistent across all functioning replicas unless explicitly overridden. Moreover, the multicast is also done atomically, so that either all functioning replicas receive the invocation or none do. This combination of properties means that a multicast invocation is equivalent to an atomic broadcast, a facility that has proven useful for constructing many types of fault-tolerant distributed systems [16, 17, 18, 19]. Provisions are also made for coordinating outgoing invocations generated within a resource group. There are two kinds of invocations that can be generated by a group member. The first is a private invocation, which a member uses to communicate with a resource instance individually without coordination with other group members. This can be used, for example, to allow each replica to have its own set of private resources. The other is a group invocation, which a group uses to generate a single outgoing invocation on behalf of the entire group. To distinguish between these two kinds of communication, FT-SR supports capability variables of type private cap. Invocations made using a private capability are considered private communication and are not coordinated with invocations from other group members. Invocations using regular capability variables are, however, group invocations that generate exactly one invocation. The invocation is actually transmitted when one of the members reaches the statement, with later instances being suppressed by the language runtime system. Note that either type of invocation will be a multicast invocation if the capability is a resource group capability. FT-SR also provides the programmer with the ability to restart a failed resource instance on a functioning virtual machine. The recovery code to be executed in this situation is denoted by the keywords recoveryand end. Restart can be either explicit or implicit. An explicit restart is done by restart lock mgr cap() on vm cap 9

which restarts the resource indicated by lock mgr cap and executes any specified recovery code. An entire resource group can be restarted using syntax similar to the create statement. In both cases, the restarted resource instance is, in fact, a re-creation of the failed instance and not a new instance. This means, for example, that its operations can be invoked using any capability values obtained prior to the failure. Implicit restart is indicated by specifying backup virtual machines when a resource or resource group is created. For example, the final clause of create lock mgr() on vm cap backups on vm caps array specifies that the lock manager be restarted on one of the backup virtual machines in vm caps array should the original instance fail. The backups on clause may also be used in conjunction with the group create statement; in this case, a group member is automatically restarted on a backup virtual machine should it fail. This facility allows a resource group to automatically regain its original level of redundancy following a failure. Another issue concerning restart is determining when the runtime of the recovering resource instance begins accepting outside invocations. In general, the resource is in an indeterminate state while performing recovery, so messages are only accepted after the recovery code has completed. The one exception is if the recovering instance itself initiates an invocation during recovery; in this case, invocations are accepted starting from the time that particular invocation terminates. This facilitates a system organization in which the recovering instance retrieves state variables from other resources during re covery.

3.3 Distributed Banking System Example As an example of how the FT-SR collection of mechanisms can be used in concert to construct a fault-tolerant application, consider the manager and stable storage modules from the distributed banking example outlined in Section 2. This example also illustrates the ease with which different fault-tolerance techniques can be used within the same program. The data manager controls concurrency and provides atomic access to data items on stable storage. For simplicity, we assume that all data items are of the same type and are referred 10

resource dataManager imports globalDefs, lockManager, stableStore op startTransaction(tid: int; dataAddrs: addrList; numDataItems: int) op read(tid: int; dataAddrs: addrList; data: dataList; numDataItems: int) op write(tid: int; dataAddrs: addressList; data: dataList; numDataItems: int) op prepareToCommit(tid: int), commit(tid: int), abort(tid: int) body dataManager(dmId: int; lmcap: cap lockManager; ss: cap stableStore) type transInfoRec = rec(tid: int; transStatus: int; dataAddrs: addressList; currentPointers: intArray; memCopy: ptr dataArray; numItems: int) var statusTable[1:MAX TRANS]: transInfoRec; statusTableMutex: semaphore initial # initialize statusTable :::

monitor(ss)send failHandler() monitor(lmcap)send failHandler() end initial code for startTransaction, prepareToCommit, commit, abort, read/write: : :

:::

proc failHandler() destroy myresource() end failHandler recovery ss.read(statusTable, sizeof(statusTable), statusTable); transManager.dmUp(dmId); end recovery end dataManager

Figure 5: Outline of dataManager resource to by a logical address. Stable storage is read by invoking its read operation, which takes as arguments the address of the block to be read, the number of bytes, and a buffer in which the values read are to be returned. Data is written to stable storage by invoking an analogous write operation. Figures 5 shows an outline of such a data manager. As can be seen from its specification, the data manager imports stable storage and lock manager resources, and exports six operations. startTransaction is invoked by the transaction manager to access data held by the data manager; its arguments are a transaction identifier tid and a list of addresses of the data items used during the transaction. read and write are used to access and modify objects. prepareToCommit and commit are invoked in succession upon completion to first, commit any modifications made to the data items by the transaction, and second, complete the transaction. abort is used to abandon any modifications and terminate the 11

transaction; it can be invoked at any time up to the time commit is first invoked. All these operations are implemented as SR procs, which means that invocations result in the creation of a new thread to service that invocation. Finally, the data manager contains initial and recovery code, as well as a failure handler proc that deals with the failure of the lockManager and stableStore resources. The data manager depends on the stable storage and lock manager resources to implement its operations correctly and so, needs to be informed when they fail catastrophically. The data manager does this by establishing an asynchronous failure handler failHandler using the monitor statement. When invoked, failHandler terminates the data manager resource, thereby causing the failure to be propagated to the transaction manager. The failure of the data manager itself is handled by recovery code that retrieves the current contents of key variables from stable storage. It is the responsibility of the transaction manager to deal with transactions that were in progress at the time of the failure; those for which commit had not yet been invoked are aborted, while commit is reissued for the others. To handle this, the recovery code sends a message to the transaction manager notifying it of the recovery. Stable storage is implemented in our example by creating a storage resource and replicating it to increase failure resilience, as shown in Figure 6. Replica failures are dealt with by restarting the resource on another machine; this is done automatically by specifying backup virtual machines when stableStore is created (see Figure 7). A replica’s recovery code starts by requesting the current state from the other group members. All replicas respond to this request; the first is received, while the others remain queued at the recvState operation until the replica is either destroyed or fails. The newly restarted replica begins processing queued messages upon finishing recovery. Since messages are queued from the point sendState is invoked, subsequent messages can be applied to the state normally to re-establish consistency. The main resource that starts up the entire system is shown in Figure 7. Resource main creates a virtual machine on each of three physical machines. Two replicas of the stable storage module are then created, with the third virtual machine being used as a backup

12

resource stableStore import globalDefs op read(address: int; numBytes: int; buffer: charArray) op write(address: int; numBytes: int; buffer: charArray) op sendState(sscap: cap stableStore) op recvState(objectStore: objList) body stableStore var store[MEMSIZE]: char process ss do true -> in read(address, numBytes, buffer) -> buffer[1:numBytes] := store[address:address+numBytes-1] 2 write(address, numBytes, buffer) -> store[address, address+numBytes-1] := buffer[1:numBytes] 2 sendState(rescap) -> send rescap.recvState(store) ni od end ss recovery send mygroup().sendState(myresource()) receive recvState(store); send ss end recovery end stableStore

Figure 6: stableStore resource machine. The two data managers are then created followed by the transaction manager. This banking example has been implemented and tested. In addition, a number of other examples have been programmed using FT-SR to test its appropriateness for writing a variety of fault-tolerant distributed programs, as well as the larger thesis that high-level distributed programming languages are suitable for software of this type. These include a fault-tolerant version of the Dining Philosophers problem that shows how a single monitor statement can be used to implement a group membership service [20, 21], and a distributed word game that exploits multiple processors for increased performance as well as fault-tolerance. A description of all these examples together with complete code can be found in [22].

3.4 Language Design Issues The fault-tolerance mechanisms of FT-SR are designed with two important considerations in mind. The first is that the mechanisms be orthogonal, so that any interplay between these mechanisms not result in unexpected behavior. The second is that, whenever possible, these mechanisms use or form natural extensions to existing SR mechanisms. These considerations 13

resource main imports transManager, dataManager, stableStore, lockManager body main var virtMachines[3] : cap vm # array of virtual machine capabilities dataSS[2], tmSS: cap stableStore # capabilities to stable stores lm: cap lockManager; dm[2]: cap dataManager # capabilities to lock and data managers virtMachines[1] := create vm() on ‘‘host1’’ virtMachines[2] := create vm() on ‘‘host2’’ virtMachines[3] := create vm() on ‘‘host3’’ # backup machine # create stable storage for use by the data managers and the transaction manager dataSS[1] := create (i := 1 to 2) stableStore() on virtMachines[i] backups on virtMachines[3] dataSS[2] := create (i := 1 to 2) stableStore() on virtMachines[i] backups on virtMachines[3] tmSS := create (i := 1 to 2) stableStore() on virtMachines[i] backups on virtMachines[3] # create lock manager, data managers, and transaction manager lm := create lockManager() on virtMachines[2] fa i := 1 to 2 -> dm[i] = create dataManager(i, lm, dataSS[i]) on virtMachines[i] af tm = create transManager(dm[1], dm[2], tmSS) on virtMachines[1] end main

Figure 7: System startup in resource main preserve the semantic integrity of the language and at the same time keep it relatively simple and therefore, easy to understand and use. We illustrate these points with several examples. FT-SR provides mechanisms for monitoring, failure handling, restarts, and replication, all of which can be meaningfully combined to achieve different effects. For example, both the monitor statement and backup operations work with groups just as they do with resources. In either case, a failure notification is generated when no resource or resource group member is available to handle invocations. Similarly, the restart statement can be used to restart entire groups, group members, or individual resources, with the same rules for execut ion of recovery code and acceptance of new invocations being used in each case. Another example is that an operation implemented by a resource group can be used in the same way as one implemented by a single resource, since the two capability values are indistinguishable. In particular, group operations may be specified as failure handlers in monitor statements or as backup operations in call statements, as well as normal invocations. The parallels between resource groups and resources also extends to invocations from a group; it is impossible to tell if an invocation originated from a group or an individual 14

resource. The second aspect of good language design is that wherever possible, the fault-tolerance mechanisms of FT-SR are integrated into existing SR mechanisms. For example, the group create statement is a natural extension of the SR resource create statement, both in terms of its syntax and semantics. Furthermore, a failure handler is essentially an operation that is invoked as a result of a failure and is therefore expressed using existing language mechanisms. Emphasizing these two aspects of language design has numerous advantages. The orthogonality of the FT-SR mechanisms allows a small set of mechanisms to be combined in different ways to achieve different effects; the lack of restrictions or special cases governing this combination eliminates any programming pitfalls that can snare a novice programmer. The use of existing SR mechanisms keeps the language small and easy to learn, while allowing the fault-tolerance aspects of the language to be blended with its concurrency aspects. All these considerations lead to a logically and aesthetically integrated language design.

4 Implementation and Performance 4.1 Overview The FT-SR implementation consists of two major components: a compiler and a runtime system. Both are written in C and borrow from the existing implementation of SR where possible. In fact, the FT-SR compiler is almost identical to the SR compiler, which is to be expected since FT-SR is syntactically close to SR. The compiler is based on lex and yacc, and consists of about 16,000 lines of code. It generates C code, which is in turn compiled by a C compiler and linked with the FT-SR runtime system. The FT-SR runtime system, which is significantly different from that of SR, provides primitives for creating, destroying and monitoring resources and resource groups, handl ing failures, restarting failed resources, invoking and servicing operations, and a variety of other miscellaneous functions. It consists of 9600 lines of code and is implemented using version 3.1 of the x-kernel. The major advantage of such a bare machine implementation is that it facilitates experimentation with realistic fault-tolerant software systems when compared 15

to systems built, for example, on top of Unix. In addition, the x-kernel provides a flexible infrastructure for composing communication protocols, something that has proven to be very useful in building the variety of protocols required for the FT-SR runtime system. Figure 8 shows the organization of the FT-SR runtime system on a single processor. As shown, each FT-SR virtual machine exists in a separate x-kernel user address space. In addition to the user program, a virtual machine contains those parts of the runtime system that create and destroy resources, route invocations to operations on resources, and manage intra-virtual machine communication. This user resident part accounts for about 85% of the runtime system and the kernel resident part the remaining 15%. The important runtime system modules and communication paths are also illustrated in Figure 8. The Communication Manager consists of multiple communication protocols that provide point-to-point and broadcast communication services between processors. The VM Manager is responsible for creating and destroying virtual machines, and for providing communication services between virtual machines. The Processor Failure Detector (PFD) is a failure detector protocol; it monitors processors and notifies the VM manager when a failure occurs. In user space, the Resource Manager is responsible for creating, destroying and restarting resources, while the Group Manager is responsible for the analogous operations on groups, as well as intergroup communication. The Resource Failure Detector (RFD) detects resource failures.

4.2 Novel Features Three interesting algorithms used within the FT-SR implementation are described in this section. The first is related to group communication and is interesting because it uses a variation of the primary replica approach to sequence invocations to a group. The second is related to group reconfiguration and is interesting because no expensive election protocols are used. Both these algorithms exploit a system parameter max sf —the maximum number of simultaneous failures to be tolerated—to optimize performance. The third algorithm is the failure detection and notification algorithm. It is interesting because it is implemented by three modules at different levels of the system, with each module using the services provided 16

User Program

User Program

Invocation Mngr.

Invocation Mngr.

Resource Mngr.

Grp. Mngr.

Resource Mngr.

Grp. Mngr.

RFD

RFD

VM 1

VM 2

User Space Kernel Space

VM Manager

PFD

Comm. Mngr.

Figure 8: Organization of FT-SR runtime system by the one below it. Group Communication. Perhaps the most interesting aspect of replication is the algorithm used to implement multicast invocations. The technique we use is similar to [23, 24], where one replica is a primary through which all messages are funneled. Another max sf replicas are designated as primary-group members, with the remaining being considered ordinary members. Upon receiving a message, the primary adds a sequence number and multicasts it to all replicas. Upon receipt, (only) primary-group members send acknowledgments. Once the primary gets these max sf messages, it sends an acknowledgement to the original sender of the message; this action is appropriate since the receipt of this many acknowledgements guarantees that at least one replica will have the message even should max sf failures actually occur. The primary is also involved in outgoing group invocations. In such situations, the runtime system suppresses the invocation from all non-primary group members. When the primary receives an acknowledgement that its invocation has been received, it relays that information to the other group members. 17

From

To

Group Size

Invocation Time (msec)

resource

group

1

3.24

resource

group

2

6.84

resource

group

3

6.84

resource

group

3

8.35

(max sf = 2) group

resource

3

7.19

group

group

3

14

Table 1: Times (in msec) for invocation involving groups Table 1 shows the cost of invocations to and from resource groups. As can be seen, for groups larger than max sf + 1, the cost of an invocation to the group is independent of group size, a direct result of the above algorithm. This is especially significant given that a max sf of one is sufficient for most systems [25]. This gives FT-SR a considerable advantage over systems such as ISIS where the cost of an invocation grows linearly with the size of the group. Group Reconfiguration after Failure. The Group Manager at each site is responsible for determining the primary and the members of the primary-group set. Specifically, it maintains a list of all group members and whether it is the primary, a primary-group member, or an ordinary member. This list is ordered consistently at all sites based on the order in which the replicas were specified in the group create statement. This ordering ensures that all Group Managers will independently pick the same primary and assign the same set of replicas to the primary-group set. The Group Managers are also responsible for dealing with the failure of group members. If the primary fails, the first member of the primary-group is designated as the new primary. This action or the failure of a primary-group member will cause the size of the primarygroup to fall below max sf, so an appropriate number of ordinary members are added to 18

the primary-group to restore its original size. No special action is needed when an ordinary member fails. If backup virtual machines were specified for the group when it was created and such machines are available, failed replicas are restarted automatically. Restarted replicas join the group as ordinary members. Failure Detection and Notification. Failure detection in FT-SR is done at three levels: at the processor level by the PFD, at the virtual machine level by the VM Manager, and at the resource level by the RFD. Each PFD monitors the other processors and notifies the local VM manager of any failures. The VM manager then maps these processor failures to virtual machine failures and notifies the RFD. The RFD in turn maps virtual machine failures to resource failures and passes this information on to any other runtime system module that requested failure notification. To detect termination of a resource that is explicitly destroyed, the RFD sends a message to its peer on the appropriate virtual machine asking to be notified when the resource is destroyed. Similarly, a VM Manager can ask another VM Manager to send a failure notification when a virtual machine is explicitly destroyed.

5 Conclusions Numerous programming languages with support for fault-tolerance have been developed, some as entirely new languages, some as extensions to existing languages and systems, and some as libraries to existing languages. Examples of new languages include Argus [2], Aeolus [26] and Plits [3]. Examples of extensions include Fault-Tolerant Concurrent C (FTCC) [27], HOPS [28], and languages described in [29], [30] and [31]. Finally, faulttolerance library support is provided by Arjuna [32] for C++, and Avalon [33] for C++, Common Lisp and Ada. A distinguishing feature of these languages is the programming model they support. For example, the transaction model is supported by Aeolus, Argus, Avalon, HOPS, Plits, and Arjuna, while the replicated state machine approach [9] is supported by HOPS and FTCC. FTSR differs from all the above languages in supporting a model based on FS modules, which

19

allows any of these other approaches to be programmed easily. Another difference is that FTSR’s design as a set of extensions to a high-level distributed programming language greatly enhances its usability. It simplifies the construction of fault-tolerant distributed programs by allowing for the seamless integration of the distribution and fault-tolerance aspects of these programs. Despite these efforts, developing enhanced language support for fault tolerance is, in some sense, a neglected area compared with the numerous efforts to develop new system libraries or network protocols. However, our view is that research in this area has the potential to render significant benefits. By offering a high-level realization of important fault-tolerance abstractions, programmers are freed from the need to learn implementation details or how a particular library can be used in a given context. The advantages of a single, coherent package for expressing the program should not be underestimated either, especially one based on a high-level distributed programming language that already offers a framework for writing multi-process programs. This paper has presented such a language-based approach to writing fault-tolerant distributed programs. Although the specifics of our approach are based on extending the SR language, the FS module programming model and design principles could be applied equally well to any similar language. It is also important when designing a language for such applications to pay sufficient attention to the implementation, especially the design of an efficient runtime system. Although confirming experiments are continuing, our expectation is that the user will pay little, if any, performance penalty for the advantages of a high-level language.

Acknowledgments Thanks to G. Andrews, H. Bal, M. Hiltunen, D. Mosberger-Tang, R. Olsson, and the anonymous referees for reading earlier versions of this paper and providing valuable feedback.

References [1] K. Birman, A. Schiper, and P. Stephenson, “Lightweight causal and atomic group multicast,” ACM Trans. Computer Systems, vol. 9, pp. 272–314, Aug 1991.

20

[2] B. Liskov, “The Argus language and system,” in Distributed Systems: Methods and Tools for Specification, LNCS, Vol. 190 (M. Paul and H. Siegert, eds.), ch. 7, pp. 343–430, Berlin: Springer-Verlag, 1985. [3] C. Ellis, J. Feldman, and J. Heliotis, “Language constructs and support systems for distributed computing,” in ACM Symp. on Prin. of Dist. Comp., pp. 1–9, Aug 1982. [4] H. Bal, “A comparative study of five parallel programming languages,” in Proc. EurOpen Conf. on Open Dist. Systems, May 1991. [5] U. S. Dept. of Defense, Reference Manual for the Ada Programming Language. Washington D.C., 1983. [6] C. A. R. Hoare, “Communicating sequential processes,” Commun. ACM, vol. 21, pp. 666–677, Aug 1978. [7] G. R. Andrews and R. A. Olsson, The SR Programming Language: Concurrency in Practice. Benjamin/Cummings, 1993. [8] N. Hutchinson and L. L. Peterson, “The x-Kernel: An architecture for implementing network protocols,” IEEE Trans. Softw. Eng., vol. 17, pp. 64–76, Jan 1991. [9] F. Schneider, “Implementing fault-tolerant services using the state machine approach: A tutorial,” ACM Computing Surveys, vol. 22, pp. 299–319, Dec 1990. [10] B. Lampson, “Atomic transactions,” in Distributed Systems—Architecture and Implementation (B. Lampson, M. Paul, and H. Seigert, eds.), ch. 11, pp. 246–265, Springer-Verlag, 1981. [11] R. Schlichting and F. Schneider, “Fail-stop processors: An approach to designing fault-tolerant computing systems,” ACM Trans. Computer Systems, vol. 1, pp. 222–238, Aug 1983. [12] P. Lee and T. Anderson, Fault Tolerance: Principles and Practice. Vienna: Springer-Verlag, second ed., 1990. [13] F. Cristian, “Understanding fault-tolerant distributed systems,” Commun. ACM, vol. 34, pp. 56– 78, Feb 1991. [14] J. Gray, “Notes on data base operating systems,” in Operating Systems, An Advanced Course (R. Bayer, R. Graham, and G. Seegmuller, eds.), ch. 3.F, pp. 393–481, Springer-Verlag, 1979. [15] G. Andrews et al., “An overview of the SR language and implementation,” ACM Trans. Prog. Lang. and Systems, vol. 10, pp. 51–86, Jan. 1988. [16] F. Cristian, H. Aghili, R. Strong, and D. Dolev, “Atomic broadcast: From simple message diffusion to Byzantine agreement,” in Proc. 15th Fault-Tolerant Computing Symp., pp. 200– 206, June 1985. [17] H. Kopetz et al., “Distributed fault-tolerant real-time systems: The Mars approach,” IEEE Micro, vol. 9, pp. 25–40, Feb 1989. [18] P. Melliar-Smith, L. Moser, and V. Agrawala, “Broadcast protocols for distributed systems,” IEEE Trans. on Parallel and Distributed Systems, vol. 1, pp. 17–25, Jan 1990.

21

[19] D. Powell, ed., Delta-4: A Generic Architecture for Dependable Computing. Springer-Verlag, 1991. [20] F. Cristian, “Reaching agreement on processor-group membership in synchronous distributed systems,” Distributed Computing, vol. 4, pp. 175–187, 1991. [21] H. Kopetz, G. Grunsteidl, and J. Reisinger, “Fault-tolerant membership service in a synchronous distributed real-time system,” in Dependable Computing for Critical Applications (A. Aviˇzienis and J.-C. Laprie, eds.), pp. 411–429, Wien: Springer-Verlag, 1991. [22] V. Thomas, FT-SR: A Programming Language for Constructing Fault-Tolerant Distributed Systems. PhD thesis, Dept. of CS, Univ. of Arizona, 1993. [23] J. Chang and N. Maxemchuk, “Reliable broadcast protocols,” ACM Trans. Computer Systems, vol. 2, pp. 251–273, Aug 1984. [24] M. F. Kaashoek, A. Tanenbaum, S. Hummel, and H. Bal, “An efficient reliable broadcast protocol,” Operating Systems Review, vol. 23, pp. 5–19, Oct 1989. [25] J. Gray, “Why do computers stop and what can be done about it,” in Proc. 5th Symp. on Reliability in Dist. Software and Database Systems, pp. 3–12, Jan 1986. [26] R. LeBlanc and C. T. Wilkes, “Systems programming with objects and actions,” in Proc. 5th Conf. on Distributed Computing Systems, (Denver), pp. 132–139, May 1985. [27] R. Cmelik, N. Gehani, and W. D. Roome, “Fault Tolerant Concurrent C: A tool for writing fault tolerant distributed programs,” in Proc. 18th Fault-Tolerant Computing Symp., pp. 55–61, June 1988. [28] H. Madduri, “Fault-tolerant distributed computing,” Scientific Honeyweller, vol. Winter 198687, pp. 1–10, 1986. [29] J. Knight and J. Urquhart, “On the implementation and use of Ada on fault-tolerant distributed systems,” IEEE Trans. Softw. Eng., vol. SE-13, pp. 553–563, May 1987. [30] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum, “Transparent fault-tolerance in parallel Orca programs,” in Proc. USENIX Symp. on Exper. with Distributed and Multiprocessor Systems, pp. 297–311, Mar 1992. [31] R. Schlichting, F. Cristian, and T. Purdin, “A linguistic approach to failure-handling in distributed systems,” in Dependable Computing for Critical Applications (A. Aviˇzienis and J.-C. Laprie, eds.), pp. 387–409, Wien: Springer-Verlag, 1991. [32] S. Shrivastava, G. Dixon, and G. Parrington, “An overview of the Arjuna distributed programming system,” IEEE Software, vol. 8, pp. 66–73, Jan 1991. [33] M. Herlihy and J. Wing, “Avalon: Language support for reliable distributed systems,” in Proc. 17th Fault-Tolerant Computing Symp., pp. 89–94, July 1987.

22