bership and topology maintenance services. ... tally ordered message delivery service include air traf- ..... corresponding to a ring and an edge to a gateway.
The Totem System L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos, T. P. Archambault Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106
Abstract The Totem system supports fault-tolerant applications in which distributed processes cooperate to perform a commontask and in which replicated data must be updated consistently in the presence of asynchrony and faults. Reliable totally ordered delivery of messages to processes within process groups is provided on a single local-area network or over multiple local-area networks interconnected by gateways. Message ordering is consistent across the entire network, despite processor and communication faults, without requiring all processes to deliver all messages. The Totem system handles processor failure and recovery, as well as network partitioning and remerging, and provides membership and topology maintenance services.
1 Introduction The Totem system, developed at the University of California, Santa Barbara, supports applications in which information must be replicated to guard against faults and in which the consistency of information must be maintained as it is updated in the presence of faults. Totem provides reliable totally ordered delivery of messages to processes within process groups. This total order of messages simpli es the application programming needed to maintain the consistency of information and, thus, reduces the risk of errors in programming fault-tolerant applications. Easier programming results in lower costs, shorter development times, and higher reliability. Causally ordered and unordered message delivery were previously advocated for distributed applications because of the poor performance of previous Supported by the National Science Foundation, Grant No. NCR-9016361, by the Advanced Research Project Agency, Grant No. N00174-93-K-0097, and by Rockwell CMC through the State of California MICRO program, Grant No. 92-101.
total ordering protocols, but are rendered unnecessary by the high throughput and low latency of Totem's total ordering protocol. The exceptional performance of Totem results from eective ow control mechanisms and from exploiting the locality of process groups by ltering messages at the gateways. Applications that can bene t from Totem's totally ordered message delivery service include air traf c control, industrial automation, transaction processing, banking, stock market trading, intelligent highway, medical monitoring, and replicated database applications. Other reliable ordered message delivery systems with similar objectives for similar applications include Isis [5, 7], Psync [16], Trans and Total [12, 14], Transis [3], Amoeba [9], Delta-4 [18] and Horus [17]. The Totem system operates on a single local-area network or over multiple local-area networks interconnected by gateways. Superimposed on each local-area network is a logical token-passing ring. The elds of the circulating token provide reliable delivery and total ordering of messages, con rmation that messages have been received by all processors, eective ow control, and detection of faults. Consistency of message ordering is provided across the entire system without the need for all processes to deliver all messages. The membership and topology maintenance services provided by Totem handle processor failure and recovery, as well as network partitioning and remerging. Con guration Change and Topology Change messages delivered to each application process de ne a sequence of process group con gurations. Each process receives all messages multicast to the process group by a member of a con guration, and timestamped within that con guration. The Totem system hierarchy is shown in Figure 1. The bottom layer of the hierarchy is a collection of local-area networks with best-eort hardware broadcasts or multicasts. The single-ring protocol converts
Application Layer Ordered multicast to process group
Process group membership changes
Process Group Interface Globally ordered reliable multicast
Network-wide topology changes
Multiple-Ring Protocol Locally ordered reliable multicast
Local configuration changes
Single-Ring Protocol Best-effort multicast
Absence of messages and timeouts
Physical Medium
Figure 1: The Totem system hierarchy.
the best-eort multicasts into the service of reliable totally ordered delivery of messages and, in addition, provides fault detection and recovery on a single localarea network. The multiple-ring protocol uses the single-ring protocol to provide system-wide total ordering of messages, as well as system-wide topology and membership services. The multiple-ring protocol, using information from the process group interface above it, forwards messages through the gateways to the rings on which they are required. The process group interface delivers messages to application processes in the appropriate process groups, and provides process group membership services. Detailed descriptions of the protocols that comprise the Totem system can be found in [2, 11], and descriptions of earlier versions of the protocols can be found in [1, 4, 13]. The Totem system has been implemented in the C programming language on Sun IPC workstations running SunOS 4.1, and on Sun SPARCstation 20s running Solaris 2.3, over a 10 Mbit/sec Ethernet. It uses standard UNIX facilities, in particular UNIX UDP sockets to broadcast messages and to transfer the token. The implementation has been ported to several other types of workstations with little modi cation to the source code.
2 The Model The Totem system supports fault-tolerant applications within a distributed system in which processors are connected by local-area networks, possibly several local-area networks interconnected by gateways. Totem is designed to protect against communication faults, including message loss and network partitioning. It also protects against processor faults, including
crash, omission, and timing faults but not Byzantine or software faults. In Totem, we distinguish between receipt and delivery of a message as follows: a message is received from the next lower layer of the protocol hierarchy and is delivered in order to the next higher layer. When messages are received, they may not be in order and, thus, the lower layer may need to reorder them before delivering them to the upper layer. Totem provides two levels of message delivery, agreed and safe, selected by the originator of the message. Delivery in agreed order for a con guration requires that (1) messages are delivered in a total order for that con guration, (2) messages are delivered in the same total order by all processors in the con guration, and (3) every message that precedes a message in the total order for the con guration is delivered before that message. The total order on messages respects Lamport's causal order [10]. Delivery in safe order for a con guration is delivery in agreed order for that con guration and, in addition, requires that (4) a processor knows that the message has been received by all of the other processors in the con guration. An important use of safe delivery is that it allows a processor to reclaim buer space, because a message that is safe will never need to be retransmitted subsequently. Continued operation when faults occur poses substantial challenges to maintaining the consistency of message ordering. When a processor fails or the system partitions, it is impossible to be certain which messages were delivered by the processor before it failed or by processors in other components of the partitioned system. Virtual synchrony [6] ensures that processors that are members of the same consecutive con gurations deliver the same sequence of messages and con guration changes, but does not constrain the behavior of faulty or isolated processors. For systems in which faulty processors can be repaired and resume operation with stable storage intact and for systems in which the network can partition and remerge, we have introduced the concept of extended virtual synchrony [4, 15]. Extended virtual synchrony requires the properties of delivery in agreed and safe order within a con guration, even if a processor fails and restarts or if the network partitions and remerges. More importantly, it requires that the total order of messages for a con guration is a subset of a global total order on all messages generated in the system. Extended virtual synchrony allows two processors in dierent components of a partitioned system to deliver dierent messages, but does not allow them to deliver the same
messages in dierent orders. When faults occur, extended virtual synchrony is achieved by introducing a transitional con guration with a reduced membership, all members of which are able to honor the agreed and safe message delivery guarantees.
3 The Totem Single-Ring Protocol The Totem single-ring protocol provides reliable totally ordered delivery of messages using a logical token-passing ring superimposed on a local-area network, such as an Ethernet. The token circulates around the ring as a point-to-point message; only the processor holding the token can broadcast messages. A sequence number eld in the token provides a single sequence of strictly increasing sequence numbers for all messages broadcast on the ring; messages are delivered in sequence number order. The single-ring protocol also provides membership services to handle processor failure and recovery, as well as network partitioning and remerging. To guard against token loss, a token retransmission mechanism has been implemented.
3.1 Message Ordering The sequence numbers of the messages are derived from a sequence number eld in the token, called seq. The seq eld is incremented as each new message is broadcast. Processors recognize missing messages by detecting gaps in the sequence of message sequence numbers, and request retransmissions by inserting the sequence numbers of the missing messages into a retransmission request (rtr) eld of the token. If a processor has received a message and all of its predecessors, as indicated by the message sequence numbers, it can deliver that message in agreed order. The token also contains an all-received-upto (aru) eld which enables a processor to determine a sequence number such that all messages with lower sequence numbers have been received by all processors on the ring. Messages with sequence numbers less than or equal to this sequence number can be delivered in safe order. The Totem single-ring protocol provides eective
ow control and, thereby, achieves high throughput and low latency. The ow control mechanisms are based on two limits: the number of messages that can be broadcast by any one processor during a single token visit and the total number of messages that can be broadcast by all processors during a single token rotation. The token also provides information about the aggregate message backlog of the various processors on the ring, which allows a fairer allocation of
bandwidth to processors than is achieved by simpler schemes, such as FDDI. Measurements for the Totem single-ring protocol, with low message loss rates, show a throughput that is two to ve times higher than the throughput achieved by competing ordered multicast protocols using similar equipment, and that is comparable to the throughput achieved by TCP/IP for point-to-point communication. Low latency, from message origination to delivery, is maintained even under high message transmission rates.
3.2 Processor Membership To provide fault tolerance, the Totem single-ring ordering protocol is integrated with a membership protocol that provides membership services to recon gure the system, including addition of new and recovered processors, deletion of faulty processors, handling of network partitioning, and remerging of components of a partitioned network. Timeouts are used to detect processor faults. New or restarted processors are detected by the appearance of messages on the local-area network from processors that are not members of the current ring. The membership protocol uses heuristics based on timeouts to identify faulty processors, with a bias toward preserving the current membership. The protocol ensures consensus in that every member of the con guration agrees on the membership of the con guration, and termination in that every processor installs some con guration with an agreed membership within a bounded time unless it fails within that time. Subject to these consensus and termination requirements, the membership protocol aims to form a membership that is as large as possible. It then constructs a new ring on which the ordering protocol can resume operation, generates a new token, and recovers messages that have not been received by some of the processors when the fault occurred. For each change in the membership within the local-area network, Totem delivers two Con guration Change messages to the application, rather than the one message that might have been expected. When a processor fails or the network partitions, the rst Con guration Change message introduces a transitional con guration of reduced size that excludes the faulty or inaccessible processors. Delivery of this message informs the application that the delivery guarantees now apply only to the smaller transitional con guration. Within the transitional con guration, the remaining messages of the old con guration are delivered. After these messages are delivered, the second Con gura-
tion Change message is delivered, which introduces the new regular con guration. These message ordering and membership services provide reliable totally ordered delivery of messages to the multiple-ring protocol described below.
4 The Totem Multiple-Ring Protocol The Totem multiple-ring protocol is layered on top of the single-ring protocol, and delivers messages and topology changes to the application processes in timestamp order. Timestamp order guarantees global consistency of message ordering. The ability of a processor to deliver a message in timestamp order depends, however, on that processor's knowing that it has already received and delivered all relevant messages, from all of the connected rings, with timestamps that are less than the timestamp of the message to be ordered. The multiple-ring protocol exploits the services provided by the single-ring protocol for this knowledge.
4.1 Message Ordering On each individual ring, messages are generated with increasing sequence numbers and timestamps. The single-ring protocol provides reliable delivery of messages in sequence number order on an individual ring. As the single-ring protocol delivers messages to a gateway, the gateway forwards the messages, in order, onto the other ring. On the new ring, a message retains its original timestamp but acquires a new sequence number so that it can be reliably delivered in sequence number order on that ring. Since the messages are forwarded in order, when the multiple-ring protocol receives a message that was originated on another ring, it has already received all relevant messages originated on that ring with earlier timestamps. Each processor maintains a ring table with an entry for each ring in the network, containing a recv msgs list of messages received from that ring. A processor can deliver a message in agreed order if the message is the lowest entry in all of the recv msgs lists and if each recv msgs list is nonempty. If the recv msgs list of some ring is empty, then no further messages can be delivered until a message from that ring has been received, because the next message from that ring may have a lower timestamp than the other messages in those lists. Messages, called Guarantee Vector messages, are broadcast periodically for each ring by the gateways to ensure that processors can continue to deliver messages in agreed order even if, for some ring, no pro-
Application Process Group Interface Multiple Ring Protocol
ring table
Totally ordered delivery in timestamp order
Reliable delivery in sequence number order
Single Ring Protocol
Local-Area Network 1
Local-Area Network 2
Figure 2: The operation of Totem at a gateway.
cessor on that ring originated a regular message recently. The Guarantee Vector messages also report which messages have been received on that ring and, thus, allow other processors to determine which messages can be delivered in safe order. A ltering mechanism at each gateway ensures that messages addressed to a process group are forwarded only if there are members of that process group in the direction of the forwarding. This enables the Totem multiple-ring protocol to exploit process group locality and to operate eciently in large networks, using the system-wide total ordering of messages to provide strict consistency of message delivery. The operation of the Totem message ordering protocol at a gateway is shown in Figure 2.
4.2 Topology Maintenance The message ordering protocol described above depends on knowledge of the network topology. If messages are originated on a ring of which a processor is unaware, that processor will not wait for such messages during the ordering and may prematurely deliver other messages with higher timestamps. Similarly, if a ring becomes inaccessible and a processor is not informed, that processor will wait forever for a message from that ring and message ordering will stop. Each gateway maintains a data structure, called topology, which contains its view of the current topology of the network, represented as a graph with a node corresponding to a ring and an edge to a gateway. Gateways use topology for several purposes, including to decide which messages should be forwarded. In the
event of a topology change, a processor receives the necessary topology information from the gateways on its ring. Processor faults and network partitioning are detected by the single-ring protocol, which generates a Con guration Change message to report the change to the local ring. The gateways analyze the Con guration Change message to determine its eect on the network topology. When a ring becomes inaccessible, a processor or gateway generates a Topology Change message and removes that ring from its ring table, ending the need to wait for messages from that ring which will never arrive and allowing messages from other rings to be ordered. A topology change must have the same eect for each of the processors that were previously able to, and can still, communicate with each other. Even though the processors learn of the topology change at dierent physical times, all agree on a common logical time for the topology change and on the same sets of messages delivered before and after the topology change. To achieve this, Con guration Change and Topology Change messages are timestamped, and are delivered to the application in timestamp order, along with the other messages. The Totem multiple-ring protocol is thus able to maintain, across a network of many rings, the extended virtual synchrony guarantees for agreed and safe delivery.
5 The Totem Process Group Interface The Totem system allows the fault-tolerant application to be structured as a set of process groups [8]. Each process group consists of a set of processes that cooperate to perform some task of the application. Messages within the Totem system are addressed to one or more process groups and are delivered to all of the processes in those groups. A typical application is structured as multiple process groups, where each process may be a member of several process groups. Maintaining the consistency of message ordering when process groups can intersect is a challenging problem. The only ecient and eective method known to us for solving this problem is to require a global total order over all messages for all process groups in the system. The process group interface provides the services of creating, joining and leaving process groups and of sending and receiving messages. The interface establishes a socket for each application process through which the process communicates with Totem and which the process can poll to determine whether any messages are pending.
The process group interface passes messages from the application processes down to the multiple-ring protocol, which breaks large messages into small messages (packets). The process group interface receives messages from the multiple-ring protocol, which constructs large messages from small messages, and then determines the application processes, if any, to which to deliver the messages. Since messages are delivered to the process group interface in order, the interface does not need to be concerned with message ordering. On each processor, the process group interface also maintains the current membership of any process group of which at least one process on that processor is a member. When a process joins or leaves a group, this fact is disseminated throughout the network to all of the other members of the group by the process group membership protocol.
6 Demonstration The demonstration of the Totem system consists of a simulated air trac control application on a network of workstations. In the airspace displayed on each of the workstation screens, aircraft travel from air eld to air eld, maintaining safe separation in ight and safe sequencing for takeo and landing. Each aircraft is controlled by one of the workstations, as indicated by the color of the aircraft. The ight plans of the aircraft are replicated on all of the workstations, exploiting Totem's total ordering of messages to maintain the consistency of these replicas. The aircraft control process on each workstation periodically generates a new ight, with a ight plan represented as a sequence of times and positions. The
ight plan for the new ight is broadcast. On ordering this message, the aircraft control process checks the ight plan for con icts with ight plans already in the database. If no con ict is detected, the ight plan is inserted into the database. The workstation controlling a ight, recorded in the database, periodically broadcasts the aircraft position to the display processes. The display process on each workstation displays the aircraft on the workstation screen. Workstations can be stopped and restarted at arbitrary times, demonstrating the fault detection and recon guration capabilities of Totem. When a workstation fails, a Con guration Change message informs the aircraft control processes of the new membership and, thus, which of the workstations has failed. The remaining workstations assume control of the aircraft of the failed workstations in round-robin order and, on the workstation display, the colors of the aircraft change accordingly.
7 Conclusion The Totem system enables fault-tolerant applications in distributed systems to maintain the consistency of replicated information by providing reliable totally ordered multicasting of messages to processes within process groups. A hierarchy of protocols allows operation over a single local-area network or over multiple local-area networks interconnected by gateways. The message ordering strategy of Totem employs timestamps to de ne a total order on messages system-wide and sequence numbers to ensure reliable delivery by determining whether all messages with lower timestamps have been received on a ring. The strategy is computationally inexpensive and results in excellent performance. Many issues remain to be considered, including more eective ow control for multiple-ring networks and better coupling between the routing and process group mechanisms. We are also planning to implement Totem over faster communication media, such as 100 Mbit/sec Ethernet and 155 Mbit/sec ATM.
References [1] D. A. Agarwal, P. M. Melliar-Smith, and L. E. Moser, \Totem: A protocol for message ordering in a widearea network," Proceedings of the First International Conference on Computer Communications and Networks, San Diego, CA, pp. 1{5, June 1992. [2] D. A. Agarwal, Totem: A Reliable Ordered Delivery Protocol for Interconnected Local-Area Networks. PhD Thesis, University of California, Santa Barbara, August 1994. [3] Y. Amir, D. Dolev, S. Kramer, and D. Malki, \Transis: A communication sub-system for high availability," Proceedings of the 22nd Annual International Symposium on Fault-Tolerant Computing, Boston, MA, pp. 76{84, July 1992. [4] Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella, \Fast message ordering and membership using a logical token-passing ring," Proceedings of the 13th IEEE International Conference on Distributed Computing Systems, Pittsburgh, PA, pp. 551{560, May 1993. [5] K. P. Birman and T. A. Joseph, \Reliable communication in the presence of failures," ACM Transactions on Computer Systems, vol. 5, no. 1, pp. 47{76, February 1987. [6] K. P. Birman and T. A. Joseph, \Exploiting virtual synchrony in distributed systems," Proceedings of the 11th Annual ACM Symposium on Operating Systems Principles, pp. 123{138, November 1987.
[7] K. P. Birman, A. Schiper, and P. Stephenson, \Lightweight causal and atomic group multicast," ACM Transactions on Computer Systems, vol. 9, no. 3, pp. 272{314, August 1991. [8] D. R. Cheriton and W. Zwaenepoel, \Distributed process groups in the V kernel," ACM Transactions on Computer Systems, vol. 3, no. 2, pp. 77{107, May 1985. [9] M. F. Kaashoek and A. S. Tanenbaum, \Group communication in the Amoeba distributed operating system," Proceedings of the 11th IEEE International Conference on Distributed Computing Systems, Arlington, TX, pp. 882{891, May 1991. [10] L. Lamport, \Time, clocks, and the ordering of events in a distributed system," Communications of the ACM, vol. 21, no. 7, pp. 558{565, July 1978. [11] C. A. Lingley-Papadopoulos, The Totem Process Group Membership and Interface, Master's Thesis, University of California, Santa Barbara, August 1994. [12] P. M. Melliar-Smith, L. E. Moser, and V. Agrawala, \Broadcast protocols for distributed systems," IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 17{25, January 1990. [13] P. M. Melliar-Smith, L. E. Moser, and D. A. Agarwal, \Ring-based ordering protocols," Proceedings of the IEE International Conference on Information Engineering, Singapore, pp. 882{891, December 1991. [14] L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, \Processor membership in asynchronous distributed systems," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 5, pp. 459{473, May 1994. [15] L. E. Moser, Y. Amir, P. M. Melliar-Smith, and D. A. Agarwal, \Extended virtual synchrony," Proceedings of the 14th IEEE International Conference on Distributed Computing Systems, Poznan, Poland, pp. 56{ 65, June 1994. [16] L. L. Peterson, N. C. Buchholz, and R. D. Schlichting, \Preserving and using context information in interprocess communication," ACM Transactions on Computer Systems, vol. 7, no. 3, pp. 217{246, August 1989. [17] R. van Renesse, T. M. Hickey, and K. P. Birman, \Design and performance of Horus: A lightweight group communications system," Technical Report 94-1442, Cornell University, Department of Computer Science, August 1994. [18] P. Verissimo, L. Rodrigues, and J. Ru no, \The Atomic Multicast protocol (AMp), in D. Powell, ed., Delta-4: A Generic Architecture for Dependable Distributed Computing, pp. 267{294, Springer-Verlag, 1991.