The Horus System 1 Introduction - Semantic Scholar

9 downloads 0 Views 301KB Size Report
Jul 28, 1993 - ABG+86] Mike Accetta, Robert Baron, David Golub, Richard Rashid, .... Michael Reiter, Ken P. Birman, and Li Gong. Integrating secu- rity in a ...
The Horus System Robbert van Renesse Brad Glade

Ken Birman Robert Cooper Patrick Stephenson July 28, 1993 Abstract

Although group communication systems have become popular, software support for such computing remains complex and poorly integrated with modern operating systems structures. We describe the Horus system, which brings microkernel design techniques to bear on this problem. Horus has proved lightweight and fast, is well suited for embedding into operating systems like Mach and Chorus, and is exible enough to serve as a base for experiments in high availability security technologies, real-time communication, and control of high speed communication devices.

1 Introduction The last ve years have seen growing use of group communication primitives in distributed applications. Group communication and group membership protocols are used in fault-tolerant systems for propagating updates to replicas and collecting vote quorums for distributed decisions, in parallel systems to distribute jobs among slave worker processes, and in physically distributed systems for applications such as trading stocks and bonds, managing large telecommunication networks, and multi-media. Groups play roles that vary from pure communication (when data must be broadcast to a set of clients) to naming (as a way of identifying a set of cooperating or related processes), and the membership of a group is often used as a form of input to distributed algorithms, for example when a task is subdivided among a set of servers. Operating systems themselves can use group mechanisms for replicated le services, le caching, distributed shared memory, and distributed process schedulers. Several distributed operating systems have responded to this trend by incorporating group communication mechanisms directly into the communication subsystem, in addition to more standard RPC/IPC interfaces and message streams (e.g., Amoeba [KTFHB89], Chorus [RAA+88], and V [CZ85]). Implementing communication primitives (of any kind) at a low level of the operating system usually results in better performance due to a decrease in context switches and cross-address space references, such as demonstrated by the 1

Amoeba operating system [vRvST89]. However, such an approach also increases the complexity of the communication subsystem. The alternative is to implement group protocols over point-to-point communication mechanisms, in user space libraries and/or services. This approach permits the operating system to focus on \greasing the skids" for a simpler style of communication, normally streams or RPC. However, the resulting group programming tools may be subject to performance limitations stemming from the mismatch of these types of communication mechanisms with the uses group communication places upon them, and the diculty of making optimal use of the underlying technology from user-space software. For example, most networks support multicast primitives, and using these primitives can result in an -fold speed up for a broadcast to a set of processes. However, taking advantage of these from a user library may require special privileges, and may preclude sharing the device with other users. Future networks, such as ATM [Lam92], will often require that the network interface be managed by a single agent on behalf of the entire node. Thus, by moving group communication out of the kernel, one may be condemned to poor performance. Here, we report on the design and implementation of a portable group communication subsystem, called Horus , intended for modern distributed operating systems. 1 Our system makes several contributions. First, it demonstrates that high performance group communication can bene t from the same sorts of modularity techniques and optimizations as are used to enhance the performance of RPC and other communication methods. Second, it shows that ecient group communication support can exist side-by-side with other forms of IPC, without impacting on IPC performance. Third, we demonstrate a valuable form of design exibility: Horus is easily recon gured to change the properties of the group communication protocols, or to employ an alternate implementation of some of its modules. This permits the system to be used as a base for experiments on systems requiring somewhat di erent functionality than the default for Horus. The remainder of this paper starts with a more detailed discussion of the design of the system, in 2. Section 3 describes the Multicast Transport Service (MUTS). Section 4 deals with thread synchronization in Horus. Section 5 describes the most important subsystem, which implements group communication and group membership. Section 6 discusses security and real-time issues. Section 7 describes how we intend to customize Horus to host operating systems. Section 8 discusses how Isis applications may be supported by Horus. Finally, Section 9 presents initial performance results and concludes. n

n

1 These systems are usually based on microkernels, which provide excellent support for memory management, multiprocessors, multithreading, and intra-machine communication and synchronization, but may provide limited support for inter-machine communication and synchronization, and failure handling.

2

2 Design The group communication and membership paradigm has emerged as a common basis for a broad variety of applications, including parallel processing, fault-tolerant systems (replication), data distribution (multi-media), objectorientation (notably where objects are shared), distributed data base transactions, and the implementation of other distributed paradigms (such as shared virtual memory). The group communication system clearly has to be exible if it is to support all these applications eciently. Desirable features include message delivery ordering, security, fault-tolerance, and support for di erent architectures, operating systems and networks. Previous work on group communication systems, notably Isis, can be criticized for seeking to be all things to all users. In the e ort to support every possible feature and every conceivable form of ordering or fault-tolerance property, one arrives at a system which is too complex to be easily used by casual developers, and which is also dicult to maintain and extend and to optimize for high performance. Moreover, many applications do not need all of the features of a system like Isis, hence applications sometimes su er overhead on behalf of mechanisms that are really intended for applications of an entirely di erent nature. On the other hand, it seems dicult to throw features out without losing the power that makes the group communication system useful. In particular, we are convinced that a group communication system built on a single protocol will either be very slow, or will not suce for many needs. In the design of Horus, we have sought to overcome this limitation by drawing on concepts that originated in the operating system community in response to similar concerns about \macrokernels." Horus explicitly separates mechanism from policy, implementing a group communication system as a general mechanism (framework) within which any of a number of policies can be supported. In our own work, we have made substantial use of this feature and found that it lets us specialize Horus for purposes such as security, without complicating the basic system for purposes in which security is not an issue. And, security is just one example: we are nding similar bene ts of this type of exibility throughout our work. The result is highly reminiscent of a microkernel architecture: a core collection of mechanisms that form a skeleton for a generic group communication system, which can be lled out with policies yielding a high performance \plug and play" group communication system speci c to a desired use. In a similar manner, micro-kernel operating systems support some basic primitives on a low (kernel) level, and more sophisticated services on higher levels (in user space). This allows for a great deal of exibility. Our work was especially in uenced by the -kernel design [PHOA89], which provides a framework for implementing network protocols. In the ?kernel, each protocol implements a simple feature, and the protocols can be tied together in a graph to support the needs of the application. (Unfortunately, we x

x

3

and others have not found the -kernel implementation itself well-suited for supporting the types of multicast protocols that interest us [HS92].) Rather than being structured as a protocol graph, the multicast protocols in the Horus design are layered on top of each other. Figure 1 shows the structure of the Horus design. The bottommost layer hides interfacing issues, such as how the network protocols are driven, or how threads are created and synchronized. This layer, called the MUlticast Transport Service (MUTS), provides an asynchronous, reliable, one-to-many message-passing model, with preemptive threads, and support for timers and multiple address spaces. MUTS can support multiple transport protocols, reliable or unreliable, point-to-point or multicast, and is easily portable to many di erent environments. It is designed to run either as part of the kernel, or in user space. The rest of the Horus system makes, as much as possible, use of the MUTS facilities alone, making Horus completely portable. Over MUTS runs the Vsync (Virtual Synchrony) subsystem, which provides a full group communication environment. It is implemented as a collection of layers, each presenting the same interface, but each implementing di erent features. One layer, the Membership layer (or VIEWS), runs directly over MUTS and provides reliable group membership administration, and, in its secure case, key distribution and message encryption. Over that, there are layers that provide di erent message delivery ordering guarantees, security models, fault-tolerance, or scalability. These layers may be stacked in di erent orders, and new layers can easily be added. This allows for maximal exibility. The interface, called the Uniform Group Interface, provides a group model. Communication endpoints (machines, processes, ports, etc.) can join or leave groups, multicast messages, or fail. All these events are distributed to the existing members of the group. Currently we have a layer that multiplexes multiple endpoints on a single endpoint (like multiple ports on a single machine), two layers that provide causal message delivery (using the conservative and a progressive implementations of Paper ??), and a layer that provides totally ordered message delivery. In addition, we are working on a layer that multiplexes multiple groups on a single group, much like light-weight threads in a single process. This allows for ecient, low-cost groups [GBCvR92]. We intend to provide other layers, and users can install layers of their own. A number of exchangeable user-level services provide the policy-making aspects of Horus (the layers provide the mechanisms). Currently we have a failure analysis service that decides if an endpoint failed using the protocol of [RB91], a message logging service, a time service, and an authentication service that is in charge of key distribution [RBvR93]. All these services are faulttolerant. In addition, we have a name service that maps ASCII names to group identi ers (which is not yet replicated, but will be soon). Users are free to use their own services, so that the group semantics can be tailored to the particular applications. x

4

Failure Analysis

Logging Service

Time Service

Authent. Service

Vsync Protocols Conservative Protocol Causal Vector Stability Domains Timestamps Admin. Progressive Protocol Causal Multiple Domains Timestamps

LightWeight Group Core Group Management

MPX Layer

Total Order

Multiplexing

Sequencing

Vsync Membership Layer (VIEWS)

Membership

Atomicity

Flush Protocol

Event Counts

Encryption

Key Distribution

Multicast Transport Service (MUTS) Authent. Memory Message Entities Flow Control Connections Allocation Library (Endpoints) Retrans.

Transport Protocol

Timer Management

Threads and Synchronization

Address Space Management

Host Environment (threads, transports, ...) 5 Figure 1: The Horus Design. The dotted line is the preferred O.S. boundary, except for the light-weight process group layer, which runs in user space.

3 The Multicast Transport Service MUTS has emerged as the most important part of Horus. MUTS can run inside the operating system kernel, in the user libraries, or as a service accessed by (local) RPC. It provides access to the services that the platform provides, and provides the services that the platform fails to provide itself. Over MUTS, layers can depend on reliable, FIFO point-to-point and multicast communication, timers, threads, synchronization, and access to multiple address spaces. MUTS provides its own multicast sliding window protocol, its own timer management, and its own threads package. These facilities are primitive, and only intended to ll in the gaps. An important design consideration is to provide access to the existing services without loss of performance. The multicast service is based on a simple extension to traditional sliding window protocols. MUTS users can choose, per group, which underlying transport they wish to use (a default is provided). Optionally, the integrity of communication over these connections can be protected with message authentication codes (encryption is supported at a higher level). Unlike other systems, MUTS never decides to break a connection by itself. It reports communication problems, and it will slow down retransmission, but it will only break a connection when explicitly instructed to do so. In general it is a user-level failure analysis service that decides, on behalf of the users and/or applications, whether an endpoint has failed or not (see Figure 1). Currently MUTS runs (simultaneously) over IP, UDP (and their Deering multicast extensions [DC90], TCP, (raw) Ethernet, and Mach messages. Some of these protocols are unreliable, some reliable.2 MUTS also provides basic ow and congestion control. As the communication model is asynchronous message passing, MUTS can pack messages together in the case of congestion. Thus a single network packet may contain many MUTS messages. This allows for an unusually high message throughput, as shown in the performance section. MUTS provides a simple interface to multi-threading. Applications have to be able to run on both non-preemptively and preemptively scheduled threads. For synchronization, MUTS provides semaphores, locks, and event counts [RK79]. Multiple threads may run in an address space, and threads may be started in \foreign" address spaces (as long as they are on the same machine). Messages are vectors of data chunks which may contain references to foreign address 2 Although TCP is \mostly" reliable, it sometimes decides to break a connection, even though it cannot be completely sure that the peer site is down. In this case MUTS will try to re-establish a connection until the failure analysis services instructs it explicitly to drop the connection. The point is that MUTS seeks to provide a completely controlled, error-free communication environment, and hence must attempt to overcome any sort of communication failure until instructed by higher layers to cease to do so. This has an explicit bene t: it allows us to enforce consistent failure handling through higher level agreement protocols, thus keeping MUTS simple without requiring that the failure semantics of MUTS connections necessarily be simplistic.

6

spaces. These vectors may be passed between address spaces, and routines are available to provide (protected) cross-address space data copying. We have stressed the importance of architectures based on frameworks into which policy can be attached. Within MUTS, this principle is evident in several aspects of the design:  The use of call-backs throughout the code allows the insertion of higher level policies to detect the completion of each stage of message handling, such as adding timestamps. Rather than doing all things that all protocols might want, this allows MUTS the exibility to do speci c things needed for a speci c protocol while not burdening some other application or protocol with the same overhead.  The separation of ow-control and memory management policy from transportlevel considerations.  The separation of decisions to act upon a failure (by breaking a connection) from the detection of a communication problem, which allows us to provide failure information to the application level in a consistent manner. It should be noted that MUTS is at least super cially very similar to the -kernel and Psync architecture. To reiterate, this was intentional: although we were unsuccessful in building our system directly on the -kernel, we do feel that MUTS is best viewed as an extension of the -kernel using similar ideas and a similar design philosophy but with the goal of supporting primarily grouporiented and multicast software, in contrast with -kernel which was initially developed with a focus on RPC and stream protocol stacks. x

x

x

x

4 Multi-threading and Ordering The Horus system is multi-threaded. For each arriving message, a new thread is created. Isis was one of the rst available systems under the UNIX system that provided multi-threading, and this was one of the reasons for its success. Now that microkernels provide their own multi-threading, we wish to adopt the available threading mechanisms, rather than impose our own. Yet there is a problem to be resolved. If multiple messages become available at once, several threads will get started, one for each message. The threads will process the messages in parallel, and, as a consequence, ordering is lost. In Horus is solved by scheduling threads non-preemptively and in the correct order. However, if we want to be able to take full advantage of real parallelism, this approach no longer suces. Horus attaches a sequence number to each message to be delivered. Any ordering issues are now left to the application. To support this, the Horus library provides a construct called event counters , based on Reed and Kanodia's work 7

Function ec create ec acquire ec release ec destroy

Arguments | event counter, sequence number event counter event counter

Result event counter | | |

Table 1: The event counter interface.

f

upcall ( message, sequence number ) initial processing on message ; ec.acquire ( event counter, sequence main processing on message ; ec.release ( event counter, sequence final processing on message ; ;

number

);

number

);

g

Figure 2: How event counters are used. Several of these threads may run in parallel, yet the main processing on messages happens in a strict order. on synchronization [RK79]. An event counter is basically a lock, which can be acquired only in a certain order. The interface is presented in Figure 1. Ec.create creates a new event counter, and initializes it to zero. To acquire the event counter, a thread calls ec.acquire with a sequence number. If the sequence number does not correspond with the event counter, the procedure will block until the value of the event counter has reached the given value. Ec.release will release the event counter, and increment its associated value. Ec.destroy will destroy the event counter, and release the associated resources. Figure 2 demonstrates how an event counter may be used. Basically, an event counter implements a critical region for processing messages, so that the messages are processed in the right order. This simple interface is consistent with the microkernel philosophy underlying our work. Other researchers have proposed more elaborate interfaces for this type of event ordering. For example, the PSYNC system allows programs to detect and act upon very complex message ordering properties [PBS89], and the work of Liskov and Ladin also supports user-implemented message delivery orderings [LL86]. Experience with Isis, however, leads us to believe that while causal delivery is vital, other sorts of delivery orderings are rarely needed. The approach described above, which forces users to process messages in a xed order consistent with causality is less powerful than these other schemes, but it has 8

the bene t of being simple and highly concurrent. Single-threaded application may ignore event counters altogether.

5 The Vsync Subsystem The function of Vsync is to extend MUTS into a full group communication environment, supporting fault-tolerant multicast. The model used is the same virtually synchronous execution model that Isis has employed since 1987, and which is the basis of the Isis Toolkit. Vsync is thus responsible for message ordering, group addressing, fault-handling, managing group membership information, and providing noti cations when group membership changes occur. Two major issues needed to be addressed to provide all of this while achieving high performance. First, it was important that Vsync maintain the highest possible degree of parallelism, through multi-threading. Second, because Vsync will often execute within the operating system address space (as a service shared among many processes co-resident on the same node), it is important to minimize the amount of cross-address space trac that the architecture can initiate. In developing Vsync, we also wished to retain exibility to explore various implementations of the basic group multicast protocols. Readers familiar with Isis will be aware that these protocols have evolved several times, and we are not convinced that a de nitive implementation of the protocols has been reached. Accordingly, Vsync is structured to allow the multicast ordering mechanism to be added as a form of plug-in module. The advantage of this approach is that we retain the exibility to experiment with various protocols, much as a microkernel operating system can be con gured to include or exclude di erent virtual memory management policies. Vsync embodies two innovations that proved important for achieving high performance. First, all interfaces permit lists of messages to be passed, as opposed to a single message at a time. The rationale is that many Isis applications generate such huge message transmission rates that cross-address space calls on a per-message basis represents a prohibitive cost. For example, in work with the New York Stock Exchange, the Isis system has seen sustained rates of 500 quotes per second, and in the Hiper-D project (a prototype system to replace the Naval AEGIS radar system) radar contact frequencies of hundreds per second are anticipated. Even higher rates of message generation arise in telecommunications systems, factory- oor applications, and a number of other Isis application domains (requirements of up to 10,000 messages per second). By packing multiple messages into a single larger object, and keeping the code path between the user's address space and the outgoing MUTS layer as short as possible, a high degree of asynchronous pipelining is achieved, resulting in high performance. The second innovation is concerned with maintaining maximal parallelism by keeping the MUTS thread that rst accepts an incoming message active, if 9

possible, all the way into the user's address space. This goal initially seems to be at odds with the message ordering philosophy, which argues that messages should be delivered in an order consistent with the causal order in which they were sent, and perhaps in a stronger, atomic, order. However, Vsync uses event counters scheme to delay synchronization of concurrent threads carrying messages until the last possible instant{and in most cases, this means that no synchronization delays occur at all. The solution is such that a single-threaded application need take no action at all; messages arrive in the correct delivery order. A maximally parallel application, if it desired, could receive messages in parallel, completely losing the ordering properties{but not the bene ts of failure atomicity. A more typical application would enter a critical section as it reaches the stage of processing an incoming message where order is signi cant, in this way preserving as much parallelism as possible for as long as possible, and keeping the sequential single-threaded region of execution as short as possible. Currently, the message ordering is implemented using the conservative CBCAST protocol of Paper ??. However, simulation studies suggest that if group overlap exceeds a threshold, the performance of the conservative scheme becomes poor, and a multi-vector-timestamp scheme will perform far better (also described in Paper ??). Results of these simulation studies appear in Figure 3. The modular structure of Vsync will allow us to experiment with that scheme, and perhaps with others, without encumbering the core system with undesired complexity. In addition, for increased performance, Vsync provides causal domains , which are sets of groups within which causality is preserved (across group boundaries). This architecture gives the developer control over when causal delivery orders will be enforced and when they can be violated. Finally, as noted earlier, the Vsync layer is closely integrated with the failure analysis service. This service runs the protocol of [RB91] among a small set of processes, which in turn manage a complete list of processes that belong to the system. An inexpensive method is used to drop faulty processes from the general list; the scheme used for members of the service itself is more complex, but rarely invoked. In implementing Vsync, many optimizations were made to the protocols These include a new and simpli ed ush protocol, the multiplexing of multiple processes to a single \site," when several processes on the same site belong to a process group, and changes to the vector-timestamp compression scheme. Details on these optimizations and their impact on performance will be reported elsewhere.

6 Security layer and Real-time Layer The rst major tests of the exibility of Horus came early in development, with the implementation of a security architecture and a real-time communication architecture within the system. Each of these was intended to operate as op10

65 6 55 5 Latency 4 5 4 (ms) 35 3 25 2 15

worst case conservative

:

:

:

:

average conservative

:

multi-vector timestamps

:

0

100 200 300 400 500 600 700 800 # messages/sec

Figure 3: Results of simulating delivery latency in a setting with 13 processes in three overlapping groups of 5 members each, sending short messages. The three lines display the performance for the compressed vector timestamp scheme, the average performance of the conservative protocol, and the worst case performance (in the absence of failures). Di erent experiments give similar results. (Source: Michael Kalantar at Cornell University.) tional functionality that extends the properties of Horus communication without introducing substantially more complexity into Horus itself. The security architecture, discussed in [RBG92], is now fully implemented and operational. It extends the basic Horus halting-failure model to encompass a variety of malicious attacks, providing a security model best visualized as a secure process group within which Horus abstractions are preserved and communication is protected. Moreover, authentication and access control mechanisms enable group members to prevent untrusted processes or sites from joining the group. To obtain good performance, it was necessary for the security architecture to intercept communication at several levels of Horus. Within MUTS, the security system authenticates packets using fast message authentication codes. Vsync supports the encryption of user data and the secure distribution of cryptographic keys. To minimize the cost of encryption, Vsync precomputes and caches strings of unpredictable pseudorandom bits, thereby reducing the latency of encryption to that of exclusive-oring the outgoing message against a precomputed string and repeating this operation on the reception of the message. Our experience in implementing the security mechanisms con rmed that 11

Horus embodies the exibility desired: the security architecture was fully operational after 9 months of development, and has had no performance impact on Horus when not in use. Moreover, the performance of the security system itself appears to be very good, dominated by unavoidable costs associated with protecting communication. The Corto system, being developed over Horus by Carlos Almeida, Keith Marzullo and Robbert van Renesse will o er a further test of this topic. Corto seeks to integrate real-time communication functionality and periodic scheduling behaviors into the same sort of group programming abstractions that Vsync supports { but with a bias on real-time properties rather than on synchronization and ordering. To accomplish this, Corto needs to insert timestamps into messages, to exploit priority-driven scheduling techniques, and to guarantee that synchronization delays will be bounded and small. At the time of this writing, a detailed design for Corto exists and implementation is underway; preliminary progress suggests that all of these requirements can be accommodated within the framework of the basic Horus system.

7 Customizing Horus to a Port Interface The original Isis system runs as a UNIX application. Of particular importance to us is getting the Horus system to run well on modern microkernel technology, notably Mach [ABG+ 86] and Chorus [RAA+88], These systems provide their own communication mechanisms on top of an interface based on ports and messages. We wish to integrate the Horus system within this framework. The basic reasoning behind these plans is that microkernels appear to o er satisfactory support for memory management and communication between processes on the same machine, but that support for applications that run on multiple machines is weak. The current IPC mechanisms, with Remote Procedure Call as the most popular one, are adequate only for the simpler distributed applications, as they do not address any of the internal management issues of distribution [Ham84, BvRT87, TvR88]. Our goal is to add stronger functionality and semantics to the existing Mach and Chorus message interfaces, rather than de ning a new interface. This functionality and semantics will take the form of Horus news groups. In this section we will rst look at the existing Mach and Chorus port interfaces, and then discuss how we wish to integrate Horus within these interfaces. 7.1

Ports

Modern operating systems support ports with a wide range of semantics. As a minimum, as in the Amoeba system [TvRvS+ 90], a port is an address to which messages can be sent. In Chorus and Mach, a bounded queue is associated with the port, so the process holding the port need not listen continuously (in Amoeba 12

this is done by having several threads wait for messages simultaneously). In Mach, the messages are reliably delivered, and the sender may block if the port queue is full. Chorus messages to ports are unreliable, although a reliable portlevel RPC interface is supported. Mach and Chorus allow only one receiver process on a port (although possibly multiple threads within that process), but the port may be migrated to a new receiver process. Mach ports do not have user-visible global names, and have, in reality, no global access. The ports, instead, are accessed using so-called rights, which can be compared to le descriptors or capabilities. Global access is simulated through a user space server, the NetMsgServer. This server acts as an agent for remote ports: it creates a local proxy port, and forwards messages sent to the port to the remote NetMsgServer using TCP or another conventional protocol. Similarly, it delivers messages received from remote NetMsgServers to the local port. Mach users do not notice this, and, in principle, local semantics are transparently maintained (currently, however, this is not the case). Chorus ports, on the other hand, do have global names and a corresponding global implementation. Chorus provides a port group concept, with weak semantics. It is possible to allocate a group, and add (local) ports to it. Messages can be sent to the group in the same way as to ports, and are unreliable. There is no membership information available. (Note: the Mach \port set" is a a mechanism that allows receiving on multiple ports at once, much like UNIX select , and has no group communication role. Mach currently does not provide a group mechanism.) 7.2

Integrating Horus into a Port Interface

The news interface will look similar on both Chorus and Mach. Rather that having processes subscribe to news groups, ports may subscribe. Henceforth, messages sent to the port are multicast to the corresponding news group, and messages received on the news group are delivered through the port. We also want to extend the basic Chorus and Mach system calls so that we can transparently replace a conventional Chorus or Mach application with one that has been replicated for fault-tolerance. The users of such an application would continue to use the original application interface, but all communication would transparently be sent to a group, and processed cooperatively by the group members. This raises the question of how to integrate the Horus news group mechanisms into the Chorus and Mach microkernel interfaces. As described, Chorus currently provides an unreliable group interface. The Chorus group-allocate function will be implemented by creating a Horus news group. A new option will indicate if membership information should be posted to the group or not. Adding a port to the group will be the same as subscribing the port to the news group [BBCT91]. Mach does not currently have a group concept. (As noted before, the \port set" provides only a select functionality, with no global semantics.) We 13

ISIS

T1

NMS microkernel

T2

NMS microkernel

T3

NMS microkernel

Figure 4: Structure of the Mach/Horus implementation with four tasks, showing the microkernels, the NetMsgServers (NMS), and the ports. intend to use the NetMsgServer to simulate a global port with multiple receivers, much in the same way as Mach already uses the NetMsgServer for simulating global ports. Instead of TCP, the NetMsgServers will use Horus to do their group communication (see Figure 4). Each application task will have two ports: one to receive incoming messages, and another to which it sends outgoing messages. The NetMsgServer forwards messages it receives from Horus to the rst port. Messages sent to the second port are received by the NetMsgServer and forwarded to the corresponding news group [vRBGS92]. It may be useful to walk through this architecture in the case where a pre-existing application is replaced with a fault-tolerant group. This will work roughly the same way under Mach and Chorus. Under Mach, when a client thread looks up the name of a server to nd the port, this request is sent to the NetMsgServer. The NetMsgServer can now generate a local port and return the send right to the client. The NetMsgServer subscribes the port to the Horus news group that implements the fault-tolerant service. When the client sends a request message to the port, the NetMsgServer will post this to the news group, the members of which cooperate to respond fault-tolerantly using one of several methods supported by Isis. After the NetMsgServer receives a reply to 14

T4

the request, it forwards it back to the client. The client can be kept completely unaware of this change. (Details of the comparable algorithm in Chorus are omitted for brevity.)

8 The Isis Toolkit Readers familiar with the Isis system will have noticed that several aspects of the system have not been described yet, in particular the tools. The current Isis system comes with a toolkit containing tools for applications that use RPC, atomic transaction, primary-backup replication, resource management, and monitoring and control [MCWB91]. The new system will continue to support these tools, but they will be implemented entirely as a set of user libraries and services. Other old Isis applications will continue to be supported through a compatibility library that o ers the old interface. For brevity, we will not present details of this mapping. The key point, however, is that the CBCAST and group membership layer of our new system is suciently powerful to let us support the full range of functionality implemented by the old Isis system.

9 Initial Performance Results and Conclusion In this section we present a sample of initial performance results. All our experiments were performed between user processes on SUN Sparc ELC workstations, running SunOS 4.1.1, and connected by a 10 Mbit/sec Ethernet. Note that we intend to run the MUTS and Vsync layers inside the operating system, rather than on top of it, which should result in a signi cant performance improvement. The MUTS protocol runs over both TCP and UDP; it chooses dynamically the protocol that it thinks is most appropriate. The graphs are intended to be self-explanatory. We have been unable to discuss a number of aspects of Horus that we consider innovative, but that would require a more detailed exposition of the system architecture to present. In particular, we have not discussed the protocol by which anonymous processes rst contact a group, which is based on an idea originally proposed by John Warne for the ANSA system, adapting the RPC protocol of Wood [Woo93] to our environment, or the Isis toolkit being ported to run over Horus. We will be able to simplify the toolkit, in large part because of better state transfer algorithms used for group join. In addition, our light-weight presentation of groups (that map down to heavy-weight groups) is being used to handle cases where a large number of groups coincide. Longer term, we are considering the possibility of integrating Horus directly into the communication model of Mach or Chorus, presenting our group mechanism as a form of \group port" that could be used transparently in existing Mach or Chorus applications. However, at the present stage of our e ort, these steps have not been taken. 15

25

auth. + encr.

20 time 15 (ms) 10 5

25 20

+ +

+

authenticated

15

insecure

10

Isis

+

5

0

0

0 1 2 3 4 5 (a) message size (Kbyte)

1

+

+

2 3 4 5 (b) group size

Figure 5: Round-trip delay with varying message and group sizes. Encryption is performed with a precomputed cache of size 2 Kbyte, which accounts for the large increase in latency between the 2 Kbyte and 3 Kbyte message in (a). The bottom dotted line is the UDP round-trip delay, and the dotted line marked with plusses the Isis round-trip delay. In (b), we used a xed message size of 1 Kbyte.

References

[ABG+ 86] Mike Accetta, Robert Baron, David Golub, Richard Rashid, Avadis Tevanian, and Michael Young. Mach: A new kernel foundation for UNIX development. In Proceedings of the USENIX Summer '86 Conference, pages 93 { 112, Atlanta, GA, June 1986. [BBCT91] Micah Beck, Ken P. Birman, Robert Cooper, and Sam Toueg. A fault tolerant extension of the Chorus nucleus. Technical report, Department of Computer Science, Cornell University, January 1991. Internal Report. [BvRT87] Henri E. Bal, Robbert van Renesse, and Andrew S. Tanenbaum. Implementing distributed algorithms using remote procedure call. In Proceedings of the 1987 National Computer Conference, pages 499{506, Chicago, IL, June 1987. [CZ85] David Cheriton and Willy Zwaenepoel. Distributed process groups in the V kernel. ACM Transactions on Computer Systems, 3(2):77{ 107, May 1985. 16

auth. + encr. authenticated insecure

+

6

200 Isis

150 time (ms) 100

+

+

+

+

50 0

0

1 2 3 4 group size (excl. new member)

5

Horus

ush

Figure 6: Join Delay with varying group size (insecure). The cost of a secure join is approximately 2.5 seconds, of which 90% is spent in the modular exponentiation routines of our C implementation of the RSA public key cryptosystem. Horus uses a ush protocol of linear complexity, whereas Isis uses a protocol of quadratic complexity. [DC90]

Stephen E. Deering and David R. Cheriton. Multicast routing in datagram internetworks and extended LANs. ACM Transactions on Computer Systems, 8(2):85{110, May 1990. [GBCvR92] Brad B. Glade, Ken P. Birman, Robert C. Cooper, and Robbert van Renesse. Light-weight process groups. In Proceedings of the OpenForum '92 Technical Conference, pages 323{336, Utrecht, The Netherlands, November 1992. [Ham84] K. G. Hamilton. A Remote Procedure Call System. PhD thesis, Computing Laboratory, University of Cambridge, Cambridge, England, December 1984. TR 70. [HS92] M. A. Hiltunen and R. D. Schlichting. Modularizing fault-tolerant protocols. In Proceedings of the Fourth European SIGOPS Workshop, Rennes, France, September 1992. ACM. [KTFHB89] M. Frans Kaashoek, Andrew S. Tanenbaum, Susan FlynnHummel, and Henri E. Bal. An ecient reliable broadcast protocol. Operating Systems Review, 23(4):5{19, October 1989. [Lam92] C. Lamb. Speeding to the ATM. UNIX Review, 10(10):29{36, October 1992. 17

[LL86]

Barbara Liskov and Rivka Ladin. Highly-available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the Fifth ACM Symposium on Principles of Distributed Computing, volume ACM 0-89791-198-9/86/0800-0029, pages 29{ 39, Calgary, Alberta, August 1986. ACM SIGOPS-SIGACT. [MCWB91] Keith Marzullo, Robert Cooper, Mark Wood, and Ken P. Birman. Tools for distributed application management. IEEE Computer, August 1991. [PBS89] Larry L. Peterson, Nick C. Bucholz, and Richard Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217{246, August 1989. [PHOA89] Larry L. Peterson, Norm Hutchinson, Sean O'Malley, and Mark Abbott. RPC in the x-Kernel: Evaluating new design techniques. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pages 91{101, Litch eld Park, Arizona, November 1989. ACM SIGOPS. [RAA+88] Marc Rozier, Vadim Abrossimov, Francois Armand, Ivan Boule, Michel Gien, Marc Guillemont, Frederic Herrmann, Claude Kaiser, Sylvain Langlois, Pierre Leonard, and Will Neuhauser. CHORUS distributed operating systems. Computing Systems Journal, 1(4):305{370, December 1988. Chorus systemes Technical Report CS/TR-88-7, revised in CS/TR-90-25.1 Overview of CHORUS Distributed Operating Systems. [RB91] Aleta Ricciardi and Ken P. Birman. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, pages 341{351, Montreal, Quebec, August 1991. ACM SIGOPS-SIGACT. [RBG92] Michael Reiter, Ken P. Birman, and Li Gong. Integrating security in a group-oriented distributed system. In Proceedings of the IEEE Symposium on Research in Security and Privacy, pages 18{ 32, Oakland, California, May 1992. [RBvR93] Mike Reiter, Ken P. Birman, and Robbert van Renesse. Faulttolerant key distribution. Technical report, Department of Computer Science, Cornell University, 1993. in preparation. [RK79] D. P. Reed and R. K. Kanodia. Synchronization with eventcounts and sequencers. Communications of the ACM, 22(2):115{123, February 1979. 18

[TvR88]

Andrew S. Tanenbaum and Robbert van Renesse. A critique of the remote procedure call paradigm. In R. Speth, editor, Proceedings of the EUTECO '88 Conference, pages 775{783, Vienna, Austria, April 1988. [TvRvS+ 90] Andy Tanenbaum, Robbert van Renesse, Hans van Staveren, Greg Sharp, Sape Mullender, Jack Jansen, and Guido van Rossum. Experiences with the Amoeba distributed operating system. Communications of the ACM, 33(12):46{63, December 1990. [vRBGS92] Robbert van Renesse, Ken P. Birman, Brad Glade, and Patrick Stephenson. Options for adding group semantics to ports. Technical report, Department of Computer Science, Cornell University, January 1992. Internal Report. [vRvST89] Robbert van Renesse, Hans van Staveren, and Andrew S. Tanenbaum. The performance of the Amoeba distributed operating system. Software{Practice and Experience, 19(3):223{234, March 1989. [Woo93] Mark D. Wood. Replicated RPC using Amoeba closed group communication. In Proceedings of the Thirteenth International Conference on Distributed Computing Systems, Pittsburgh, PA, 1993.

19