Comparing Kernel-Space and User-Space ... - CiteSeerX

7 downloads 28406 Views 72KB Size Report
the put reply system call. Note that this solution, which works around the inflexible kernel RPC, undoes the Orca. RTS optimizations and re-introduces an ...
Comparing Kernel-Space and User-Space Communication Protocols on Amoeba Marco Oey

Koen Langendoen

Henri E. Bal

Dept. of Mathematics and Computer Science Vrije Universiteit Amsterdam, The Netherlands

Abstract Most distributed systems contain protocols for reliable communication, which are implemented either in the microkernel or in user space. In the latter case, the microkernel provides only low-level, unreliable primitives and the higher-level protocols are implemented as a library in user space. This approach is more flexible but potentially less efficient. We study the impact on performance of this choice for RPC and group communication protocols on Amoeba. An important goal in this paper is to look at overall system performance. For this purpose, we use several (communication-intensive) parallel applications written in Orca. We look at two implementations of Orca on Amoeba, one using Amoeba’s kernel-space protocols and one using userspace protocols built on top of Amoeba’s low-level FLIP protocol. The results show that comparable performance can be obtained with user-space protocols.

1 Introduction Most modern operating systems are based on a microkernel, which provides only the basic functionality of the system. All other services are implemented in servers that run in user space. Mechanisms provided by a microkernel usually include low-level memory management, process creation, I/O, and communication. An important design issue for microkernels is which communication primitives to provide. Some systems provide primitives of a high abstraction level, such as reliable message passing, Remote Procedure Call [5], or reliable group communication [10]. An alternative approach, in the spirit of microkernels, is to provide only low-level, unreliable “send” and “receive” primitives, and to build higherlevel protocols as a library in user space. The trade-offs are similar to those of other design choices for microkernels: implementing functionality in user space is more flexible but potentially less efficient. Several researchers have implemented protocols like TCP/IP and UDP/IP in user space [12, 16]. They describe many advantages of this approach. For example, it eases maintenance and debugging, allows the co-existence of  This research was supported in part by a PIONIER grant from the Netherlands Organization for Scientific Research (N.W.O.).

multiple protocols, and makes it possible to use applicationspecific protocols. In this paper we compare kernel-space and user-space protocols for the Amoeba distributed operating system [15]. We consider the RPC and group communication protocols Amoeba provides. One of the main goals of this paper is to study the impact of user-space protocols on overall system performance. In general, implementing functionality outside the kernel entails some overhead, but in practice other factors often dominate and the overhead may well become negligible [6]. Therefore we study not only the performance of the protocols, but also that of applications. Our study is based on parallel applications written in the Orca programming language [3]. We have built two implementations of Orca on Amoeba. The first one uses Amoeba’s high-level communication protocols. The second implementation makes calls to Amoeba’s low-level datagram protocol, FLIP [11], and uses its own communication protocols in user space. We have done extensive performance measurements at different levels of both systems, including the low-level primitives, the high-level protocols, and many Orca applications. These measurements were done on a large-scale Ethernet-based distributed system. Since most parallel Orca programs do a significant amount of communication, Orca is a good platform for a meaningful comparison. The outline of the rest of the paper is as follows. In Section 2 we present the general structure of our Orca system. In Section 3, we describe the two Orca implementations on Amoeba. In Section 4, we discuss the performance of the communication primitives. In Section 5, we look at the performance of Orca applications. Finally, Section 6 contains a discussion and conclusions.

2 The Orca/Panda system Orca is a language for writing parallel applications that run on distributed-memory systems (collections of workstations and multicomputers) [3]. Its programming model is based on shared data-objects, which are instances of abstract data types that can be shared among different processes. The shared data encapsulated in an object can only be accessed through the operations defined by the abstract data type. Each operation is applied to a single object and is guaranteed to be executed indivisibly. The model can be regarded as an object-based distributed shared memory with sequential consistency semantics [14].

Orca runtime system

Interface layer

threads

System layer

threads

RPC

Group Comm.

Panda communication

Operating System

Figure 1: Structure of the Orca/Panda system . The Orca runtime system (RTS) takes care of object management. To reduce communication overhead, it may decide to store an object on one processor (preferably the one that accesses the object most frequently) or it may replicate the object and store copies of it on multiple processors. The decision which strategy to use is made by the RTS, based on heuristic information provided by the compiler [2]. Different strategies may be used for different objects. In general, the RTS will replicate objects that are expected to be read frequently (based on the compiler-generated information). Objects with a low (expected) read/write ratio are stored on a single processor. The Orca RTS is implemented on top of an abstract machine called Panda [4]. Panda provides threads, Remote Procedure Call, and group communication. Group communication is totally ordered, which means that all messages will be received by all processors in the same total order. The Orca RTS is fully independent of the operating system and architecture. It uses Panda RPC to implement remote invocations on nonreplicated objects. Read-only operations on replicated objects are executed locally, without doing any communication. Write operations on replicated objects are implemented by broadcasting the operation and its parameters to each processor holding a copy of the object, using Panda group communication. Each processor applies the operation to its local copy. Since the group communication is totally ordered, all copies remain consistent. Recently, the Orca RTS has been improved to handle condition synchronization on nonreplicated shared objects more efficiently. Orca operations may specify that a certain condition, called a guard, should hold before the operation is started. If an operation on a remote object blocks, the Orca RTS no longer blocks the RPC server thread, but queues a continuation [7] at the object. Later, when another operation modifies the state of the object, the guard of the blocked operation is checked. If it evaluates to true, the blocked operation is retrieved from the continuation and resumed. This optimization reduces the number of blocked threads and saves a context switch since the thread that modifies the state sends back the reply of the blocked operation itself instead of waking up a blocked server thread. In the next section it is shown that only the flexible user-space protocols can fully exploit this improvement in the Orca RTS. Internally, Panda consists of two layers (see Figure 1). The system layer is partly operating system dependent and provides point-to-point communication and multicast (not

necessarily reliable). The interface layer uses these primitives to provide a higher level interface to the RTS. An important issue is in which layer the reliability and ordering semantics should be implemented. Panda has been designed to be flexible and allows both layers to implement any of these semantics. For example, on most multicomputers the hardware communication primitives are reliable, so it is most efficient to have the system layer provide reliable communication. In this case the interface layer will be simple. If the communication primitives provided by the system layer are unreliable, the interface layer uses protocols to make point-to-point communication reliable. Likewise, Panda’s interface layer has a protocol for making group communication reliable and totally ordered. The Panda RPC protocol is a 2-way stop-and-wait protocol. The client sends a request message to the server. The server executes the request and sends back a reply message, which also acts as an implicit acknowledgement for the request. Finally, the client needs to send back an acknowledgement for the reply message. If possible, the Panda RPC protocol piggy backs this acknowledgement on another request message, and only sends an explicit message after a certain timeout. This optimization is the major difference with Amoeba’s 3-way RPC protocol, which always sends back an explicit acknowledge message. The protocol for totally ordered group communication is similar to the Amoeba protocol of Kaashoek [10]. It uses a sequencer to order all messages. To broadcast a message, the sender passes the message to the sequencer, which tags it with the next sequence number and then does the actual broadcast. The receivers use the sequence numbers to determine if they have missed a message, in which case they ask the sequencer to send the message again. For this purpose, the sequencer keeps a history of messages that may not have been delivered at all machines yet. The protocol has several mechanisms to prevent overflow of the history buffer. Also, for large messages, both Amoeba and Panda use a more efficient protocol, in which the senders broadcast messages themselves and the sequencer broadcasts (small) acknowledgement messages [10]. The Orca/Panda system was originally developed on top of SunOS running on a collection of Sun-4 workstations [4]. The system has been ported to several operating systems (including Amoeba, Solaris, Horus, and Parix) and machines (including the CM-5 and Parsytec transputer grid). In this paper, we will only look at the Amoeba implementations.

3 Kernel-space and user-space communication protocols Panda has been implemented in two different ways on Amoeba. The first implementation uses Amoeba’s kernelspace RPC and group communication protocols. The second implementation uses Amoeba’s low-level, unreliable communication primitives and runs the Panda protocols on top of them in user-space. Both implementations use use Amoeba threads, which are created and scheduled (preemptively) by the kernel. Mutexes are provided for synchronization between the threads of one process. Panda uses them to implement mutexes and condition variables, which are provided to the Panda user (i.e., the Orca RTS). This section describes both implementations of Panda.

Kernel-space implementation Interface layer

Wrapper routines

System layer

Amoeba Microkernel

User-space implementation Panda RPC

Panda Group Comm.

unreliable communication RPC

Group FLIP

FLIP

Figure 2: Two implementations of Panda on Amoeba.

3.1 Kernel-space protocols Our first implementation of Panda on Amoeba uses the Amoeba RPC and totally ordered group communication protocols to implement the corresponding Panda primitives. The goal of this implementation is to wrap the Amoeba primitives to make them directly usable by the Orca RTS. The communication primitives provided by Amoeba differ slightly from the Panda primitives. Amoeba expects a server thread to block waiting for a request. Server threads in Amoeba therefore repeatedly call get request to wait for the next request. Panda, on the other hand, is based on implicit message receipt, in which a new (pop-up) thread is created for each incoming RPC request. The difference between the two models, however, is easy to overcome. The Panda model is implemented on top of the Amoeba model by using daemon threads. An RPC daemon waits for an incoming RPC request (using get request) and then does an upcall to a procedure that handles the request and computes the reply. A restriction of Amoeba’s RPC primitives is that the reply message has to be sent back by the same thread that issued the get request primitive. As explained in Section 2, the Orca RTS may send back a reply message from a different thread when handling guarded operations. The Panda RPC interface function pan rpc reply is implemented on top of the Amoeba kernel by signaling the original thread so it can send back the message using the put reply system call. Note that this solution, which works around the inflexible kernel RPC, undoes the Orca RTS optimizations and re-introduces an additional context switch and increased memory usage because of the blocked server thread. The structure of this implementation is shown in the left part of Figure 2. The Panda system layer does not contain any code related to communication. The interface layer mainly consists of wrapper routines that make the Amoeba primitives look like the Panda primitives. The overhead introduced by this is small. We refer to this system as the kernel-space implementation.

3.2 User-space protocols A radically different implementation of Panda is to bypass the Amoeba protocols and directly access the low-level communication primitives (see the right part of Figure 2). These primitives are provided by FLIP (Fast Local Internet Protocol), which is a network layer protocol. FLIP differs from the more familiar Internet Protocol (IP) in many important ways. In particular, FLIP supports loca-

tion transparency, group communication, large messages, and security [11]. FLIP offers unreliable communication primitives for sending a message to one process (unicast) or a group of processes (multicast). For this implementation, we have used Panda’s RPC and group-communication protocols described in Section 3. These protocols were initially developed on UNIX (see Section 2). They remained unchanged during the port to Amoeba. We only had to implement Panda’s system level primitives on top of FLIP. All modifications to the system thus were localized in the system-dependent part of Panda (see Figure 2). In summary, this implementation treats Amoeba as a system providing unreliable and unordered communication. We refer to this system as the user-space implementation. The system layer uses one daemon thread to receive FLIP packets. FLIP fragments large messages, so the daemon reassembles packets into a complete message, checks if the message is to be delivered to the group or RPC module, and makes an upcall to the corresponding handler in the interface layer. These handlers run their protocols (e.g., to order group messages) and make upcalls to the messagehandler procedures registered by the Panda user. Panda requires these upcall procedures to run to completion quickly such that messages can be processed completely by the system level daemon without context switching to intermediate threads. The Orca RTS uses continuations to handle longterm blocking of operation invocations (see Section 2), so its message handlers fulfill this requirement (since Orca operations normally take little time). Unlike the kernel-space RPC protocol, no work arounds are needed to support the asynchronous pan rpc reply call. The flexible user-space protocol can take full advantage of the Orca RTS optimizations. In comparison to a previous Panda version that included daemon threads at the interface layer to support blocking Orca message handlers, the latency of RPC and group messages has dropped with 300 s. The Amoeba high level RPC and group protocols could not take advantage of the change in the Orca RTS and still incur a context switch when handling blocked object invocations.

4 Performance of the communication protocols We measured the performance of the protocols on Amoeba version 5.2, running on SPARC processors. Each processor runs at 50 MHz and is on a single board (a Tsunami board, equivalent to the Sun Classic), which contains 32 MB of local memory. The boards are connected by a 10 Mbit/s Ethernet. Below, we describe the performance of the Panda system-layer primitives and of the RPC and group communication protocols. All reported measurements are average values of 10 runs with little variation: less than 1%.

4.1 Performance of the system-layer primitives Table 1 shows the latency for the unicast and multicast primitives that are provided by the Panda system layer in the user space implementation (see Figure 2). These library routines do system calls to invoke the corresponding FLIP primitives. All measurements are from user process to user process.

message size 0 Kb 1 Kb 2 Kb 3 Kb 4 Kb

unicast user 0.53 ms 1.50 ms 2.50 ms 3.72 ms 4.18 ms

multicast user 0.62 ms 1.58 ms 2.55 ms 3.74 ms 4.23 ms

RPC user kernel 1.56 ms 1.27 ms 2.53 ms 2.23 ms 3.60 ms 3.40 ms 4.77 ms 4.48 ms 5.27 ms 5.06 ms

group user kernel 1.67 ms 1.44 ms 3.59 ms 3.38 ms 3.67 ms 3.44 ms 4.84 ms 4.56 ms 5.35 ms 5.25 ms

Table 1: Communication Latencies. The two primitives are almost equally expensive, because Ethernet provides multicast in hardware. Unicast latency was measured with a simple “pingpong” program, in which two machines repeatedly send messages of various sizes to each other. Multicast latency was measured with a similar program in which two machines repeatedly multicast a message to the “group” on receipt of a message from the other machine. Since the replies are sent directly from within the upcall issued by the Panda layer, the latency figures do not include any context switching overhead. This holds for both the unicast and multicast latency. The nonlinear relation between latency and message length is due to the fragmentation performed by the lowlevel FLIP primitives in the Amoeba kernel. Messages are broken down into maximum length Ethernet packets of 1500 bytes, which are reassembled in user space at the receiving side. Receiving a packet incurs some overhead costs as can be seen from the difference in latency between 2 Kb, 3 Kb, and 4 Kb messages. A 2 Kb message can be transmitted in two packets, while both 3 Kb and 4 Kb message take three packets, hence the relatively small latency difference between sending 3 Kb and 4 Kb messages.

4.2 Performance of the Remote Procedure Call protocols RPC latency (see Table 1) was measured using RPC requests of various sizes and empty reply messages. The RPC throughput (see Table 2) was measured by sending requests of 8000 bytes each and sending back empty replies. As can be seen, the kernel-space implementation has a lower RPC latency and, consequently, a higher throughput than the user-space implementation. For null messages, for example, the kernel-space protocols are 0.3 ms faster (1.27 ms versus 1.57 ms). We think it is very important to determine how much of this difference is due to fundamental limits of a user-space protocol and how much is due to implementation differences between Amoeba and Panda. Therefore, we give an analysis of this difference. The Amoeba and Panda protocols are both tuned fairly well, although more effort was spent on optimizing the Amoeba protocol. The most important difference between the two implementations is that the user-space protocol incurs additional context switches, even though Panda makes upcalls directly from within the system level receive daemon. The userspace protocol needs extra context switches at the client side when handling the reply message. First the system level daemon has to be scheduled when the reply arrives, and then the reply has to be passed on to the client thread.

RPC group

user-space 825 Kb/s 941 Kb/s

kernel-space 897 Kb/s 941 Kb/s

Table 2: Communication Throughputs.

Hence, two context switches are needed to process the reply message. With the kernel-space protocol, on the other hand, Amoeba immediately delivers the reply message to the blocked client thread; no context switches are needed since no other thread was scheduled between sending the request and receiving the reply. We measured inside the Amoeba kernel that the total overhead of the two context switches is about 140 s, which already explains half the difference in latency between the user-space and kernelspace implementation. Thread handling in Amoeba is expensive since only kernel level threads are provided, so each operation that involves threads (e.g., signaling) potentially has to go through the kernel. Consequently, the user-space implementation needs four additional crossings between kernel and user address spaces for each RPC: two for waking up the client thread, and two for switching from the daemon thread to the client thread. The costs of an address space crossing are not fixed but depend on the depth of the call stack, because the kernel has to save and restore register windows. Our SPARC processors use six register windows of fixed size. A new window is allocated during each procedure call. When the user invokes a system call, the Amoeba kernel first saves all register windows in use, performs the system call, and then restores the single topmost register window, before returning to user space. When the user program continues, the register windows deeper in the call stack are faulted in through underflow traps. These traps are handled in software by the operating system, hence they are rather expensive: about 6 s per trap. The user-space implementation suffers from the Amoeba policy to only restore the topmost register window. When the daemon thread enters the kernel to signal the client thread about the arrival of its reply message, the daemon thread’s stack is using all register windows. Consequently, the daemon suffers six additional underflow traps when returning down the call stack after the system call has finished. The combined overhead of crossing the address space boundary and underflow traps is about 50 s. At the server machine both user-space and kernel-space imple-

mentation cause one context switch and two address space crossings. A related difference that has little to do with running protocols in user space or kernel space is due to the fact that (for software engineering reasons) procedure calls in Panda are more deeply nested than in Amoeba. This not only results in more function call overhead for the user-space implementation, but also causes extra register window overflow and underflow traps. An important case where the user-space implementation suffers from additional overflow traps is at the server side of the RPC latency test. When sending back the reply, the user-space implementation goes down the protocol layers by stacking additional procedure invocations, hence, generating overflow traps. In contrast, the kernel-space implementation just records a pointer to the reply message, returns from the protocol stack because it has finished processing the request, and finally at the bottom layer sends back the reply by invoking the mandatory put reply system call. Another difference is that the user-space implementation suffers from more locking overhead. Profiling data shows that it does seven times more lock() calls than the kernelspace implementation. The kernel-space protocol rarely needs to lock shared data, because internal kernel threads are scheduled nonpreemptively in Amoeba. Fortunately, acquiring and releasing locks in user space can be done cheaply if no other thread is holding the lock or waiting for it. Therefore the overhead is negligible in comparison to context switching and trapping costs. Two more differences have a noticeable impact on the latency. First, the user-space implementation uses slightly larger headers (64 bytes vs. 56 bytes), which amounts to 28 = 16 s latency on the Ethernet. Second, the userspace implementation includes portable fragmentation code to handle large messages that are broken up into small fragments. The kernel-space implementation relies on the FLIP layer in the Amoeba kernel to do fragmentation. Note that the user-space implementation therefore includes two layers that provide fragmentation, which results in an overhead of about 20 s per message, i.e. 40 s per RPC. At the moment it is impossible to leave out the RPC fragmentation code without completely changing the software structure of Panda. In summary, the difference in performance has as the only essential component two context switches, counting for 140 s. The fact that Amoeba currently provides only kernel threads causes an additional overhead of 50 s for register window traps and address space crossings. Increased functionality (fragmentation) takes 40 s. The larger header size of the user-space RPC protocol is an implementation detail; it accounts for 16 s. This leaves a gap of 54 s in performance difference. This can largely be attributed to the fact that the Amoeba extensions that make the low-level FLIP interface available to the userspace implementation have not yet been optimized: for instance, user-to-kernel address translation can be sped up considerably.

4.3 Performance of the group communication protocols Group communication latency is measured by creating a group of two members and having one of them send group messages. The sending member waits until it gets

its own message back from the sequencer (which is on the other processor). We used messages of different sizes for measuring the latency, see Table 1. The difference between user-space and kernel-space latency is about 0.23 ms. Throughput is measured by creating a group of multiple members sending messages of 8000 bytes in parallel. This effectively saturates the Ethernet, so user-space and kernelspace implementation achieve equal throughput. In contrast to the RPC protocol, the group protocol in user space does not incur an additional context switch at the client machine. Both the kernel-space and user-space implementation receive the ordered message in a daemon thread, which signals the client thread that it can send the next message. Therefore both implementations have a context switch in the critical path at the machine of the sending member. There is, however, a subtle difference that causes 40 s overhead in the case of the user-space implementation. When the ordered message from the sequencer arrives, the client thread is sleeping on a condition variable and has to be notified by the daemon thread. This requires a system call and causes a number of underflow traps when returning from the kernel. In the kernel-space implementation the grp send call is blocking in the sense that the calling thread is suspended until the message has returned from the sequencer. Hence, unblocking the sleeping client thread does not require an expensive address space crossing. Like the RPC protocol, the user-space group communication protocol includes code to fragment and reassemble large messages even though the FLIP interface of Amoeba is already capable of doing so. Note that double fragmentation occurs only at the sending member, because the sequencer is written to order groups messages at the fragment level. Therefore the user-space group protocol only incurs a 20 s overhead whereas the RPC implementation lost 40 s to fragment both the request and reply message. In total, the user-space group protocol incurs 60 s overhead at the sending member machine; the remaining overhead is incurred at the sequencer machine. In case of the kernel-space implementation, the sequencer runs entirely inside the Amoeba kernel so no time is wasted in crossing the user-kernel address space boundary. In the case of the user-space implementation, however, each message has to cross this boundary twice. The sequencer issues two system calls: one to fetch a message from the network and another to multicast the message including the sequence number describing the global ordering. These additional system calls increase the latency by about 40 s, because of address space crossings and additional underflow traps. Another source of overhead is that both the unordered incoming message and ordered outgoing message have to be transferred from one address space to the another. This requires additional user-to-kernel address translations and data copying in comparison to the kernel-space implementation. The extra overhead accounts for 30 s. Another major difference between the user-space and kernel-space group protocols is that the Amoeba group code is invoked from within the (software) interrupt handler, whereas the sequencer in the user-space implementation is a separate thread. Consequently, the user-space implementation does an additional thread switch, which takes about 110 s. This switch is so expensive because the interrupt handler first runs to completion, then the scheduler is in-

voked, and finally the context of the current thread can be saved, so the sequencer thread can be resumed. At the sequencer node, an extra thread runs to deliver the group message to the user. Since this thread has run last to deliver the previous message, a full context switch is needed. The performance of the user-space protocol can be improved by running the sequencer at a dedicated machine such that no member thread needs to run to process group messages. This effectively reduces the context switch time to 60 s, since the sequencer context is still loaded when a message arrives. The user-space implementation performs better when considering the Ethernet network latency because it works with small headers of 40 bytes, whereas the kernel-space implementation prepends each data message with a 52 byte header. At this point the user-space implementation saves about 212 = 24 s per sequenced message. In summary, the difference in performance of the group protocols has as the essential component one context switch, counting for 110 s, and one address space crossing, counting for 40 s. Thus the total essential overhead for the user-space implementation is 150 s. The fact that Amoeba only provides kernel threads causes an additional overhead of 50 s for register window traps and address space crossings at the sending member machine. Increased functionality (fragmentation) takes another 20 s, but the smaller header size yields an improvement of 24 s. In total, this leaves a gap of about 30 s in performance difference. This can again be attributed to the untuned part of the Amoeba kernel making the FLIP interface available to user programs.

5 Performance of parallel applications In this section we give performance measurements of six parallel Orca applications on Amoeba. The measurements were done on a processor pool of SPARC processors as described in the previous section. The pool consists of several Ethernet segments connected by an Ethernet switch. Each segment connects eight processors by a 10 Mbit/sec Ethernet. The six Orca applications selected for the measurements together solve a wide range of problems with different characteristics. Some of the applications are described in [1]. The applications were run on an Orca implementation using kernel-space communication protocols, and on an Orca implementation using user-space protocols. The two implementations use exactly the same runtime system and compiler. The absolute execution times of the applications and maximum speedup (relative to the single-processor case) are given in Table 3. (Most applications achieve higher speedups for larger input problems, but this is not relevant for the current comparison.) For each application and for both protocols we give the elapsed time on different numbers of processors. The reported measurements are the mean values obtained in 10 runs. The variation was less than 0.2%. To make a fair comparison between both implementations, we have gone through considerable effort to eliminate caching effects. Caching effects played a significant role in earlier measurements since the SPARC processors are equipped with small direct-mapped caches: 4 Kb instruction cache, and 2 Kb data cache. We have observed a factor

of two difference in performance due to conflicts in the instruction cache. By linking the code fragments common to both Orca implementations (e.g., the runtime system) at fixed positions, we succeeded in achieving almost equal execution times for both implementations when running on a single processor. The measured differences are typically quite small: less than 2%. For the Travelling Salesman Problem (TSP) the kernelspace implementation marginally outperforms the userspace implementation. Both implementations achieve superlinear speedups because of a different search order and a reduction in cache misses in the data cache. The coarsegrain nature of this application limits the advantage of the kernel-space implementation’s communication primitives. The frequently accessed data object holding the shortest path is replicated by the Orca RTS, so it can be read locally. The only communication that takes place is needed for operations to fetch jobs from a central queue object, but the number of jobs is small: 2184. The figures for the All-Pairs Shortest Paths program (ASP) also show only a marginal difference between the user-space and kernel-space protocols. Again this is a consequence of the infrequent usage of communication; the program sends 768 group messages to coordinate an iterative process. The moderate speedup is caused by the relatively high latency that each group message of 3200 bytes incurs: about 5 ms per message, which sums to a delay of almost 4 seconds. The summed latency difference between the user-space and kernel-space implementations is 200 ms, which is reflected in the slightly better speedup attained by the kernel-space implementation. The Alpha-Beta Search program (AB) has also been written in a coarse-grained style and does not communicate a lot. The poor speedups are caused by the search overhead the parallel algorithm incurs; efficient pruning in parallel - search is a known hard problem. The Region Labeling (RL) and Successive Overrelaxation (SOR) programs are both finite-element methods that iteratively update all elements of a matrix. At the end of each iteration, processors exchange boundary elements with their neighbors by means of shared buffer objects. This exchange causes the kernel-space implementation to perform worse than the user-space applications if the grain size of the application is small. For example, Region Labeling on the kernel-space implementation takes six seconds longer to run to completion on 32 processors than its user-space counterpart. The kernel-space implementation suffers from an additional context switch per remote guarded BufGet operation that blocks until the buffer is filled by its owning processor. Likewise the BufPut operation blocks if the buffer is full. The kernel-space implementation needs an additional context switch because the Amoeba RPC implementation demands that a matching get request and put reply are issued by the same thread (see Section 3.1), while the operation that sets the guard to true is necessarily executed by another thread. The execution times for the Region Labeling and Successive Overrelaxation programs show that the performance of both implementations flattens out at 16 processors. This behavior is caused by the limited bandwidth of the Ethernet; apparently, on both implementations the programs cause network saturation for a large number of processors and therefore achieve poor speedups.

Orca Application Travelling Salesman Problem All-Pairs Shortest Paths Alpha-Beta Search Region Labeling Successive Overrelaxation Linear Equation Solver

Implementation Kernel-space User-space Kernel-space User-space Kernel-space User-space Kernel-space User-space Kernel-space User-space Kernel-space User-space User-space-dedicated

1 790 783 213 216 565 567 759 767 118 118 521 527 527

8 87 92 30 31 106 106 132 133 20 19 102 113 116

16 44 46 17 18 78 78 115 119 14 13 91 112 94

32 23 24 11 11 60 59 114 108 13 11 127 164 128

max. speedup 34.2 32.2 18.4 19.0 9.3 9.5 6.6 7.1 9.0 10.3 5.7 4.7 5.6

Table 3: Performance of Orca applications; execution times [sec] and max. speedup The Linear Equation Solver (LEQ) is the only application that shows a clear advantage for the kernel-space protocol. The poor performance on the user-space implementation is due to the sequencer’s machine. This processor handles broadcast requests from all other machines, but it must also process all incoming update messages and run an Orca process (as all machines do). With 32 processors, this machine becomes overloaded and slows down the iterative application. The solution is to dedicate one processor to just sequencing; this not only reduces the latency of a group message by 50 s (see Section 3.2), but also avoids the overhead of preempting the Orca process at the sequencer for each incoming message. The kernel-space does not suffer from the overhead of context switching between two threads in user-space, because the sequencer is run as part of the interrupt handler inside the kernel. The row labeled “User-space-dedicated” lists the timings for the user-space implementation that sacrifices one processor to run the sequencer part of the group protocol only. On eight processors the loss of one worker process does not outweight the benefits, but at 16 and 32 processors the dedicated sequencer clearly pays off. For example, on 16 processors the 15 workers now solve the linear equation in 94 seconds instead of 112 seconds. Note that the execution times increase when going from 16 processors to 32 for all three implementations. This is a consequence of the parallel algorithm that now sends twice the number of group messages of half the size. The decrease in computation time does not outweigh the increase in message handling overhead. For the coarse-grained applications (TSP, ASP, and AB), the performances listed in Table 3 show no significant advantage for either of the two protocols. Apparently, the poorer results for the user-space implementation on the latency tests reported in Section 4 do not influence these Orca applications. For the finer grained applications (RL, SOR, LEQ), the performance results show two effects. First, the kernel-space implementation does better when a lot of

group communication occurs, because the sequencer in the user-space implementation is overloaded due to additional context switches. Second, the user-space implementation does better when a lot of remote object invocations block in guarded operations, because the kernel-space implementation needs an additional context switch in this case to conform to Amoeba’s RPC requirements.

6 Discussion The paper has shown that user-space communication protocols can be implemented on top of Amoeba with an acceptable performance loss. The communication performance is somewhat lower than that of kernel-space protocols. The great advantage of user-space protocols is increased flexibility. For example, we have incorporated continuations [7] in our Orca RTS to reduce the context switching overhead inside the Panda RPC protocol and to save on stack space for blocking Orca operations. The Amoeba RPC protocol does not support the asynchronous transmission of the reply message, which leads to an additional context switch when handling blocking Orca operations. As another example, we intend to add nonblocking broadcast messages to our system, where the sending thread does not have to wait for its message to come back from the sequencer. For some write-operations, nonblocking broadcasts can be used without violating Orca’s sequential consistency semantics. With the Amoeba broadcast protocol this optimization would require modifications to the kernel. Our work is related to several other systems. Implementations of user-space protocols are also studied in [12, 16]. These studies use a reliable byte-stream protocol (TCP/IP) and an unreliable datagram protocol (UDP/IP). Our work is based on the same motivations. A significant contribution of our work is that we have written several parallel applications on top of the kernel-space and user-space protocols. We have used them as example applications to study the impact on overall performance.

The Mach system also has a user-space protocol for communication between different machines, but this protocol is implemented by separate servers and consequently has a high overhead [13]. Protocol decomposition is studied in the x-kernel [13], which allows protocols to be connected in a graph, depending on the needs of the application. Userspace group communication protocols are studied in the Horus project [18]. Horus uses a multicast transport service (MUTS) that can run in kernel space or in user space. Application-specific protocols in user-space are supported by Tempest, which eases the implementation of multiple coherence protocols for distributed shared memory [9]. The performance of our user-space implementation could be improved significantly if user-level access to the network would be allowed, since such access would eliminate many system calls. An alternative optimization is to have the kernel execute user code. Since Orca is a type-secure language, it is possible to have the kernel execute operations on shared objects in a secure way. Several groups are working on this idea [17, 8].

Acknowledgements Raoul Bhoedjang implemented parts of the Orca runtime system and Panda RPC. Tim R¨uhl also implemented parts of Panda RPC and the Panda system layer. Rutger Hofman implemented Panda’s broadcast protocol, and helped in analyzing its performance characteristics. Ceriel Jacobs wrote the Orca compiler. Frans Kaashoek was one of the main designers of the Panda system. Kees Verstoep provided valuable assistance when tracking down time spent inside the Amoeba kernel. We would like to thank Raoul Bhoedjang, Leendert van Doorn, Saniya Ben Hassen, Rutger Hofman, Ceriel Jacobs, Frans Kaashoek, Tim R¨uhl, and the anonymous referees for their useful comments on the paper.

References [1] H.E. Bal. Programming Distributed Systems. Prentice Hall Int’l, Hemel Hempstead, UK, 1991. [2] H.E. Bal and M.F. Kaashoek. Object distribution in Orca using compile-time and run-time techniques. In Conference on Object-Oriented Programming Systems, Languages and Applications, pages 162–177, Washington D.C., September 1993. [3] H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca: A language for parallel programming of distributed systems. IEEE Transactions on Software Engineering, 18(3):190–205, March 1992. [4] R. Bhoedjang, T. R¨uhl, R. Hofman, K. Langendoen, H.E. Bal, and M.F. Kaashoek. Panda: A portable platform to support parallel programming languages. Symposium on Experiences with Distributed and Multiprocessor Systems, pages 213–226, September 1993. [5] A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems, 2(1):39–59, February 1984. [6] F. Douglis, J.K. Ousterhout, M.F. Kaashoek, and A.S. Tanenbaum. A comparison of two distributed

systems: Amoeba and Sprite. Computing Systems, 4(4):353–384, 1991. [7] R.P. Draves, B.N. Bershad, R.F. Rashid, and R.W. Dean. Using continuations to implement thread management and communication in operating systems. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 122–36. ACM SIGOPS, October 1991. [8] D. Engler, M.F. Kaashoek, and J. O’Toole. The Operating System Kernel as a Secure Programmable Machine. In Proceedings of the 6th SIGOPS European Workshop, Wadern, Germany, September 1994. ACM SIGOPS. [9] B. Falsafi, A.R. LeBeck, S.K. Reinhardt, I. Schoinas, M.D. Hill, J.R. Larus, A. Rogers, and D.A. Wood. Application-specific protocols for user-level shared memory. In Supercomputing ’94, pages 380–389, November 1994. [10] M.F. Kaashoek. Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit, Amsterdam, December 1992. [11] M.F. Kaashoek, R. van Renesse, H. van Staveren, and A.S. Tanenbaum. FLIP: an internet protocol for supporting distributed systems. ACM Transactions on Computer Systems, 11(1):73–106, January 1993. [12] C. Maeda and B.N. Bershad. Protocol service decomposition for high-performance networking. In Proceedings of 14th ACM Symposium on Operating Systems Principles, pages 244–255. ACM SIGOPS, December 1993. [13] L.L. Peterson, N. Hutchinson, S. O’Malley, and H. Rao. The x-kernel: A platform for accessing internet resources. IEEE Computer, 23(5):23–33, May 1990. [14] A.S. Tanenbaum, M.F. Kaashoek, and H.E. Bal. Parallel programming using shared objects and broadcasting. IEEE Computer, 25(8):10–19, August 1992. [15] A.S. Tanenbaum, R. van Renesse, H. van Staveren, G.J. Sharp, S.J. Mullender, A.J. Jansen, and G. van Rossum. Experiences with the Amoeba distributed operating system. Communications of the ACM, 33(2):46–63, December 1990. [16] C.A. Thekkath, T.D. Nguyen, E. Moy, and E.D. Lazowska. Implementing network protocols at user level. In Proc. of the SIGCOMM ’93 Symposium, September 1993. [17] L. van Doorn and A.S. Tanenbaum. Using active messages to support shared objects. In Proceedings of the 6th SIGOPS European Workshop, Wadern, Germany, September 1994. ACM SIGOPS. [18] R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephenson. The Horus system. In K.P. Birman and R. van Renesse, editors, Reliable Distributed Computing with the Isis Toolkit, pages 133–147. IEEE Computer Society Press, September 1993.

Suggest Documents