is unlikely that thread switching will soon become cheap on stock hardware .... server threads extract the messages from the buffers, unmar- shal them, and ..... dedicated sequencer [ 1 l] which orders all group messages. Table 2 shows that ...
Proceedings of the 29th Annual Hawaii International Conference on System Sciences- 1996
Friendly and Efficient MessageHandling * R. A. F. Bhoedjang
K. G. Langendoen
Dept. of Mathematics and Computer Science Vrije Universiteit 108 1 HV Amsterdam, The Netherlands
processing model for small and simple actions. However, since they disallow all blocking, Active Messages are less suited for complex message processing. Each potentially blocking action that a message handler might perform, including library calls, must be rewritten, which makes it hard to integrate Active Messages into large, preemption-based systems. Other approaches,like popup threads, put no restrictions on the actions that are allowed in a messagehandler. Our experience with the Orca [2] parallel programming language, however, indicates that popup threads introduce memory and thread scheduling overheads. For example, while porting Orca to the CM-5 we found that remote object invocation was an order of magnitude slower than Active Messages [19]. Much of this overhead could be attributed to context switches to threads in the runtime system (RTS) while processing a message. Although thread switching on the CM-5 suffers from the SPARC-specific register windows [15], it is unlikely that thread switching will soon become cheap on stock hardware, becauseof architectural features such as deep pipelines and large register files. To improve the performance of the Orca RTS, we implemented an upcall model that allows most messagesto be processedby a single thread. This model, which we call hybrid upcalls, avoids unnecessarythread switching in the common case by putting some restrictions on the programming model. Inside message handlers, blocking on synchronous communication and condition variables is not allowed. To deal with caseswhere such blocking is needed,we explicitly build continuations. In contrast with Active Messages,we do allow messagehandlers to block on locks that protect critical sections. We believe this makes hybrid upcalls easier to use than Active Messages. Measurements on 50 MHz SPARCClassic clones, connected by 10 Mbps Ethernet show that using hybrid upcalls in the Orca RTS has several beneficial effects. First, by removing a thread switch from the critical path, remote object invocation latencies dropped by 300 ps. Second, by building application-specific continuations rather than blocking a thread, we significantly reduced memory consumption by the Orca RTS. Third, fewer undesired thread preemptions occur,
Abstract Since communication software spends a significant amount of time on handling incoming messages,it is desirable that messagehandlers avoid expensivecontext-switches onfrequently executedpaths. High-performanceActiveMessage systemsdemand that handlers run to completion without blocking. Unfortunately, disallowing all blocking in handlers makes it hard to integrate them into large, preemption-based systems,because each potentially blocking action, including library calls, must be rewritten. We have implemented a portable, hybrid upcall mechanism that is easier to use than Active Messages yet avoids unnecessary thread switching. The key idea is that message handlers are only allowed to block on locks protecting shared data. Inside message handlers, blocking on synchronous communication and condition variables is not allowed. This restriction allows most messages to be processed without unnecessary thread switching on the critical path. When a messagehandler has to suspendits work, it explicitly creates a continuation. We have added hybrid upcalls to the runtime system of Orca, an object-based Distributed Shared Memory system. By removing a thread switch from the critical path, remote object invocation latencies dropped by 300 ps. By by building application-specific continuations rather than blocking a thread, we significantly reduced memory consumption by the Orca RTS. Finally, fewer undesired thread preemptions occur, becausemost messagesare handled by a single thread.
1 Introduction Since communication software spends a significant amount of time on handling incoming messages,it is desirable that messagehandlers avoid expensive context-switches on frequently executed paths. High-performance Active Message systems demand that handlers run to completion without blocking and thus provide an efficient and usable *This research is supported by a PIONIER Organization for Scientific Research (N.W.O.).
grant from the Netherlands
121 1060-3425/96 $5.00 0 1996 IEEE
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International Conference on SystemSciences- 1996 because most messagesare handled by a single thread. In our current hardware setting (Ethernet), the effect on the execution times of applications is small, partly because most of our applications are coarse-grained and partly because our base communication times are larger relative to the cost of a thread switch. On emerging networks, however, messagelatencies and thread switching costs are of the same magnitude. Our first experiments with Myrinet [4] confirm that thread switches form a significant bottleneck. This paper is organized as follows. Section 2 motivates our work: it describes the Orca parallel programming language and points out efficiency problems in the implementation of the Orca RTS. In Section 3, we explain how hybrid upcalls and continuations can be used to reduce messageprocessing overheads. Section 4 compares the performance of the old and the improved Orca RTS. Section 5 discussesrelated work and Section 6 concludes.
2
A small part of the message(less than 100 bytes) is copied into the kernel; the remaining data is copied straight to the Ethernet board (using DMA). At the receiving side, incoming Ethernet packets are copied into the kernel where FLIP assemblesthem. Panda contains an upcall thread that waits for incoming messages. When FLIP has assembled a complete fragment, it copies it to the message buffer provided by Panda’s upcall thread. The upcall thread is then unblocked and starts processing the message. Thus, in the best case, two copies are needed to receive the data in a Panda message. For large, fragmented messages,however, more copying may occur. 2.2 The Orca runtime system The most important task of the Orca RTS is to manage shared objects used for high-level communication between Orca processes. To minimize communication overhead, the RTS employs object migration and replication. Operations on an object go through the RTS, possibly resulting in communication if the object is located on another processor or replicated on multiple processors. Consequently, the RTS has a significant influence on application performance. Each Orca processis implemented as a Pandathread. Read and write operations on remote, non-replicated objects are implemented with Panda RPC, while write operations on replicated objects use Panda’stotally ordered group communication. Since totally ordered group communication delivers all messagesin the same order at all processors,it greatly simplifies the task of keeping replicated copies consistent [2, 111. Reads on replicated objects are always performed locally, without communication. An important property of Orca operations is that they sometimes block on a boolean condition (guard) associated with the object. For example, a programmer may specify that a dequeue operation on a shared queue object blocks when the queue is empty. If the queue object is accessedwith an RPC, then the guard may cause the upcall that processesthe RPC to block. If the dequeue operation would be executed completely by Panda’ssingle upcall thread, we can get into a deadlock, becauseno more incoming messagesare processed until the guard becomes true. To avoid such deadlocks, the portable Orca implcmentation was originally structured as shown in Figure 1. Each Panda upcall to the RTS stores a pointer to the message received in either the group or the RPC buffer. RTS-level server threads extract the messagesfrom the buffers, unmarshal them, and processthem further. In contrast with Panda’s upcall thread, these RTS threads do not listen to the network and may freely block. To retain the total ordering of group messages,we use only one thread to serve the group buffer. Unlike the threads that serve the RPC buffer, this thread does not block on guards; instead it queuesmessageswhen guards fail. In some other cases, however, this thread can block;
Orca
Orca [2] is a high-level, parallel programming language in which user-defined processescommunicate and synchronize through operations on passive, shared objects. The programmer is responsible for process placement, but object management is left to the compiler and runtime system. Using hints from the compiler and dynamic information about object access patterns, the runtime system decides where to place objects and how to replicate them. All objects are either replicated on all processors that can reference them or stored on a single processor. To minimize the effort involved in porting Orca to different parallel machines, the RTS does not run directly on the native hardware and operating system (OS), but is built on top of a virtual machine, which we call Panda [3]. Orca programs are compiled to ANSI C and linked with two libraries, the Orca RTS and Panda. Below, we describe both Panda and the Orca RTS in more detail. 2.1 Panda Panda is a portable communication layer that provides threads, messages,Remote Procedure Call (RPC), and group communication to the Orca RTS. Panda uses native communication primitives to send and receive messages. Using Panda, we have ported the Orca system to several systems (Amoeba, Solaris, HORUS, Parix, and Fast Messages) and machines (SPARCs, CM-5, Parsytec GCel and PowerXplorer, and CS-2). In this article, we focus on the Amoeba configuration of Panda, which runs on SPAR& connected by Ethernet. This Panda implementation uses Amoeba’s unreliable datagram protocol, FLIP [ 121, to implement RPC and group communication. When Panda sends a message, it hands it to FLIP. 122
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International
Conference on System Sciences -
1996
prove the performanceof Orca, without giving up portability, is to avoid thread switches whenever possible.
3
RTS
Hybrid upcalls
To reduce the number of thread switches, we have implemented an Orca RTS that uses a restricted upcall model. This section describesthe model and relatesit to other upcall models.
Panda
3.1 Existing upcall models Figure 1: The structure of the Orca implementation. Various systems(e.g., the x-kernel [lo] and HORUS [ 171) use popup threads to handle incoming messages. Popup threadshavethe benefit that they are first-classthreadsthat are allowed to do anything that other threadscan do. In particular, they may block for any amount of time: they may contendfor locks, wait on condition variables, and perform synchronous communication. Managing multiple popup threads,however, introduces additional synchronization. The Orca RTS as shown in Figure 1 implements popup threads (the RTS server threads) on top of Panda. This adds an expensivethread switch to the critical path of all message handlers. Also, if many popup threads block, the system is forced to create a large number of threads,each with its own stack. Active Messages(AMs) are at the other end of the spectrum. Active Messagesare usually received through polling but can also be delivered asynchronously. With polling, AM handlers execute on the stack of the polling thread. With interrupts, they executeon the stack of some runnable thread. In both cases,if an AM handler blocks, then the stack it runs on can no longer be used by the thread that owns that stack, which can lead to deadlock. For this reason,AM handlersare restricted to performing small, simple, nonblocking actions (e.g., reading or storing a word). If one wants to perform more complicated tasks like executing a user-defined Orca operation, then the likelihood of blocking increases. For example, with interrupt-driven Active Messages, it is necessary to use locks that protect shareddata. Or, in the case of Orca, operations can block on boolean conditions (guards). Since blocking is disallowed in AM handlers, it is necessaryto write or generateexception code for all potentially blocking actions. .Especially in the case of locks, which may be hidden in thread-safe library code, this is awkward, unless one uses compiler support [9]. Rewriting the Orca RTS to active messagecode is a major undertaking, since exception code must be addedat eachcall to lock a mutex; identifying those calls is difficult in itself when using libraries that include thread-safe functions like malloc.
during Orca process creation, for example, this thread uses RPC to fetch copies of sharedobjects. 2.3 Efficiency problems Experiencewith Orca on various platforms has shown that the system structure outlined in Figure 1 is not optimal for the following reasons: Long call chains. Layering for portability costs additional function calls when communicating. The resulting long call chains have a significant impact on performance for SPARCbased systems becauseof the register window overflow and underflow traps. Even though the C compiler inlines functions, the modular software structure requires two or three function calls per layer. Blocking threads. Orca operationssometimesblock, e.g., when waiting for a barrier. If many operations block simultaneously, the RTS is forced to create additional threads, each with its own stack. While porting Orca to a 512 node Transputer machine, we found that this can lead to excessive memory consumption. Moreover, as we discussin Section 4, having many threads can lead to scheduling anomalies that affect performance. Thread switching overhead. The major source of overhead is thread switching in the RTS when processing operation invocations. Each incoming messagefrom the Panda layer is handedto an RTS serverthread through a FIFO queue. On systemswith a low-latency network and fast accessto the network, such a thread switch in the critical path of a remote object invocation will account for a large part of the invocation latency. On the CM-5, for example, we measuredthat this thread switch accounts for more than 30% of the Panda RPC latency. At present, thread switching overhead has the largest impact on performance. Unfortunately, architectural trends like deep pipelines and large register files indicate that thread switches on commodity processorswill remain expensivein the near future. Therefore, the only way to significantly im123
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International
Conference on System Sciences I
3.2 Hybrid upcalls
struct
1;
I
cant {
char cant-queue-t int struct cant
The hybrid upcall model strikes a middle-ground between popup threads and Active Messages. Hybrid upcalls are easier to use than Active Messagesyet more efficient than popup threads. The underlying premise of hybrid upcalls is that all messages are received and processed by a single thread. If this thread blocks, then no other messageswill be processed until the thread continues. Consequently, if the thread can only be unblocked by another message,then deadlock ensues. Note that blocking the upcall by itself does not cause deadlock. We assume that the upcall is executedby a tirstclass thread with its own stack; this is where hybrid upcalls diverge from Active Messages.Thus, upcalls can block like any other thread. Mutual exclusion synchronization in an upcall is therefore perfectly legal. The upcall may block on a lock, but we know that this lock is held by a local thread and will be releasedsoon1. It is when the upcall thread can only be unblocked by another message that deadlock occurs. It is the programmer’s responsibility to avoid such deadlocks. In practice, the programmer must follow three rules:
1996
c-buf[CONT-SIZE]; *c-queue; (*c-func) (void *state); *c-next;
..
Figure 2: The continuation structure. continit(contqueue, lock) cant-clear(contqueue) stateptr = cont-alloc(contqueue,size, contfunc) contsave(state-ptr) contresume(contqueue) Figure 3: The continuations interface. 3.3 Handling blocking upcalls with continuations The hybrid upcall model forbids blocking on condition variables (or similar synchronization primitives) and synchronous communication within upcalls. In the Orca runtime system, however, condition synchronization occurs frequently. For example, Orca operationsmay block on a guard; for remote objects that are accessedwith RPCs, this results in a blocked upcall. Synchronouscommunication, on the other hand, happens only when processesare created or when the RTS decides to migrate an object. Compared to the number of operations performed by most Orca programs, these are rare events. Using continuations as in [5] we havebeen able to remove all but one RTS server thread from the Orca RTS. Most messages are now processedentirely by Panda’s upcall thread, without thread switching. The upcall thread neither blocks on condition variables nor performs synchronous communication, but creates continuations instead. Continuations provide a generaland flexible mechanismto handle blocking explicitly. A continuation is a data structure that holds user-defined state and a pointer to a user-supplied continuation function (see Figure 2). Instead of calling a blocking primitive, an upcall creates a continuation that describes how its work should later be continued. The continuation is stored on a continuation queue. When a logically blocked upcall should be resumed,the continuation function is called with the saved state as an argument. This function inspects its state and either completesthe work or re-queues the continuation; in both casesit returns normally. interestingly, queuesof continuations resemble condition variables, which are essentially queuesof thread descriptors. We found that this similarity makesreplacing condition variables with continuations quite easy. Figure 3 shows the interface to the continuation mcchanism. Continit initializes a continuation queue; initially, the
No synchronouscommunication must take place within the context of an upcall. No condition synchronization is allowed within the context of an upcall. Upcalls must terminate quickly. Synchronous communication is not allowed for two reasons. First, one cannot generally know when a matching communication action will be issued: some types of synchronous communication (e.g., RPC) can thus cause an upcall to block for an arbitrary amount of time. Second, since the upcall would be blocked, the matching messagecannot be received without using another thread. Condition synchronization can lead to similar problems. In general, the waiting time can be unbounded and the condition can be such that it can only be satisfied by processing another message.For example, consider fetching a job from a job queueat a remote processor. The upcall that is executed in responseto the fetch-job request is blocked when it finds the queue empty; it remains blocked until a new job is added to the queue. The rationale behind the three rules should be clear: applying the first two rules avoids situations in which the upcall thread has to block and wait for another message. Violating the third rule can lead to bad performance, even though we assumethat some form of flow-control will stall senders when they send data faster than the upcall thread can process it. When a network is not drained for a long time, messages may be dropped and other processesmay be delayed. t We assumethat locks are only used to protect critical sections,ix., they should not be misusedfor condition synchronization.
124
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii Inter-national Conference on System Sciences -
OK%?
n
Oral processes
operations. Second, continuations are used to deal with the problem of synchronous communication in the context of an upcall. To solve this problem, one must either switch to using asynchronous communication primitives or revert to using an extra thread. Since synchronous communication in upcalls occurs infrequently, we accept the costs of using a single extra thread that is allowed to block while calling the synchronous communication primitive. When an upcall needs synchronous communication it creates a continuation that describes the communication to be performed and stores it on a continuation queue. It then awakens the extra thread, which fetches and executes continuations from the queue. We only need one extra thread, since we know that no unboundedblocking will occur when the nestedcommunication is processedat its destination.
A
RTS
Panda OS
1996
(Amoeba)
Figure 4: Orca implementation with hybrid upcalls. queue is empty. Like a condition variable, each continuation queue has an associated mutex (lock) that ensures that accessesto the queue are atomic. Cont-clear is used to destroy a continuation queue. Cont-alloc heap-allocates a continuation structure (see Figure 2) and associatesit with a continuation queue. Continuations cannot be allocated on the stack, becausewe reuse the stack. Contalloc initializes the c_queueand c-func fields. C-queue refers to the associatedcontinuation queue; c&c is the continuation function. Contalloc returns a pointer to a buffer chf of at least size bytes in which the client saves its state. (Our implementation uses small, fixed-size buffers for fast allocation.) After saving its state, the client calls contsaue which appendsthe continuation to the queue. Together, contalloc and contsave correspond to a wait operation on a condition variable. Note that in the case of condition variables, the system knows what state to save (the client’s thread stack), so no separateallocation call is needed. Contresume resembles a broadcast on a condition variable; it traversesthe queue and resumes all continuations.
4
Performance
In this section we show that combining hybrid upcalls with the use of continuations has four beneficial effects on the performance of the Orca RTS: 1. We have removed a thread switch from the critical path of all remote Orca operations. For simple operations this almost halves the RTS overhead. 2. Fewer undesired thread preemptions occur, becauseall messagesare handled by a single thread. 3. The RTS does not waste memory in the form of blocked RTS server threads. 4. By using continuations instead of blocked threads, guards of blocked operations can be re-evaluatedwithout thread switching. 4.1 Experimental environment We measured the performance of two Orca runtime systems. Both RTSs use Panda 2.0 running on Amoeba [ 161. RTS-threads follows the structure of Figure 1 and usesserver threads, whereas RTS-cant follows the structure of Figure 4 and uses continuations. For both systems, we present the results of two benchmark programs and two parallel applications. All measurementswere done on 50 MHz SPARCClassic clones, each equippedwith 32 Mbyte of memory and on-chip, direct-mapped caches (2 Kbyte data, 4 Kbyte instruction). The processorboards are interconnectedby 10 Mbit/set Ethernet segments. Each segment connects 8 boards; 10 such segmentsare linked to one Kalpana Ethernet switch. Running the Panda portability layer on top of Amoeba introduces some overhead. Table 1 summarizes the baseperformance of Pandaon Amoeba and native Amoeba. Communication on Pandais slower, becausePandaruns its protocols
3.4 Hybrid upcalls in the Orca runtime system We restructured the Orca RTS to use hybrid upcalls; the new structure is shown in Figure 4. The new RTS contains only one RTS server thread (not shown) which is only used in exceptional cases. In the common case,Panda’supcall thread processesall messages(i.e., operations on shared objects). The upcall thread must follow the rules in Section 3.2 to avoid long-term blocking. To achieve this, we modified the RTS in two ways. First, we replaced condition variables that were used in upcalls with continuation queues. (This occurred four times.) An important example is blocking on a guarded operation, which is now implemented by creating a continuation that describesthe operation. Each shared object has a continuation queue on which blocked operations can be placed. When the object is modified, the writing thread invokes contresume to re-evaluate the guards of blocked 125
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International
Amoeba 1.20 ms 1.31 ms 145 f.Ls
Client RTS overhead local write attempt marshaling request adding request time stamp reply time stamp check unmarshaling reply Server RTS overhead request time stamp check unmarshaling + object lookup operation execution marshaling reply adding reply time stamp Network overhead
Panda 1.58 ms 1.62 ms 150 /.0
Table 2: Object invocation costs. roi goi
RTS-threads RTS-cant 2.19 ms 1.90 ms 2.21 ms 1.94 ms
1996
Table 3: Breakdown of RTS overhead.
Table 1: Base performance.
null RPC null group msg. thread switch
Conference on System Sciences -
improvement 13% 12%
in user space; the difference in performance is causedby extra thread switching, more user-kernel crossings, and more addresstranslations in the Panda implementation. However, the detailed comparison of the Panda user-level protocols and Amoeba kernel-level protocols given by Oey et al. [ 141 shows that typical Orca applications are hardly affected by this overheadbecauseof their coarse-grainednature.
CLS 7 40 7 9 24 PS 16 36 37 13
ing these benchmarks reveals that the extra thread switch in RTS-threads accounts for a large part of the RTS processing overheadat the receiving processors.The measurements in Tables 1 and 2 show that RTS-cant adds approximately 320 ps to the latency of a Panda null RPC (1.90 ms versus 1.58 ms). Most of this overhead can be attributed to the items in Table 3. The timings were obtained with a memorymapped timer with a 0.5 bs granularity. Reading this timer costs approximately 1.5 ps. As all items in Table 3 are also performed by RTS-threads, we consider the timings to be representativefor both RTSs. All items are discussedbelow, where we describe in detail how the RTS executesa remote object invocation in the roi benchmark. For brevity we do not present a similar analysis of goi. At the start of each operation invocation, code generated by the Orca compiler checks whether the operation can be executed without going through all of the RTS. In the case of roi, this local write attempt always fails, becauseno local copy of the object is available. Therefore, the RTS is invoked. The RTS marshals an operation request and sends it to the processor holding the object. To maintain consistency, requestsand replies are tagged with a time stamp (a sequence number) that is checked by the receiving side [7]. Messages that arrive too early are queued. In the simple benchmarks roi and goi this never occurs. At the server side, the receiving RTSchecks the request time stamp, unmarshals the rest of the request, looks up the integer object, and performs the operation. The increment operation does not block, so a reply messageis built and tagged with a time stamp. The final source of overhead is network overhead. The operation request contains 60 bytes of RTS data. This data contains the time stamp, an operation descriptor, and object
4.2 Remote object invocation Table 2 gives the results of two object invocation benchmarks that measure Orca-level base communication costs. The times listed are averagesover 10,000 invocations. Orca initialization, process creation, and object placement are not included in the measurements. In both benchmarks, processor 1 executes 10,000 increment operations on a shared integer object. Roi (RPC object invocation) is a 2-processorbenchmark that places the object on processor 2. Each increment is performed by means of a Panda RPC from processor 1 to processor 2. Goi (group object invocation) is 3-processorbenchmark that replicates the integer object on processors1 and 2. Each increment results in a group messageto processors 1 and 2. Processor3 is a dedicated sequencer [ 1 l] which orders all group messages. Table 2 shows that RTS-cant reducesthe latency of simple Orca object invocations by up to 13%. The differences between RTS-threadsand RTS-cant are mainly due to the extra thread switch in RTS-threads. Note, however, that both for roi and goi the difference (290 and 270 ps) is larger than the cost of a thread switch (150 ps, see Table 1). The reasonis that the thread switch from Panda’s upcall thread to an RTS server thread does not take place immediately. Due to its high priority, the upcall thread continues to run even after it has signaled the RTS thread. Additionally, some time is lost due to queuing overhead. In RTS-threads,the upcall thread and the pool threads communicate through a shared job queue. The upcall thread must copy a request descriptor (4 to 12 bytes) into the job queue. A closer look at the work performed by the RTSs dur126
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International Conference on System Sciences - 1996
access statistics for object placement. The RTS maintains such statistics SO that it can dynamically revise its object placement and replication decisions. Any actual parameters would also have been marshaledinto the request and would haveaddedmore bytes. Similarly, eachreply containsat least 16 bytes, comprising a time stamp, a return status, and the size of any return values. Each extra byte addsapproximately 1 ps to the network latency of a message. Note that the times in Table 3 do not sum to the measured overheadof 320 /.Ls.We have statically verified that parts of the Orca RTS and Pandaclash in the (direct-mapped)instruction cache while executing code that is on the critical path. Also, with roi, the client is deeper in its call stack when invoking Panda’sRPC call than our RPC benchmarkprogram. This results in additional register window underflows when the client receives the reply. Timings performed inside the Amoeba kernel show that thesetwo effects account for about 40 ps, which completesthe cost breakdown. At the server side, RTS-cant takes 107 ps to processthe increment operation (in the RTS). The extra thread switch in RTS-threadsadds no less than 290 ,us. Comparing the total RTS overhead,we find that RTS-threadsadds 2.19 - 1.58 = 0.61 ms overhead to a Panda null RPC, whereas RTS-cant adds only 0.32 ms (seeTables 1 and 2). Removing the extra thread switch has thus almost halved the total RTS overhead.
+ RTS-threads x RTS-cant
& P i? .M b
10 P
+x z x
+ ++ ++ xx + ++ xx x +x xx + x +X + x
1 1 1
8 16 32 2 4 Number of processors
64
Figure 5: Performanceof a non-replicated barrier.
low/
+ RTS-threads x RTS-cant *** ****
4.3 Barrier synchronization
** 0
The previous section reported on the performanceof nonsynchronizing operations. Here we present the results of a performancetest in which processorsrepeatedlysynchronize through a barrier object. Each processorexecutestwo potentially blocking operationson the barrier object, one to signal its presenceand one to await the arrival of the last processor. The await operation clears the barrier for the next iteration; the signal operation blocks until the barrier has been cleared. No work is done between successivearrivals at the barrier. We presentresults for the casein which the barrier object is placed on one processorand the casein which it is replicated across all processors. The non-replicated barrier object is placed on a separateprocessor that does not participate in the barrier. The reported times were obtained by executing 1000 barrier synchronizationsin a loop (i.e., 2000 operations per participating process),and dividing the elapsedtime by 1000. Orca startup, processcreation, and object placement are not included in thesetimings. The results for the non-replicated barrier are depicted in Figure 5. Both for RTS-cant and for RTS-threadsthe time per iteration increaseslinearly with the number of processors (P). This is as expected, since the processor holding the barrier receivesand processes2~ P RPCs per iteration. We found that RTS-threadsincurs approximately three thread switches per RPC serviced more than RTS-cont. One thread switch is needed to hand the request from the up-
a
1
* **
8 16 32 2 4 Number of processors
64
Figure 6: Performanceof a replicated barrier. call thread to the server thread. Another thread switch occurs when new messagesarrive while an RTS server thread processesa previous message. The new messageactivates the (high-priority) upcall thread, thus preempting the server thread. The third threadswitch is causedby Amoeba’sroundrobin scheduling policy. After it has preempted an active serverthread, the upcall thread queuesthe new message,signals a sleeping serverthread, and waits for the next message. Now Amoeba runs the signaled server thread rather than the preemptedserver thread. When this thread tries to accessthe barrier object it immediately blocks, becausethe preempted server thread still has the object locked. The results for a replicated barrier are shown in Figure 6. In this case,the advantageof RTS-cant is much smaller. The 127
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annzul Hawaii International
Conference on System Sciences -
i
9
+ RTS-threads x RTS-cant
8
*
1996
*
** *tgz + RTS-threads x RTS-cant
1
2 4 8 16 32 Number of processors
1
64
16 32 2 4 8 Number of processors
64
Figure 8: Performance of TSP for 14 cities.
Figure 7: Performanceof Awari for DBlo. overheadof RTS-threadsis again linear in the number of processors,but the overhead per operation (100 ps) is less than the time of a thread switch (150 ps). Due to the additional server thread in RTS-threads, we expected an overhead of at least one thread switch per message. However, since all processesare informed at the same time about the arrival of the last process, they all start the next operation at the same time, which causesa burst of messages. The upcall thread can processmessagesimmediately after another, so it queues multiple messagesbefore the server thread is scheduled. In the best case, all messagesof an iteration are queued before the serverthread is run; then we only switch once to the server thread, which can then processall messages.The sequencer, however, which orders all messages,introduces a delay per messagethat is close to the time the upcall thread needs to queue a message,so the combining effect is rather small.
an endgame databaseDB,, which can be used by a gameplaying program. Subscript n refers to the maximum number of stones that are left on the board; DB, contains gametheoretical values for all boards that have at most n stones left on the board. These game-theoretical values represent the outcome of the game in the case that both players make best moves only. When building DB,, each processor allocates space for its part of the database*. The parents of a board are the boards that can be reached by applying a legal unmove to the board. Whenever a processor updates a board’s gametheoretical value it must also update all parents. Since the boards are randomly distributed across all processors, it is likely that a parent is located on anotherprocessor,so a single update may result in several remote update operations. To avoid excessivecommunication overhead,remote updatesare delayed and stored in a queue for their destination processor. As soon as a reasonablenumber of updateshas accumulated, they are transferred to their destination with a single RPC. Although this strategy greatly improves performance, the program remains communication-intensive. Figure 7 shows that the difference between RTS-threads and RTS-cant is initially small, but increaseswith the number of processors and the amount of communication. Unfortunately, the gains do not become visible until the efficiency of awari begins to drop; at that point overall performance cannot be improved significantly by adding more processors.
4.4 Application-level performance Most Orca applications are medium- to coarse-grainedand communicate infrequently. We expect that Orca’s improved communication performancewill have little effect on the execution time of such applications. Fine-grained programs, however, should benefit from the improved latency of Orca operations, especially on emerging low-latency communication systems [ 18, 41. 4.4.1 Retrograde analysis. To verify our expectations we measuredthe performance of a realistic application, parallel Retrograde Analysis (RA). The Orca program awari applies a parallel I2A algorithm to Awari, a two-player board game [l]. In contrast with top-down search techniques like @ -search, RA searches bottom-up by making unmoves, starting with the end positions of Awari. Awari creates
4.4.2 The Traveling Salesman Problem. This Orca program, tsp14, uses a branch-and-bound algorithm to find the shortest route that visits 14 cities exactly once. The 2For 7~-> 167DB, no longer fits into the memory of a single processor and must be distributed.
128
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii International Conference on SystemSciences- 1996 program generates 1716 jobs and uses two objects: a (centralized) job queue to distribute the jobs to other processors and a (replicated) integer that holds the value of the best route found thus far. The global integer bound is mostly read and updated infrequently. RTS-cant performs much better than RTS-threads (see Figure 8). With 32 processors and at the processor holding the job queue, RTS-threads incurs approximately 10,000 more thread switches than RTS-cont. Since tsp14 mainly communicates when fetching one of the 1716 jobs, this number (10,000) is unexpectedly large. We believe it results from the way RTS-threads handles blocked operations. When an RTS server thread fetches a job on behalf of a remote Orca process and finds job queue empty, then RTS-threadsblocks this thread. When a job is added to the queue, all blocked server threads are wakened up to retry their fetch operations. Since the server threads run at a high priority, they are all run before the Orca thread can generatemore jobs. As a result, only one server thread finds a job; all others go to sleep again, until the next job arrives. With RTS-cant, in contrast, there are no blocked server threads at the queue,just passive continuations (see Section 3.3). Each failed request for work results in a continuation that is queued with the job queue. After an Orca thread has added a job, it calls contyesume and resumesthe pending continuations without thread switching. The performance of RTS-threads for tsp14 can be improved by lowering the priority of the server threads, so the Orca thread can continue generating jobs. This solution, however, will .certainly decrease performance of other applications, because Orca threads are no longer preempted when operations from outside arrive. RTS-cant really benefits from its reduced resource consumption. For example, having only one thread handle incoming messagesavoids the risk of scheduling anomalies as with RTS-threads.
5
however, that we use continuations in the context of upcalls and create them through a condition-variable like interface. Whereas our approach has been to avoid expensive thread switches by hand, Optimistic Active Messages[9] (OAMs), Lazy Task Creation [13], and Lazy Threads [8] a11rely on compiler support. OAMs transform an AM handler that runs into a locked mutex into a true thread. The overhead of creating a thread is thus only paid when the lock operation fails. On the CM-5, OAMs reduced the latency of Orca object invocation by an order of magnitude. In contrast with our approach, OAMs use compiler support and require that all locks be known in advance, which makes it hard to use them in conjunction with thread-safe libraries. Lazy Task Creation and Lazy Threads allow the user to create many threads. Both use compiler support to inline threads, optimistically assuming that a newly forked thread can run to completion on the stack of its parent. These techniques allow any task to be inlined and rely on the compiler to take appropriate action when blocking occurs. Saving the state of a blocked thread in a small buffer, like we do, is hard in the general case. Instead, before inlining a new task on the parent’s stack, Lazy Task Creation and Lazy Threads save enough state on the stack to be able to resume the parent when the child blocks. Our hand-crafted continuations, in contrast, depend heavily on the fact that the state of an upcall thread is well-defined when an upcall may have to block for a long time. This makes it easier to save state by hand. In their work on Shared Filaments [6], Engler et al. use three types of very lightweight threads (filaments). Filaments only synchronize through locks, barriers, and by joining forked threads. This allows the Filaments runtime to inline forked filaments without compiler support; filaments do not require an individual stack to run. Filaments allow inlining and locking, but provide no mechanism for condition synchronization. We use continuations to handle condition synchronization.
Related work 6
By exposing the hardware’s messaging capabilities, Active Messages [19] achieve very low latencies. To make these low latencies available to Orca programs, we would have to process all incoming messagesin an AM handler. AM handlers, however, are awkward to use becausethey are not allowed to block. Our upcalls are allowed to block on locks and are therefore easier to program with. As a consequence, however, our upcalls cannot be implemented as an AM handler, even when AMs are available. (AMs are not supported on all platforms that Orca runs on.) Draves and Bershad used continuations inside an opcrating system kernel to speed up thread switching [S). Instead of blocking a kernel thread, a continuation is created and the same stack is used to run the next thread. We use the same technique in user space for the same reasons: to reduce thread switching overhead and memory consumption. Note,
Conclusions
Thread switching overhead has been a major performance problem for the portable Orca RTS. Restructuring the RTS to avoid thread switches when handling incoming operations on shared objects led to the development of the hybrid upcall model, which combines properties of Active Messages and traditional popup threads. Unlike most approachesthat attack thread switching overhead, hybrid upcallsare implemented in user space and do not require compiler support. Using this restricted upcall model has been profitable. First, on Amoeba, we have almost halved the total RTS overhead for simple operations and reduced the latency of operation invocations with 300 hs (13%). Although this is a modest improvement, it is independent of the network technology we used. Our first experiments with Panda on 129
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings of the 29th Annual Hawaii international Conference on System Sciences- 1996 Myrinet [4] clearly show that thread switches on the critical path form a severe bottleneck. Second, memory usage has been reduced, because unbounded blocking is now done by saving and restoring small continuations instead of blocked upcall threads. Third, using continuations instead of blocked threads avoids thread switching when guards are re-evaluated. Finally, scheduling anomalies that result from having many runnable upcall threads have disappeared.
Acknowledgments We thank Henri Bal, Saniya Ben Hassen,Rutger Hofman, Ceriel Jacobs, Tim Riihl, and the anonymous referees for their valuable comments on draft versions of this paper.
References
PI
H.E. Bal and V. Allis. Parallel Retrograde Analysis on a Distributed System. Technical Report IR-384, Dept. of Mathematics and Computer Science, Vrije Universiteit Amsterdam, April 1995.
PI
H.E. Bal, M.F. Kaashoek, and AS. Tanenbaum. Orca: A Language for Parallel Programming of Distributed Systems. IEEE Transactions on Software Engineering, 18(3):190-205, March 1992.
131 R.A.F. Bhoedjang, T. Riihl, R. Hofman, K. Langendoen,
H.E. Bal, and M.F. Kaashoek. Panda: A Portable Platform to Support Parallel Programming Languages. In Proc. of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems, pages 213226, San Diego, September 1993.
[41 N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik,
C.L. Seitz, J.N. Seizovic, and W. Su. Myrinet: A Gigabit-per-second Local Area Network. IEEE Micro, 15(1):29-36, February 1995.
PI R.P.Draves, B.N. Bershad, R.F. Rashid, and R.W. Dean.
Using Continuations to Implement Thread Management and Communication in Operating Systems. In Proc. of 13th ACM Symposium on Operating Systems Principles, pages 122-136. ACM SIGOPS, October 1991.
[61 D.E. Engler, G.R. Andrew& and D.K. Lowenthal. Filaments: Efficient Support for Fine-Grain Parallelism. Technical Report TR 93-13, Dept. of Computer Science, University of Arizona, April 1993. I71 A. Fekete, M.F. Kaashoek, and N. Lynch. Implementing
Sequentially Consistent Shared Objects using Broadcast and Point-To-Point Communication. In Proc. of the 15th International Conference on Distributed Computing Systems,Vancouver, May 1995.
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
PI S.C. Goldstein, K.E. Schauser, and D. Culler. Lazy Threads, Stacklets, and Synchronizers: Enabling Primitives for Parallel Languages.Submitted for publication, University of California, Berkeley, November 1994.
[91 W.C. Hsieh, K.L. Johnson, M.F. Kaashoek, D.A. Wal-
lath, and W.E. Weihl. Efficient Implementation of High-Level Languages on User-Level Communication Architectures. Technical Report MlT/LCS/TR-616, MIT, May 1994.
WI
N.C. Hutchinson and L.L. Peterson. The x-Kernel: An Architecture for Implementing Network Protocols. IEEE Transactions on Software Engineering, 17( 1):6476, January 1991.
PII
M.F. Kaashoek. Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit Amsterdam, 1992.
1121M.F. Kaashoek, R. van Renesse,H. van Staveren, and A.S. Tanenbaum. FLIP: an Internet Protocol for Supporting Distributed Systems. ACM Transactions on Computer Systems, 1 l( 1):73-106, January 1993.
u31 E. Mohr, D.A. Kranz, and R.H. Halstead. Lazy Task
Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems,2(3):264-280, July 1991. 1141 M. Oey, K. Langendoen, and H.E. Bal. Comparing
Kernel-Space and User-Space Communication Protocols on Amoeba. In Proc. of the 15th International Conference on Distributed Computing Systems, Vancouver, May 1995.
1151 The SPARCArchitecture Manual, Version 8, 1992.
[161A.S. Tanenbaum, R. van Renesse,H. van Staveren,G.J.
Sharp, S.J. Mullender, A.J. Jansen,and 6. van Rossum. Experiences with the Amoeba Distributed Operating System. Communications of the ACM, 33(2):46-63, December 1990.
1171 R. van Renesse, K. Birman, R. Cooper, B. Glade, and
P. Stephenson. Reliable Multicast between Microkernels. In Proc. of the USENIX workshop on MicroKernels and Other Kernel Architectures, pages 269283, April 1992.
[I81 T. von Eicken, A. Basu, and V. Buch. Low-Latency Communication over ATM Networks Using Active Messages. IEEE Micro, 15(1):46-53, February 1995.
P91 T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E.
Schauser. Active Messages: a Mechanism for lntegrated Communication and Computation. In The 29th Annual International Symposium on Computer Architecture, pages 256-266, Gold Coast, May 1991.