Implementing Object-Based Distributed Shared Memory ... - CiteSeerX

89 downloads 125304 Views 114KB Size Report
In contrast to message passing, DSM offers the programmer ..... Running the ASP program with a bigger input graph was not possible because of the small.
Implementing Object-Based Distributed Shared Memory on Transputers Heinz-Peter Heinzley

Henri E. Bal

Koen Langendoen

Dept. of Mathematics and Computer Science Vrije Universiteit Amsterdam, The Netherlands

Abstract Object-based distributed shared memory systems allow processes on different machines to communicate through passive shared objects. This paper describes the implementation of such a system on a transputer grid. The system automatically takes care of placement and replication of objects. The main difficulty in implementing shared objects is updating replicated objects in a consistent way. We use totally-ordered group communication (broadcasting) for this purpose. We give four different algorithms for ordering broadcasts on a grid and study their performance. We also describe a portable runtime system for shared objects. Measurements for three parallel applications running on 128 T800 transputers show that good performance can be obtained. Keywords: Distributed Shared Memory, shared objects, Orca, parallel programming languages, transputers, Parix, totally-ordered group communication, broadcasting.

1 Introduction Distributed shared memory (DSM) is an attractive alternative to message passing for programming distributed-memory parallel machines. In contrast to message passing, DSM offers the programmer the illusion that all processors in the system have access to a shared memory. This model eases parallel programming, since it allows sharing of state information between processes on different processors, which need not be connected by a physical shared memory. The DSM model is most popular in the distributed systems community, where collections of workstations connected by a Local Area Network (e.g., Ethernet) are used as parallel machines. Many DSM systems have been implemented [Bennett91, Bershad93, Fleisch89, Li89b], most of which use a page-based implementation. The address space is partitioned into fixed-size pages, which are moved or copied between local memories when needed. Accesses to pages that are not in local memory are trapped by the hardware Memory Management Unit (MMU) and are handled in software. Page-based DSM is also referred to as shared virtual memory, since it is similar in concept and implementation to virtual memory. On transputer systems, however, DSM has received hardly any attention. With a few exceptions (e.g., [Raina94]), programming systems on transputers are based on message passing between processes that have only local memory [Burns88, Clarke91, Par93, Sunderam90]. There are good  This research is partly supported by a PIONIER grant from the Dutch Organisation for Scientific Research (N.W.O.). y

On visit from Technical University Graz, Austria, through Erasmus exchange program.

1

reasons for this choice. First, the T800 transputer does not provide the hardware MMU support required to implement shared virtual memory. Also, shared virtual memory moves fixed-size pages over the network. So, if a processor tries to read a single word, the entire page on which the word resides is fetched from a remote processor (unless the page already is in the local memory). Sending a page of, say, 4 Kbytes over a network just to read one word clearly is very inefficient on any system. On a transputer, this problem is even more severe than on an Ethernet, because (unlike an Ethernet) transputer networks can handle small messages very efficiently. We advocate the use of a DSM system that is not page-based but object-based. An object-based DSM deals with user-defined objects rather than system-defined pages. An object is an instance of an abstract data type. It can contain any data structure, from a single integer to a complex tree or graph. Both its internal representation and the operations used to access it are defined by the user. An object-based DSM allows objects to be shared by processes on different machines. It does not simulate physical shared memory, but provides a new object-based programming model. The implementation of shared objects is done entirely in software, without any need for hardware MMU support. The system will only move the data that are actually needed, so it can be more efficient than page-based DSM. Also, object-based DSM does not suffer from the problem of false sharing [Li89b], which arises if shared variables written by different processors are allocated on the same page. We have designed and implemented such a model, called the shared data-object model [Bal91]. The model is supported in a parallel programming language called Orca [Bal92]. There are at least two major advantages of using shared objects for programming transputers (and other parallel machines). First, the model provides the programmer with an abstract view of the machine. Programmers no longer have to worry about how to get data from one processor to another, since data movement is done by the system. Hence, the programming model is high-level and easy to use. A second and related advantage is portability. Orca programs can run on different network topologies. Architecture-dependent optimizations are done in the implementation of the model, not by the programmer. The Orca language also is operating system independent. We have implementations of Orca on a collection of Unix workstations, on the Amoeba [Tanenbaum90] processor pool, the CM5 (running CMOST), and the transputer (running Parix). A program written in Orca can run unmodified on each of these platforms. The ease of programming and portability also have their price. The performance of programs using shared objects is lower than that of message-passing programs hand-coded for a specific architecture. We believe that, in the long run, portability is more important than squeezing the last drop of performance out of the system, as was demonstrated years ago by high-level languages. The goal of this paper is to demonstrate that shared data-objects can be implemented on transputers with reasonable efficiency. This paper is organized as follows. Section 2 describes the shared data-object model. The implementation makes use of a portable compiler and runtime system, as explained in Section 3. The difficult problem on transputers is how to implement totally-ordered multicast, which the runtime system needs to assure consistency of replicated shared objects. Various multicast algorithms and their performance are discussed in Section 4. Section 5 describes the performance of several example Orca applications that run on a network of 128 T800 processors. Section 6 discusses related work and Section 7 contains our conclusions.

2

2 The Shared Data-Object Model The key idea in the shared data-object model is that processes may communicate through shared variables of abstract data types, called shared data-objects. (For brevity, we also use the term ‘object’ instead of ‘data-object’.) Even if the processors do not have a common shared memory, processes may share such objects. Programming with shared objects is similar to programming with shared variables, except that all accesses to shared data are through operations defined by an abstract data type (ADT). Unlike shared variables, the model guarantees that all these operations execute atomically (indivisibly). This greatly simplifies programming, since users do not have to worry about mutual exclusion synchronization. While all operations on a given object are conceptually serialized, the implementation may execute operations in parallel, provided that this has the same effect as serialized operations. Each operation is always applied to a single object. This restriction makes it possible to implement the model efficiently, even on a distributed-memory machine. Programmers can build more complicated atomic actions involving multiple objects on top of the simple model, although this involves explicit synchronization. Condition synchronization is integrated in the model by allowing operations to block. To implement a bounded buffer, for example, one can define a Put operation that blocks if the buffer is full and a Get operation that blocks if it is empty. An operation is only allowed to block initially. A blocking operation consists of one or more guarded statements, each containing a condition (Boolean expression) and a statement list. The operation blocks until at least one of the conditions is true. It then chooses one of the guarded statements whose condition is true and executes its statement list, without blocking again. The shared data-object model is used as the basis for Orca. Orca has been designed to simplify the programming of distributed-memory systems. For example, it allows any data structure (e.g., a list or a graph) to be passed as a parameter in an operation, and automatically marshalls parameters in the runtime system. Also, Orca is a type-secure language [Hoare81]. All violations of the type rules (e.g., array-bound errors) are detected by the compiler or the runtime system. An example ADT (object type) in Orca is shown in Figure 1, which declares and implements a type IntObject. The specification part is the interface to the type. It lists the atomic operations that can be applied to objects of this type. The implementation part contains the representation (internal data) of the type and the implementation of the operations. Most of the code in Figure 1 is straightforward and similar to abstract data type definitions in sequential languages. The operation AwaitValue illustrates condition synchronization in Orca. The operation blocks until the expression after the keyword guard evaluates to True. It then executes the (empty) statement part between the keywords do and od. Parallelism is expressed in Orca through the explicit creation of processes. A process can be created dynamically by a fork statement, which has the following form: fork process-name(parameters) [ on (CPU)] ; This statement creates a process and passes parameters to it; optionally, the processor on which the process is to run may be specified. By default, a process is created on the same CPU as its parent. Processes communicate through shared data-objects. An object may be passed as a shared parameter to a child process, as indicated by the declaration of the child process. For example, assume a process child has been declared as follows: process child(X: shared IntObject); begin ... end;

3

object specification IntObject; operation Value(): integer; operation Assign(v: integer); operation Inc(); operation AwaitValue(v: integer); end;

# # # #

return value assign new value indivisibly increment value wait for certain value

object implementation IntObject; x: integer; operation Value(): integer;

# internal data

begin return x;

# return current value

end; operation Assign(v: integer); begin x := v; end; operation Inc();

# assign new value

begin x +:= 1;

# increment

end; operation AwaitValue(v: integer); begin guard x = v do od; # block until value equals v end; begin x := 0;

# initialize objects to zero

end; Figure 1: Example abstract data type definition in Orca. We can now declare an object MyObj of abstract data type IntObject and pass this object as a shared parameter when creating a new child process: MyObj:

IntObject;

.. . fork child(MyObj); This is similar to calling a conventional procedure and passing a call-by-reference parameter to it, except that the parent and child execute in parallel. Each process that can access a given object can apply operations to it, for example: MyObj$Assign(25); tmp := MyObj$Value();

# apply Assign(25) to MyObj # apply Value() and store result in tmp

All processes sharing the object observe the effects of these operations. Any number of child processes can be created in this way, and the children can pass the shared objects on to their children, and so on. A hierarchy of processes communicating through shared 4

objects can thereby be created. If the processes sharing an object do not have access to a shared memory, the compiler and runtime system take care of object distribution, for example by replicating objects in the local memories. With replicated objects, read-only operations (such as Value and AwaitValue) are executed on the local copy and write operations (e.g., Assign and Inc) are broadcast.

3 Implementing Shared Data Objects on Transputers The implementation of the shared data-object model uses an integrated approach involving compiler, runtime system, and operating system. Recently, the initial Orca implementation [Bal92] running on top of the Amoeba operating system has been redesigned as a layered structure (Figure 2) to improve portability across a wide range of hardware architectures [Bhoedjang93]. 1 2 3 4

Compiler Runtime system (object management) Panda (group communication, RPC, and threads) Operating System/Hardware (including network)

Figure 2: Layers in the implementation of the shared data-object model. The compiler (layer 1) has been rewritten to generate ANSI C [ANSI89] augmented with calls to the runtime system (RTS) for handling shared data objects. While porting Orca to the transputer, we did not make any changes to the Orca compiler, because the target system, Parix [Par93], provides an ANSI C compiler. The RTS (layer 2) could also be used without modification, because all machine dependencies are put in the third layer (Panda), which is a small shell around the native operating system providing only the functionality needed by the RTS. Both the RTS and Panda layer will be discussed in detail below.

3.1 The runtime system The RTS is responsible for managing objects. Its prime concern is whether or not to replicate a shared object. Replication is effective if processes frequently read the object (i.e. do not change the state of the object) because read operations can be done locally without any communication. Writes on replicated objects are expensive, because all copies have to be kept consistent. Hence, for frequently written objects it is profitable to store a single instance at the CPU holding the most active process. The decisions about replication and migration are made by the RTS based on information provided by the compiler. Whenever a new process is forked, the RTS uses the compiler estimate of how many read and write operations this process will invoke on its shared objects. This static access information is combined with knowledge of communication costs to dynamically decide the best representation for each shared object. A detailed description can be found in [Bal93]. Invoking operations on shared objects that are not replicated is easy. If the object is stored remotely, the RTS does a Remote Procedure Call (RPC) to the owning node; otherwise, the operation is performed locally. Invoking operations on replicated objects is more complex: performing a read operation is simple since the data are available locally, but performing a write operation is difficult because all copies have to be kept consistent. The RTS broadcasts such write operations so that all copies will be updated. This update policy requires that the underlying broadcast facility orders messages system-wide, as will be shown in the next example (Figure 3). 5

 

 

x=0

x=0

6

6 HHH Y *   HH  HH   HHH Assign 2 Assign 1   HH   HHH    HB A Figure 3: Illustration of consistency problem.

Consider the case in which two processors A and B share an object of type IntObject that is replicated on both nodes. If both processes invoke an Assign operation on the shared object, then the respective RTSes will each send out a broadcast message containing the operation to be performed (Assign) and a parameter value (say 1 for A and 2 for B), which we denote as MA and MB . Whenever a broadcast message arrives at a node the local RTS will perform the operation as specified in the message. If the broadcast facility does not order messages then the following scenario leads to an inconsistency. At node A, message MA is delivered first followed by message MB , while messages at node B are delivered in the reverse order ( MB followed by MA). The result is that after the RTS has processed both messages at node A, the shared IntObject has the value 2, while at node B the value is set to 1. This would clearly violate the semantics of the language. We solve the problem by using totally-ordered group communication [Kaashoek92], which guarantees that all broadcast messages are received by all processors in the same order. The Panda layer offers this communication primitive to the RTS. The problem is addressed in the Panda layer, because our experience showed that handling it in the RTS is cumbersome and less efficient.

3.2 The Panda layer Since we want the RTS to be machine independent, it is not implemented directly on top of the operating system, but on top of Panda. Panda is a virtual machine that offers general and flexible support for implementing runtime systems for parallel programming languages [Bhoedjang93]. Panda provides a subset of POSIX threads (without the real-time support) [POSIX92], RPC, and totally-ordered group communication [Kaashoek92]. To achieve portability across a wide range of parallel machines, Panda assumes that the underlying operating system provides nothing more than unreliable point-to-point communication. If the underlying hardware also supports multicast or broadcast communication, then Panda will use that, instead of sending multiple point-to-point messages, to improve performance. Implementing Orca on transputers therefore amounts to porting the Panda layer to the Parix operating system. Since Parix offers threads and reliable point-to-point communication, little effort was required to implement Panda threads and RPC. It was considerably more difficult to implement the totally-ordered group communication part of Panda efficiently, since neither the transputer hardware nor Parix provides any form of broadcasting. The next section discusses various totally ordered multicast algorithms we have developed for grid-like topologies.

6

4 Totally ordered multicast algorithms While porting Panda to the Transputer, we implemented four different totally-ordered multicast algorithms. The first two of these algorithms are taken from the Amoeba operating system, which supports totally-ordered group communication and runs on a collection of processors connected by a local area network like Ethernet [Kaashoek92]. The other two algorithms were developed especially for communication on networks of point-to-point connected processors (e.g., a Transputer grid). Note that we are not developing new multicast algorithms for point-to-point networks, but focus on making multicast algorithms totally ordered.

4.1 Protocol descriptions All four multicast algorithms use a single node, called the sequencer [Kaashoek92], to order the messages system wide. Although the sequencer becomes a bottleneck on very large machines, this approach is much simpler than decentralized protocols that use timestamp vectors and the like. The four totally-ordered multicast protocols work as follows: The PB protocol The sender of a multicast message first sends it as a point-to-point message to the sequencer. On receipt, the sequencer tags the message with a global sequence number, and broadcasts the message to all nodes in the grid. Each node delivers incoming messages to the user (i.e. the RTS) in strict order, as determined by the sequence numbers. The broadcast is implemented as a sequence of point-to-point messages along a spanning tree. We call this the PB protocol (P oint-to-point followed by a B roadcast). The BB protocol A disadvantage of the PB protocol is that the throughput is severely limited since all data has to pass through the sequencer node. The BB protocol overcomes this disadvantage by having the sender broadcast the data message itself, tagged with a unique message id; when the sequencer receives the broadcast, it broadcasts a corresponding accept message containing the global sequence number for that message id. All nodes use these sequence numbers to order the broadcast messages originating from different sources. We call this the BB protocol (B roadcast followed by a B roadcast). The EBB protocol To improve the latency of sending totally ordered multicast messages with the BB protocol, the EBB protocol sends a small E xpress message to the sequencer prior to broadcasting the data message. On receipt of the express message, long before the data message arrives, the sequencer sends out the accept as with the BB protocol. For large messages, the EBB protocol reduces the latency considerably in store-and-forward networks such as a transputer mesh. The GSB protocol Although handing out sequence numbers is a computationally inexpensive task, the sequencer is likely to become a bottleneck when scaling to large numbers of processors, because of all the message traffic to/from it. To stretch the limits, the GSB protocol avoids broadcasts issued by the sequencer as present in the previous three protocols. Sending a multicast message consists of requesting a global sequence number by sending a point-topoint message to the sequencer who replies with another point-to-point message. The sender then broadcasts the data message tagged with its sequence number along a spanning tree. Note that the sequencer only sends out control messages point-to-point. We call this the GSB protocol (Get S equence number then B roadcast). 7

In summary, all four protocols are based on a single sequencer that orders messages system-wide. The protocols, however, differ in the size, number, and nature (point-to-point vs. broadcast) of control messages used to accomplish the total ordering. A comparison of the performance of the protocols is provided in Section 4.3. The next section discusses the effort undertaken to arrive at efficient implementations of these algorithms running on top of the Parix operating system.

4.2 Implementation considerations The four multicast protocols use two types of communication: point-to-point communication with the sequencer, and (unordered) spanning-tree broadcast. Unfortunately, the Parix operating system does not provide high-level communication primitives that can be used directly: there is no support for broadcast, while the virtual-link primitives would use too much memory in the central sequencer node. Hence, we implemented a suitable communication layer on top of Parix’s reliable send/receive functions. The spanning tree broadcast, which can originate at any node, has been implemented from scratch using one routing daemon per node. Parix is used only to send messages between immediate neighbors. A broadcast starts by sending out the message on all available links. When a broadcast message is received on some link, the daemon delivers the message to the application and forwards the message to its neighbors down the spanning tree. These neighbors are determined by the relative position of the intermediate node to the source node that initiated the broadcast. The forwarding of messages to multiple links is done in parallel without copying the message. The danger of using a fixed spanning tree per node is that, for example, in the case of a single sender (e.g., the sequencer in the PB protocol), only half of the communication links in the grid are used. Therefore, we alternate between two spanning trees per node. Another important optimization is that a broadcast message is forwarded asynchronously by placing it on multiple output queues, so that in case of congestion on one link the message can still be sent out immediately on other branches of the spanning tree. A hard problem for any broadcast protocol in general is how to do flow control to avoid deadlocks caused by running out of message buffers. Fortunately, the ordering of messages helps out for the PB and GSB protocols. In the case of PB, the sequencer does all broadcasts so the messages always flow along the same spanning trees, which automatically guarantees that no cyclic deadlock can arise. Flow control for GSB is more difficult since messages are broadcast from different sources, but all messages include a global sequence number. This sequence number is used to implement a kind of sliding-window protocol. Each node has allocated a window of message buffers, and keeps its neighbors informed about the highest sequence number it is willing to accept by piggy backing it on ordinary broadcast messages. This limit changes whenever the message with the lowest sequence number in the window has been forwarded to the neighbors along its spanning tree and has been processed by the application. This assures that the message with the system-wide lowest sequence number can always be forwarded. Thus, deadlock cannot occur if the application consumes all messages eventually. For BB and EBB, messages are broadcast from different sources without globally unique sequence numbers, so it is impossible to use the sliding-window flow control mechanism of the GSB protocol. There is no simple solution, so BB and EBB just run without any flow control. To implement point-to-point communication with the sequencer, we would like to use Parix virtual links, which provide low latency connection oriented communication between arbitrary nodes. These virtual links, however, consume too much memory for buffering (almost 1 Kbyte per link). The sequencer cannot afford to have connections with all members, because it has only 4 Mbyte of local memory, which constrains the problem sizes that Orca applications can handle. 8

Parix offers both synchronous and asynchronous primitives for connectionless communication between arbitrary nodes. Of course we would like to use the fast asynchronous primitives, but unfortunately Parix throws messages away without notification when the destination mailbox overflows. Since we do not want to add reliability to Parix’s ‘reliable’ communication primitives, we use the synchronous primitives. Their high latency was especially found to be a problem for the GSB protocol, where the sequencer has to wait for an acknowledgement each time it sends out a sequence number. This is a significant problem, because the sequencer is a critical resource. For the other multicast protocols, the sequencer does not send point-to-point messages to distant nodes, so the problem is less severe. For GSB, the problem is solved by having the sequencer send back the requested sequence number asynchronously; there is no chance of buffer overflow since each node can only have one outstanding request. Given the spanning-tree broadcast and point-to-point communication mechanisms, the four totally-ordered multicast protocols are rather straightforward to implement. The remaining decision is where to place the sequencer. Parix offers the opportunity to configure the transputer machine either as a two-dimensional grid or torus. In case of the grid topology the sequencer is best placed in the center, to minimize the average distance to the sequencer. In case of the torus its placement does not matter, since any node is conceptually in the center. The torus configuration, however, is handled in software by spreading the ‘logical’ nodes over the underlying hardware transputer grid, which gives equal, but much higher, communication times between any two neighbors. Hence, we use the grid topology and explicitly place the sequencer in the center.

4.3 Comparison To assess the quality of the four totally-ordered multicast algorithms, we measured two important properties of communication protocols: latency and throughput. All experiments were carried out on an idle segment of the transputer grid with varying numbers of processors (up to 128) and varying message sizes (up to 8 Kbyte). The Parsytec machine we used contains T800 Transputers with 4 Mbyte of local memory connected by 20 Mbit/sec links in a two-dimensional grid. The latency of a broadcast message heavily depends on the position of the sending node in the grid. The worst-case latency for all four protocols is the time needed for a broadcast message from one corner to arrive at the opposite corner in the grid. Figure 4 presents the average results measured by sending 10,000 broadcast messages back and forth between opposite corners; each of the two processors repeatedly sends a message and then waits for the other processor to reply. Clearly the PB protocol outperforms the other three. This is a consequence of the position of the node that broadcasts the data message; in the case of PB, the sequencer in the middle issues the ‘real’ broadcast, while for the three other protocols the broadcast starts in the corner. Thus for PB, the data is sent to the middle of the grid as a point-to-point message, and from then on it is forwarded by a broadcast. The timings show that sending data point-to-point via Parix is cheaper than forwarding it in software along a spanning tree. The results for larger messages (up to 8 Kbyte) are similar to those presented for 1 Kbyte messages, but for small messages of 128 bytes the difference between PB and the other protocols is somewhat smaller. The throughput of the multicast protocols has been measured by running a program in which each node repeatedly broadcasts messages. To avoid flooding the system completely, each node waits for its own message to be ordered and delivered locally. The throughput is computed by dividing the total amount of data sent out over the links by the execution time for a multicast protocol. To see how well the protocols perform, we compare throughputs to the maximum bandwidth Parix offers. This maximum is measured by a test program that runs on each CPU and 9

35 PB BB EBB GSB

30

time [msec]

25 20 15 10 5 0 48

16

32

64 number of cpus

128

Figure 4: Worst-case latency for ordered broadcast messages of 1 Kbyte.

effective percentage of bandwidth

30 PB BB EBB GSB

25

20

15

10

5 48

16

32

64 number of cpus

128

Figure 5: Maximal throughput for ordered broadcast messages of 1 Kbyte. sends out messages of a certain size as fast as possible on all links; incoming messages are accepted and immediately thrown away. The results are given in Figure 5. The ratios show that GSB performs best and manages to achieve almost 25% of the bandwidth

10

128

speedup

linear tsp asp sor

64

32 16 8 1 1

8 16

32

64 number of cpus

128

Figure 6: Performance of Orca applications. offered by Parix on 128 Transputers. The overhead stems from the additional control messages to request and return sequence numbers and congestion on the communication links close to the sequencer. The other three protocols perform significantly less well than GSB for various reasons. In case of PB, the sequencer has to broadcast all data messages, which leads to heavy contention on its links. In the case of BB, broadcasting the sequence numbers leads to many more messages in the grid in comparison to the point-to-point messages for GSB. This also holds for EBB, which sends an additional express message, and runs out of buffer space on 128 transputers. The throughput numbers for other message sizes show that GSB always performs best: for smaller messages the relative performance is greater than for the presented 1 Kbyte messages. Based on the latency and throughput numbers of the four multicast protocols, we have decided to use GSB for implementing the Orca RTS. GSB performs best in terms of throughput and only suffers from latency problems for large messages. The alternative PB protocol has the lowest worstcase latency, but performs very poorly when multiple senders are active, because the sequencer has to process all the data.

5 Application performance In this section we will look at the performance of three example applications (TSP, ASP, and SOR). The Orca programs for these applications have been described elsewhere [Bal91, Bal92]. The speedups for the three programs are given in Figure 6. The speedups are computed by comparing the execution times against the runtime of the parallel code on a single processor. All measurements were carried out on an idle segment of the transputer grid. The Traveling Salesman Problem (TSP) uses a parallel branch-and-bound algorithm to find the shortest route visiting 15 cities. The program obtains a speedup of 92 on 128 transputers. It uses

11

two shared objects: one for holding the minimum length of the best tour found so far, and another object for storing the job queue with partial tours. The minimum object is replicated by the RTS since it is read frequently (to prune the search tree) and written infrequently (to install a shorter tour). The job queue is not replicated, because both insert and delete operations modify the queue. If the queue were replicated, each operation would result in a broadcast message. With 128 CPUs, the object containing the current minimum length is updated only a few times, so the non-linear speedup (92) is not caused by the GSB multicast protocol, but it is the result of distributing the work through a single job queue. The program is based on the replicated-workers model, and the centralized job queue becomes a bottleneck if a large number of processors are used. The All-pairs Shortest Paths (ASP) program uses an iterative algorithm. At the beginning of each iteration, one process puts a pivot row in a shared object, which is subsequently read by all processes. The RTS takes advantage of this behaviour and replicates the shared object containing the pivot rows, which results in one broadcast per iteration. Because the number of broadcasts does not depend on the number of processors, we expected ASP to scale well, but this claim is not supported by the speedup curve in Figure 6. The problem with ASP is that the input graph with 500 nodes is partitioned over all CPUs, which results in a too small grain size if many processors are used. Running the ASP program with a bigger input graph was not possible because of the small amount of memory per Transputer (4 Mbyte). The Successive Over Relaxation (SOR) program uses many buffer objects for transferring data between neighboring processors at the end of each iteration. Since both insert and delete operations modify the object, the RTS does not replicate the buffers. Instead, each buffer is stored on one of the two processors that accesses it. The SOR program exchanges a lot of data at each iteration, but since all communication is between immediate neighbors we expect good speedups. The results in Figure 6, however, show that SOR achieves a modest speedup of 53 on 128 transputers. Again this is a consequence of using an input problem that is too small (a 512 by 512 grid). For this program, however, we were able to use larger data sets, because the data are partitioned among the different processors. We therefore also ran the program with a grid of 8192 by 256 cells. On 128 processors the program now was 7.15 times faster than on 16 processors, which is close to the maximum speedup of 8. The program could not be run on 1 processor with this input problem, because it then needed more than 4Mb of memory. The speedup for the SOR program also is limited by a barrier object, which determines at the end of each iteration whether the computation is finished or not. This barrier object is replicated by the RTS. The object uses only small operations (which broadcast a few bytes), so the performance of the SOR program can probably be improved by using the PB protocol. For comparison, the same algorithms have also been implemented in ANSI C with explicit calls to Parix’s message passing routines. The sequential Orca code is about a factor of two slower than sequential C, mainly because the C compiler performs more global optimizations than the Orca compiler. (The C compiler is ineffective for optimizing the output of the Orca compiler.) As a result, the communication overhead relative to computation time is lower for Orca than for C, which results in better speedups for Orca. (We are currently implementing global optimizations in the Orca compiler, to allow better comparisons.) The parallel C program for the TSP problem achieves a speedup of 77 on 128 nodes. Like the Orca program, it suffers from using a sequential job queue. The Orca TSP program achieves a better relative speedup (92), probably because it has less relative communication overhead. For ASP, the C program obtains a maximum speedup of 66 (compared to 61 for Orca). For SOR, the maximum speedup is only 38 on 128 nodes. Even though the comparisons are not entirely fair, they at least show that the Orca performance is quite acceptable. 12

In conclusion, we think the results show that the shared data object model can be implemented efficiently on a transputer system. The example programs achieve good speedups up to 64 processors. Even on 128 nodes, the performance is not limited by the broadcast protocol, but by the grain size and the synchronization behaviour of the applications.

6 Related Work Much research has been done on broadcast and multicast algorithms for multicomputers. For example, Tiny [Clarke91] is a message router implemented on T800 transputers that supports broadcasting. Its broadcast algorithm uses a tree structure. Each processor forwards broadcast messages to a predetermined subset of its neighbors. Each message is forwarded one hop at a time, so processors at the edge of the grid experience a long delay before they receive the message. On multicomputers that support wormhole routing, a message can be forwarded to processors that are further away with roughly the same delay as sending it to a neighbor. This approach leads to more efficient broadcast algorithms [Barnett91]. Our Orca system could easily be adapted to use such algorithms if ported to machines with wormhole routing (e.g., the T9000). Results on broadcast algorithms for multicomputers typically do not take message ordering into account. At best, most algorithms guarantee FIFO ordering. For distributed systems, on the other hand, many protocols with stronger ordering semantics have been proposed [Birman91, Chang84, Kaashoek92, Peterson89]. Most of these assume unreliable networks (e.g., Ethernet) and can deal with lost messages and processor failures. These protocols often add a large amount of state information to each message (e.g., vectors with sequence numbers in ISIS [Birman91]), which makes them less efficient for multicomputers. Most of the work in implementing distributed shared memory models also takes place in the distributed systems community. The most popular approach is page-based shared virtual memory, which simulates physical shared memory [Bennett91, Bershad93, Fleisch87, Li89b]. A few SVM systems have also been implemented on multicomputers, for example Shiva [Li89a] and KOAN [Priol92]. Experience with KOAN indicates that the programmer needs to do several difficult optimizations (e.g., page alignment) to get acceptable performance, which makes programming complicated [Priol92]. This is also the problem with memory coherence protocols implemented in hardware like DASH [Lenoski92], Alewife [Chaiken90], and the Data Diffusion Machine [Raina94], because such systems cannot transfer large data blocks efficiently or take advantage of static communication patterns [Kranz93]. The Data Diffusion Machine is not implemented in hardware yet, but has been simulated on a parallel transputer machine. All data references are trapped in software, which results in a big overhead that makes the transputer system unsuitable as a shared memory system. Since simulating shared memory efficiently on transputers is hard, an alternative is to provide a shared-data programming model different from shared memory. One such model is the Linda Tuple Space [Carriero89], which is a globally-shared associative store of tuples. A possible implementation of Tuple Space on transputers is described in [MacDonald89]. One of the main problems in a distributed implementation is locating the processor that contains a requested tuple. Since the Tuple Space uses associative addressing, this may involve searching many processors, which introduces communication overhead. The shared data-object model does not have this problem. Also, the model allows programmers to define operations of arbitrary complexity, whereas the Linda model has a fixed number of operations built in. More complicated operations can be built out of multiple Tuple Space operations, but this again has communication overhead,

13

since each Tuple Space operation usually results in one or more messages being sent.

7 Conclusions We have implemented an object-based distributed shared memory system on a grid of transputers. The advantages of object-based DSM over page-based DSM are three-fold: no MMU support is needed, no fixed-size pages are used, and the problem of false sharing is avoided. Our implementation uses a portable runtime system, which is suitable for networks of workstations as well as multicomputers. Only a small part of this RTS is system-dependent and had to be adapted to the transputer environment (Parix). The main issue was how to implement totally-ordered group communication on a transputer grid. We have implemented four different multicast algorithms, which use a single sequencer to order messages system wide. Latency and throughput measurements show that the simple GSB (Get Sequence number and Broadcast) algorithm in general is the best choice. We have also looked at the performance for three example applications written in Orca. The programs achieve good speedups on 64 nodes. Further performance improvements on larger systems were limited by the maximum problem size we could use. Even on 128 nodes, the sequencer still was not the bottleneck. Our conclusion is that it is possible to implement object-based DSM efficiently on transputers. The greatest advantages for programmers are that programming is easier than with message passing (because processes can share state information) and that programs are portable. The same Orca program can run unmodified on transputers and on Ethernet-based distributed systems, for example. The price paid for these advantages is a performance penalty. Our experiences thus far indicate that this penalty is small and that Orca can achieve performance competitive to message passing.

Acknowledgements We like to thank Oswald Elmont for coding the C applications and testing them on the transputer machine, which was kindly made available by the University of Amsterdam (UvA). Saniya Ben Hassen, Raoul Bhoedjang, Ceriel Jacobs, Rutger Hofman, Tim R¨uhl, and Greg Wilson made valuable comments on draft versions of this paper.

References [ANSI89] ANSI. ANS X3.159-1989 - Programming Language C. American National Standards Institute Inc., 1989. [Bal91] H. Bal. Programming Distributed Systems. Prentice Hall Int’l, Hemel Hempstead, UK, 1991. [Bal92] H. Bal, M. Kaashoek, and A. Tanenbaum. Orca: A language for parallel programming of distributed systems. IEEE Transactions on Software Engineering , 18(3):190–205, 1992. [Bal93] H. Bal and M. Kaashoek. Data distribution in Orca through compiler optimization. In Conference on Object-Oriented Programming Systems, Languages and Applications, pages 162–177, Washington D.C., 1993.

14

[Barnett91] M. Barnett, D. G. Payne, and R. van de Geijn. Optimal broadcasting in mesh-connected architectures. Technical Report TR-91-38, University of Texas, Computer Science, 1991. [Bennett91] J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Implementation and performance of Munin. Proc. of the Thirteenth ACM Symposium on Operating System Principles, pages 152–164, 1991. [Bershad93] B. Bershad, M. Zekaushas, and W. Sawdon. The Midway distributed shared memory system. COMPCON, 1993. [Bhoedjang93] R. Bhoedjang, T. R¨uhl, R. Hofman, K. Langendoen, H. Bal, and M. Kaashoek. Panda: A portable platform to support parallel programming languages. Symposium on Experiences with Distributed and Multiprocessor Systems , pages 213–226, 1993. [Birman91] K. Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group multicast. ACM Transactions on Computer Systems, 9(3):272–314, 1991. [Burns88] A. Burns. Programming in occam 2. Addison-Wesley, Wokingham, England, 1988. [Carriero89] N. Carriero and D. Gelernter. How to write parallel programs: A guide to the perplexed. ACM Computing Surveys, 21(3):323–357, 1989. [Chaiken90] D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal. Directory-based cache coherence in large-scale multiprocessors. IEEE Computer, pages 49–58, 1990. [Chang84] J. Chang and N. Maxemchuk. Reliable broadcast protocols. ACM Transactions on Computer Systems, 2(3):251–273, 1984. [Clarke91] L. Clarke and G. Wilson. Tiny: An efficient routing harness for the INMOS Transputer. Concurrency: Practice and Experience , 3(3):221–245, 1991. [Fleisch87] B. Fleisch. Distributed shared memory in a loosely coupled distributed system. In Computer Communication Review (SIGCOMM’87 Workshop on Frontiers in Computer Communications Technology, Stowe, Vermont), volume 17(5), pages 317–327, 1987. [Fleisch89] B. Fleisch and G. Popek. Mirage: A coherent distributed shared memory design. In Proc. of the 12th ACM Symposium on Operating System Principles, pages 211–223, Litchfield Park, AZ, 1989. [Hoare81] C. Hoare. The emperor’s old clothes. Communications of the ACM, 24(2):75–83, 1981. [Kaashoek92] M. Kaashoek. Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit, Amsterdam, 1992. [Kranz93] D. Kranz, K. Johnsson, and A. Agarwal. Integrating message-passing ans sharedmemory: Eraly experience. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 54–63, 1993. [Lenoski92] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford Dash multiprocessor. IEEE Computer, pages 63–79, 1992. [Li89a] K. Li and R. Schaefer. A hypercube shared virtual memory system. In Proc. 1989 Int. Conf. Parallel Processing (Vol. I), pages 125–132, St. Charles, Ill., 1989. 15

[Li89b] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321–359, 1989. [MacDonald89] N. MacDonald. A distributed Linda kernel. Technical Report ECSP-TN-33, Edinburgh Parallel Computing Center, 1989. [Par93] Par. Parix 1.2, Reference Manual. Parsytec, 1993. [Peterson89] L. Peterson, N. Buchholz, and R. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217–246, 1989. [POSIX92] POSIX. Threads Extensions for Portable Operating Systems P1003.4a. IEEE Standards Project, draft 6 edition, 1992. [Priol92] T. Priol and Z. Lahjomri. Experiments with shared virtual memory on an iPSC/2 hypercube. In Proc. 1992 Int. Conf. Parallel Processing (Vol. II), pages 145–148, St. Charles, Ill., 1992. [Raina94] S. Raina. Emulation of a Virtual Memory Architecture . PhD thesis, Bristol University, U.K., 1994. [Sunderam90] V. Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315–339, 1990. [Tanenbaum90] A. Tanenbaum, R. van Renesse, H. van Staveren, G. Sharp, S. Mullender, A. Jansen, and G. van Rossum. Experiences with the Amoeba distributed operating system. Communications of the ACM, 33(2):46–63, 1990.

16