High-Performance Distributed Shared Memory ... - Semantic Scholar

6 downloads 0 Views 42KB Size Report
substrate that can serve as the foundation for a variety of ... Our prototype system software has been designed to make ..... sparse hash table. ... descriptor. Thus ...
High-Performance Distributed Shared Memory Substrate for Workstation Clusters

Arindam Banerji, Dinesh C. Kulkarni, John M. Tracey, Paul M. Greenawalt, David L. Cohn Distributed Computing Research Laboratory Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 Technical Report 93-1

January, 1993

To be published in the Proceedings of the Second International Symposium on High Performance Distributed Computing, Spokane, WA, July 1993.

High-Performance Distributed Shared Memory Substrate for Workstation Clusters Arindam Banerji, Dinesh Kulkarni, John Tracey, Paul Greenawalt and David Cohn Distributed Computing Research Laboratory University of Notre Dame, Notre Dame, IN 46556 Abstract In order to exploit the latest advances in hardware technology, application developers need high-performance, easy-to-use cooperation tools that span interconnections of standard hardware. Distributed shared memory has been proposed as such a cooperation tool, but performance problems have limited its usefulness. This paper argues that a new approach to distributed shared memory implementation can make it a effective tool in its own right and a foundation for other tools. It describes a prototype implementation that allows sharing of memory resources in a workstation cluster. The prototype is based on an innovative, low-overhead messaging protocol which utilizes the high bandwidth of the underlying hardware, while adding very little latency overhead. Finally, the interface exported by this software is designed to function effectively as a substrate for a variety of cooperation tools.

1: Introduction Clustered workstations offer the same potential peak processing power as a modern multiprocessor, but this power is difficult. Gigabit fibers provide bandwidth comparable to that of multiprocessor backplanes, and latency within at least an order of magnitude. Thus, the hardware is available, and this paper presents a new system software approach designed to exploit it. It proposes coupling a distributed shared memory implementation with efficient communication to form a canonical substrate that can serve as the foundation for a variety of cooperation tools. Early experimental results from a prototype implementation are reported. The use of shared memory as a foundation for cooperation tools has its origins in the design of multiprocessors [Li86]. Some multiprocessors use physically shares memory while others provide shared memory emulation. Applications use this (emulated) shared memory either directly or through system software tools built on top of

it. The physical interconnection bus in a shared memory multi-processor (SMMP) hides the distinction between remote and local memory; locality effects only latency, not access method. Users generally have found shared memory a good cooperation tool for SMMPs, but performance problems have limited its usefulness in clustered systems. Clustered workstations do not have a bus and normally do distinguish between local and remote memory. However, system software that emulates an SMMP bus also reduces locality differences to latency differences and makes the cluster the logical equivalent of a multiprocessor. Our prototype system software has been designed to make a set of workstations interconnected with high-bandwidth optical fiber channels look like a shared-memory multiprocessor. An extension to the operating system, called the pager, allows a portion of each machine’s address space to be logically shared throughout the cluster. Classic distributed shared memory implementation techniques have been modified based on the speed and reliability of the optical links to offer high-performance. This efficiency is based upon three factors: • The characteristics of a new generation of communication hardware • A thin and extremely efficient communication protocol • An implementation with minimal overhead compared to message passing In addition, the software architecture of the pager provides extendibility and an easy to use programming interface. This paper begins with a brief overview of the pager design followed by a description of the hardware resources used for the construction and evaluation of this prototype. The performance of the pager and its various components are described in Section 4. This is followed by a discussion of the prototype pager implementation and an analysis of the performance results. Section 6 details the pager software architecture and its use as an operating substrate. The paper ends with a discussion of the

advantages of this approach and describes some of the future work already under way.

pages. In addition, system calls are provided to allow explicit movement of pages between objects.

2: Pager Overview

In order to provide complete DSM functionality, the pager provides two synchronization mechanisms. Clients that need guaranteed continuous access to a set of pages may choose to pin those pages to the local machine. This acts much like the locking protocols supported by other DSM implementations [Li86] [Ra89] [Co92]. In addition, support is provided for event notification. Thus, a receiving client can be notified whenever a sending client moves pages into the segment. Notification is based on installable callback routines which are normally defined during segment allocation. These routines are then activated on the occurrence of specified events.

The pager substrate provides location independent access to the memory resources of all machines within a cluster. Pager instances communicate with each other through a special-purpose message protocol that is a client of the device driver for an optical fiber communication link. Figure 1 illustrates the relationship between the pager, the Kernel Call Page

System Call

Transmit Queue

Message Handler

Call

Pager

Fault

Memory

IOCTL i/f

FASWRT i/f

Interrupt

Message Protocol

Device Driver Link Controller

Fibers

Figure 1 - Implementation Relationships message protocol and the device driver. This section summarizes the pager services and semantics; a more detailed discussion is presented later. The pager substrate is implemented as a distributed virtual file system. This allows the pager to install its own page fault handler for each virtual memory segment that it uses. It defines a name space that spans all of the machines in a cluster. Each vnode in this file system corresponds to a uniquely named DSM virtual memory segment, or memory object. The pager augments the traditional vnode interface with a system call interface for manipulating these segments. Clients of the pager use the name of the memory segment as a memory object handle. They map a DSM segment into their address space by allocating an object with a given name. Sharing is achieved when multiple clients allocate an object with the same name. Once the segment has been allocated, clients may allocate its pages by using a page allocation interface that closely resembles the malloc interface. Once a page has been allocated by any client of an object, all other clients have access to the page. Page sharing between DSM objects is driven by page faults, generated by client-access of the

A specialized message protocol handles all cooperation traffic between instances of the pager. This is a thin layer that implements an end-to-end protocol on top of a standard optical link device driver. It relies on the controller hardware to detect errors and failures and provides interrupt service routines to handle these conditions. It allows clients, such as the pager, to define other interrupt service routines which handle normal messages. When a message is received, the device driver transfers it from the hardware to memory and notifies the communication subsystem. The subsystem then checks the message type and activates the appropriate service routine. To transmit, a client calls a routine that places the message on a transmit queue and then activates the subsystem.

3: The Interconnection The pager prototype was implemented on a three-machine cluster of IBM Risc System/6000 machines. The 500series of RS/6000 support interconnection via highbandwidth fiber optic channels. They feature an expansion slot on their CPU planar board which accommodates a serial optical channel converter (SOCC). The SOCC supports two fiber optic channels; each channel provides full-duplex point-to-point communication at approximately 220 Mbits per second with a maximum packet size of 60 KBytes. The resulting communications channel supports data transfer between workstations with bandwidth comparable to that for processor-to-memory transfers.

4: Pager Performance The performance of the pager subsystem is influenced by three primary factors: • The capabilities of the serial optical link hardware • The behavior of the message protocol • The pager’s software overhead

This section explains how each of these factors was measured. A number of experiments were conducted to measure the communication performance of the pager software. These generally involved sending data from one machine to another and then sending the same amount of data back. The resulting transfer times were then plotted against message size and the following three parameters were computed: • Latency - The time it takes for the first bit of a short message to arrive. • Incremental Bandwidth - The fastest rate at which data can be transferred, essentially the incremental slope of the time vs message size curve. • Asymptotic Bandwidth - The average data transfer rate including "end of packet effects" which tend to slow things down.

4.1: SOL Hardware Capabilities Our first experiment was designed to determine the actual data carrying capacity of the serial optical link (SOL) hardware. We measured the round trip transfer time between two RS/6000 model 530s running AIX Version 3.2. Messages of various sizes were sent. The smallest consisted of a just a 44-byte header; larger ones were in increments of 4 KBytes up to 56 KBytes (just less than the 60 KByte maximum packet size). For each size, transfer times were averaged over 60 to 100 trials.

than the maximum packet size requires multiple invocations of the SOL device driver. The associated packet overhead averages 0.8 msec, although the first two invocations took only 0.4 msec. The slope of the straight portions of the graph implies an incremental bandwidth of 150 Mbits/sec in each direction; discounting the initial latency, the average slope gives an asymptotic bandwidth of 130 Mbits/sec. The graph is essentially linear over a wide range of transmission sizes (we ran as high as 560 KBytes), indicating that the hardware can sustain high throughput rate.

4.2: Message Protocol Behavior The next experiment assessed the performance of the message protocol performance. Again, a series of round-trip data transfers was used. This time, communication was between two user processes, not between kernel extensions. Thus, we will incur a penalty each time we cross the boundary between the user space and the kernel space. The message protocol was designed to accept requests from user processes so that its performance could be easily evaluated. For comparison, we ran a set of four tests. In addition to evaluating the message protocol over the serial optical link, we evaluated UDP over an ethernet, UDP over the serial optical link and PVM [Su90] over an ethernet. UDP provides services similar to the message protocol and PVM offers some additional features. The transfer times for different sized messages are shown in Figure 3.

The Y intercept in Figure 2 shows that for very small

200

50

150 M PV

ms

40

o ket Soc

100

ms

30

E th on

e ern

nE

t

t rne the

20 50 SOL Socket on

10

PMP on SOL 0

0

0

100

200 Message Size in KBytes

300

400

Figure 2 - Hardware Round Trip Times packets, the round trip time is approximately 1.4 msec. Thus, the latency of the SOL device driver and hardware is half of this, or 0.7 msec. This is the minimal start-up cost for each transaction across the SOL. The small discontinuities or jumps in the graph are due to the optical link’s maximum packet size of 60 KBytes. Sending more

0

10

20 30 40 Message Size in KBytes

50

60

Figure 3 - Protocol Round Trip Times The three communication parameters: latency, incremental bandwidth and asymptotic bandwidth were calculated and are listed in Table I. It is clear from the table that the user space to kernel space transition is costly, particularly in terms of bandwidth. However, it is also clear that the special purpose message protocol has significantly better performance than UDP, which is a light-weight, general

purpose member of the TCP/IP communication suite. UDP over the serial optical link is better than over the ethernet, but by as much as the raw bandwidth would imply. Finally, the increasingly popular PVM system is further limited by its daemon-based implementation.

4.3: Pager Shared Memory Performance The main purpose of the pager is to implement distributed shared memory using the well-known page fault mechanism [Fl89]. Pages are moved between machines in response to page faults. Since the pager is implemented as a virtual file system kernel extension, it can efficiently handle page faults. It communicates directly with the device driver and is notified of page faults by the AIX kernel. The this software veneer that provides the paging services adds minimal overhead. A good measure of DSM performance is page load time. This is the time between the generation of a page request and the availability of the required page. For a remote page, this has two important components: communication time and paging software overhead. The communication involves a request indicating which page is required and the response containing the required page; paging software overhead involves locating the page, managing housekeeping information and providing other checks and services. Table I - User-Level Latency & Bandwidth Experiment SOL Hardware Message Protocol UDP on SOL UDP on Ethernet PVM on Ethernet

Latency Incremental Asymptotic 0.68 ms 159 Mbs 131 Mbs 1.09 ms 64.9 Mbs 63.9 Mbs 1.59 ms 38.0 Mbs 37.7 Mbs 1.56 ms 8.69 Mbs 8.52 Mbs 5.00 ms ** 5.80 Mbs

Repeated measurements showed that for the 4 KByte pages of AIX 3.2, the page load time is approximately 1.90 msec. Previous experiments showed that the message protocol requires 1.57 msec to send a short message followed by a 4 KByte message, we conclude that the paging software overhead is 0.33 msec, or only 17% of the total page load time.

5: Pager Implementation The pager prototype is implemented as an AIX distributed virtual file system, with a pager instance appearing on each machine in the cluster. The instances cooperate through a specialized message protocol to create a single common substrate view. This protocol is a separate component and exists independently of the pager. The pager virtual file system uses the vnode interface and the AIX memory management interface. It maintains and

manipulates all information pertaining to the DSM segments and the pages contained therein.

5.1: Message Protocol The interface exported by the message protocol allows clients, such as the pager, to transmit packets over the link and to install handlers for the receipt of specific message types. The protocol’s interface is available to user-level clients and is the basis for the performance tests described in Section 4.2. However, it has been designed for trusted clients. Prior to discussing the protocol implementation, it is necessary to look at the link controller and serial link device driver interfaces. The serial optical link controller [Ir90] contains buffers, a sequencing engine, an optical receiver and control registers. The controller transmits messages as a sequence of 256-byte frames with an appended 16-bit cyclic redundancy check. When message transmission or reception completes or an unexpected event happens, the controller fires an interrupt to inform the device driver. The device driver exports three kinds of interfaces: • The IOCTL interface allows clients to perform controller-specific actions such as obtaining information about the processor, transmitting a message, etc. • The FASWRT interface provides kernel-level clients a fast-path data transmission facility that goes directly to the controller. • Interrupt Handlers installed by device driver clients are used to notify the clients of hardware failure, data reception, status changes, etc. From Pager Transmit Queue

Call

Message Handler

Transmit Queue Manager

Installable Message Handler

Timer-Based Status Checks

Receive Interrupt Handler

Initialization & Termination

Status Change Interrupt Handler

IOCTL i/f to Memory

FASWRT i/f

to Device Driver

Interrupts

Figure 4 - Message Protocol Implementation The message protocol seen in Figure 4 provides two inter-

faces. One allows clients to install message handlers for various message types. A client defines a message type by using this interface to notify the message protocol of the callback routine to be associated with this type. When such a message arrives, the protocol realization activates this routine, with a pointer to the message as a parameter. The other interface allows clients to place messages on the protocol’s transmit queue. These interfaces are designed for minimum overhead by giving the clients responsibility for any advanced services. The major components of the protocol implementation are: • Receive Interrupt handler - called by the device driver on the arrival of any message that uses this protocol. This component does some initial checks and then calls the appropriate message handler for the received message type. • Status Change Interrupt Handler - called by the device driver whenever any hardware errors occur or a change in the hardware status is detected. • Timer Based Status Checks - from time-to-time ensure that the status of the link is appropriate and that the all neighbors can still be reached. It is also responsible for responding when a neighboring machine becomes reachable. • Initialization and Termination Components - open the link, perform initial sanity checks, set up the reachability matrix, bring down the link methodically and perform all cleanup operations. • Installable Message Handler Component - manipulates the data structures that maintain the information pertaining to the callbacks for various message types. • Transmit Queue Manager - manipulates the queue, allowing clients to add messages and then removing and transmitting them. An installable error handler which allows the clients to deal with some specific error situations has been designed, but not yet implemented.

5.2: Pager Virtual File System The pager virtual file system (VFS) [IB90] shown in Figure 5 exports a set of system calls that allow the manipulation of shared memory segments, the pages contained in the segments, movement of pages, synchronization and event-notification. These may be used by

Page Fault

from Kernel-Level

from User-Level

Kernel Call

System Call

System Call Handler

Master/Slaver Resolver

Page Fault Handler

Memory Manager

Directory Manager

Timer-based Services

Remote System Call Handler

Pager-Specific Message Handlers

Transmit Queue

Message Handler

`

to Message Protocol

Call

Figure 5 - Pager Implementation either kernel-level or user-level clients, with only marginally different semantics. This section describes the functionality of the major components of the pager VFS. The directory manager maintains per-page information for all the segments that are handled by the VFS. This includes hints about current location, if the page is not local; hints about previous location, if it is local, and page-control bit status, such as whether the page is pinned. In essence, the directory manager maintains a complete map for all the pages that it handles. The memory manager manipulates the allocation and deallocation of pages within each DSM segment. Free pages are maintained in a heap and allocated ones in a sparse hash table. After allocation and deallocation, this component also ensures that fragmentation is minimized. It uses a best fit algorithm to allocate pages within a segment. The page fault handler forms the core of the pager VFS. All page faults generated on segments managed by this VFS are passed to this handler by the AIX kernel. If the page is not available locally, the handler finds its location from the directory manager and makes a PAGE-FAULT request. When the page is returned by the remote machine, the handler pins it locally for a short period, and then completes the page fault process. The pinning avoids unnecessary thrashing [Fl89] [Ni91].

The master/slave resolver manages and controls the name space of the segments within the cluster. Whenever a memory object is allocated, this protocol ensures that only one object with that name exists in the cluster. It uses a two-phase commit protocol to ensures name uniqueness. Thus, only one machine within the cluster holds the actual memory object pertaining to the DSM segment; the other machines maintain a proxy for this object. The system call handler implements all the calls exported by the pager VFS. It determines whether a call should be forwarded to a remote machine or whether it should be resolved locally. It calls the appropriate routine within the memory manager, the directory manager or the master/slave resolver to handle the system call. It checks return codes and takes appropriate actions. The remote system call handler implements all system calls that originated on foreign machines. It essentially performs the same function as the system call handler, except that error recovery and completion are done differently. The pager-specific message handler implements and installs all message handlers specific to this VFS with the message protocol realization. On receipt of a message from the message protocol layer, this component retrieves the appropriate information from the message and invokes the target component that handles the message. The timer-based services handle the reliability features of the pager VFS. They ensure that system calls do not hang, that messages are resent, etc. In addition, they are used for housekeeping and cleaning up data structures.

6: An Operating System Substrate The pager uses an object-based definition of shared memory. Clients allocate a memory object consisting of multiple pages and can then allocate individual pages within that object. These pages can be moved between machines, forming the basis for other cooperation mechanisms. This section describes the memory object services and the kind of cooperation mechanisms that can be built on top of them.

6.1: DSM Substrate - Architecture The pager represents regions of memory as memory objects with several associated resources. The pager guarantees that the name of each memory object is unique within the cluster. Each object has a set of pages whose maximum number is set at allocation. Finally, each has one or more associated callback routines to handle event notifications.

The pager interface exports five kinds of location-independent, re-entrant operations: • Object allocation and deallocation - An allocation call specifies object name and maximum size. It returns a handle to either an existing object or a newly created object. ("Allocations" of existing object return handles to those objects.) The initial allocation call causes pager to globally define a memory segment of maximum size and initializing a descriptor; all calls cause it to map the segment into the client’s address space and return a valid segment descriptor. Thus, multiply allocated objects become shared in the sense of classic DSM. • Page allocation and deallocation - When a client adds pages to a memory object, a pointer to the initial offset within the new pages is returned. The pager ensures that all page allocation calls to the same object are serialized. If a client refers to a location someone else has allocated, a page-fault causes the page to be mapped into the client’s address space. If the page had never been allocated, an exception is generated. This limited protection is acceptable since the pager is a substrate for trusted services. • Page movement between objects - A client can name source and destination objects and ask that a set of pages be copied or moved. This is useful, for example, in implementing 4.3 BSD sockets. Sending a packet between two processes involves just moving a set of pages between the source pager object and the destination pager object. • Callback routine installation and manipulation Events, such as the arrival of a page, may require that a memory object notify a client. For example, a process may be waiting on the socket mentioned above and must be awakened when the pages arrive. • Control flag manipulation - Sometimes the normal, or default, behavior of a memory object may not allow a client to use it successfully. For example, the page-fault mechanism for moving pages can lead to thrashing and starvation. Therefore, a client can control the behavior of a memory object through control flags. As we have noted, the pager is a flexible distributed shared memory substrate on which a number of cooperation tools can be built.

6.2: Cooperation Tools A significant community of parallel application developers favor the UNIX socket interface, others feel comfortable with the Linda tuple space, while still others prefer to use PVM messaging. The pager’s software architecture allows it to easily support each of these cooperation tools, as well as shared memory. The pager’s low-overhead ensures good performance of these higher-level tools. On the other hand, using the pager as a building block makes tool implementation considerably simpler, significantly reducing development and maintenance costs. The range of tools which can be implemented on the pager includes: • BSD Sockets - Sockets are a classic communication mechanism which provide a reliable byte-stream between unrelated processes. A socket implementation based on the messaging protocol of Section 5.1 can provide good performance to clients. • Linda - Linda [Ah88] is a language-level view of a shared data space. This data space, also known as the tuple space maps directly to a memory object. Additional metadata is maintained on a per tuple basis to hold name and type information. Tuples are put into or removed from the space by allocating and deallocating the required pages. • File Systems - Many UNIX-based file-systems optimize performance by mapping active files into virtual memory. Extending this by mapping the files into a memory object automatically provides clusterwide sharing of open files. This suggests that it may also be possible to share devices throughout the cluster. If devices, such as disks, are mapped to the substrate, they will be globally visible.

6.3: Ease of Use As discussed in previous sections, the pager is designed as a very thin and efficient software veneer to achieve good performance with implementations in existing operating systems. Since the pager was designed both for direct use and as a substrate for operating system services, one might assume that it would have a complex interface. However, special care has been taken to ensure that the interface is easy to use. Its few calls have well-defined semantics as illustrated in the following small code segment: // A code fragment illustrating use of pager // Allocate obj, get obj & callback ids // Callback optional, allows greater page // movement control ret_code = pager_object_alloc (size, object_name, &address, &object_id, flags, &call-

back, &callback_routine_id); .. // Allocate a few pages ret_code = pager_alloc_page (object_name, object_id, num_pages, &offset_within_object); .. // Use pages, then de-allocate them ret_code = pager_free_page (object_id, object_name, offset); // Deallocate object and then we are done ret_code = pager_object_dealloc (object_name, object_id, address, callback_routine_id);

These calls are further simplified by the standard malloc interface.

7: Discussion At this point, the pager is still in the prototype stage. Our early experiments indicate that it is capable of providing the connectivity needed to join workstations into a valuable distributed computing structure. The pager design is also a valuable tool for assessing operating systems and cooperation structures.

7.1: The Value of Distributed Shared Memory Distributed shared memory has long been recognized as a convenient cooperation tool. However, performance considerations have limited its usefulness. Our prototype pager has shown that it is possible to reduce page load times by a factor of five from previous work [An90]. By reducing software overhead to only 17% of communication time, we have shown that DSM can be almost as efficient as raw message passing. Future experiments will investigate ways to further reduce this overhead. We will exploit locality of reference and transfer multiple pages whenever we encounter a page fault. Also, we will evaluate pre-paging when the communication channel is idle. We are in the process of evaluating the impact of DSM on large, compute intensive applications. Early results are promising, particularly for asynchronous algorithms and those which exploit the pagers facility to pin pages.

7.2: The Substrate Structure Perhaps the most important contribution of the pager work is the substrate approach to building cooperation mechanisms in clustered environments. Hardware advances,

especially in communication links, are exploited through the high-performance message protocol, which has been designed especially for the pager prototype. Also, the architecture of the pager itself allows interesting opportunities to dynamically extend the very nature of the substrate. DSM implementations have traditionally been geared towards application-level clients. By contrast, the pager’s clients are operating system services. This critically affects the design, implementation and performance of the prototype. With system-level clients, the pager need not maintain any per-process information or perform general and extensive error checking. This makes the implementation small and reduces the DSM overhead to a minimum. Clients maintain their own per-process information and perform specific error-checking to ensure security. This allow the prototype to be built far closer to low-level kernel interfaces, such as that of the device-driver, and avoids expensive context change operations such as those encountered by daemon-based implementations of cooperation protocols. Since optical communication links have far greater bandwidth and much better performance robustness with increasing traffic than LANs, a page transfer operation is relatively inexpensive. In addition, their error rates are remarkably low and recovery can be left to pager’s clients. Thus, the message protocol achieves high performance by offering basic services and letting higher-level entities provide guarantees of more specialized services. The initial pager prototype has only been tested on a small workstation cluster. By limiting the number of machines, we have limited communication latencies. With our experiments, all machines are within one hop of each other. If a message packet has to be routed through switches, latency will increase, and if intermediary hosts are used, the latencies become disastrous. Future work will consider a hierarchical structure, or a cluster of clusters, to mitigate the latency problem.

References [Ah88] Ahuja, S., et. al., Linda and Friends, IEEE Computer, Vol. 19, No. 8, August, 1986, pp. 26-34. [An92] Ananthanarayanan, R., et. al. Application Specific Coherence Control for High Performance Distributed Shared Memory, in Proceedings Symposium on Experiences with Distributed and Multiprocessor Systems, 1992. [An90] Ananthanarayanan, R., et. al. On the Integration of Distributed Shared Memory with Virtual Memory Management, Technical Report, Georgia Institute of Technology, GITCC-90/40, 1990.

[Ba89] Bal, H., The Shared Data-Object Model as a Paradigm for Programming Distributed Systems, Ph.D. Dissertation, Vrije Universiteit, 1989. [Bl92] Blount, M., DSVM6K: Distributed Shared Virtual Memory on the Risc System/6000, IBM Research Draft, 1992. [Ca91] Carter, J. et al Implementation and Performance of Munin, Proc 13th ACM Symp on Operating System Principles, ACM, pp. 152-164. [Co92] D. Cohn et. al. A Universal Distributed Programming Paradigm for Multiple Operating Systems, Proc Symp on Experiences with Distributed and Multiprocessor Systems, USENIX. [Fl89] Fleisch, B. and Popek, G., Mirage: A Coherent Distributed Shared Memory Design, Proceedings of the 12th ACM Symposium on Operating Systems Principles, December, 1989, pp. 211-223. [Fo88] Forin, A., Barrera, J., Young, M. and Rashid, R., Design, Implementation, and Performance Evaluation of a Distributed Shared Memory Server for Mach, Technical Report CMU-CS-88-165, Carnegie-Mellon University, August, 1988. [IB90] AIX Kernel Extensions and Device Support Programming Concepts for IBM RISC System/6000, IBM, Part No. SC232207-00, 1990. [Ir90] Irwin, J. and Mathis, J., "Serial I/O Architecture and Implementation", RISC System/6000 Technology, IBM, Part No. SA23-2619-00, 1990. [Li86] Li, K., Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Dissertation, Yale University, YALEU/DCS/RR-492, September, 1986. [Ni91] Nitzberg, B. and Lo, V., Distributed Shared Memory: A Survey of Issues and Algorithms, IEEE Computer, Vol. 24, No. 8, August 1991, pp. 52-60. [Ra89] Ramachandran, U., Ahamad, M. and Khalidi, M., Coherence of Distributed Shared Memory: Unifying Synchronization and Data Transfer, Proceedings of the 1989 International Conference on Parallel Processing, Volume II, August, 1989, pp. 160-169. [Sc92] Scott, M. and Garrett, W., "Shared Memory Ought to be Commonplace", The Third Workshop on Workstation Operating Systems (WWOS-III), IEEE, 1992. [Su90] Sunderam, V. "PVM: A Framework for Parallel Distributed Computing", Concurrency: Practice and Experience, Vol. 2, No. 4, 1990.