Global Management of Coherent Shared Memory on an SCI Cluster Povl T. Koch, Emmanuel Cecchet, Xavier Rousset de Pina
Abstract | The I/O-based implementations of the SCI standard allow cost-ecient use of shared memory on a wide range of cluster architectures. These implementations have typically been used for message-passing interfaces, but we are exploiting the use of I/O based SCI as a way to create NUMA architectures with commodity components. A major issue is that data placement and especially data consistency becomes the responsibility of the programmer. We propose an interface, called SciOS, that provides a coherent shared memory abstraction where all the physical memory in the cluster is treated like in a NUMA architecture. We use a relaxed memory consistency model, dynamic page migration, page replication, and use remote memory instead of swap to disk to lower the average memory access times. We describe our prototype based on Dolphin's PCI-SCI adapter, with focus on the global management of the shared memory. Keywords | Coherent distributed shared memory, global memory management, IEEE Scalable Coherent Interface (SCI), NUMA architectures.
I. Introduction
Currently, the most successful parallel architectures in the commercial marketplace are the SMP and CC-NUMA multiprocessors. The SMPs are inexpensive and easy to program but they typically have inherent scalability problems because they are based on a shared bus architecture for connecting the processors to the memory system. CCNUMA architectures have more scalable memory interconnects, but while accesses to local memory are inexpensive, accesses to remote memory are 2-3 times more expensive than local accesses. This can hurt performance for applications that have dynamic sharing patterns because it can be hard for an application programmer to optimally place shared data in the \right" physical memory. Both the SMP and CC-NUMA architectures have longer time-to-market times than uniprocessors because the memory interconnect has to be redesigned each time either new processor architectures are introduced or faster processors need higher memory bandwidths. This has led researchers to propose distributed shared memory to be implemented in software on networks of workstations (NOW), thereby making it possible to construct parallel architectures from the latest and fastest o-the-shelf processors and networks. Many NOW architectures use traditional network interfaces, e.g., ATM and Myrinet, to transfer shared data between local memories. Although they can provide high bandwidths, they suer from high software overheads and high message latencies. This limits the use of a NOW as P. Koch, E. Cecchet, and X. Rousset de Pina are with the SIRAC Laboratory, INRIA Rh^one-Alpes, ZIRST - 655, avenue de l'Europe, 38330 Montbonnot Saint-Martin (France). E-mail:
[email protected]. P. Koch is also associated with the University of Copenhagen (Denmark).
a parallel architecture because only problems with coarsergrained parallelism can be solved eciently. With the advent of I/O-based SCI, e.g., Dolphin's PCISCI adapter [6] and similar interfaces [8], [18] with very low latencies, it is now possible to solve much ner-grained problems using a shared memory abstraction. With the appropriate techniques, we believe that a NOW with an I/O based SCI interface, an SCI cluster, can perform almost as well as an SMP or CC-NUMA for a wide range of parallel and distributed applications. Our approach is to use an I/O-based SCI interface as a means to build a non-coherent NUMA architecture from a network of workstations. To avoid the hardware limitations of such an architecture (especially the lack of cache coherency, reordering of writes, and the xed placement of pinned physical memory), we propose to provide a coherent shared memory system through a tight integration with the operating system's virtual memory mechanisms. The main contribution of our approach is a combination of distributed virtual memory algorithms which will allow an ecient use of the global physical memory and minimize the overall cost of accesses to shared data. The remainder of this paper is organized as follows: Section II describes some approaches to distributed memory management and Section III describes our approach for global memory management on SCI cluster in detail. Section IV describes the SciOS prototype and discusses some implementation issues. We present some very early performance result in Section V. Section VI describes work related to ours. Finally, we summarize and discuss some of the issues in Section VII. II. Distributed Virtual Memory Management
On a traditional network of workstations, a shared memory abstraction can be implemented using the virtual memory system's protection mechanisms. This is known as Shared Virtual Memory (SVM). The IVY system [17] provides demand-replication of shared pages. The system provides a sequentially consistent memory which is implemented using a single-writer/multiple-reader protocol. Replicated pages are protected so a write access will generate a page fault. Then all replicas can be invalidated. To avoid the ping-pong eects of falsely shared pages, later systems have implemented consistency protocols that allow multiple writers to a page [2], [13]. Before a node is allowed to write to a replicated page, a twin of the page is rst made. Later, the modi cations to the page are encoded in a di which can be distributed and applied to other replicas of the page at certain synchronization points.
Although NUMA architectures provide shared memory in hardware, dynamic page migration and replication based on virtual memory pages have shown to lower the average memory access time and reduce load on the interconnect. To avoid ping-pong eects, the PLATINUM system [4] introduced a mechanism that freezes frequently migrating pages. The pages are later defrosted based on either the time since they were frozen or when there are more remote than local accesses to the page [15]. The migration and replication techniques have shown to bene t even modern CC-NUMA architectures [1], [22] where the ratio between the cost of local and remote accesses is as low as 2-3. With modern network technologies, access to a block cached in a remote memory can be many times faster than access to a disk. On a 155 Mbit/s ATM network, Feeley et al. [7] reports up to 7 times depending on the disk access pattern. Thus, mechanisms such as le system caching and virtual memory swapping can gain substantially from using remote memory instead of disk. This has been used in several systems. With the remote memory server model, the large memories of dedicated nodes are used for swap space to lower the load of workstation with heavy paging activity [3]. With the N-Chance Forwarding algorithm [5], remote memory improves distributed le system performance by a coordination of the contents of client caches. When the nodes are running low on physical memory, and after all replica blocks are removed, the locally oldest block is forwarded to a randomly-picked node. In the Global Memory Service (GMS) system [7], the choice of a remote node for page replacement is based on approximate knowledge about the state and age of all physical pages in the cluster (a \global LRU"). We are experimenting with combinations of existing distributed memory management algorithms for use on an SCI cluster. The hardware and performance characteristics of an I/O based SCI adapter makes the design space dierent from the above systems. Compared with the SVM systems, that use traditional networks, the load/store interface of an I/O-based SCI adapter allows us to directly access a remote page (as on NUMA architectures) without rst having to copy the whole page. But the remote memory accesses on an SCI cluster cost 2-5 microseconds which is more expensive than on a NUMA architecture. This favors even more aggressive replication. The advantage of using remote memory as cache instead of accesses to disk is much bigger for an SCI cluster than on a traditional network. E.g., on our SCI cluster, a page can be fetched from a remote memory more than 50 times faster than from a local disk. III. The SciOS Approach
With SciOS, we use a network of workstations connected by I/O-based SCI interfaces. The current operating system support on such architectures is based on xed placement of pinned physical memory. Remote nodes can map this memory though the I/O bus and application programmers have to deal with hardware limitations such as the lack of cache coherency, reordering of writes, and buering in the SCI adapter. To relieve the programmers of these burdens,
SciOS provides a coherent shared memory system which is based on a global protocol for managing the shared memory: all the individual physical memories are treated as one big memory to be shared in the most optimal way. Initially, pages are allocated on a rst touch basis. But according to the sharing pattern, pages are either migrated or replicated. A freeze/defrost mechanism reduces the ping-pong eect of pages that are falsely shared or frequently written in dierent nodes. When nodes are running low on physical memory, remote memory is used to cache pages. Only the globally oldest pages are placed on disk when all the physical memories are full. The main abstraction in SciOS is a memory mapped le which is created and deleted independently of the processes. A process can open a le, map in its virtual address space and use normal load and store instructions on the mappings thereby reading and modifying the le's contents. Multiple processes, possibly on dierent nodes, can map the same le and thus share data. The programming model is based on release consistency [9], [16] that allows many low-level optimizations such as the delaying and reordering of writes. Only at synchronization points, known to SciOS, is the shared memory guaranteed to be coherent. Programmers need to write data race free programs using the prede ned synchronization primitives, i.e., locks and barriers provided by SciOS. We envisage SciOS being tightly integrated with the operating system's virtual memory management. Below, we describe three components that implement SciOS' global memory management protocol: 1. A page fault handler is called on access to an unmapped page or when writing to a replicated page (marked readonly). The page fault handler makes decisions about freezing, migration, and replication of pages. 2. A daemon on each node collects information about how the nodes access the shared pages (used by the swap handler), updates global data structures about idle memory and the age distribution of pages. The daemon also defrosts pages that have been frozen by the page fault handler. 3. A swap handler decides where to place old pages that need to be swapped out and where to nd pages that later need to be swapped in. We now describe each of the components in turn. A. The Page Fault Handler A physical page is allocated on the very rst page fault to a page ( rst touch). During subsequent page faults to a page not currently mapped on a node, three possibilities exist: establish a remote mapping, migrate the page, or replicate the page to allow local accesses. When the node is low on idle memory, there are even more possibilities. The decision tree for a page fault is shown in Fig. 1. First, we treat the cases when there is sucient memory to allocate a new page. A page that is already allocated in the local physical memory, e.g., by another process, is mapped directly. If the page is placed on disk either locally or remotely, we simple allocate a new physical page and read the page from disk.
Page fault Frozen page
Valid page Last invalidation
Recent
Page on disk Idle memory
Remote map
(locally) High
Low Swap out
Long ago
+
Disk swap in
Disk swap in
Access type
Read
Write
Recent
Last write Long ago
Idle memory
Idle memory
(locally) High Migration
(locally) Low Exchange
High Replication
Low Swap out
+
Replication
Fig. 1. Decision tree for handling page faults in SciOS.
If the page is in a remote memory, we rst check if the page is or should be frozen. This decision is based on when the page last was invalidated (either from a migration or from a write to a replicated page). A page that has already been frozen is simply mapped remotely. If the page has been invalidated \recently", then the page fault handler freezes the pages and establishes a remote mapping. A page that is not frozen can be either migrated or replicated. On a write access, the page is always migrated and any remote mappings are invalidated so correct mappings can be established. On a read access, and if the page has not been written recently, the page can be replicated instead. For a replicated page, the master copy and all the replicas are marked read-only to the virtual memory system. On a write access to a replicated page, a page fault is generated and all other replicas must be discarded. Then, the page is marked read-write and the process can continue the write. With release consistency, the discarding of replicas can be done lazily, until the next time a synchronization primitive on the remote node is acquired. In the following cases, there might not be enough local physical memory on a node to allocate a new page: swapping in a page from disk, a migration of page towards the local node, and making a replica of page. In these cases, the swap handler is invoked (see Section III-C). B. The Daemon A daemon on each node periodically collects hardware information about how each physical page is used. We use this information to determine the local age of each physical page along with the classi cation of each page. We classify all physical pages in the following way. The idle pages can be allocated for local use or allocated by remote nodes for swapped out pages. The mapped pages are mapped locally and maybe remotely. These pages are either private or shared and thus controlled by the replication/migration mechanisms. Pages that are marked replica are mapped locally and marked read-only (the master copy is marked as mapped.) A frozen page cannot migrate and is most
likely mapped both locally and remotely. The unmapped pages are guaranteed not to be mapped locally, but they are possibly mapped remotely. If a page is unmapped, it is always considered among the oldest pages on the local node. Pages marked mapped, replica, and frozen pages age dierently. We let replica and frozen pages age more quickly than mapped pages. A replica page can easily be discarded by the swap handler and if a frozen page is not accessed locally, it should be defrosted and migrated to a node that access it. After the new age information has been computed for each physical page, a summary of the local memory's state is communicated to all other nodes. The summary consists of (1) the number of idle pages, (2) the number of unmapped pages, together with the (3) distribution of the page ages for all the other pages. Based on the all summaries, a node can calculate a weight for each node. The weight is used by the swap handler when deciding which node to select for a page replacement. Some systems try to limit the number of times the summary information is exchanged because their solution requires a synchronization of all nodes [7]. On our target architecture, an SCI cluster, the summary information can eciently be distributed to all the other nodes each time that the daemon runs (multiple times per second)1 . Also, we do not nd it necessary to globally synchronize the daemons. The weights for each node in SciOS are thus calculated from the last known summary information. Since the swap out decisions for selecting a remote node are probabilistic, this should be more than sucient. The pages frozen by a page fault handler is later defrosted by the daemon. We use a simple timeout algorithm for unfreezing based on the elapsed time since the page was frozen [4]. As mentioned above, a page can also be defrosted by the swap handler, if the amount of idle memory is running low. C. The Swap Handler The swap handler is called whenever a physical page needs to be freed (swap out) and whenever a swapped out page needs to be retrieved (swap in).2 The decision about where to place a page that needs to be swapped out depends on the type of page fault event that caused the swap out: Case 1: To make room for a migrating page from node Q, a node P needs to swap out its locally oldest page. This can be optimized by an exchange operation. Because a migration means that node Q frees a page, node P can simply exchange its locally oldest page with the migrating page from node Q. Case 2: A node P needs to free a page because it is fetching a page from disk or replicating a remote page. 1 On larger-scale SCI clusters, a hierarchical collection and distribution of summary information is needed. 2 We only describe a swap in handler for pages on disk. A page swapped out to a remote memory is still marked as a valid. On a subsequent access to the page, the normal migration mechanism in the page fault handler are used to migrate the page to the faulting node.
Node R
Node P
Physical memory Globally oldest page
Physical memory
Node P’s locally oldest page 1
2
3
Replicate
Node Q Disk
Physical memory
Fig. 2. Example: Node P needs to make room to replicate a page on Q. First, R is asked to put the globally oldest page to disk. Then, P moves its locally oldest page to R. Finally, P can replicate the page.
If the locally oldest page on node P is not a replica or a frozen page, we ask the node with the globally oldest page, say node R, to swap out a page. When R has nished, we swap out the locally oldest page from node P to node R and then fetch the needed page from disk or replicate the page. In Case 2, the node with the globally oldest page is determined using a random number and an algorithm that assures that the node is chosen with a probability corresponding to the weight calculated by the daemon. This is to approximate a global LRU [7]. When a node swaps out its locally oldest page, it can discard replica pages. Pages that are mapped but have replicas can also be discarded when a replica page has been made the master copy. Pages that are marked frozen are migrated to a node that maps the page remotely. All other pages are placed on disk. Fig. 2 shows a worst case example based on Case 2 (frequently, a node can discard either a replica or migrate a frozen page). IV. The SciOS Prototype Implementation
We are currently implementing the SciOS prototype on a small cluster of Intel uniprocessor nodes running the Linux operating system, version 2.0. The nodes each have a 200 MB/s PCI-SCI adapter. Below we brie y describe the PCISCI adapter and how how we implement SciOS using a virtual le system. Finally, we describe the main data structures that we are using and some implementation issues. A. Dolphin's PCI-SCI Adapter Dolphin's PCI-SCI adapter is based on an IEEE standard called Scalable Coherent Interface (SCI) [12] which speci es the implementation of a coherent shared memory abstraction and a low-level message passing interface. SCI speci es a large, shared 64-bit address space. The high order 16 bits of an SCI address designate a node and the remaining 48 bits allow addressing of physical memory within each node. The PCI-SCI adapter [6] provides a number of mechanisms to map remote memory, to send and receive low-level messages, to generate remote interrupts, and to provide atomic operations on shared memory. The PCI-SCI adapter does not implement the cache consistency mechanisms in the SCI standard. Each application will therefore have to deal with issues such as when
to ush or invalidate processor caches, buers in the PCISCI adapter, etc. To share physical memory between nodes, a process can allocate local physical memory. The physical memory is accessed locally through a normal virtual-to-physical memory mapping. For a process on a remote node to access the shared memory, the process must set up two mappings using the operating system. (1) The remote memory is mapped into the address space of the local PCI bus. This is done by updating an address translation table (ATT) in the PCI-SCI adapter with the node ID and physical address of the remote memory. (2) The operating system then maps the speci c address range of the PCI bus into its virtual address space through manipulation of its page tables. This mapping also enforces protection since the process can only access mapped memory and only in the right mode (read/write/execute). The PCI-SCI adapter provides special atomic fetch-and-increment mappings which allow implementation of synchronization primitives.3 Currently, we only use the load/store interface of the PCI-SCI adapter and for remotely mapped memory, the processor's cache is disabled. B. The SciOS File System Dolphin's PCI-SCI driver runs as a Linux kernel extension, called a module , which can be dynamically loaded and unloaded. SciOS is being implemented as another kernel module that uses only a few basic functions in the driver module. For SciOS's shared memory, we use the normal UNIX le abstraction that can be implemented using Linux's Virtual File System (VFS). 4 After the SciOS module has been loaded and a directory mounted, an application can use the standard UNIX system calls open, ioctl, mmap, munmap, and close on the SciOS les. The Linux kernel directs these system calls | along with other kernel events such as page faults, swap in, and swap out | for all SciOS les to our SciOS module. This way, we can implement SciOS at the kernel level with full control of the shared memory and the associated mappings and physical memory. The structure is shown in Fig. 3. C. Main SciOS Data Structures The state of all the SciOS les and nodes having mounted the SciOS le system is described in a global data structure called the global table which is xed on a master node. A le has an entry in the top level table which holds the le's symbolic name. We use an intermediate directory table that points to page tables. A page table entry contains information about the physical memory allocated for a virtual page. The global table also holds the summary information used by the swap events. It is updated independently by each node when the node's daemon has nished updating all its local page ages. 3 To the user, SciOS provides a simple ticket-based lock primitive and barriers with a variable number of participants. 4 A virtual le system facility can be found on many modern operating systems, e.g., most UNIX variants and Windows NT.
User space
Application
Application
Application
Linux Virtual File System (VFS)
Kernel space
SciOS Memory Management create/unlink, open/close, mmap/munmap read/write, nopage, swapin/swapout
Dolphin PCI-SCI Driver
PCI-SCI Adapter
Fig. 3. Structure of the prototype implementation of SciOS.
On page faults and swapout/swapin events, a lookup in the global table is needed to get the global information for a virtual page: its state, physical location, and what nodes are mapping it or holds replicas. The states are: invalid for pages that have not yet been allocated, valid for allocated pages (possibly replicated), frozen for pages that cannot migrate, and swapped for pages that are placed on disk. A SciOS page table entry is protected by a lock to avoid competing accesses. Virtual pages are uniquely identi ed in the cluster by a pageid which is a tuple ( le id, oset). The le id is assigned when tile le is created. On each node, a map table describes what local processes have mapped each SciOS le and at what virtual address. Given the map table and a pageid, we can locally on each node easily nd the Linux page table entries, e.g., when invalidating remote mappings when migrating a page or when discarding remote replica pages on a write access to the page. To manage the physical memory on each node, we use an inverted page table. For each physical page, the table describes the local state (e.g., replica and frozen), the pageid if it is mapped either locally or remotely, the number of local mappings, and the age. It also has two timestamps. The rst is a local timestamp of the last write (set by the daemon). The other is used in the following way. (1) For valid pages, it contains the time of the last invalidation | from either a migration or write access to a replicated page. (2) For frozen pages, it describes when the page was frozen. It is used by the daemon for deciding when to defrost the page. Remote mappings are also managed by an inverted page table which holds the pageid of all remotely mapped virtual pages. A mapping in the PCI-SCI adapter covers a contiguous block of 512 KB physical memory that must be boundary aligned. When mapping a remote physical page (only 4 KB on an Intel architecture) that is already mapped by an existing ATT entry, we simply use that mapping. Otherwise we have to initialize a new ATT entry in the PCI-SCI adapter. If no more ATT entries are available
(due to limitations on the space used on the PCI bus [14]), we must free one by invalidating all the remote mappings covered by the least recently used ATT entry (the daemon also keeps ages for each ATT entry). For this invalidation, we use the pageids stored for each ATT entry. When writing to a replicated page, we need to discard all other replica pages. This does not have to be done immediately, but can be postponed until the remote nodes that hold replica performs an synchronizing acquire operation [13], [21]. We are experimenting with dierent approaches to this. The basic data structure is a queue on each node that holds the pageids for pages that must be discarded on the next acquire. D. Implementation Issues The main feature of SciOS is its tight integration with the virtual memory management of the operating system. During our initial tests, we have found several problems with Linux's swap system, especially during heavy swap activity. Thus, for our initial versions of the SciOS prototype, we implemented our own kernel daemon that deals with the aging of the physical pages used by SciOS. It is our intent to merge the two daemons later on. Our daemon runs four times per second and scans the processes that have mapped SciOS les (found using the map table). We check the dirty and accessed bits set by the Intel hardware [19] in Linux's page tables. We reset bits after each scan. If the dirty bit was set for a valid page, we set the \last write" timestamp used when a page fault handler needs to determine when a page can be replicated. We are working with three timestamps in the implementation. The defrosting of a page is handled by the same node that froze the page. This makes it easy to compute the elapsed time because it is a local operation. For the \last invalidation" and \last write" used by the page fault handler, multiple solutions exist. The last invalidation timestamp is set by the node that either migrated the page or discarded replicas when writing to the page. But it is another node that checks if a new page fault has not occurred within a certain time interval. The same problem exists for the last write timestamp which is set by the daemon. Since we do not have a global time in the cluster, we must resort to either (1) sending a low-level message to the node that handled the event to have it calculate an approximate elapsed time or (2) keeping clocks in shared memory that can be kept up-to-date by either the daemon(s). We currently use the low-level message approach, but expect to be able to optimize this. V. Early Performance Evaluation
Unfortunately, we are not able to present a complete performance evaluation of the SciOS prototype. With the current implementation, we can only show some early and very basic measurements of the remote swap capability on two 200 MHz Pentium uniprocessor nodes. We present the results for two benchmarks. The numbers should be considered preliminary.
We have measured the performance of Dolphin's PCISCI interface when the nodes are connected back-to-back. The PCI-bus has a theoretical bandwidth of 132 MB/s, but the maximum of 70 MB/s is obtained using a series of processor store instructions to remote memory. The PCI-chipset (Intel TX97) also limits the performance of remote load instructions to 10 MB/s. These measurements have been performed with all optimizations in the PCISCI adapter enabled, e.g., write combining and aggressive prefetching. The latencies have been measured to 2.4 microseconds for stores and 4.6 microseconds for loads, which is roughly 10-20 times higher than local memory accesses. To evaluate the performance of the swap functions in SciOS, we use a synthetic benchmark that we call swaptest. It is a simple program that accesses a 64 MB table sequentially multiple times. First, the table is initialized and then read four times. We only measure the read phases to avoid comparing the initial allocation of the physical memory. To make swaptest exercise the swap system, we only make 32 MB of physical memory available to SciOS. The results are shown in Table I. First we report the basic costs of getting and putting a page (4 KB) to and from either the a local disk (EIDE) or memory on the other node. We nd that writing a page to remote memory is 18 times faster than to local disk. For getting the pages, the disk slows down and remote memory is 84 times faster. Because the load bandwidth is 7 times slower that the write bandwidth on the PCI-SCI adapter, we get a page from a remote node by sending an interrupt to the node, asking it to write back the page. This reduces the time for remote page get from more than 400 to only 100 microseconds. Operation/ application Basic page put Basic page get Swap out Swap in swaptest
FFT
Swap target Speedlocal disk remote mem. up 1,200 us 65 us 18 8,400 us 100 us 84 1,255 us 113 us 11 8,426 us 123 us 69 860.8 s 29.8 s 29 454.4 s 28.6 s 16 TABLE I
Basic timings for SciOS's swap functions.
We also report the kernel costs of the swap out and swap in times for swaptest in Table I. The overhead of the swap out and swap in functions are low (basically updates in the SciOS's page table) which con rms the pattern seen with the basic page put/get: a speedup of 11 for swap out and 69 for swap in. We note that the overhead of the swap out and swap in functions is low. However, it starts becoming important when the costs of a basic page put/get are as low as for the remote memory. The overhead is 42% for swap out and 19% for swap in. The execution time at user-level in swaptest shows a speedup of 29. We have also tested SciOS's remote swap functions using the Fast Fourier Transform (FFT) application from the SPLASH suite [20] using a problem size of 52 MB. The execution time is 12.6 when the application has enough
physical memory. The result of FFT | when given only 32 MB by SciOS | is shown in Table I. We see that the application is very sensitive to the cost of the swap system because SciOS performs more than 27,000 swap out and 23,000 swap in operations. We note that the basic disk read operations have increased to more that 13 milliseconds compared to the 8.4 milliseconds for swaptest because accesses the disk are no longer sequential. This also slightly hurts the cost of making mappings for remote memory. ATT entries have to be frequently reinitialized and the average cost of creating remote mapping of 4 microseconds in swaptest increases to 65 microseconds for FFT. That is only a modest increase compared to the increase in the cost of reading from disk. VI. Related Work
SciOS draws on the research in multiple areas. The Global Memory Service (GMS) [7] implements a global swap mechanism based on approximate information about the cluster-wide state of the physical memory. GMS is implemented as a low-level operating system service that can be used by many types of clients, i.e., the paging system. Shared pages are replicated and are simply discarded if swapped out, like in SciOS. Consistency of shared pages is the responsibility of the GMS clients, but there are no mechanisms provided to handle consistency, e.g., primitives to invalidate remote replicas. If a GMS client replicates pages itself, GMS will treat the replicas as distinct pages and two replicas can thus end up on the same node. The N-Chance forwarding algorithm [5] coordinates caches for a distributed le system. The node used for placing swapped out pages is randomly picked. The cache blocks are kept coherent using a single-writer/multiple readers protocol. But this can mean a ping-pong eect for pages that are write shared or are subject to false sharing eects. The NUMA migration, freeze/defrost, and replication techniques used in SciOS are based on the results of PLATINUM [4], the studies by LaRowe et al. [15], and later CCNUMA studies. Verghese et al. [22], like SciOS, take the amount of idle memory into consideration when replicating pages on a CC-NUMA architecture. Their system stops replicating pages when memory pressure is experienced on a node. With SciOS, we replicate pages even if it means that another page needs to be swapped out. Because the ratio of local/remote accesses on an SCI cluster is much higher than on CC-NUMA and because SciOS presently does not replicate at the cache block level for remotely mapped pages, we believe that page replication is an important optimization even under memory pressure. We do age replica pages faster than normal pages, thereby allowing the node quickly free the physical page, e.g., for another replicated page. The Cashmere-2L system [21] is implemented on a cluster of SMP nodes and uses the write-only based Memory Channel [10]. Memory coherency is maintained in hardware within each SMP node ( rst level) and is based on a home-based lazy release consistency model between nodes (second level). Cashmere-2L is, like TreadMarks [13], based on the twin/dis technique which allows multiple writers
to a shared page. Experiments show that the twin/ding can perform better than \write doubling" on a one-level version of Cashmere. The write doubling technique tries to emulate a load/store interface on the write-only network not only making writes to a remote master copy, but as as well to a local copy. In SciOS , the twin/ding can enhance performance | compared to a remote mapping to a frozen write-shared page | for applications with much cache locality for frozen pages. But the twin/ding uses extra physical memory. We plan to study the twin/ding technique together with the possibility enabling the processor caches for remote load/store accesses on the PCI-SCI adapter. Ibel et al. have implemented a global address space at the user level using the Spilt-C language on an SCI cluster[11]. Because SciOS is implemented in the kernel, we can share the physical memory and remote mappings between applications/processes. SciOS can be used by many dierent languages and is not dependent on pointer indirections. VII. Conclusions
Instead of using an I/O-based SCI interface as a means to provide ecient message passing, we use it as a noncoherent NUMA architecture. We have presented SciOS, a prototype which provides a programming model based on a coherent shared memory that facilitates the programming of an SCI cluster. Mechanisms such as migration and replication of shared pages will lower the average memory access times. A contribution of SciOS is a global memory protocol that integrates tightly the shared memory abstraction with the swap functionality of the virtual memory system. The initial results from the remote swap capability of SciOS prototype are very encouraging. We are continuing our implementation and experiments to be able to present a detailed evaluation of the possibilities for an ecient implementation of a coherent shared memory on the I/O-based SCI technology. Acknowledgments
Kare Lchsen and Hugo Kohmann from Dolphin Interconnect Solutions (Norway) willingly granted us access to the source codes for the PCI-SCI adapter and answered all of our questions. Roger Butenuth from Universitat Paderborn (Germany) kindly gave us access to his Linux port of the PCI-SCI driver. Jean-Philippe Fassino assisted in the early implementation phase. We are grateful for the discussions with the members of INRIA's research groups who use SCI clusters. References [1] Edouard Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 143{156, October 1997. [2] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating System Principles, October 1991.
[3] D. Comer and J. Grioen. A new design for distributed systems: The remote memory model. In Proceedings 1990 USENIX Conference, June 1990. [4] A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceedings of the 12th ACM Symposium on Computer Architecture, December 1990. [5] Michael D. Dahlin, Randolph Y. Wang, Thomas E. Anderson, and David A. Patterson. Cooperative caching: Using remote client memory to improve le system performance. In Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 267{280, November 1994. [6] Dolphin Interconnect Solutions. PCI-SCI cluster adapter speci cation, May 1996. Version 1.2. See also http://www.dolphinics.com. [7] Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath. Implementing global memory management in a workstation cluster. In Proceedings of the 15th ACM Symposium on Operating System Principles, pages 201{212, December 1995. [8] Marco Fillo and Richard B. Gillett. Architecture and implementation of MEMORY CHANNEL. Digital Technical Journal, 9(1):27{41, 1997. [9] Kourosh Gharachorloo, Sarita V. Adve, Anoop Gupta, John L. Hennessy, and Mark D. Hill. Programming for dierent memory consistency models. Journal of Parallel and Distributed Computing, 15(4):399{407, August 1992. [10] Richard B. Gillett. Memory channel network for PCI. In IEEE Micro, volume 16, pages 12{18. February 1996. [11] Maximilian Ibel, Klaus E. Schauser, Chris J. Scheiman, and Manfred Weis. High-performance cluster computing using SCI. In Hot Interconnects Symposium V, August 1997. [12] IEEE. IEEE Standard for Scalable Coherent Interface (SCI). 1992. Standard 1596. [13] Peter Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter USENIX Conference, pages 115{132, January 1994. [14] Povl T. Koch and Xavier Rousset de Pina. Flexible operating system support for SCI clusters. In Proceedings of the EuroPar'98 Conference, Lecture Notes in Computer Science Series. Springer-Verlag, 1998. To appear. [15] Richard P. Larowe Jr., Carla Schlatter Ellis, and Laurence S. Kaplan. The robustness of NUMA memory management. In Proceedings of the 13th ACM Symposium on Operating System Principles, pages 137{151, October 1991. [16] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennesy. The directory-based cache coherence protocol for the DASH multiprocessor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 148{159, May 1990. [17] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. [18] Evangelos P. Markatos and Manolis G.H. Katevenis. Telegraphos: High-performance networking for parallel processing on workstation clusters. In Proceedings of the Second International Symposium on High-Performance Computer Architecture (HPCA). February 1996. [19] Tom Shanley MindShare Inc. Pentium Pro and Pentium II System Architecture, 2nd ed. Addison Wesley, 1998. [20] J. Singh, W. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared memory. Computer Architecture News, 20(1):5{44, March 1992. [21] Robert Stets, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen Hunt, Leonidas Kontothanassis, Srinivasan Parthasarathy, and Michael Scott. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 170{ 183, October 1997. [22] Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 279{289, October 1996.