SNOW: A State-Less Low-Overhead All Software Distributed Shared ...

12 downloads 0 Views 227KB Size Report
Mar 31, 1997 - SNOW: A State-Less Low-Overhead All Software Distributed Shared. Memory ... shared memory programming model in software on distributed memory (DM) machines; these ma- chines do not ...... Berkeley, CA 94720, 1994.
SNOW: A State-Less Low-Overhead All Software Distributed Shared Memory System Arif M. Bhatti

[email protected]

Computer Science Department Boston University

March 31, 1997 Abstract

Research in user-level communication and the recent increase in network speeds requires reevaluation of traditional distributed shared memory (DSM) protocols for workstation networks. Our experience with a release consistency (RC) DSM system shows that the system has: (1) high protocol processing overhead, (2) high coherence related overhead, and (3) high software access control overhead. To reduce these overheads, in this paper we present a state-less DSM system based on the following features: (1) xed ownership, (2) data layout, (3) read-only data cache, (4) write-bu er for updates, and (5) a hybrid coherence protocol. Our results were obtained by executing a set of four benchmark applications that we ported to run on SNOW DSM system. The system executes the applications on a simulated architecture and the applications di er from each other in sharing pattern, computation granularity and amount of synchronization. In this paper we show an execution graph and a comparison graph for each application. We compared the performance of SNOW with a traditional update-based system and our results show that the traditional system can copy upto 6 times more data and can send upto 5 times more communication messages. The performance gains are because of the xed ownership protocol. The protocol reduces the software access detection, the write-shared protocol processing, and the consistency related communication overheads.

Keywords: distributed shared memory, parallel programming, NOW



Contact Address: 111 Cummington Street, Boston, MA 02215, voice: 617-353-5227, fax: -6457

1

1 Introduction The popularity of shared memory programming model for parallel programs is due to the simple interface that hides all architectural details from programmers and allows programmers to concentrate on algorithmic details. In a multiprocessor system, there can be multiple memory modules but the system, for shared memory model, provides the abstraction of a single memory module to programmers; and the system supports this abstraction in hardware by providing global address space and coherent caches. Distributed shared memory (DSM) is an abstraction that supports shared memory programming model in software on distributed memory (DM) machines; these machines do not provide global address space in hardware. Each node in a DM machine has a private address space and an access to a memory module of a remote node requires an exchange of messages with the remote node. The DSM abstraction on a distributed memory machine is implemented by a runtime system and the system supports global address space in software. A DSM system can be entirely based on a runtime system [13, 12]. The use of special compilers [9, 16, 3, 14] and dedicated hardware [7], for DSM abstraction, in addition to the runtime system changes the design parameters. Most DSM systems treat local memory of nodes as coherent caches and detect shared memory accesses in node MMUs. The access detection in MMU is atomic and costs zero processing cycles but a disadvantage of this technique is that it binds the data block size to the hardware page size which restricts the DSM system to coarse-grain applications. This technique has two draw backs: large data block size and interaction with operating system. Large data block size problem can be solved by separating the data block from the hardware page; there are two possible options with di erent tradeo s. Firstly, implement DSM system based on relaxed consistency (RC) models and use a signi cantly complex programming interface. This approach requires signi cant amount of work from programmers and is against the basic philosophy of shared memory model as a simple interface. Secondly, implement access control mechanism in software instead of using MMU but this approach has xed extra cost for each shared memory access. This cost depends upon the consistency model and communication mechanism used. To compute medium and ne grain applications on DSM systems, recent research trends are toward all-software systems based on sequential consistency (SC) [17, 16] and relaxed consistency models such as release consistency [12]. Traditional release consistent systems have high overhead due to protocol processing and coherence related communication. For all-software implementations, these systems also have high software access detection cost due to the dynamic ownership protocol. This paper presents the design and performance of SNOW, a state less DSM system that uses xed ownership, data layout and hybrid coherence protocol to reduces the above mentioned overheads. Section 2 discusses the motivation and background, while section 3 explains the design of SNOW that includes data structures, synchronization primitives and coherence protocols. Section 4 brie y describes the experimental methodology and implementation of the runtime system. Section 5 presents the results of selected applications and last section summarizes the paper and discusses future research directions.

2 Motivation and Background Traditional DSM systems treat local memory of nodes as coherent caches. These systems use a directory at each node that maintains the state of all locally mapped pages and use distributed ownership protocol as part of the coherence mechanism. Early SC-based DSM systems like IVY [15] 2

Shared Memory SM 0

0 0 0

SM i

SM n-1

0 0 0

SM i SM 0

0 0 0

Cache

Cache 0 0 0

Cache 0 0 0

SM n-1

Cache

Cache

Node(0)

Node(i)

Node(n-1)

Node(0)

Node(n-1)

Figure 1: SNOW shared memory(left) and shared memory in DSM systems (right) did not receive much popularity due to large data block size, false sharing problem and strict orderings of the consistency model. Relaxed memory consistency models, such as RC, have received much more popularity due to the fact that these models impose strictly less ordering restrictions. Munin [9], a release consistent DSM system, introduced the write shared protocol that allows multiple nodes to concurrently update the same data block. This technique reduces the impact of false sharing problem but may send more data due to the large data block size. A page-based DSM system treats a part of virtual address space as shared and detects shared memory accesses in node MMUs. At each shared memory access the system performs a lookup on state of a page that contains the referenced data object and invokes an action if the page is not in an appropriate state. The use of MMUs to detect shared memory accesses restricts the data block size to be the hardware page size but a large data block size may degrade the system's performance. Selecting a smaller data block size for a DSM system requires more assistance from programmers that may change the programming model [3, 14, 12]. Implementation of an access control mechanism [17] in software without any assistance from node MMUs allows the DSM systems to use a data block smaller than a hardware page. The memory lookups in software cost a x number of CPU cycles on each memory access, while this cost is zero cycle in a DSM system that performs memory lookups in MMU. Slow networks, such as Ethernet, due to their shared nature have high protocol processing overhead, unreliable communication and packet collision. These networks provide low bandwidth and message latency increases with increase in load and increase in number of nodes in the network. Sending a large message over the net in slow networks is almost as expensive as sending a small message, due to high communication latency and protocol processing overhead [9]. New local area networks, such as ATM and AN2 [1], provide high speed, high bandwidth and low communication latency. These networks perform point-to-point communication and can deliver small messages with low latency. These networks are reliable and can deliver the messages in order. Recent research is user-level communication [2, 19, 10] on these networks made these networks more interesting for parallel computing. Existing DSM protocols are designed for slow networks and, new network technology and user-level communication requires re-evaluation of the existing DSM protocols; and our results show that the new DSM protocols designed for point-to-point networks can outperform the traditional ones. We evaluated the e ect of data block size on the performance of a SC invalidation-based system, CDSM [5], and a RC write-shared update-based system, UDSM [4]. A directory protocol is 3

used to maintain the state of the shared data blocks at each node and a distributed ownership protocol is used to migrate or replicate the shared data blocks in local memories. The dynamic ownership protocols reduce the remote access latency by changing the ownership of a data block, according to its usage, as dictated by the coherence protocol. For software access detection, this state information, the dynamic ownership protocol and the coherence protocol can in uence the cost of memory lookups. The dynamic ownership protocol can have a high communication cost for locating a page due to the dynamic changes in the ownership. For ne-grain computation, the update based write-shared DSM systems increase the protocol processing overhead required to maintain the state information for updates; on the other hand the write-shared protocol removes the restriction on parallel concurrency but the protocol has to detect updated word in a data block to merge with updates performed by other nodes on the same data block. The distributed ownership protocol may even increase the number of communication messages, because of the falsely shared data blocks [4] and the ownership transfers. The goal of this research is to reduce the number of stale updates, reduce the coherence related communication by not maintaining the state information and reduce the software access control overhead by not performing lookups at the state information. In this paper we present SNOW, a RC-based state-less DSM system that supports the conventional programming model like Munin [9]. The coherence protocol of SNOW unlike all other DSM systems uses xed ownership protocol and data layout technique. SNOW is designed to take an advantage of user-level communication, availability of reliable networks and the ability of a node to process communication messages inorder of arrival.

3 Design and Architecture of SNOW SNOW provides the abstraction of global shared address space in software on the distributed memory architecture. The system uses a xed-owner approach and distributes shared address space among participating nodes as shown in Figure 1(left) compared to the traditional systems where each node may maintain the state for all the shared address space as shown in Figure 1(right). SNOW architecture supports the shared address space of NxM bytes where N is the number of nodes in the system and M is the shared address space at each node where N and M are exact powers of two. Each node maps its part of the shared address space at a contiguous segment of its local memory. Software access control mechanism requires that all shared memory accesses must be visible to the runtime system. The shared address space can be a part of virtual address space for software access control but a compiler [16, 3] or an editing tool [17] is needed to insert access detection code before each shared memory access. We do not use any special compiler or an editing tool to insert the access detection code, instead we completely rely on runtime system and expect that a pre-processor will replace the shared memory accesses with appropriate memory access routines. Instead of treating a part of virtual address space as shared address space like traditional DSM systems, we map the shared address space on top of virtual address space of the nodes. The main features of SNOW are: xed ownership and distribution of shared address space among nodes, data layout, read-only cached data, write-bu ers for updates, eager propagation of updates and implicit invalidation of cached data at the release synchronization operation. These features reduce the processing overhead of write shared protocols and also reduce the cost of shared memory access detection in software. SNOW does not treat local memory of nodes as coherent caches like traditional systems. Unlike other write-shared systems [9, 4, 13], SNOW does not make a 4

copy of a data block and, does not computes the di s. Instead SNOW keeps the caches in read-only mode and stores the updates in separate write bu er.

C-size

C-lock

C-address

C-data

(a) Read cache W-data W-bitmap

W-type

W-node

W-address

(b) Write buffer

Figure 2: The read cache (a), the write bu er (b) The runtime system at each node maintains a shared read-cache of xed size to reduce remote communication, and to o set the e ect of remote access latency. The system is parameterized and a user, before running a parallel program on this system, can select appropriate data block size and number of blocks at each node for the program. This organization di ers from traditional systems because these systems treat local memory of a node as a coherent cache and at any moment a node is allowed to cache all shared memory. The read cache is implemented as a hash table and the shared address is used as the key of the hash function; the organization of read-cache is shown in Figure 2(a). Each data block in the cache has three attributes: C-address, C-size and C-data. The C-data attribute stores the cached data block and the C-address attribute records the shared address of the block. The C-size attribute plays dual role: If a cached data block is valid then it contains the size of the block otherwise it contains a negative number to indicate that the block has inconsistent data. The read-cache, in addition to the above mentioned attributes, has another attribute, C-lock, which is used by the cache coherence protocol and its usage is explained in the coherence protocol section. The readcache at a node is used for read only purposes by a computation thread running at the node, while all updates performed by the node are recorded in a write bu er. A node can drop any data block from local read-cache without any communication with other nodes. The runtime system propagates local updates to a node which is responsible for the updated shared memory. The write-bu er, as shown in Figure 2(b), also maintains some state information in addition to the updates performed by the node. The bu er can store a xed number of words in its data eld and the W-bitmap attribute of the bu er maintains one bit for each updated word in the bu er. The W-type attribute of the bu er maintains the data type of updates and this information is used by the runtime system to propagate local updates. The W-node attribute of the bu er keeps track of the owner node for updated shared memory and the owner eventually will receive these updates. The W-address attribute maintains the shared address of the rst word in the write-bu er. The mechanism for propagating the updates in the write-bu er is discussed later in the coherence protocol section. 5

3.1 Synchronization

In shared memory model, parallel programs require synchronization primitives to control and order memory accesses by di erent nodes; and most of the DSM systems implement lock and barrier synchronization primitives. SNOW implements the lock primitive based on the distributed queuebased algorithm. A thread at a node can acquire a lock by invoking the acquire lock operation and can enter in its critical section if the lock is available at local node. If the lock is not available locally then the operation sends a request to the lock owner, suspends the computation thread and transfers the control to local scheduler. Each node in the system has a lock request server, a userde ned interrupt handler, to handle all lock requests from remote nodes. On each lock request, the server chooses one of the followings: service the request, insert the requesting node in the the lock queue or forward the request to the current owner of the lock. A thread invokes the release lock operation to exit from its critical section; the operation, after performing some coherence related actions, sends a lock-reply to the next node waiting for the lock. The runtime system at a node, after receiving a lock reply, resumes the suspended thread. The barrier primitive is implemented in a centralized fashion where a barrier-manager collects all requests from the participating nodes. A thread can enter a barrier by calling the barrier operation. The operation after performing some coherence related actions, sends a request to the barrier-manager and transfers the control to local scheduler. The barrier-manager waits for one request from each of the participating nodes and after collecting all requests, the barrier-manager sends a barrier-reply message to the nodes. The runtime system at each node after receiving the barrier-reply message resumes the suspended computation thread.

3.2 Coherence Protocol

The protocol is based on xed ownership instead of dynamic distributed ownership used by other DSM systems. There is no message forwarding for read or update request due to the xed ownership which reduces the number of messages required to locate a shared page. We will show that the amount of forwarded requests, due to the dynamic ownership protocol, is a signi cant fraction of all communications. SNOW is based on release consistency model [6] and maintains the consistency of shared memory by a hybrid coherence protocol. The shared address space of SNOW on a n node system is divided in n segments and each node is the owner of exactly one segment. A node that updates local segment does not send its updates to any other node, but the node forwards all updates on a remote segment to the owner node. Unlike update protocols on RC DSM systems, SNOW does not wait for the next release operation to propagate its updates. SNOW maintains a small write-bu er for updates and propagates the updates if the bu er is full. The bu er is full if it cannot accommodate a new update or the thread has invoked the release operation to exit from its critical section. At release lock operation, a node virtually invalidates local read-cache and records the lock in C-lock attribute of the read-cache. If the node re-acquire the lock without transferring it to another node, then the node considers the read-cache valid and the node continues reading from the cache. If the lock is requested by another node, then local node will invalidate the read-cache before servicing the request. A node always implicitly invalidate local read-cache on a barrier request after ushing the write-bu er. Figure 3 explains the interaction of synchronization and the cache coherence protocol. In Figure 3(a), node B sends a request to another node A. Node A before exiting its critical section, 6

1 Flush Write-buffer

2. Flush Write-buffer

qu

es

ply

ly

3 Ba

dR ep

r-r

eq

ue

st BARMGR

5.

Re

Re a

4. R

ck

rie

ly

Lo

2B

ar

3

rrie r-re

eq u

Re

ply

1.

ead R

ck

BARMGR

Home

N

p re rrie ar

Lo

Home

B

3.

est

A

N0

ooo

Nn-1

B

t

(b)

(a)

(c)

Figure 3: Relationship of updates and synchronization requests by invoking the release lock operation, sends remaining updates to the home node (owner) and invalidates local read-cache before sending a lock-reply to node B. Node B, after acquiring the lock, may request some data from the home node. In-order delivery of messages and in-order processing of delivered messages guarantees that the home node will apply the updates to local shared memory before servicing any request from node B; this sequence of messages ensures that node B will get most up-to-date values from shared memory. Figure 3(b) and (c) show coherence related actions at a barrier request. A node enters a barrier by invoking the barrier operation and the operation sends all updates in local write-bu er to the owner node, invalidates local read-cache and sends a request to the barrier-manger. The operation suspends the thread that invoked the operation and transfers the control to local scheduler. In-order delivery by the network and in-order processing by the nodes guarantees that each node will apply all its updates to local shared memory before processing the barrier-reply message issued by the barrier-manager.

4 Methodology We use the execution-driven simulation technique to study the DSM abstraction on a network of workstations and the simulated architecture is based on Proteus architecture simulator [8]. The parallel architecture simulated for the research is a collection of homogeneous nodes as shown in Figure 4(left). Nodes in the system are connected by a high-speed local area network. Like a workstation, each node in the system has a processor, a cache, a memory module and a network interface. The system does not have any special hardware like protocol processor [7] or memorymapped network interface [11]. A parallel program has several computation threads and these threads execute concurrently to perform the required computation. A node in a parallel system may have multiple computation threads and a scheduler module is needed at the node to ensure orderly execution of these thread. A computation thread executes till its completion and no other thread can preempt an executing thread. A node may block a computation thread on a memory fault or on a call to a synchronization operation. A memory fault occurs when a thread accesses a data object which is not present in local memory and the faulting thread sends a message to a remote node and transfers control to local scheduler. The runtime system of SNOW has some user-de ned interrupt-handlers or servers on each node to handle all communication messages. A server executes till its completion to service a request sent by a remote node. A server always executes in its critical section and no other server can interrupt the execution of currently executing server. A node records all new messages, arrived during the 7

Processor

Processor

Processor

cache

cache

Memory

Memory

N.I.

N.I.

P i Processor

Protocol Overhead

cache

cache 0 0 0

Memory

P j

Memory

Protocol Overhead

N.I.

N.I. 0 0 0

0 0 0

Local Area Network

Processor

Controller Latency

Processor

cache

cache

Interrupt

0 0 0

Memory

Memory

N.I.

N.I.

Processor

Processor

cache

cache

Memory

Memory

N.I.

Switch W Delay + D ire elay

N.I.

Controller Latency

Figure 4: System architecture(left) and one way communication cost for a message (right) execution of a server, in a local interrupt-queue. After completion of an executing server, the node executes the server for the next message in the queue. If there is no message in the queue, the node resumes an interrupted computation thread. The servers are state-less and, on interrupts, execute in node's address space to reduce state saving overhead.

4.1 System Architecture

The simulated architecture is a distributed memory machine where each node has its own private memory and a node can access its local memory without any network communication. To access a remote memory, a node must send a request message and the message will cause an interrupt at the remote node. Upon receiving an interrupt, the node invokes a user-de ned interrupt-handler or server to service the request. The server processes the request according to the coherence protocol of SNOW and sends a reply message to the requesting node. The simulated system does not have any hardware support for global address space and the runtime system supports this abstraction in software. Nodes of the system are connected by a point-point high speed LAN switch and inter-node communication is done via-message passing. The simulated network is reliable and delivers the packets in order and each node processes all incoming messages in the order of their arrival. Figure 4(right) shows the factors involved in oneway communication latency of a message from one node to another node. The simulator assumes a xed one-way communication latency of 700 processing cycles.

4.2 Runtime System

In this research, instead of keeping a large data block size and changing the programming model, we chose conventional programming model similar to the model used by Munin [9]. The runtime system implements access control mechanism in software and, instead of treating a part of local address space as shared, maps the shared address space on top of local address space of nodes; and requires that all shared memory accesses must be visible to the system. There is one computation thread and three servers at each node. The computation threads execute the code of application programs, while the servers implement cache coherence protocol of the system and deal with all consistency related communication. Our simulated architecture has 8

reliable network which delivers the packets in order; and nodes process the received messages in order of their arrival. Traditional DSM systems usually initialize all shared data blocks at the root-node (node 0) and a dynamic ownership protocol distributes these data blocks to other nodes according to the memory access pattern of application programs. SNOW uses the xed ownership protocol, and uses the data layout technique to distribute shared data among the nodes. For example, SNOW runtime system allocates a two dimensional matrix in shared memory such that each node receives an equal number of rows. A node can perform a memory access to its part of shared memory without causing any communication activity. Send Read Request

Read Request Server Send Read Reply

Read Reply Server

(a)

Flush Write-Buffer

Write Request Server

(b)

Figure 5: Runtime servers to handle remote communication Core of the runtime system is based on user-de ned servers and the servers handle all consistency related communication among the nodes. For SNOW runtime system, Figure 5 shows the servers and the protocol actions involved in servicing remote requests; dotted boxes represent the nodes, rectangles show the servers and oval shapes show the actions involved in remote communication. On a read memory access to a shared address, the runtime system tries to read from local shared memory, if the referenced data is not in local share memory then it tries to read from local readcache; if the request can not be serviced locally, the read operation sends a request to the owner of the referenced data. In SNOW, there are two servers: read request server and read reply server to deal with read related communication among the nodes. On a write memory access to a shared address the system either updates local shared memory or stores the new value in local write-bu er. Storing a new value in local write-bu er may require that the system must propagate the current updates before storing a new update in local write-bu er. The write request server receives all update requests and applies the updates to local shared memory.

5 Results The application suite and the runtime system of SNOW is discussed in detail in [6] and this section presents the performance of selected applications. SNOW allows programmers to choose the size 9

and the number of data-blocks for the read-cache at each node and programmer can also choose the size of write-bu er according to the granularity of the application. To study the e ect of data block size on the application performance, we selected the data block sizes of 0.125K, 0.5K and 1.0K bytes. For all the results in this paper, we use read-cache of 128 data-blocks and write-bu er of 128 bytes at each node. In addition to the performance of SNOW for di erent data block sizes, we also compare its performance with a write-shared DSM system, a variant of Munin [9], that maintains the memory consistency by an update-based dynamic ownership protocol. The performance and implementation details of UDSM are discussed in [6, 4]; and our results showed that UDSM has high software access detection overhead, high coherence related overhead and high protocol processing overhead for medium-to- ne grain applications. The goal of our research is to reduce the coherence related overhead that involves the communication required by the dynamic ownership protocol to maintain the state information at each node. We also would like to reduce the protocol processing overhead caused by the write-shared protocol for ne grain applications and would like to reduce the software access control overhead. For each application we present two graphs, rst graph shows the e ect of data block size on parallel execution time of the application, while the second graph compares the application performance on SNOW and UDSM systems. MATMULT

MATMULT

5.36871e+08

5.36871e+08 0.125K 0.5K 1.0K

UDSM-.5 SNOW-.5 UDSM-1 SNOW-1

2.68435e+08 Parallel execution Time

Parallel Execution Time

2.68435e+08

1.34218e+08

1.34218e+08

6.71089e+07

6.71089e+07

3.35544e+07

3.35544e+07 2

4

8

16

2

Nodes

4

8

16

Nodes

Figure 6: Parallel execution times on SNOW and comparison with UDSM system for MATMULT The imperfect distribution of work among the nodes may degrade the performance of an application. SNOW uses a xed owner protocol instead of a dynamic ownership protocol used by the traditional DSM systems and the xed ownership protocol requires that either programmer or compiler distribute the shared data among the participating nodes. A mismatch in data layout and work assignment by an application to a node may increase the processing overhead that involves extra communication and data copying. Currently in the absence of a complier support we distribute the shared data among the participating nodes on a round-robin basis.

5.0.1 MATMULT

This application computes the product of two matrices and stores the results in a third matrix for the problem size of 256x256 doubles. The algorithm completes in two phases: a computation phase 10

followed by an update phase. The application instead of repeatedly reading from shared memory, caches shared data in local memory which results in low access rate to shared memory. Also for this application the number of read memory accesses increase with the number of nodes in the system. The application has little sharing and requires little synchronization. Communication for MATMULT

21393

189524

100000

146738

Messages

Messages 100000 50000

Coherence-related Communication

50000

K bytes 10000 5000

15360

150000

20000 15000

150000

Data Copied for MATMULT (0.5K)

UDSM

SNOW

0

0

0

0

30720

UDSM

SNOW

UDSM

SNOW

Figure 7: Data copied, number of messages, and coherence related communication for MATMULT The graph in Figure 6(left) shows that for coarse-grain applications very small data block size like 128 bytes actually increases the communication overhead and, decreases the application performance by increasing the idle time of nodes. The system with 6, 10, 12 or 14 nodes shows the e ect of imperfect work distribution among the nodes but the e ect of imperfect work distribution is insigni cant because of little synchronization requirement of the application. Parallel execution time of the application on SNOW and UDSM is shown in the right graph of Figure 6 and the graph shows that SNOW outperforms UDSM; the improvement for two-node systems is due to low software access detection overhead on SNOW. With an increase in the number of nodes, the application performance improves further because of xed ownership and data layout techniques used by SNOW. Other DSM systems use dynamic ownership protocol and for these systems the amount of communication increases with an increase in number of nodes; most of the communication is coherence related and is due to the dynamic ownership protocol. Figure 7 shows that UDSM does 39% more data copying than SNOW, and sends 6 times more messages but 77% of these messages are coherence related.

5.0.2 SOR I The application is from Munin [9] application suite and computes the successive over relaxation algorithm for a problem size of 512x512 integers for 25 iterations. The computation is based on scratch matrix approach where each node computes new values for each element, assigned to the node, and stores the results in a scratch matrix; then all nodes synchronize at a barrier before starting the second phase. In the second phase, each node reads its updates from scratch matrix and updates the shared memory. The application has the following characteristics: high access rate to shared memory, pair-wise sharing among nodes, xed granularity and frequent barriersynchronization. Parallel execution graph in Figure 8 shows that the size of a data block is not signi cant for this application because of the high reuse of shared data. Each node is assigned a block of contiguous rows of the matrix and, the node shares only the boundary rows with its neighboring nodes. The graph also shows the e ect of work imbalance and, the mismatch between the assigned work and 11

SOR I

SOR I

1.07374e+09

2.14748e+09 0.125K 0.5K 1.0K

UDSM-.5 SNOW-.5 UDSM-1 SNOW-1 1.07374e+09

Parallel execution Time

Parallel execution Time

5.36871e+08

5.36871e+08

2.68435e+08

2.68435e+08

1.34218e+08 1.34218e+08

6.71089e+07

6.71089e+07 2

4

8

16

2

4

Nodes

8

16

Nodes

Figure 8: Parallel execution times on SNOW and comparison with UDSM system for SOR I

Communication for SOR I

5100

Coherence-related Communication 1584

8000

1500

8840

Messages 1000

6000

UDSM

SNOW

0

0

0

0

1000

2000

500

3000

Messages 4000 6000

K Bytes 2000 3000

4000

5000

Data Copied for SOR I (0.5K)

UDSM

SNOW

UDSM

SNOW

Figure 9: Data copied, number of messages, and coherence related communication for SOR I

12

data layout. The mismatch and the imbalance work degrades the application performance on 12node and 14-node systems. The comparison graph in Figure 8 shows that SNOW outperforms UDSM for this application even for imbalance work assignment and, this improvement is because of the low access detection overhead due to the xed ownership protocol. Figure 9 shows the amount of data copied and the communication required during the computation. The left most graph in the gure shows that UDSM system copied 70% more data and sent 47% more messages than SNOW. The coherence related communication of UDSM system, due to the dynamic ownership protocol, was 17% of the total communication

5.0.3 GAUSS The application is from TreadMark [13] application suite and the algorithm iteratively computes gauss-elimination method. All nodes synchronize with each other, after each iteration, on a barrier. The application has high access rate to shared memory and computation granularity of the application decreases in each iteration. GAUSS

GAUSS

5.36871e+08

1.07374e+09 0.125K 0.5K 1.0K

UDSM-.5 SNOW-.5 UDSM-1 SNOW-1

5.36871e+08 Parallel execution Time

Parallel execution Time

2.68435e+08

2.68435e+08

1.34218e+08

1.34218e+08

6.71089e+07

6.71089e+07 2

4

8

16

2

4

Nodes

8

16

Nodess

Figure 10: Parallel execution times on SNOW and comparison with UDSM system for GAUSS Comunication for GAUSS 100000

Messages 40000

Messages 40000 60000 20000

4000

SNOW

0

0

19200

0

2000 0

UDSM

82678

60000

80000

10000 12000 K bytes 6000 8000

9600

Coherence-related Communication 80000

109291

13433

20000

Data Copied for GAUSS (0.5K)

UDSM

SNOW

UDSM

SNOW

Figure 11: Data copied, number of messages, and coherence related communication for GAUSS Parallel execution graph of the application in Figure 10 shows the e ect of data block size. The performance of the system with small data block increases with an increase in the number of nodes 13

and, the graph shows that, for 16 nodes, a system with a data block size of 128 bytes performs better than a system with larger data block size. The comparison graph in the same gure shows that SNOW outperformed UDSM due to lower overhead for software access detection for all data block sizes. Figure 11 shows the amount of data copied and the number of messages sent over the network during the execution of this application on UDSM and SNOW. The left most graph in the gure shows that 39% more data was copied and ve times more messages were sent during the execution of the application on UDSM as compared to the execution on SNOW. The coherence related communication was 75% of total communication on UDSM.

5.0.4 WATER This application simulates a system of water molecules in the liquid state and, is taken from SPLASH suite [18]. The application has medium granularity, high frequency of synchronization and a good reuse of shared data. We used problem size of 128 molecules for two iterations. The size of data block becomes important for medium and ne grain applications. The execution graph in Figure 12 shows that a small data block size reduces the protocol processing overhead and reduces the execution time of the application. The application is sensitive to the imbalance in work assignment among the nodes and the graph also shows the performance degradation due to work imbalance on systems with 10, 12 and 14 nodes. On 16 nodes system SNOW with 1K data block size outperforms UDSM system with 0.5K data block size. WATER

WATER

5.36871e+08

5.36871e+08 0.125K 0.5K 1.0K

UDSM-.5 SNOW-.5 UDSM-1 SNOW-1

2.68435e+08 Parallel execution Time

Parallel execution Time

2.68435e+08

1.34218e+08

1.34218e+08

6.71089e+07

3.35544e+07

6.71089e+07 2

4

8

16

2

Nodes

4

8

16

Nodes

Figure 12: Parallel execution times on SNOW and comparison with UDSM system for WATER The comparison graph in Figure 12 compares the application performance on SNOW and UDSM systems. SNOW outperforms UDSM, if the work assigned to all nodes is balanced but its performance for imbalance work load is worse than UDSM. The performance of SNOW with 1K byte data block is better than the performance of UDSM system with 0.5K byte data block size. In summary, SNOW outperforms UDSM for 16-node system. The amount of data copied and the communication performed by the application on both systems are shown in Figure 13. UDSM system copied six times more data and sent 4 times more messages than SNOW. The coherence re14

SNOW

400000

486133

Messages 200000 300000

0

0

178474

0 UDSM

Coherence-related Communication

100000

200000

Messages 400000

K bytes 100000 50000 0

26240

500000

Communication for WATER 772089

600000

150000

161746

800000

Data Copied for WATER (0.5K)

UDSM

SNOW

UDSM

SNOW

Figure 13: Data copied, number of messages, and coherence related communication for WATER lated communication in UDSM system is 62% of the total communication performed by the system during the execution of the algorithm.

6 Conclusion and Future Work Our results show that the xed ownership protocol on a RC DSM system can reduce the amount of data copied and the amount of data sent over the network. Contarary to the intution, the xed ownership protocol reduces the total number of messages required during the computation. For an all software system, the xed ownership protocol actually has low access control overhead as compared the dynamic ownership protocol. The xed ownership protocol also reduces the need for coherence-related communication. Separation of read cache and write-bu er at a node reduces the protocol processing overhead. Coarse-grain applications usually have high reuse of shared data. The imbalance work load and the mismatch between assigned work and assigned data can degrades application's performance. Currently SNOW distributes the shared data on a round robin basis, but an application can bene t from the compile time analysis for ecient and automatic data initialization. In this implementation, SNOW eagerly invalidates local read cache on release operations and we would like to evaluate the e ect of selective invalidation of read-cache.

Acknowledgements. I would like to thank my advisor Dr. Abdelsalam heddaya for his help and guidence. I would also like to thank Ms. Ghazaleh Nahidi for proof-reading an early draft of the paper.

References

[1] Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. High Speed Switch Scheduling for Local Area Networks. Technical Report UCB CSD-94-803, Computer Science Division, University of California, Berkeley, CA 94720, 1994. [2] Anindya Basu, Vineet Buch, Werner Vogels, and Thorsten von Eicken. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. Technical report, Department of Computer Science, Cornell University, 1995. [3] Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway Distributed Shared Memory System. In 1993 IEEE CompCon Conference, pages 528{537, Feb 1993. [4] Arif M. Bhatti. Data Block Size, Software Access Control and Release Consistent Distributed Shared Memory on Reliable Networks. Technical report, Boston University, 1997. Submitted to PDPTA'97, also available at http://www.cs.bu.edu/students/grads/tahir/pdpta97.ps.

15

[5] Arif M. Bhatti. Evaluation of All-Software Conventional Distributed Shared Memory on NOWs based on High Speed Networks. In Cluster Computing Conference, March 1997. [6] Arif M. Bhatti. Evaluation of Distributed Shared Memory Systems for Network of Workstation Without any Hardware Assistance. PhD thesis, Boston University, 1997. (In preparation). [7] R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding Communication Latency and Coherence Overhead in Software DSMs. In 7th International Conference of Architectural Support for Programming Languages and Operating Systems, 1996. [8] E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl. Proteus: A High-Performance ParallelArchitectue Simulator. In ACM SIGMETRICS and PERFORMANCE Conference, June 1992. [9] John B. Carter. Ecient Distributed Shared Memory Based on Multi-Protocol Release Consistency. PhD thesis, Rice University, Houston, Texas, September 1993. [10] A. Fahmy and A. Heddaya. BSPk: Low Overhead Communication Construct and Logical Barriers for Bulk Synchronous Parallel Programming. Bulletin of IEEE TCOS, 8(2):27{32, August 1996. [11] Edward W. Felten, Richard D. Alpert, Angelos Bilas, Matthias A. Blumrich, Douglas W. Clark, Stefanos Damianakis, Cezary Dubnicki, Liviu Iftode, and Kai Li. Early Experience with Message-Passing on the SHRIMP Multicomputer. Technical Report TR-510-96, Department of Computer Science, Princeton University, 1996. [12] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: High-Performance All-Software Distributed Shared Memory. In Fifteenth Symposium on Operating Systems Principles, December 1995. [13] Pete Keleher, sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. TreadMark: Distributed Shared Memory on Standard Workstation and Operating Systems. Technical Report Rice COMP TR93-214, Department of Computer Science, Rice University, P. O. BOx 1892, Houston, Texas 77251-1892, November 1993. [14] J. William Lee. Concord: Re-Thinking the Division of Labor in a Distributed Shared Memory System. Technical Report 93-12-05, Department of Computer Science and Engineering, University of Washington, December 1993. [15] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. [16] Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A Low Overhead, SoftwareOnly Approach for Supporting Fine-Grain Shared Memory. In 7th International Conference of Architectural Support for Programming Languages and Operating Systems, 1996. [17] Ioannis Schoinas, Babak Falsa , Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In ASPLOS-VI. ACM, Oct 1994. [18] Jaswinder Pal Singh, Wolk-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Application for Shared Memory. Computer Architecture News, pages 5{44, March 1992. [19] Chandramohan A. Thekkath. System Support for Ecient Network Communication. Technical Report 94-07-02, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, July 1994.

16