SNOW: A State-Less Low-Overhead All Software Distributed Shared Memory System Arif M. Bhatti
[email protected]
Computer Science Department Boston University
March 31, 1997 Abstract
Research in user-level communication and the recent increase in network speeds requires reevaluation of traditional distributed shared memory (DSM) protocols for workstation networks. Our experience with a release consistency (RC) DSM system shows that the system has: (1) high protocol processing overhead, (2) high coherence related overhead, and (3) high software access control overhead. To reduce these overheads, in this paper we present a state-less DSM system based on the following features: (1) xed ownership, (2) data layout, (3) read-only data cache, (4) write-buer for updates, and (5) a hybrid coherence protocol. Our results were obtained by executing a set of four benchmark applications that we ported to run on SNOW DSM system. The system executes the applications on a simulated architecture and the applications dier from each other in sharing pattern, computation granularity and amount of synchronization. In this paper we show an execution graph and a comparison graph for each application. We compared the performance of SNOW with a traditional update-based system and our results show that the traditional system can copy upto 6 times more data and can send upto 5 times more communication messages. The performance gains are because of the xed ownership protocol. The protocol reduces the software access detection, the write-shared protocol processing, and the consistency related communication overheads.
Keywords: distributed shared memory, parallel programming, NOW
Contact Address: 111 Cummington Street, Boston, MA 02215, voice: 617-353-5227, fax: -6457
1
1 Introduction The popularity of shared memory programming model for parallel programs is due to the simple interface that hides all architectural details from programmers and allows programmers to concentrate on algorithmic details. In a multiprocessor system, there can be multiple memory modules but the system, for shared memory model, provides the abstraction of a single memory module to programmers; and the system supports this abstraction in hardware by providing global address space and coherent caches. Distributed shared memory (DSM) is an abstraction that supports shared memory programming model in software on distributed memory (DM) machines; these machines do not provide global address space in hardware. Each node in a DM machine has a private address space and an access to a memory module of a remote node requires an exchange of messages with the remote node. The DSM abstraction on a distributed memory machine is implemented by a runtime system and the system supports global address space in software. A DSM system can be entirely based on a runtime system [13, 12]. The use of special compilers [9, 16, 3, 14] and dedicated hardware [7], for DSM abstraction, in addition to the runtime system changes the design parameters. Most DSM systems treat local memory of nodes as coherent caches and detect shared memory accesses in node MMUs. The access detection in MMU is atomic and costs zero processing cycles but a disadvantage of this technique is that it binds the data block size to the hardware page size which restricts the DSM system to coarse-grain applications. This technique has two draw backs: large data block size and interaction with operating system. Large data block size problem can be solved by separating the data block from the hardware page; there are two possible options with dierent tradeos. Firstly, implement DSM system based on relaxed consistency (RC) models and use a signi cantly complex programming interface. This approach requires signi cant amount of work from programmers and is against the basic philosophy of shared memory model as a simple interface. Secondly, implement access control mechanism in software instead of using MMU but this approach has xed extra cost for each shared memory access. This cost depends upon the consistency model and communication mechanism used. To compute medium and ne grain applications on DSM systems, recent research trends are toward all-software systems based on sequential consistency (SC) [17, 16] and relaxed consistency models such as release consistency [12]. Traditional release consistent systems have high overhead due to protocol processing and coherence related communication. For all-software implementations, these systems also have high software access detection cost due to the dynamic ownership protocol. This paper presents the design and performance of SNOW, a state less DSM system that uses xed ownership, data layout and hybrid coherence protocol to reduces the above mentioned overheads. Section 2 discusses the motivation and background, while section 3 explains the design of SNOW that includes data structures, synchronization primitives and coherence protocols. Section 4 brie y describes the experimental methodology and implementation of the runtime system. Section 5 presents the results of selected applications and last section summarizes the paper and discusses future research directions.
2 Motivation and Background Traditional DSM systems treat local memory of nodes as coherent caches. These systems use a directory at each node that maintains the state of all locally mapped pages and use distributed ownership protocol as part of the coherence mechanism. Early SC-based DSM systems like IVY [15] 2
Shared Memory SM 0
0 0 0
SM i
SM n-1
0 0 0
SM i SM 0
0 0 0
Cache
Cache 0 0 0
Cache 0 0 0
SM n-1
Cache
Cache
Node(0)
Node(i)
Node(n-1)
Node(0)
Node(n-1)
Figure 1: SNOW shared memory(left) and shared memory in DSM systems (right) did not receive much popularity due to large data block size, false sharing problem and strict orderings of the consistency model. Relaxed memory consistency models, such as RC, have received much more popularity due to the fact that these models impose strictly less ordering restrictions. Munin [9], a release consistent DSM system, introduced the write shared protocol that allows multiple nodes to concurrently update the same data block. This technique reduces the impact of false sharing problem but may send more data due to the large data block size. A page-based DSM system treats a part of virtual address space as shared and detects shared memory accesses in node MMUs. At each shared memory access the system performs a lookup on state of a page that contains the referenced data object and invokes an action if the page is not in an appropriate state. The use of MMUs to detect shared memory accesses restricts the data block size to be the hardware page size but a large data block size may degrade the system's performance. Selecting a smaller data block size for a DSM system requires more assistance from programmers that may change the programming model [3, 14, 12]. Implementation of an access control mechanism [17] in software without any assistance from node MMUs allows the DSM systems to use a data block smaller than a hardware page. The memory lookups in software cost a x number of CPU cycles on each memory access, while this cost is zero cycle in a DSM system that performs memory lookups in MMU. Slow networks, such as Ethernet, due to their shared nature have high protocol processing overhead, unreliable communication and packet collision. These networks provide low bandwidth and message latency increases with increase in load and increase in number of nodes in the network. Sending a large message over the net in slow networks is almost as expensive as sending a small message, due to high communication latency and protocol processing overhead [9]. New local area networks, such as ATM and AN2 [1], provide high speed, high bandwidth and low communication latency. These networks perform point-to-point communication and can deliver small messages with low latency. These networks are reliable and can deliver the messages in order. Recent research is user-level communication [2, 19, 10] on these networks made these networks more interesting for parallel computing. Existing DSM protocols are designed for slow networks and, new network technology and user-level communication requires re-evaluation of the existing DSM protocols; and our results show that the new DSM protocols designed for point-to-point networks can outperform the traditional ones. We evaluated the eect of data block size on the performance of a SC invalidation-based system, CDSM [5], and a RC write-shared update-based system, UDSM [4]. A directory protocol is 3
used to maintain the state of the shared data blocks at each node and a distributed ownership protocol is used to migrate or replicate the shared data blocks in local memories. The dynamic ownership protocols reduce the remote access latency by changing the ownership of a data block, according to its usage, as dictated by the coherence protocol. For software access detection, this state information, the dynamic ownership protocol and the coherence protocol can in uence the cost of memory lookups. The dynamic ownership protocol can have a high communication cost for locating a page due to the dynamic changes in the ownership. For ne-grain computation, the update based write-shared DSM systems increase the protocol processing overhead required to maintain the state information for updates; on the other hand the write-shared protocol removes the restriction on parallel concurrency but the protocol has to detect updated word in a data block to merge with updates performed by other nodes on the same data block. The distributed ownership protocol may even increase the number of communication messages, because of the falsely shared data blocks [4] and the ownership transfers. The goal of this research is to reduce the number of stale updates, reduce the coherence related communication by not maintaining the state information and reduce the software access control overhead by not performing lookups at the state information. In this paper we present SNOW, a RC-based state-less DSM system that supports the conventional programming model like Munin [9]. The coherence protocol of SNOW unlike all other DSM systems uses xed ownership protocol and data layout technique. SNOW is designed to take an advantage of user-level communication, availability of reliable networks and the ability of a node to process communication messages inorder of arrival.
3 Design and Architecture of SNOW SNOW provides the abstraction of global shared address space in software on the distributed memory architecture. The system uses a xed-owner approach and distributes shared address space among participating nodes as shown in Figure 1(left) compared to the traditional systems where each node may maintain the state for all the shared address space as shown in Figure 1(right). SNOW architecture supports the shared address space of NxM bytes where N is the number of nodes in the system and M is the shared address space at each node where N and M are exact powers of two. Each node maps its part of the shared address space at a contiguous segment of its local memory. Software access control mechanism requires that all shared memory accesses must be visible to the runtime system. The shared address space can be a part of virtual address space for software access control but a compiler [16, 3] or an editing tool [17] is needed to insert access detection code before each shared memory access. We do not use any special compiler or an editing tool to insert the access detection code, instead we completely rely on runtime system and expect that a pre-processor will replace the shared memory accesses with appropriate memory access routines. Instead of treating a part of virtual address space as shared address space like traditional DSM systems, we map the shared address space on top of virtual address space of the nodes. The main features of SNOW are: xed ownership and distribution of shared address space among nodes, data layout, read-only cached data, write-buers for updates, eager propagation of updates and implicit invalidation of cached data at the release synchronization operation. These features reduce the processing overhead of write shared protocols and also reduce the cost of shared memory access detection in software. SNOW does not treat local memory of nodes as coherent caches like traditional systems. Unlike other write-shared systems [9, 4, 13], SNOW does not make a 4
copy of a data block and, does not computes the dis. Instead SNOW keeps the caches in read-only mode and stores the updates in separate write buer.
C-size
C-lock
C-address
C-data
(a) Read cache W-data W-bitmap
W-type
W-node
W-address
(b) Write buffer
Figure 2: The read cache (a), the write buer (b) The runtime system at each node maintains a shared read-cache of xed size to reduce remote communication, and to oset the eect of remote access latency. The system is parameterized and a user, before running a parallel program on this system, can select appropriate data block size and number of blocks at each node for the program. This organization diers from traditional systems because these systems treat local memory of a node as a coherent cache and at any moment a node is allowed to cache all shared memory. The read cache is implemented as a hash table and the shared address is used as the key of the hash function; the organization of read-cache is shown in Figure 2(a). Each data block in the cache has three attributes: C-address, C-size and C-data. The C-data attribute stores the cached data block and the C-address attribute records the shared address of the block. The C-size attribute plays dual role: If a cached data block is valid then it contains the size of the block otherwise it contains a negative number to indicate that the block has inconsistent data. The read-cache, in addition to the above mentioned attributes, has another attribute, C-lock, which is used by the cache coherence protocol and its usage is explained in the coherence protocol section. The readcache at a node is used for read only purposes by a computation thread running at the node, while all updates performed by the node are recorded in a write buer. A node can drop any data block from local read-cache without any communication with other nodes. The runtime system propagates local updates to a node which is responsible for the updated shared memory. The write-buer, as shown in Figure 2(b), also maintains some state information in addition to the updates performed by the node. The buer can store a xed number of words in its data eld and the W-bitmap attribute of the buer maintains one bit for each updated word in the buer. The W-type attribute of the buer maintains the data type of updates and this information is used by the runtime system to propagate local updates. The W-node attribute of the buer keeps track of the owner node for updated shared memory and the owner eventually will receive these updates. The W-address attribute maintains the shared address of the rst word in the write-buer. The mechanism for propagating the updates in the write-buer is discussed later in the coherence protocol section. 5
3.1 Synchronization
In shared memory model, parallel programs require synchronization primitives to control and order memory accesses by dierent nodes; and most of the DSM systems implement lock and barrier synchronization primitives. SNOW implements the lock primitive based on the distributed queuebased algorithm. A thread at a node can acquire a lock by invoking the acquire lock operation and can enter in its critical section if the lock is available at local node. If the lock is not available locally then the operation sends a request to the lock owner, suspends the computation thread and transfers the control to local scheduler. Each node in the system has a lock request server, a userde ned interrupt handler, to handle all lock requests from remote nodes. On each lock request, the server chooses one of the followings: service the request, insert the requesting node in the the lock queue or forward the request to the current owner of the lock. A thread invokes the release lock operation to exit from its critical section; the operation, after performing some coherence related actions, sends a lock-reply to the next node waiting for the lock. The runtime system at a node, after receiving a lock reply, resumes the suspended thread. The barrier primitive is implemented in a centralized fashion where a barrier-manager collects all requests from the participating nodes. A thread can enter a barrier by calling the barrier operation. The operation after performing some coherence related actions, sends a request to the barrier-manager and transfers the control to local scheduler. The barrier-manager waits for one request from each of the participating nodes and after collecting all requests, the barrier-manager sends a barrier-reply message to the nodes. The runtime system at each node after receiving the barrier-reply message resumes the suspended computation thread.
3.2 Coherence Protocol
The protocol is based on xed ownership instead of dynamic distributed ownership used by other DSM systems. There is no message forwarding for read or update request due to the xed ownership which reduces the number of messages required to locate a shared page. We will show that the amount of forwarded requests, due to the dynamic ownership protocol, is a signi cant fraction of all communications. SNOW is based on release consistency model [6] and maintains the consistency of shared memory by a hybrid coherence protocol. The shared address space of SNOW on a n node system is divided in n segments and each node is the owner of exactly one segment. A node that updates local segment does not send its updates to any other node, but the node forwards all updates on a remote segment to the owner node. Unlike update protocols on RC DSM systems, SNOW does not wait for the next release operation to propagate its updates. SNOW maintains a small write-buer for updates and propagates the updates if the buer is full. The buer is full if it cannot accommodate a new update or the thread has invoked the release operation to exit from its critical section. At release lock operation, a node virtually invalidates local read-cache and records the lock in C-lock attribute of the read-cache. If the node re-acquire the lock without transferring it to another node, then the node considers the read-cache valid and the node continues reading from the cache. If the lock is requested by another node, then local node will invalidate the read-cache before servicing the request. A node always implicitly invalidate local read-cache on a barrier request after ushing the write-buer. Figure 3 explains the interaction of synchronization and the cache coherence protocol. In Figure 3(a), node B sends a request to another node A. Node A before exiting its critical section, 6
1 Flush Write-buffer
2. Flush Write-buffer
qu
es
ply
ly
3 Ba
dR ep
r-r
eq
ue
st BARMGR
5.
Re
Re a
4. R
ck
rie
ly
Lo
2B
ar
3
rrie r-re
eq u
Re
ply
1.
ead R
ck
BARMGR
Home
N
p re rrie ar
Lo
Home
B
3.
est
A
N0
ooo
Nn-1
B
t
(b)
(a)
(c)
Figure 3: Relationship of updates and synchronization requests by invoking the release lock operation, sends remaining updates to the home node (owner) and invalidates local read-cache before sending a lock-reply to node B. Node B, after acquiring the lock, may request some data from the home node. In-order delivery of messages and in-order processing of delivered messages guarantees that the home node will apply the updates to local shared memory before servicing any request from node B; this sequence of messages ensures that node B will get most up-to-date values from shared memory. Figure 3(b) and (c) show coherence related actions at a barrier request. A node enters a barrier by invoking the barrier operation and the operation sends all updates in local write-buer to the owner node, invalidates local read-cache and sends a request to the barrier-manger. The operation suspends the thread that invoked the operation and transfers the control to local scheduler. In-order delivery by the network and in-order processing by the nodes guarantees that each node will apply all its updates to local shared memory before processing the barrier-reply message issued by the barrier-manager.
4 Methodology We use the execution-driven simulation technique to study the DSM abstraction on a network of workstations and the simulated architecture is based on Proteus architecture simulator [8]. The parallel architecture simulated for the research is a collection of homogeneous nodes as shown in Figure 4(left). Nodes in the system are connected by a high-speed local area network. Like a workstation, each node in the system has a processor, a cache, a memory module and a network interface. The system does not have any special hardware like protocol processor [7] or memorymapped network interface [11]. A parallel program has several computation threads and these threads execute concurrently to perform the required computation. A node in a parallel system may have multiple computation threads and a scheduler module is needed at the node to ensure orderly execution of these thread. A computation thread executes till its completion and no other thread can preempt an executing thread. A node may block a computation thread on a memory fault or on a call to a synchronization operation. A memory fault occurs when a thread accesses a data object which is not present in local memory and the faulting thread sends a message to a remote node and transfers control to local scheduler. The runtime system of SNOW has some user-de ned interrupt-handlers or servers on each node to handle all communication messages. A server executes till its completion to service a request sent by a remote node. A server always executes in its critical section and no other server can interrupt the execution of currently executing server. A node records all new messages, arrived during the 7
Processor
Processor
Processor
cache
cache
Memory
Memory
N.I.
N.I.
P i Processor
Protocol Overhead
cache
cache 0 0 0
Memory
P j
Memory
Protocol Overhead
N.I.
N.I. 0 0 0
0 0 0
Local Area Network
Processor
Controller Latency
Processor
cache
cache
Interrupt
0 0 0
Memory
Memory
N.I.
N.I.
Processor
Processor
cache
cache
Memory
Memory
N.I.
Switch W Delay + D ire elay
N.I.
Controller Latency
Figure 4: System architecture(left) and one way communication cost for a message (right) execution of a server, in a local interrupt-queue. After completion of an executing server, the node executes the server for the next message in the queue. If there is no message in the queue, the node resumes an interrupted computation thread. The servers are state-less and, on interrupts, execute in node's address space to reduce state saving overhead.
4.1 System Architecture
The simulated architecture is a distributed memory machine where each node has its own private memory and a node can access its local memory without any network communication. To access a remote memory, a node must send a request message and the message will cause an interrupt at the remote node. Upon receiving an interrupt, the node invokes a user-de ned interrupt-handler or server to service the request. The server processes the request according to the coherence protocol of SNOW and sends a reply message to the requesting node. The simulated system does not have any hardware support for global address space and the runtime system supports this abstraction in software. Nodes of the system are connected by a point-point high speed LAN switch and inter-node communication is done via-message passing. The simulated network is reliable and delivers the packets in order and each node processes all incoming messages in the order of their arrival. Figure 4(right) shows the factors involved in oneway communication latency of a message from one node to another node. The simulator assumes a xed one-way communication latency of 700 processing cycles.
4.2 Runtime System
In this research, instead of keeping a large data block size and changing the programming model, we chose conventional programming model similar to the model used by Munin [9]. The runtime system implements access control mechanism in software and, instead of treating a part of local address space as shared, maps the shared address space on top of local address space of nodes; and requires that all shared memory accesses must be visible to the system. There is one computation thread and three servers at each node. The computation threads execute the code of application programs, while the servers implement cache coherence protocol of the system and deal with all consistency related communication. Our simulated architecture has 8
reliable network which delivers the packets in order; and nodes process the received messages in order of their arrival. Traditional DSM systems usually initialize all shared data blocks at the root-node (node 0) and a dynamic ownership protocol distributes these data blocks to other nodes according to the memory access pattern of application programs. SNOW uses the xed ownership protocol, and uses the data layout technique to distribute shared data among the nodes. For example, SNOW runtime system allocates a two dimensional matrix in shared memory such that each node receives an equal number of rows. A node can perform a memory access to its part of shared memory without causing any communication activity. Send Read Request
Read Request Server Send Read Reply
Read Reply Server
(a)
Flush Write-Buffer
Write Request Server
(b)
Figure 5: Runtime servers to handle remote communication Core of the runtime system is based on user-de ned servers and the servers handle all consistency related communication among the nodes. For SNOW runtime system, Figure 5 shows the servers and the protocol actions involved in servicing remote requests; dotted boxes represent the nodes, rectangles show the servers and oval shapes show the actions involved in remote communication. On a read memory access to a shared address, the runtime system tries to read from local shared memory, if the referenced data is not in local share memory then it tries to read from local readcache; if the request can not be serviced locally, the read operation sends a request to the owner of the referenced data. In SNOW, there are two servers: read request server and read reply server to deal with read related communication among the nodes. On a write memory access to a shared address the system either updates local shared memory or stores the new value in local write-buer. Storing a new value in local write-buer may require that the system must propagate the current updates before storing a new update in local write-buer. The write request server receives all update requests and applies the updates to local shared memory.
5 Results The application suite and the runtime system of SNOW is discussed in detail in [6] and this section presents the performance of selected applications. SNOW allows programmers to choose the size 9
and the number of data-blocks for the read-cache at each node and programmer can also choose the size of write-buer according to the granularity of the application. To study the eect of data block size on the application performance, we selected the data block sizes of 0.125K, 0.5K and 1.0K bytes. For all the results in this paper, we use read-cache of 128 data-blocks and write-buer of 128 bytes at each node. In addition to the performance of SNOW for dierent data block sizes, we also compare its performance with a write-shared DSM system, a variant of Munin [9], that maintains the memory consistency by an update-based dynamic ownership protocol. The performance and implementation details of UDSM are discussed in [6, 4]; and our results showed that UDSM has high software access detection overhead, high coherence related overhead and high protocol processing overhead for medium-to- ne grain applications. The goal of our research is to reduce the coherence related overhead that involves the communication required by the dynamic ownership protocol to maintain the state information at each node. We also would like to reduce the protocol processing overhead caused by the write-shared protocol for ne grain applications and would like to reduce the software access control overhead. For each application we present two graphs, rst graph shows the eect of data block size on parallel execution time of the application, while the second graph compares the application performance on SNOW and UDSM systems. MATMULT
MATMULT
5.36871e+08
5.36871e+08 0.125K 0.5K 1.0K
UDSM-.5 SNOW-.5 UDSM-1 SNOW-1
2.68435e+08 Parallel execution Time
Parallel Execution Time
2.68435e+08
1.34218e+08
1.34218e+08
6.71089e+07
6.71089e+07
3.35544e+07
3.35544e+07 2
4
8
16
2
Nodes
4
8
16
Nodes
Figure 6: Parallel execution times on SNOW and comparison with UDSM system for MATMULT The imperfect distribution of work among the nodes may degrade the performance of an application. SNOW uses a xed owner protocol instead of a dynamic ownership protocol used by the traditional DSM systems and the xed ownership protocol requires that either programmer or compiler distribute the shared data among the participating nodes. A mismatch in data layout and work assignment by an application to a node may increase the processing overhead that involves extra communication and data copying. Currently in the absence of a complier support we distribute the shared data among the participating nodes on a round-robin basis.
5.0.1 MATMULT
This application computes the product of two matrices and stores the results in a third matrix for the problem size of 256x256 doubles. The algorithm completes in two phases: a computation phase 10
followed by an update phase. The application instead of repeatedly reading from shared memory, caches shared data in local memory which results in low access rate to shared memory. Also for this application the number of read memory accesses increase with the number of nodes in the system. The application has little sharing and requires little synchronization. Communication for MATMULT
21393
189524
100000
146738
Messages
Messages 100000 50000
Coherence-related Communication
50000
K bytes 10000 5000
15360
150000
20000 15000
150000
Data Copied for MATMULT (0.5K)
UDSM
SNOW
0
0
0
0
30720
UDSM
SNOW
UDSM
SNOW
Figure 7: Data copied, number of messages, and coherence related communication for MATMULT The graph in Figure 6(left) shows that for coarse-grain applications very small data block size like 128 bytes actually increases the communication overhead and, decreases the application performance by increasing the idle time of nodes. The system with 6, 10, 12 or 14 nodes shows the eect of imperfect work distribution among the nodes but the eect of imperfect work distribution is insigni cant because of little synchronization requirement of the application. Parallel execution time of the application on SNOW and UDSM is shown in the right graph of Figure 6 and the graph shows that SNOW outperforms UDSM; the improvement for two-node systems is due to low software access detection overhead on SNOW. With an increase in the number of nodes, the application performance improves further because of xed ownership and data layout techniques used by SNOW. Other DSM systems use dynamic ownership protocol and for these systems the amount of communication increases with an increase in number of nodes; most of the communication is coherence related and is due to the dynamic ownership protocol. Figure 7 shows that UDSM does 39% more data copying than SNOW, and sends 6 times more messages but 77% of these messages are coherence related.
5.0.2 SOR I The application is from Munin [9] application suite and computes the successive over relaxation algorithm for a problem size of 512x512 integers for 25 iterations. The computation is based on scratch matrix approach where each node computes new values for each element, assigned to the node, and stores the results in a scratch matrix; then all nodes synchronize at a barrier before starting the second phase. In the second phase, each node reads its updates from scratch matrix and updates the shared memory. The application has the following characteristics: high access rate to shared memory, pair-wise sharing among nodes, xed granularity and frequent barriersynchronization. Parallel execution graph in Figure 8 shows that the size of a data block is not signi cant for this application because of the high reuse of shared data. Each node is assigned a block of contiguous rows of the matrix and, the node shares only the boundary rows with its neighboring nodes. The graph also shows the eect of work imbalance and, the mismatch between the assigned work and 11
SOR I
SOR I
1.07374e+09
2.14748e+09 0.125K 0.5K 1.0K
UDSM-.5 SNOW-.5 UDSM-1 SNOW-1 1.07374e+09
Parallel execution Time
Parallel execution Time
5.36871e+08
5.36871e+08
2.68435e+08
2.68435e+08
1.34218e+08 1.34218e+08
6.71089e+07
6.71089e+07 2
4
8
16
2
4
Nodes
8
16
Nodes
Figure 8: Parallel execution times on SNOW and comparison with UDSM system for SOR I
Communication for SOR I
5100
Coherence-related Communication 1584
8000
1500
8840
Messages 1000
6000
UDSM
SNOW
0
0
0
0
1000
2000
500
3000
Messages 4000 6000
K Bytes 2000 3000
4000
5000
Data Copied for SOR I (0.5K)
UDSM
SNOW
UDSM
SNOW
Figure 9: Data copied, number of messages, and coherence related communication for SOR I
12
data layout. The mismatch and the imbalance work degrades the application performance on 12node and 14-node systems. The comparison graph in Figure 8 shows that SNOW outperforms UDSM for this application even for imbalance work assignment and, this improvement is because of the low access detection overhead due to the xed ownership protocol. Figure 9 shows the amount of data copied and the communication required during the computation. The left most graph in the gure shows that UDSM system copied 70% more data and sent 47% more messages than SNOW. The coherence related communication of UDSM system, due to the dynamic ownership protocol, was 17% of the total communication
5.0.3 GAUSS The application is from TreadMark [13] application suite and the algorithm iteratively computes gauss-elimination method. All nodes synchronize with each other, after each iteration, on a barrier. The application has high access rate to shared memory and computation granularity of the application decreases in each iteration. GAUSS
GAUSS
5.36871e+08
1.07374e+09 0.125K 0.5K 1.0K
UDSM-.5 SNOW-.5 UDSM-1 SNOW-1
5.36871e+08 Parallel execution Time
Parallel execution Time
2.68435e+08
2.68435e+08
1.34218e+08
1.34218e+08
6.71089e+07
6.71089e+07 2
4
8
16
2
4
Nodes
8
16
Nodess
Figure 10: Parallel execution times on SNOW and comparison with UDSM system for GAUSS Comunication for GAUSS 100000
Messages 40000
Messages 40000 60000 20000
4000
SNOW
0
0
19200
0
2000 0
UDSM
82678
60000
80000
10000 12000 K bytes 6000 8000
9600
Coherence-related Communication 80000
109291
13433
20000
Data Copied for GAUSS (0.5K)
UDSM
SNOW
UDSM
SNOW
Figure 11: Data copied, number of messages, and coherence related communication for GAUSS Parallel execution graph of the application in Figure 10 shows the eect of data block size. The performance of the system with small data block increases with an increase in the number of nodes 13
and, the graph shows that, for 16 nodes, a system with a data block size of 128 bytes performs better than a system with larger data block size. The comparison graph in the same gure shows that SNOW outperformed UDSM due to lower overhead for software access detection for all data block sizes. Figure 11 shows the amount of data copied and the number of messages sent over the network during the execution of this application on UDSM and SNOW. The left most graph in the gure shows that 39% more data was copied and ve times more messages were sent during the execution of the application on UDSM as compared to the execution on SNOW. The coherence related communication was 75% of total communication on UDSM.
5.0.4 WATER This application simulates a system of water molecules in the liquid state and, is taken from SPLASH suite [18]. The application has medium granularity, high frequency of synchronization and a good reuse of shared data. We used problem size of 128 molecules for two iterations. The size of data block becomes important for medium and ne grain applications. The execution graph in Figure 12 shows that a small data block size reduces the protocol processing overhead and reduces the execution time of the application. The application is sensitive to the imbalance in work assignment among the nodes and the graph also shows the performance degradation due to work imbalance on systems with 10, 12 and 14 nodes. On 16 nodes system SNOW with 1K data block size outperforms UDSM system with 0.5K data block size. WATER
WATER
5.36871e+08
5.36871e+08 0.125K 0.5K 1.0K
UDSM-.5 SNOW-.5 UDSM-1 SNOW-1
2.68435e+08 Parallel execution Time
Parallel execution Time
2.68435e+08
1.34218e+08
1.34218e+08
6.71089e+07
3.35544e+07
6.71089e+07 2
4
8
16
2
Nodes
4
8
16
Nodes
Figure 12: Parallel execution times on SNOW and comparison with UDSM system for WATER The comparison graph in Figure 12 compares the application performance on SNOW and UDSM systems. SNOW outperforms UDSM, if the work assigned to all nodes is balanced but its performance for imbalance work load is worse than UDSM. The performance of SNOW with 1K byte data block is better than the performance of UDSM system with 0.5K byte data block size. In summary, SNOW outperforms UDSM for 16-node system. The amount of data copied and the communication performed by the application on both systems are shown in Figure 13. UDSM system copied six times more data and sent 4 times more messages than SNOW. The coherence re14
SNOW
400000
486133
Messages 200000 300000
0
0
178474
0 UDSM
Coherence-related Communication
100000
200000
Messages 400000
K bytes 100000 50000 0
26240
500000
Communication for WATER 772089
600000
150000
161746
800000
Data Copied for WATER (0.5K)
UDSM
SNOW
UDSM
SNOW
Figure 13: Data copied, number of messages, and coherence related communication for WATER lated communication in UDSM system is 62% of the total communication performed by the system during the execution of the algorithm.
6 Conclusion and Future Work Our results show that the xed ownership protocol on a RC DSM system can reduce the amount of data copied and the amount of data sent over the network. Contarary to the intution, the xed ownership protocol reduces the total number of messages required during the computation. For an all software system, the xed ownership protocol actually has low access control overhead as compared the dynamic ownership protocol. The xed ownership protocol also reduces the need for coherence-related communication. Separation of read cache and write-buer at a node reduces the protocol processing overhead. Coarse-grain applications usually have high reuse of shared data. The imbalance work load and the mismatch between assigned work and assigned data can degrades application's performance. Currently SNOW distributes the shared data on a round robin basis, but an application can bene t from the compile time analysis for ecient and automatic data initialization. In this implementation, SNOW eagerly invalidates local read cache on release operations and we would like to evaluate the eect of selective invalidation of read-cache.
Acknowledgements. I would like to thank my advisor Dr. Abdelsalam heddaya for his help and guidence. I would also like to thank Ms. Ghazaleh Nahidi for proof-reading an early draft of the paper.
References
[1] Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. High Speed Switch Scheduling for Local Area Networks. Technical Report UCB CSD-94-803, Computer Science Division, University of California, Berkeley, CA 94720, 1994. [2] Anindya Basu, Vineet Buch, Werner Vogels, and Thorsten von Eicken. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. Technical report, Department of Computer Science, Cornell University, 1995. [3] Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway Distributed Shared Memory System. In 1993 IEEE CompCon Conference, pages 528{537, Feb 1993. [4] Arif M. Bhatti. Data Block Size, Software Access Control and Release Consistent Distributed Shared Memory on Reliable Networks. Technical report, Boston University, 1997. Submitted to PDPTA'97, also available at http://www.cs.bu.edu/students/grads/tahir/pdpta97.ps.
15
[5] Arif M. Bhatti. Evaluation of All-Software Conventional Distributed Shared Memory on NOWs based on High Speed Networks. In Cluster Computing Conference, March 1997. [6] Arif M. Bhatti. Evaluation of Distributed Shared Memory Systems for Network of Workstation Without any Hardware Assistance. PhD thesis, Boston University, 1997. (In preparation). [7] R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding Communication Latency and Coherence Overhead in Software DSMs. In 7th International Conference of Architectural Support for Programming Languages and Operating Systems, 1996. [8] E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl. Proteus: A High-Performance ParallelArchitectue Simulator. In ACM SIGMETRICS and PERFORMANCE Conference, June 1992. [9] John B. Carter. Ecient Distributed Shared Memory Based on Multi-Protocol Release Consistency. PhD thesis, Rice University, Houston, Texas, September 1993. [10] A. Fahmy and A. Heddaya. BSPk: Low Overhead Communication Construct and Logical Barriers for Bulk Synchronous Parallel Programming. Bulletin of IEEE TCOS, 8(2):27{32, August 1996. [11] Edward W. Felten, Richard D. Alpert, Angelos Bilas, Matthias A. Blumrich, Douglas W. Clark, Stefanos Damianakis, Cezary Dubnicki, Liviu Iftode, and Kai Li. Early Experience with Message-Passing on the SHRIMP Multicomputer. Technical Report TR-510-96, Department of Computer Science, Princeton University, 1996. [12] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: High-Performance All-Software Distributed Shared Memory. In Fifteenth Symposium on Operating Systems Principles, December 1995. [13] Pete Keleher, sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. TreadMark: Distributed Shared Memory on Standard Workstation and Operating Systems. Technical Report Rice COMP TR93-214, Department of Computer Science, Rice University, P. O. BOx 1892, Houston, Texas 77251-1892, November 1993. [14] J. William Lee. Concord: Re-Thinking the Division of Labor in a Distributed Shared Memory System. Technical Report 93-12-05, Department of Computer Science and Engineering, University of Washington, December 1993. [15] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. [16] Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A Low Overhead, SoftwareOnly Approach for Supporting Fine-Grain Shared Memory. In 7th International Conference of Architectural Support for Programming Languages and Operating Systems, 1996. [17] Ioannis Schoinas, Babak Falsa , Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In ASPLOS-VI. ACM, Oct 1994. [18] Jaswinder Pal Singh, Wolk-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Application for Shared Memory. Computer Architecture News, pages 5{44, March 1992. [19] Chandramohan A. Thekkath. System Support for Ecient Network Communication. Technical Report 94-07-02, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, July 1994.
16