Fast Locks in Distributed Shared Memory Systems

0 downloads 0 Views 191KB Size Report
purpose of eager sharing is to have data present in the local memory of each sharing processor whenever they are needed. Ideally, processors are never idled.
Appeared in HICSS 1994

Fast Locks in Distributed Shared Memory Systems Gudjon Hermannsson

Larry Wittie

Computer Science Department, SUNY Stony Brook, NY 11794-4400

Abstract Synchronization and remote memory access delays cause staggering ineciency in most shared memory programs if run on thousands of processors. This paper introduces ecient lock synchronization using the combination of group write consistency, which guarantees write ordering within groups of processors, and eagersharing distributed memory, which sends newly written data values over fast network links whenever shared data are written locally. This fast locking method uses queue-based locks and a group root as a lock manager. Write ordering allows lock grants and releases to immediately follow nal shared data writes. Most other consistency models need shared writes to be completed globally before lock release. Program execution times are much shorter using group write consistency than weak, release, or entry consistency.

1 Introduction Distributed shared memory(DSM) systems are networks of computers that transparently share variable values among processors. They are distributed systems that hide the underlying message passing mechanism from programmers and allow a shared memory programming paradigm. DSM systems keep the conceptualization advantages of shared memory and yet have the potential for scaling eciently to highly parallel systems with thousands of processors. Remote memory accesses and synchronizations are the two major activities, besides load imbalance, that  This research has been supported in part by Department of Energy/Superconducting Super Collider contract SSC91W09964; by National Science Foundation grants for equipment (CDA88-22721, CDA90-22388, and CDA93-03181) and research MIP89-22353; by National Aeronautics and Space Administration grant NAG-1-249; and by Oce of Naval Research grant N00014-88-K-0383.

avoidably lengthen execution times for parallel programs running on DSM machines. Factors that determine the length of delays are: the way the system handles remote memory accesses to the logically shared data space, the consistency model employed to relax constraints on memory access order, and the synchronization methods available. Remote access mechanisms for logically shared distributed memory form a spectrum. At one end are demand driven mechanisms, which delay accesses to remote data until they are actually needed and then halt the processor until each can be fetched. Network trac is minimized by passing only needed data and data requests. At the other end are eagersharing mechanisms built on the principle that as soon as a shared datum changes, it is sent over the network to all processors that may need it in the future. The main purpose of eager sharing is to have data present in the local memory of each sharing processor whenever they are needed. Ideally, processors are never idled by delays, and eager sharing can successfully overlap communication delays with useful computations. Eager sharing and cache update are similar in effect, but determine data sharing at di erent times, at compilation versus execution. Dynamically enabled eagersharing mechanisms activated at run time can use compile time analyses to determine when sharing will rst and last be needed. Eager sharing can avoid network latency even on rst reading a variable newly written by another processor. Dynamic disabling can avoid post-need sharing costs incurred by update. Eager sharing is roughly equivalent to cache update combined with initial prefetch and software controlled ushing of no longer needed cache lines. Eager sharing is producer-initiated and does not require impossibly precise timing estimates for optimal latency hiding, as consumer-initiated prefetch does. Demand-fetch protocols do not scale well; for many important parallel algorithms, they do not allow e-

cient execution on networks larger than a few dozen processors[21, 22]. By masking most interprocessor delays, eager sharing allows ecient execution in much larger networks than demand-fetch access. Consistency models place speci c requirements on the order in which shared memory accesses from one processor may be observed by other processors in a multiprocessor system [7, 17]. The programmer of a parallel computer must be familiar with its consistency model to design correctly functioning and ecient programs scalable to large numbers of processors. The strictest model is sequential consistency[14], which requires memory accesses to appear as some interleaving of the execution of the parallel processes on a sequential machine. A weaker model is processor consistency[8], which allows reads to bypass pending writes to gain more eciency. Total store ordering[11] ensures ordering of all writes to memory. However, the use of a central memory controller to arbitrate all writes from di erent processors is not viable for large distributed memory networks. Partial store ordering[11] allows any order of writes between explicit storage synchronization markers, but completes all writes in one marker section before the next starts. Weak consistency[4] guarantees consistency only at synchronization points. Release consistency[7] relaxes weak consistency further by using knowledge of the type of lock access to allow more pipelining and bu ering. Entry consistency[3] goes further by associating data with dataguards and requiring consistency only when entering a guarded data section. For the weaker models, memory accesses may well be performed out of order between special synchronization points, but data must be consistent at those points. The group write consistency model implemented by the hardware interfaces for Sesame (Scalable Eagerly ShAred MEmory) networks[21] is a consistency model for parallel computers that produces total store ordering within each small group of processors. It gives more precisely structured data changes than processor consistency, but is much faster. Under group write consistency, shared writes arrive at all processors of a multicast group in the same relative order. Adding processor groups to total store ordering overcomes its central arbitration bottleneck and creates systems with a high potential for scalability. This paper shows how Sesame's combination of group write consistency and eager sharing gives greater eciency for synchronizations in highly parallel programs than either release or entry consistency. Synchronization primitives make programs easier to understand and write, but processors waste time when

waiting for locks. Locks are used in nearly every parallel program. Lessening synchronization delays is a major goal for ecient parallel program execution. The strict write ordering of group write consistency reduces lock delays by allowing ordinary shared variables to have special meanings, for example, to be reader-writer locks for shared data structures. Ordering can eliminate most locking penalties when there is only one writer. Mutual exclusion can also be very ecient for multiple writers, since lock permission can safely accompany each last shared write. Section 2 is an overview of hardware synchronization primitives proposed for parallel computers. Section 3 is a discussion of group write consistency, listing its advantages. Section 4 includes diagrams of idle times incurred by processors for simple one-writer locking with group write consistency versus weaker consistency models. Section 5 has similar comparisons for mutual exclusion. Section 6 shows program speedups possible using eager sharing and fast locks.

2 Overview of Lock Synchronization Hardware synchronization primitives originally evolved on shared memory multiprocessors. In distributed memory systems, where local caches hold copies of remote data, early primitives are prone to heavy network trac and memory contention. Atomic, sequentially consistent loads and stores can be used for locking. However, faster lock primitives using weaker consistency models have been proposed. Test-and-set and reset are popular primitives[5]. Test-and-set tests a lock and leaves it set. It is repeated until a lock variable tests \unset" before entering a mutual exclusion section. The lock is reset at section exit. Test-and-set across a network causes excessive trac if processors spin-lock, repeatedly testing a lock until it is acquired. Test-and-set and compareand-swap, for the IBM 370[5], are read-modify-write operations. As opposed to spin-lock, suspend-lock employs interprocessor interrupts. If its rst test-and-set fails, a processor waits for an interrupt[5]. Less general than read-modify-write is the full/empty tag on each word in HEP memory[12]. The tag can be tested before a producer-consumer write or read operation: only a full word can be read; only an empty one, written. If the test succeeds, the tag value is reversed and the operation is performed. An expanded set of memory tags is included for fast synchronization in the new Tera supercomputer[1]. Test-test-and-set[19] extends test-and-set by spin testing a local copy of the lock whenever the rst

atomic test-and-set of the global lock fails. After each lock reset, all local copies are invalidated and every waiting processor does a new test-and-set. Only one gets the lock. To reduce mis-trials before rechecking a lock, an exponential backo delay after release of a lock with test-test-and-set lessens contention[2]. The delay is locally doubled whenever the lock is unlocked, but an attempt to obtain it fails. Queue-based locks[9, 10, 2, 15, 16, 3, 18] are alternatives to retested locks. A lock request is sent to a lock owner. If the lock is free, permission is granted; if busy, the request is queued. When the lock is freed, the next queued process gets permission. Lock queues can be supported in hardware[9, 15] or in software[10, 2, 16, 3, 18]. Test-and-set with hardware support[20] is adequate on tightly coupled multiprocessors. Queue-based locks are needed in distributed memory systems to minimize network trac after lock release. On multicomputers connected by networks, locating the lock owner is an issue. Distributed directory schemes[15] allow a lock request to go directly to the lock manager or through the manager to the current owner. Other schemes[3, 18] use a distributed algorithm to guess the current lock owner, p. If the guess is wrong, there are two possibilities: p is waiting for the lock and the request is queued at p, or the request is forwarded to a new guess supplied by p. Fast, ecient synchronization under group write consistency employs queue-based locks and a group root as a lock manager. In most cases, the lock is managed by the network interface on the same processor that is using the lock, and lock requests are local. The write ordering of group write consistency allows lock grants and releases to immediately follow the last shared data in a set, enhancing eciency compared to other systems[15] with queue-based locks, where shared writes must be globally completed before lock release.

3 Group Write Consistency Locks Group write consistency for parallel computations looks almost like simple sequential consistency in most parallel programs and yet provides much higher eciency than sequential consistency. Group write consistency is built on the principle of guaranteeing the same relative arrival order of shared writes at each participating processor. Safe, rapid shared data access can usually rely on local write ordering rather than global synchronization, which is needed period-

ically with other consistency models and is very slow in huge networks with long message delays. Group write consistency has been developed for the Sesame[21] eagersharing distributed shared memory system. There is a reliable tree-based multicast protocol implemented in hardware by the collection of memory sharing interfaces that link Sesame workstations. Each group of processors sharing a variable determines a multicast group within the network. One of the processors that writes to the variable is selected as root for the spanning tree used to route, to sequence, and to retransmit all hidden sharing messages within the group. Data sharing packets within each processor group are sequenced. Each processor accepts shared data only in the order set by the group root. Each processor asks the root for a copy of any lost packet. At worst, strict write ordering doubles average propagation delays for data sharing between pairs of processors. Group write consistency guarantees the order of writes within a sharing group whether the writes are from a single source or multiple sources. Writes from a single origin addressed to the same multicast group arrive at all destinations in the group in the same relative order. When a single source processor writes to all processors in the network, single source ordering is closest to processor consistency[8]. However, processor consistency is not ecient at fast write rates in large networks. Group write consistency also supports multiple source ordering. Writes from di erent origins addressed to the same multicast group will arrive at all sites in the group in the same relative order. Group write consistency could also guarantee ordering between overlapping groups, as is done in an independently developed, similar distributed algorithm for ordering messages[6]. However, for many coding applications, complete ordering is not needed. Combining overlapping groups into one global group can prevent scaling in large networks by overloading the global root and greatly reducing performance. For this reason, Sesame does not automatically combine overlapping groups. Instead, explicit mutual exclusion can enforce ordering for overlapping groups in the rare cases when it is needed. The programmer of a group write system can take advantage of the write ordering within each group. If processors p1 and p2 are both waiting for data a and b, associated with di erent groups, changes to a and b can be re ected in two monotonic counters (or setonce ags), one per group. To get consistent values of a and b, processors p1 and p2 locally observe the values of the counters. If they are consistent, so is the

data. One major di erence distinguishes group write consistency from other models. Others force each processor to wait at synchronization points or possibly at every remote memory access until the previous write has changed memory on all other processors before continuing. Instead, Sesame hardware[21] guarantees the consistent local ordering of writes. This di erence is extremely important in networks of thousands of computers. All eagerly shared writes are intercepted by memory sharing hardware and will be performed in the same order on all sharing processors. For each synchronized write, a computer using older consistency models encounters the delays of a round-trip. Using the write ordering of group write consistency, a processor can immediately perform the next instruction, even if it is another shared write.

3.1 Single Writer Many Readers Locking Lock synchronization is ecient under group write consistency. Since writes are ordered, the case for one writer is simple; an ordinary variable can lock a data structure awaited by one or many readers. If each shared datum is written only once, each reader can just spin locally on its lock copy. When the writer sets its own lock copy to indicate that the data structure is safe to read, the new lock value with its implicit message is sent to each reader. The readers may safely receive and access the updates at di erent times. Many applications that can use this form of synchronization show great performance increases[21] compared to demand fetch locking. In the case of multiple writes by the same writer, each reader must also enable a local interrupt that is triggered by any change in the status of the lock. The interrupt is disabled after the whole locked data structure has been read successfully. If it is triggered, the local processor just rereads the data to get consistent values. This option is especially desirable when an application can read slightly old values without significantly changing execution results. For all readers, the interrupt and lock checking are local, avoiding remote network accesses to obtain a mutual exclusion lock.

3.2 Mutual Exclusion in One Group Mutual exclusion guarantees atomic access, usually for one processor to write. The ordering of group write consistency allows a new, very ecient mutual exclusion algorithm for eagersharing systems. Compiler tools can aggregate related variables and locks into the same sharing group. Each lock is initially set to

a unique negative number not matching any positive processor number, say -99..99, meaning free. When the variable is positive, the processor with that unique identi cation (ID) number has exclusive access. A processor wanting exclusive access writes the lock variable with the negated value of its own processor number. The write is copied by the local eagersharing monitor and sent to the group root. The root checks if the lock is free. If not free, the processor ID number is queued. If free, the root writes the positive processor ID into the lock variable to grant permission. When the original node sees its own positive ID arrive in the lock value, it can continue execution. As each processor frees the lock by writing -99..99 in its local copy, the root checks whether any nodes are queued awaiting exclusive access. If so, the next queued number is written as the new lock value. If not, the free value (-99..99) is propagated to all group memories. A processor always receives exclusive access within one or one half round-trip time of the lock being freed. There is no network trac except three one-way messages to request, grant, and release the lock. An advantage unique to group write consistency occurs for heavily requested locks. The last exclusive write is followed by local lock release. The group root can immediately append the next lock grant to the shared data written by the previous processor. On each node, the writes complete before the lock changes. This mutual exclusion method needs little space, only one reservation processor queue per lock.

3.3 Mutual Exclusion Among Groups When dealing with mutual exclusion across multiple sharing groups, an additional step is needed. A processor requiring exclusive access to a collection of variables not in the same group sends a request to the group roots of all variables in the collection. Any one group root does not have enough information to give the requestor exclusive access to the collection, but does reserve exclusive access to the group it controls. When all groups accept, the processor can continue. The release is sent the same way, to each group root involved. Routing corresponding synchronization messages and data changes on the same paths through group roots guarantees a consistent view of variable updates. All data changes are con ned to one group.

4 Producer-Consumer Comparisons The various data consistency models produce very di erent delays for producer-consumer interactions in

SESAME Group Write Consistency Writer

Writer

Reader(s)

Data

tes

Flag

upda

te

Reader Spinning on flag

Data Ready

Entry Consistency

Reader(s)

Data

upda

Change data

Write flag

Weak and Release Consistency

tes Reader Spinning on flag

Release Waits

t gmen

d nowle

Ack

Data Ready

Time

(a)

(b)

Reader Waiting for lock

Change data

New

Release lock

Lock

Release

Time

Reader(s)

Got write lock

upda

Change data

Writer

data

gran

t

Data Ready

Time

(c)

Figure 1: Producer-Consumers Delays for Group-Write versus Weak, Release, and Entry Consistency a parallel program coded to run correctly regardless of timings. In Figure 1, group write consistency, using Sesame[21] eager sharing, is contrasted to three less strict consistency models: weak[5, 4], release[7], and entry consistency[3]. The timings for group write consistency assume eager sharing; those for weak and release models, cache update sharing; and for entry consistency, updates are communicated only when the corresponding lock is requested. Group write consistency is markedly more ecient. The code for this scenario is very simple: Producer --------------... Write(data); Write(flag=1); ...

Consumer(s) -------------... While(flag==0) ; Read(data); ...

The ag is initially zero. In Sesame, if the ag and data are in the same sharing region (group), their changes are routed in sequence to the same multicast group of processors. Since group write consistency guarantees that write orders are preserved within each group, the local data will be valid before the copy of the ag changes to one on each reader processor. In the weak and release consistency models, the same assurance can be guaranteed by labeling the ag accesses as a lock acquire by the readers and a lock release by the writer. By treating ag changes as synchronization accesses, writes to the data will complete on all computers before the release completes. For entry consistency, the code must be modi ed, since the data must be guarded by a lock. The code must use explicit lock requests and releases. When the writer releases the write lock, both the data and lock permission are sent to a requesting reader.

In Figure 1, the reader spins testing the ag until the writer sets it and data read access is safe. Sesame group write consistency is shown in Figure 1(a). Immediately after the last datum is written, the release ( ag) is written and safely follows the last write. For weak and release consistency, Figure 1(b), the release can not be done until update acknowledgments return from all processors receiving copies of the data. Reader access to the data is one round-trip later than for Sesame. Figure 1(b) does not show yet another round-trip delay that occurs under weak and release consistency for readers that do not already have a copy of the

ag. Using cache invalidation instead of cache update policy with weak or release consistency would also cost another round-trip for the reader to fetch the data. Figure 1(b) shows the best possible times for the weaker consistency models. Much longer delays are possible. The times shown for group write consistency will never be longer. Figure 1(c) shows the most rapid release for entry consistency. A read request has already been queued at the writer before or during the update. Entry consistency is more ecient than release consistency. The writer stores the data locally. After the lock is released, the data and the lock permission are sent to the reader. Since data can more nearly ll whole cache lines, data transfer may be more ecient than for the individual updates in Figure 1(b). However, all data transfers must follow lock release, unlike Sesame. Whenever the reader requests a lock that is already available but only at the writer, entry consistency has two one-way delays plus the data transmission times. Sesame has no network delays for this case. At its best, entry consistency gives times close to those for group write consistency, but still takes longer by the slight additional time for data transmission.

Group write consistency always lets readers get new data as soon as possible and sooner than other consistency models. In the best case for Sesame, data written by a remote node are already present locally when needed, and there are no network delays at all.

SESAME Group Write Consistency LOCKOWNER GROUP ROOT CPU3 CPU2

CPU1 Request lock

Acqu

ire1

Request lock

n

issio

Perm

5 Mutual Exclusion Comparison Figures 2, 3, and 4 compare wasted idle times for three successive sets of mutually exclusive accesses under Sesame group write consistency, weak[5, 4], release consistency[7], and entry consistency[3]. They show times for contending requests to the same lock. Again, eager sharing or cache update sharing are used to minimize data access delays in all cases except entry consistency, which has the policy of not updating until a lock is requested. Weak and release consistency behave the same since each processor locks, reads or updates, and releases only once. In all models, only one processor at a time is allowed to access locked data, even though (possibly inconsistent) copies may be present on the other processors. In the gures, three processors(C P U 1, C P U 2, C P U 3) each start with a local memory copy of the shared data that will later be updated. C P U 2 is the current lock owner, which arbitrates lock access requests. At the top of each gure, two processors, C P U 1 and C P U 3, contend to acquire the lock for exclusive access to the data. Each must write the data when no other processor is writing or trying to read. The request from C P U 1 reaches C P U 2 slightly sooner and is granted rst. After C P U 1 has nished its exclusive accesses and C P U 3 still has the lock, C P U 2 requests exclusive access that will be granted later. In Figure 2 for group write consistency, the Sesame interface for C P U 1 copies its data changes without slowing C P U 1. The dashed lines indicate shared data being sent to the group root and the solid lines data from the root to all destinations. When C P U 1 nishes its last update, it immediately releases the lock. The shared writes reach group root C P U 2 and are redistributed via its tree to all (three) sharing CPUs. The lock release reaches C P U 2 after the last written datum, as guaranteed by write order. The release immediately becomes a permission forwarded to queued processor C P U 3. When C P U 3 receives the lock permission, it can read the shared data locally and perform its own updates. After the last update, the lock release is sent back to C P U 2 for its updates. If sharing is enabled, eagerly shared variables are copied whenever changed. Local adherence to the lock protocol prevents untimely read or write accesses to

ire2

Acqu

Shar

ed W

rite

Writes completed Release lock

Perm

issio

n e

Request Lock

rit red W

Sha

Writes completed Release lock

Release

Figure 2: Fast Group Write Consistency Locking shared data. Sesame is ecient even under heavy contention. It is not possible to release the lock sooner than the last datum update, nor to get permission before the last changes to shared data reach the requesting processor. If there is no contention, at most one round-trip is needed to acquire the lock. Similar lock requests under weak and release consistency are shown in Figure 3. Again two CPUs request the lock, C P U 1 is given permission, and the request by C P U 3 is forwarded to C P U 1 and queued. This method may need three one-way messages to get a lock [13], for example, the request by C P U 3 to the lock manager (C P U 2) is forwarded to the current lock owner (C P U 1), which must eventually grant the lock to C P U 3. When C P U 1 nishes, lock release to C P U 3 is blocked until the updates reach all nodes. After C P U 3 gets permission, it writes the shared data and blocks lock release until its updates are completed everywhere. Comparing Figures 2 and 3, even with update policy, weak and release consistency with update policy need an extra one-way-trip to release the lock. The weaker consistencies get updates to non-root processors (C P U 1,C P U 3) faster since they do not have to pass through the root (C P U 2) for sequencing, but the value cannot safely be used until after lock release which is much slower than for group write consistency. If an invalidation policy is used,

Entry Consistency

Weak and Release Consistency LOCKOWNER Data Owner CPU2

CPU1 Request lock

Acqu

Request lock

re2 cqui

ire1

A

LOCKOWNER CPU2

CPU1

CPU3

Request lock

Acqu

Update mp te Co

ire2 Acqu

d

arde

Forw

Release

a

Upd

Permit + C

hanged D

Permiss

ion

Request Lock

Request lock

ata

it + D

Perm

leted

Release

Acqu

Invalidation (non-excl. mode) Updates

Update

ire2

ire1

n issio Perm re2 cqui rd A a w r Fo

CPU3

Fwd

Acq

3

Updates

Update

Upd

ate C

ire3 Acqu Forward Current O to wner

ata Transm

ission

Request Lock Update Release

ata

it + D

omp

Perm

leted Release

ion miss

Per

Release

Update Release

Figure 4: Entry Consistency Synchronization

Figure 3: Weak and Release Consistency Locking another round-trip, not shown in Figure 3, is needed to move the shared data to each new processor as it acquires exclusive access. Lock requests under entry consistency are shown in Figure 4. Again two CPUs request the lock. Locks under entry consistency can be requested in either nonexclusive mode or exclusive mode, and invalidation is used to transfer from non-exclusive to exclusive mode. Before C P U 1 is given permission, the lock owner send an invalidation to the processors holding the data in non-exclusive mode. Data changes and lock permission are sent to C P U 1. The request by C P U 3 is forwarded to C P U 1 and queued. When C P U 1 nishes, it releases the lock, and both the changed data and lock permit are sent to C P U 3. After C P U 3 gets permission, it writes the shared data and releases the lock locally. When C P U 2 requests the lock, it sends the lock request to its best guess for the lock owner, C P U 1, which forwards the lock request to C P U 3. The data changes and lock permission are sent to C P U 2, which can now perform its updates.

The same time scale is used in Figures 2, 3, and 4. Entry consistency shows quicker results than weak and release consistency, but is not so rapid as Sesame. A round-trip invalidation is needed to move data in nonexclusive mode to exclusive mode. If several processors are contending heavily to acquire the lock, entry consistency performs as well as possible for Sesame, except for the extra data transmission time needed after each local lock release. Data changes are sent to the next processor requesting the lock just before the lock permit is sent. Group write consistency avoids this extra propagation delay. Under light contention, entry consistency may not perform as well, since a new requestor may often guess the wrong lock owner and have to wait for its request to be forwarded. Using update policy, weak and release consistency take much longer than group write consistency, and take even longer if invalidation is used. For very large systems, the disparity between group write consistency and the other models will be signi cantly larger than shown, since network delays will be much longer than local update times.

Speedup

Speedup 128

128 64 64

32

Eager Share Prefetch Demand Fetch

32 16 8

16

4

Maximum possible speedup Eager sharing 1024 tasks Eager sharing 128 tasks Entry Consistency 1024 tasks Entry Consistency 128 tasks

8 2 4 2 2

4

8

16

32

64

128

256

512

1024

N CPUs

0.5

2

4

7

10

20

40

70

100

200

400

#CPUs

Figure 5: Gaussian Elimination Speedup: 400 Eqns

6 Simulation Results Single writer synchronization is highly ecient, as shown on logarithmic scales in Figure 5 [22]. The gure plots speedups using Gaussian elimination on a 400  400 matrix to solve 400 linear equations. Network size varies from 1 to 400 computers. Speedup is average processor eciency times network size. Eciency is the percentage of peak processor speed. Figure 5 shows the speedup resulting from using weak or release consistency for demand fetch and prefetch, similar to DASH[15]. The consistency scenario for demand (pre)fetch is the same one shown in Figure 1(b) except: (1) sending data does not incur any network costs other than sharing contention on the memory of the sending processor; and (2) instead of a release after a shared write taking three half trips, demand fetch needs only two half trips to request and to receive the lock and its associated data values. Because of these cost simpli cations, simulated speedups for demand fetch and prefetch are slightly higher than in a real system. Figure 5 also shows how much prefetch improves performance over demand fetch, by asking for each lock value and its data slightly ahead of time. The results are the best for any single prefetch time kept constant throughout execution. The di erences in scaling ability of the three sharing methods are clear. The demand sharing methods provide speedups of at most 4:2 and 13:7, and for large networks run slower than on one processor. On the other hand, eager sharing extracts ever-increasing network power from more processors, reaching a speedup of 126:8 (4 gigaFLOPS) for 400 processors. Results are similar for Fast Fourier Transform [22], a speedup 140 times greater using eager sharing rather than demand fetch among 2048 processors. This section also shows system speedups for a task

0.25

Figure 6: Speedup for Task Management:

1 100

management application that uses eager sharing combined with group write consistency for mutual exclusion. Eager sharing is compared to a fast version of entry consistency. The version of entry consistency used in the simulations assumes that the lock owner is always known, so no time is ever lost in relaying requests to nd the lock owner. All releases in entry consistency are local. In this application, one processor produces tasks for the others to execute. If a processor is free, it locks the shared tail index of a common idle-processor queue, adds itself to the tail of the queue, and then releases the lock. It then waits for the producer to inform it of a task. When the producer creates a new task, it locks the shared head index for the idle queue, removes the next processor id from the queue, starts to inform the processor of the task, and releases the lock. If there is no idle processor, the producer waits for one to enter itself onto the idle queue. The time to produce a task is assumed to be much shorter than 1 for Figure 6 and 1 the time to process a task, 100 2000 for Figure 7. Figure 6 illustrates e ective network power when the producer generates a total of 128 or 1024 tasks and waits for the last to be executed before stopping. The top line shows the maximum speedup possible if network delays were zero. Since the time to generate 128 or 1024 tasks is negligible compared to the execution time, the producer is e ectively an idle processor. For 2 processors, minutely more than 50% is the maximum eciency, resulting in an e ective speedup of 1. The number of processors in Figure 6 is a power of two plus one (3; 5; 9; :::) to eliminate load balancing e ects. The extra one produces tasks. As can be seen in Figure 6, the extra time for entry consistency to send the changed data with the lock and the waits for updated read copies of values protected by a lock

become signi cant for larger networks. For 128 tasks, SESAME eager sharing shows maximum possible eciency up to 17 processors resulting in an e ective network speedup of 16. When SESAME reaches 17 processors, mutual exclusion request delays become more visible. For 65 processors, average processor eciency drops to about 72%, which is 74% of the maximum possible eciency. For 128 processors and 128 tasks, average eciency is about 40%, but overall network speedup still rises slightly. The reason for the drop in eciencies at the end is clear. The 1 for task production assumption of a time ratio of 100 versus task execution places a limit on eciency. With more than 100 processors, there are not have enough tasks produced to keep all processors busy. A processor nishing a task has to wait more than one task execution time to get a new task to execute. For 128 tasks, the peak speedup is 17:9 from 33 processors for entry consistency and 53:3 from 129 processors for group write consistency. A total of 128 tasks can run 3:0 times faster using fast locks under group write consistency than under entry consistency. Similar curves for 1024 tasks are also shown in Figure 6. For 1024 tasks, SESAME reaches a peak speedup of 84:1 from 129 processors. The eciency drops after the network reaches the 100 processor mark. For entry consistency, peak speedup is only 22:5 from 33 processors. Group write consistency gives 3:7 times faster performance with 1024 tasks. Eager sharing allows much more ecient locked queue management than entry consistency. For entry consistency, the maximum possible network power is never reached if there are more than 2 processors. For more processors, the gap between the realized and maximum attainable power widens slowly up to 17 processors and very rapidly after that. Entry consistency provides less computing power than SESAME group write consistency for two reasons: it takes extra time to send the data just before each lock and the processors must fetch and test a variable written by the producer to check if the processor queue is full, causing network trac and delays. With eager sharing, the test variable is immediately sent to all processors whenever it changes. Figure 7 shows similar results for 128 and 1024 tasks, when the ratio of task production to task pro1 . In this case, the 1025 or fewer cessing times is 2000 processors never wait idly. For larger networks, network speedup is roughly twice as great as in Figure 6. For 1024 tasks, peak speedups are 216:6 (with 257 processors) for SESAME locking and only 42:2 (with 65 processors) for entry consistency. Group write consis-

Speedup 256 128 64 32 16 8

Eager Eager Entry Entry

4 2

2

4

8

16

sharing 1024 tasks sharing 128 tasks Consistency 1024 tasks Consistency 128 tasks 32

64

128

256

512

1024

N CPUs

0.5 0.25

Figure 7: Speedup for Task Management:

1 2000

tency improves overall task execution by a factor of 5:1 over entry consistency.

7 Conclusions This paper has shown fast lock synchronization methods under group write consistency, which guarantees to maintain the order of all shared writes within multicomputer processor groups. Lock synchronization methods made very ecient by group write consistency include producer-consumer constraints and mutual exclusion within a group and across groups. The write ordering guaranteed within each sharing group minimizes processor wait times for producerconsumer relations, and allows for ecient mutual exclusion since the lock can be released immediately after the last shared write protected by the lock. For the two cases of producer-consumer and mutual exclusion, graphical comparisons in this paper show that program idle times are much shorter for synchronization under group write consistency than under weak, release, or entry consistency. Producer-consumer locking is all the synchronization that is needed for many parallel programs. Simulation results show high eciencies for eager sharing compared to demand fetch and prefetch using a simpli ed version of release consistency[22]. Simulation results for mutual exclusion also show high eciency for group write consistency compared to entry consistency, even when the results for entry consistency are slightly in ated by using global knowledge about lock ownership to route requests perfectly. Simulations show that for Gaussian elimination with up to 400 processors, eager sharing give 30 times greater speedup than demand fetch, and 9 times better than demand fetch augmented by prefetch. Other

results show that fast locking under group write consistency allows managed tasks to complete 3 to 5 times faster than with entry consistency. The interleaving of the data accesses with synchronization accesses within groups is a promising idea worth exploring further.

References [1] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porter eld, and B. Smith. The Tera Computer System. Int. Conf. on Supercomputing, 1{6, June 1990. [2] T.E. Anderson. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors. Int. Conf. on Parallel Processing, II:170{174, Aug. 1989. [3] B.N. Bershad and M.J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors. Technical Report CMU-CS-91-170, Carnegie Mellon University, Sept. 1991. [4] M. Dubois, C. Scheurich, and F.A. Briggs. Memory Access Bu ering in Multiprocessors. 13th Int. Symp. on Comp. Arch., 434{442, June 1986. [5] M. Dubois, C. Scheurich, and F.A. Briggs. Synchronization, Coherence and Event Ordering in Multiprocessors. IEEE Computer, 21(2):9{21, Feb. 1988. [6] H. Garcia-Molina and A Spauster. Ordered and Reliable Multicast Communication. ACM Trans. on Computer Systems, 9(3):242{271, Aug. 1991. [7] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbson, A. Gupta, and J. Hennessy. Memory Consistency and Event Ordering in Scalable SharedMemory Multiprocessors. 17th Int. Symp. on Comp. Arch., 15{26, May 1990. [8] J.R. Goodman. Cache Consistency and Sequential Consistency. TR # 1006, Computer Science, University of Wisconsin, Madison, Feb. 1991. [9] J.R. Goodman, M.K. Vernon, and P.J. Woest. Ef cient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors. 3rd Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS), 3:64{75, April 1989. [10] G. Graunke and S. Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, 23(6):60{69, June 1990.

[11] Sparc International Inc. The SPARC Architecture Manual, Version 8. Prentice Hall, 1992. [12] H.F. Jordan. Performance Measurements on HEP - a Pipelined MIMD Computer. 10th Int. Symp. on Comp. Arch., 207{212, June 1983. [13] P. Keleher, A.L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. 19th International Symp. on Comp. Arch., 20(2):13{21, May 1992. [14] L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. on Computers, C-28(9):690{ 691, Sept. 1979. [15] D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. Design of Scalable Shared-Memory Multiprocessors: The DASH Approach. Spring COMPCON 90, 62{67, Feb. 1990. [16] J.M. Mellor-Crummey and M.L. Scott. Synchronization Without Contention. 4th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS), 269{278, April 1991. [17] B. Nitzberg and V. Lo. Distributed Shared Memory: A Survey of Issues and Algorithms. IEEE Computer, 24(8):52{60, Aug. 1991. [18] U. Ramachandran, M. Ahamad, M. Yousef, and A. Khalidi. Coherence of Distributed Shared Memory: Unifying Synchronization and Data Transfer. Int. Conf. on Parallel Processing, II:160{169, Aug. 1989. [19] L. Rudolph and Z. Segall. Dynamic Decentralized Cache Schemes for MIMD Parallel Processors. 11th Int. Symp. on Comp. Arch., 340{347, June 1984. [20] S. Thakkar, P. Gi ord, and G. Fielland. The Balance Multiprocessor System. IEEE Micro, 8(1):57{69, Feb. 1988. [21] L.D. Wittie, G. Hermannsson, and A. Li. Eager Sharing for Ecient Massive Parallelism. Int. Conf. on Parallel Processing, II:251{255, Aug. 1992. [22] L.D. Wittie, G. Hermannsson, and A. Li. Evaluation of Distributed Memory Systems for Parallel Numerical Applications. 6th SIAM Conf. on Parallel Processing for Scienti c Computing, 561{568, March 1993.

Suggest Documents