Evaluation of Two Optimized Protocols for

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Evaluation of Two Optimized Protocols for Sequential Consistency Gabriel Girard

Hon F. Li

Dept. de mathematiques et d'informatique Universite de Sherbrooke Sherbrooke, Quebec, Canada J1K 2R1 email : [email protected]

Abstract

Sequential consistency is a well known consistency requirement for distributed shared memory. However, most of the algorithms that implement sequential consistency involve some expensive blocking. In this paper, we propose algorithms that reduce blocking. First, we propose a synchronous algorithm using a 3-phase protocol and, when it is possible, a more ecient 2-phase protocol. We also propose a semi-synchronous algorithm where all processes can proceed asynchronously until a speci c operation, called the ush, is invoked. The processes are then synchronized by this ush at a particular point in space and time. Subsequently, all processes resume their asynchronous progress. Simulation results show signi cant improvements in performance in our protocols, especially in the case of the semi-synchronous protocol.

1 Introduction Distributed shared memory (DSM) is an important paradigm in programming a distributed system. It is easier to use than pure message passing. However, DSM often suers from performance as consistency often incurs long access latencies that cannot be overlapped with other operations in a process. Sequential consistency [12] is the most general consistency requirement. Quite a few algorithms that implement sequential consistency have been proposed. But they usually require blocking of processes [1, 2, 5, 6, 7, 15] or special labeling of operations [1, 5, 8, 10]. Other less restrictive consistency requirements [3, 9, 13] have also been proposed but they are dicult to use in general. In this paper, we present two novel protocols that implement sequential consistency in a distributed memory system with replications in reader sites. We introduce a new strategy to minimize synchronization cost and maximize the hiding of synchronization delays in a process. The strategy is based on the knowledge of

Department of Computer Science Concordia University Montreal, Quebec, Canada H3G 1M8 email : h [email protected] spatial locality in the sharing of memory objects. An access graph is used to capture the sharing relationship among processes via the shared objects. This is assumed to be static knowledge available at compilation/design time of the program. Section 2 introduces the idea of access graph and various views of an execution. Section 3 relates view cycle to non-sequential consistency and presence of access cycle. Hence it is possible to implement sequential consistency by synchronizing every access cycle. Section 4 describes the two types of protocols that aim to synchronize each access cycle. The rst type synchronizes the two neighbors in each access cycle involved in an operation, and the second type synchronizes all processes in an access cycle only when one special operation, named ush operation, in the access cycle is performed. Section 5 describes a simulation experiment conducted to compare the eectiveness of these protocols against the general synchronous protocol.

2 Access graph An access graph is used to model the sharing of memory objects among the processes. These objects may not be uniformly shared among all processes. For example, to some processes, a memory object may be readable but not writable, while to others, it is both readable and writable. If process can write into object and process can read from object , then a directed edge with in its label connects node to node . Figure 1 shows such an example. In this example, is a write-only object to process 3 but a read/write object to processes 1 and 2; is a write-only object to processs 2 and read-only object to processes 1 and 3, while is a read/write object to both processes 3 and 4. Hence the spatial locality of an object is captured by the access restrictions imposed on the processes. A directed access graph becomes undirected if the direction of each edge is omitted. We will use the term i

x

j

x

x

i

j

x

P

P

P

z

P

y

P

P

P

P

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

1


P2

{x}

{x}

Processes

{z}

{y} P3

{x,y}

P4

{z}

{x} P1

Operation sequences

P1 :

w(x,1)

P2 :

w(t,1) ; w(y,1) ; r(z,1) ; r(x,1) ; w(y,2)

P3 :

w(x,2) ; r(y,3) ; r(t,1)

P4 :

w(y,3) ; w(t,2) ; w(z,1) ; r(y,3) a) Operation Sequences

Figure 1: Access graph.

V1 :

access graph to refer to the undirected access graph and a cycle in the access graph will be called an access cycle. The usefulness of the access graph lies in the minimization of synchronization delay in each process. We will show that synchronization of object accesses to maintain sequential consistency can be minimized by focussing on the synchrony of operations in every access cycle; indeed if every access cycle is properly `synchronized', sequential consistency will be guaranteed.

V2 :

w(x,1) w(z,1) w(t,1) ; w(y,1) ; r(z,1) ; r(x,1) ; w(y,2) w(x,1) w(y,3)

V3 :

w(x,2) ; r(y,3) ; r(t,1) w(t,1) w(y,3) w(y,3) ; w(t,2) ; w(z,1) ; r(y,3)

V4 :

b) Local Views w(x,1) w(t,1) ; w(y,1) ; r(z,1) ; r(x,1) ; w(y,2) w(x,2) ; r(y,3) ; r(t,1)

2.1 Views of an execution

w(y,3) ; w(t,2) ; w(z,1) ; r(y,3)

An execution of the system results in a set of linear traces, one per process. Each trace contains the sequence of program-ordered memory operations and the values associated with them performed by the process. In particular, ( ) represents the writing of value into object by process , and ( ) represents the reading of value from object by process . For ease of explanation, we will assume the values written to an object are distinct. We may omit the process label whenever the context is clear or the value of is insigni cant. From an execution, we could derive the local view and the other global views, as will be explained next. wi x; v

v

x

i

v

ri x; v

x

c) Global View w(x,1) w(t,1) ; w(y,1) ; r(z,1) ; r(x,1) ; w(y,2) w(x,2) ; r(y,3) ; r(t,1) w(y,3) ; w(t,2) ; w(z,1) ; r(y,3)

i

d) Necessary view w(x,1)

i

w(t,1) ; w(y,1) ; r(z,1) ; r(x,1) ; w(y,2)

i

w(x,2) ; r(y,3) ; r(t,1) w(y,3) ; w(t,2) ; w(z,1) ; r(y,3)

2.1.1 Local view

e) Acyclic Possible view

The local view of process , denoted by , is its local trace augmented with the writes by other processes whose values it has read. The augmentation transforms the totally ordered operations of process into a partial order; ( ) ! ( ) is introduced whenever process reads the value written by process . The `!' is the order relation used in the view model. Figure 2a shows an execution and Figure 2b shows the local views of the various processes. i

Vi

i

wj x; v

ri x; v

i

j

Vi

Property 1

:

Vi

is acyclic (a partial order).

Figure 2: Execution views.

2.1.2 Global view The (set) union of the local views forms the global view. Figure 2c shows the global view of the example execution. The global view is used to represent the ordering of operations deducible from the program order and the direct coupling when a reader picks up the value written by a writer. If the global view is cyclic, then we conclude immediately that the execution is not

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

2


sequentially consistent. There cannot exist a linear order of the operations that is consistent with individual program orders without contradicting the direct coupling between a writer and a reader. However, the converse is not true.

Processes P1 :

2.1.3 Necessary view

The coupling between readers and writers must also ensure that every object is atomic through indirect coupling among the processes. Hence there are additional orderings that must be satis ed in the global view, and they are captured by the following augmentation rule. Augmentation Rule (AR): ( ) ; ( 0) ( 0) ) ( ) ! ( 0) Rule AR orders every read of that returns the value of to appear before the write of 0 into if the write of the former is ordered before some read of the latter or the latter itself. The symbol ';' indicates ordering that could be direct (such as program order or writer/reader coupling) or transitively induced via other operations. For the example, Figure 2d shows the necessary view of the execution. It incudes ( 3) ! ( 2) because ( 3) ; ( 2). The necessary view is the maximal ordering among the operations that we could deduce from an execution. If the necessary view is cyclic, by the same reasoning as in Section 2.1.2, the execution is not sequentially consistent. Unfortunately, the converse is still not true. However, it is easily veri able that the necessary view can be derived from an execution in polynomial time and is unique up to transitive closure (i.e., reachability relation between nodes) in the representation. wi x; v

rj x; v

=wj x; v

rk x; v

v

x

r y;

w y;

w y;

w y;

2.1.4 Possible view

For an execution to be sequentially consistent, the acyclic necessary view must satisfy an additional property. Suppose two writes, say ( ) and ( 0 ) at least one of whose values has been read by some process, are unordered in the acyclic necessary view. These writes are called concurrent writes. wi x; v

wj x; v

De nition 1

: A possible view is obtained from the necessary view by ordering every pair of concurrent write, say wi (x; v) and wj (x; v0 ), such that wi (x; v) ! 0 0 wj (x; v ) implies every rk (x; v ) ! wj (x; v )

The existence of an acyclic possible view is related to sequential consistency, which is de ned as follows.

De nition 2

An execution is sequentially consistent i there exists a total ordering of all the operations such that (i) it is consistent with each program :

P2 :

w(x,1) ; r(a,1) ; r(b,2)

P3 :

w(b,2) ; w(a,2) ; r(c,1) ; r(x,2)

P4 :

w(x,2) ; r(d,1) ; r(e,1)

P5 :

w(d,1) ; w(e,2) ; w(f,1)

P6 :

w(e,1) ; w(d,2) ; r(f,1) ; r(x,1)

Figure 3: Acyclic necessary view without acyclic possible view.

wj x; v

x

v

Operation sequences w(a,1) ; w(b,1) ; w(c,1)

order, and (ii) in the ordering, w(x; v) must appear before r(x; v) and they are no other w(x; v0 ) or r(x; v0 ) appear between w(x; v) and r(x; v).

We state the following lemma.

Lemma 1

: An execution is sequentially consistent i it possesses an acyclic possible view.

Proof: Given an acyclic possible view, we could obtain a total order of the execution by iteratively selecting among the subset of operations which are not preceded by other operations in the remaining possible view as the next operation in the total order. because of the property of the possible view, this total order satis es the requirement of de nition 2. ( The reverse is immediate as the total order is an acyclic possible view itself. For the example, Figure 2e shows an acyclic possible view of the execution. Hence from the lemma, we conclude that the execution is sequentially consistent. Figure 3a shows another execution which has an acyclic necessary view but which does not possess an acyclic possible view. Figure 3b shows a cyclic possible view for this execution. Thus it is not sequentially consistent. Notice that an acyclic possible view may contain unordered ( ) and ( 0 ). This is allowed when both values, and 0 , have not been read by any process. Hence our possible view model is more general than that proposed by [15]. )

wi x; v v

wj x; v

v

2.1.5 Necessary ordering vs possible ordering

It has been proved that deciding if an execution is sequentially consistent is an NP-complete problem.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

3


Hence given an arbitrary execution, even though we could easily derive the necessary ordering and hence the necessary view (in polynomial time), we do not think it is possible to derive an acyclic possible view without enumeration. So the necessary ordering is the maximal information we could deduce which must exist among the operations in order that the read/write semantics of memory objects are not violated. The possible ordering includes runtime choices which are not apparently deducible from the execution trace and is often not unique.

3 View cycle and access cycle A distributed protocol correctly implements sequential consistency only if it guarantees that every execution has an acyclic possible view. If all possible views are cyclic, and even if the necessary view is acyclic, the execution is not sequentially consistent. In this paper, our protocols attempt to make use of the knowledge of spatial locality, particularly in the form of access graph, to reduce synchronization cost/delay and allowing processes to tolerate memory latency. Consider any possible view. If it is cyclic, this means there exists a view cycle that is formed by a set of operations of the form: ! op12 ! ! op11

op11 opk2

:::

!

opi1

!

opi2

!

:::::

!

opk1

!

where is the rst and last operations in process traversed in the view cycle. Without loss of generality, the above representation assumes that the processes traversed in the view cycle are processes 1 2 and , and that 1 and 2 may be a same operation. In the latter case, the view cycle includes a single operation in process . The construction of necessary view and possible view directly infers that whenever 2 ! ( +1)1 , these are operations on a same memory object. We show that there is a path in the access graph between process and process ( + 1) under the circumstances. If these two operations form a read/write pair, the result is immediate. If both of them are write operations to a same memory object, then there must be a reader that reads the value of 2 and orders the read before ( +1)1 . Hence there is a path from process to process ( + 1) through the intermediate reader node. Similar arguments apply to the case when both the processes are readers. Hence we state the following lemma. opij

i

;

k

opi

; :::i; :::;

opi

i

opi

i

op i

i

opi

op i

i

i

Lemma 2

: Existence of a view cycle in a possible view implies existence of a corresponding access cycle in the access graph.

Proof: Suppose the view cycle consists of the follow-

ing events: 11 ; 12 ; 21 ; 22 ; 1; 2 ; 11 where 1 is the rst event and 2 is the last event involving in the cycle; 1 and 2 need not be distinct events. Since ( 1)2 and 1 must be con icting operations on a same object, this view cycle also immediatly induce an access cycle formed by these con icting operations in the chain. Hence the claim. e

:::e

e

:::e

::::ek

:::ek

ei

e

ei

Pi

ei

e i

ei

ei

4 Protocol design A correctly designed distributed protocol for sequential consistency in the shared memory must ensure that at least one acyclic possible view exists, or that view cycle cannot occur. This could be achieved in a number of ways. Two dierent approaches lead to two dierent protocols presented in this section.

4.1 Access cycle based synchronous protocol

We will rst introduce the theoretical basis before describing the synchronous protocol. We use to denote operation (in view space), to denote the starting event of operation , and to denote the termination event of operation (in protocol space). In the protocol space, events are ordered according to the happens-before relation (7!). The view space is what we deduce using the results of Section 3. The protocol space, on the other hand, contains events that actually occur in the message passing distributed system. The protocol that implements the distributed shared memory tries to enforce the following correspondences between ordering relations in the view space and the ordering relations in the protocol space. opi

i

st opi

i

end opi

i

Relation 1 : PO - Program Order in an Access Cycle

opi

;

opj

) end

opi

7! end

opj

In other words, if and are program ordered and appear in a same access cycle, then must happen before . opi

opj

end opi

end opj

Relation 2 : CO - Con ict Order in a Shared Object

This models the atomicity of each shared object with respect to con icting read/write operations. 1. All write operations on a same object are totally ordered, i.e., end w(x; v1) 7! end w(x; v2) 7! end w (x; v 3) 7! ...

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

4


2. Writing of a value must precede any reading of that value, i.e., end w(x; v) 7! end r(x; v) for every r (x; v ). 3. Every read must return the most recent value written, i.e., end w(x; v) 7! end w(x; v0 ) implies 0 end r (x; v ) 7! end w (x; v ) for every r (x; v ).

In other words, every pair of read and write operations on a same object must be ordered in the protocol space. We state the following important result.

Theorem 1 : An execution involving a protocol that

satis es PO and CO must possess an acyclic possible view. Hence such a protocol implements sequential consistency.

Proof: An abstract proof argument is given here

because of space limitation. Suppose the assertion is false, i.e., every possible view has a view cycle. From PO and CO, the events in the view cycle form a cyclic happens-before order and this directly contradicts the principle of causality. Hence the claim. Based on the above abstract results, we could focus on designing a protocol that synchronizes every access cycle while maintaining that every read or write of an object is locally atomic (to the object). To illustrate this aspect, let us consider the simple access graph in Figure 4. In this graph, there are only two access cycles. Suppose process 1 performs the following sequence of operations: ( 1); ( 2); ( ?); ( ?) Then theorem 1 asserts that ( 1) does not delay ( 2) but must be completed before ( ?) as the two writes are in two dierent access cycles and cannot form a view cycle. Hence ( 1) 7! ( 2) ( 2) 7! does not have to be enforced. Similarly, ( ?) does not have to be enforced. P

w x;

w y;

r z;

r t;

:::

w x;

w y;

r z;

end w x;

end w y;

end w y;

end r z;

u

a

t

x P1 v

z

y

b

Figure 4: Access graph with 2 independant access cycles.

Access cycle based synchronous protocol In this protocol, objects are separated into 2-phase and 3-phase objects respectively. Single reader objects are synchronized using a 2-phase protocol. Others are

synchronized using a 3-phase protocol. Synchronization delay between operations in a process is con ned to operations in an access cycle. Operations not lying in an access cycle are not synchronized and do not incur delay between them. The serialization of concurrent writes in 3-phase objects is achieved by using logical timestamp augmented with the process id. It is assumed that a process contains a sequence of memory operations to be invoked. This sequence is the program order of these operations. The invocation of a memory operation spawns a child thread from the parent process thread. The end of an operation is delayed if a preceding operation in a common access cycle has not ended. When an operation nishes, the child thread disappears. In addition, there is a kernel thread that is responsible for receiving and updating the values of objects readable by that process. The details of the protocol are given in gure 5. It is noteworthy that the above protocol satis es both PO and CO and hence is correct. The details of its correctness proof, which involves showing the causal relationship among the ending of all operations, are beyond the scope of this paper and will be omitted.

4.2 Flush protocol The ush protocol is a novel protocol designed to replace the tight synchronization imposed by PO and CO. In particular, we wish to allow con icting operations to proceed asynchronously. In terms of the synchronous protocol described in Section 4.1, it means a child thread no longer waits for its old brother threads to return before proceeding with its termination operation. The parent thread can start each child thread in program order. These child threads move asynchronously with respect to their brothers as well as with respect to other threads in other processes. The only synchronization imposed is atomically triggered whenever a special write operation in each access cycle is invoked by a parent thread. This special write operation is called a ush write. The ush write serves to synchronize all asynchronous operations in an access cycle that could potentially lead to a view cycle. Intuitively a view cycle involving the operations in a given access cycle cannot be formed unless every operation in the access cycle has been invoked since the last ( ush) synchronization. Hence synchronization is completely hidden/ignored until the ush write occurs. When it does, each writer process in the same access cycle is checked such that all operations in the access cycle that have started before the ush must have ended before the ush is allowed to end. The theory allows the use a read operation as a ush, but for our design and sim-

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

5


Process i:

Suppose ti is the logical clock shared by all the threads in process i and txi is the timestamp/version number of the most recent version of x established at process i. We assume that each statement in the protocol is atomically executed in a thread and a parent thread spawns child threads in program order. (i) w(x,v) thread: case of 3-phase write: procedure write thread 3(x,v,ts)*; local: ts, wait; ti, txi, ts:= inc(ti); (st w(x,v)) broadcast update3(x,v,i,ts) to all readers of x; wait := fj j process j is a reader of xg; repeat until (wait = empty or killed(x,ts)) and (all active older brother threads in a same access cycle have returned) fupon receipt of ack(x,i,ts) from process j do wait := wait - fjg g; if :killed[x,ts] then broadcast commit(x,i,ts) to all readers of x; (end w(x,v))

return; 2-phase write: procedure write thread 2(x,v); repeat until all active older brother threads in a same access cycle have returned; send update2(x,v) to the reader; (st w(x,v)) wait for acknowledge from the reader; return; * For simple presentation, it is assumed here that a writer is also a reader. (ii) r(x,?) thread: procedure read thread(x); repeat until all older brother threads in a same access cycle has returned; case of 3-phase object: repeat until readable(x); return value(x); (st/end r(x,?)) 2-phase object: return value(x); (st/end r(x,?)) (iii) Kernel thread: repeat forever case of receipt of update3(x,v,j,ts') from process j: ti: = max(ti,ts'); if txi < ts' then readable[x] := false; value[x] := v; txi := ts'; send ack(x,j,ts'); for all active write thread(x,v,ts) and ts' > ts do killed[x,ts] := true; receipt of commit(x,j,ts'): if txi = ts' then readable[x] := true; receipt of update2(x,v) from process j: value[x] := v; send acknowledge(x,v) to process j; (end w(x,v)) (2-phase write) end repeat forever

Figure 5: Synchronous protocol.

ulation, we have used a write for this purpose. The details of the protocol are given below. It is assumed that the protocol described in Section 4.1 is used with the removal of all delays (waiting) caused by the return of older brother threads (those lines marked in italics). The changes to the base protocol then include the case of a ` ush write' in the write thread and the case of receipt of ush write in the kernel thread: w(x,v) thread: case of :::

ush write(x,v): procedure ush write(x,v); broadcast ush write(x,v) to each process in access cycle; repeat until receipt of ush ack(x,v) from each process; (st ush) broadcast commit ush(x,v) to each process; (end ush) return; Kernel thread: case of ::: receipt of ush write(x,v) from process j; atomically perform wait until all child threads in the same access cycle as the ush operation have returned; send ush ack(x,v) to process j; if process i is a reader of x then value[x]:= v;

In addition, the parent thread is changed so that it delays spawning a child thread if the latter is in a same access cycle of an ongoing ush write which has not yet committed, i.e., the commit ush message has not been received. The correctness of the ush protocol is based on the following ush order, enforced between an arbitrary operation (op) and a ush operation ( ush) in a same access cycle. We use st op to represent the starting event of op, as marked explicitly in the detailed protocol earlier.

Relation 3 : FO - Flush Order in an Access Cycle: 1. op and ush in a same access cycle st f lush or st f lush 7! st op, and 2.

st op

7! st

f lush

)

end op

7! end

)

st op

7!

.

f lush

We will state the following theorem without presenting its complete proof.

Theorem 2 : An execution involving a protocol that satis es FO must have an acyclic possible view.

Proof: The argument is identical to that in the proof of Theorem 1.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

6


5 Simulation

5.2 The network simulator

A simulation is performed to evaluate the performance of the protocols described in Section 4 under dierent systems. The simulation is written in Java and built around a discrete event simulation package called Javasim [14]. It consists of three basic components: the shared memory kernel, the network simulator, and the application simulator. This section will present some relevant details of these before analyzing the simulation results.

5.1 The shared memory kernel simulator

Five versions of dierent protocols are simulated as the kernel support. They are respectively: General 3-phase protocol: Each write operation is a 3-phase write that is broadcasted to all processes, without taking advantage of the information of the access graph of the user application. Hence this is expected to be the worst performing protocol and serves as an upper bound to the execution time of the simulated system. Restricted-synchronous protocol: This is the synchronous protocol presented in Section 4 without making full use of the neighbor information. In particular, the 3-phase handshaking is restricted to only those processes actually sharing a same object. The writer always delays later operations until the current operation has ended. Synchronous (neighbor) protocol: This is the synchronous protocol presented in Section 4. It diers from the restricted-synchronous protocol in that a writer delays a later operation only if the latter lies in a same access cycle as one of its readers which has not yet acknowledged. Flush protocol: This is the protocol described in Section 4. Asynchronous protocol: The asynchronous protocol is formed by removing all handshaking among readers and writers. In particular, a writer simply broadcasts its new value, and a reader reads the local copy at all times, asynchronously. This protocol obviously does not implement sequential consistency but the resulting performance will serve as the oor (lower bound) to the execution time of the simulated system.

The shared memory protocols in Section 5.1 are simulated on a simulated network environment. A single performance metric, the total execution time of the simulated application, is chosen for analyzing the system performance. Hence we do not need a detailed simulator such as that in [16]. To account for realistic communication delay, we use the same approach as that in [11]. The sending and receiving of a message incurs a delay D. Hence a 2-phase handshake between a writer and a reader incurs a communication delay of 2D. A 3-phase operation involves the writer that broadcasts a message and receives acknowledges before broadcasting a commit message. The total communication delay is therefore (n+2)D, assuming that an ethernetlike broadcast channel exists and each broadcast incurs a delay of D. These assumptions are similar to those used in [11] when the size of a packet is small. We have made a simplifying assumption by ignoring congestion and retransmission. The latter could be modeled and simulated as well but is unlikely to make a dierence in comparing the performance of dierent protocols. n

5.3 The application simulator

The application simulator drives the shared memory kernel based on (i) the choice of protocol used, (ii) the behavior of the application being simulated, and (iii) the static access graph supplied with the application.

Behavior of the application and access graph

The behavior of an application process consists of a sequence of read and write operations to be performed. These are either synthetically generated or derived from some known applications. 1. Synthetic applications A pure synthetic application generator is used to generate dierent behaviors to be tested. Each process repeatedly executes a computation phase followed by a read/write operation chosen randomly. During a computation phase, a process can perform any operation except shared memory access. The duration of a computation phase is normally distributed with a mean of 5 time units. Figure 6 shows some of the access graphs used in the synthetic applications. 2. Mutual exclusion Lamport's bakery algorithm [4] for critical sections is simulated here. In its general form, the access graph is a fully connected graph. The variable

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

7


P4

o4

P3

o1 P1

o5 o6

o9

P9

P8

o10 o2 o1

P5

P2

o8 o3

o7

P6

P7

1) Access Graph 1 o3 o1

P2

o1 P3

P5

P1

o4 o2

o4

o1

o1

o5

2) Access Graph 2

P4

P4

P2

o1

o1 P5

o2

P1 o1

P3

o3

o5 3) Access Graph 3

Figure 6: Synthetic application's access graphs.

parameters in this application include the computation delays associated with the critical and noncritical sections. 3. Dining philosopher A distributed mutual exclusion algorithm for the dining philosopher problem [4] is simulated. The access graph here is a single ring. As in the previous case, the computation delay of a process is a variable parameter in the simulation. By varying this, we achieve dierent degrees of concurrency in computation among the processes.

5.4 Analysis of results

The results of various simulation runs are illustrated in Figures 7, 8 and 9. Figure 7 shows the simulated performances of the protocols for the synthetic applications. Figure 8 shows a typical result for the mutual exclusion application and Figure 9 shows that of the dining philosopher application. In general, we expect the restricted-synchronous, synchronous (neighbor), and ush protocols to outperform the general 3phase because of their avoidance of unnecessary synchronization and abilities to hide long access latency with the computation or among accesses. In the case of the restricted-synchronous protocol, it reduces synchronization cost by restricting reader/writer synchronization among relevant processes. Hence each acknowledgement phase will be faster. In the synchronous neighbor protocol, two accesses from a process may overlap if they do not lie in a same access cycle. Hence synchronization delays of program-ordered accesses can overlap among themselves as well as with the computation phase of the process. In the ush protocol, not only that synchronization is restricted to

those processes that are related, but also each access does not delay subsequent accesses except in the case of the ush. Hence, all synchronization delays except the ush are hidden. The cost of synchronization will surface in the latter case and it is localized to the access cycles it controls. The results of the synthetic applications more or less substantiate the above expectations. Many dierent access graphs were simulated but only the most representative ones are included here. Access graph 3 in Figure 6 contains more access cycles, whereas the two other access graphs are rather simple. Generally, the

ush protocol outperforms the other protocols except in graph 2 which contains relatively more processes in a single access cycle. In that case, the synchronous neighbor protocol gives the best result, as synchronization between two neighbors is more eective than invoking a 3-phase ush involving a relatively large set of processes when that operation is performed. The eectiveness of the synchronous neighbor protocol is also noteworthy in Figure 7, demonstrating that using knowledge of access cycles to hide access latency is an eective strategy. The mutual exclusion simulation is performed with some small changes from the synthetic simulation. As the access graph for this application is a complete graph, there is no dierence between the general 3phase protocol and the restricted-synchronous protocol. To ensure progress, the asynchronous protocol is not meaningful. So it was not included in the evaluation. Simulation is performed for dierent combinations of communication delay and number of processes. The results for dierent communication delays are very similar, and a typical comparison is plotted in Figure 8. In general, the ush protocol outperforms both the general 3-phase and the synchronous neighbor protocols, and the eects are more signi cant as the number of processes increases. This is understandable as the frequency of synchronization and program-ordered delays become more signi cant with increase in the number of processes. The dining philosopher problem is in the opposite spectrum when compared with the general mutual exclusion problem as object sharing is more localized and precise access cycles exist. Hence, as the results in Figure 9 con rm, all three protocols that make use of the results of this paper perform well, and the synchronous neighbor protocol is the best of the three. The access graph of the dining philosopher problem contains cycles between two neighbors as well as global cycles involving all processes. Hence ush synchronization in various access cycles may result in more non-hidden delays than the synchronous neighbor protocol.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

8


Synthetic application - Access graph 1

Mutual exlusion

4000

10000 Asynchronous Synchronous (neighbor) Restricted-synchronous Flush Three-phase

Total simulation time

3000

9000

2500 2000 1500 1000

7000 6000 5000 4000 3000

500

2000

0

1000 0

5

10 Communication delay

15

20

1600

2

2.5

3

3.5 4 4.5 Number of processes

5

5.5

6

Figure 8: Mutual exclusion simulation results for two con gurations.

Synthetic application - Access graph 2 1800 Asynchronous Synchronous (neighbor) Restricted-synchronous Flush Three-phase

1400 1200

Dining philosopher 2500

1000

Synchronous (neighbor) Restricted-synchronous Flush Three-phase

800 2000

600 400 200 0 0

5


15

20



Synchronous (neighbor) Flush Three-phase

8000 Total simulation time

3500

1500

1000

500

Synthetic application - Access graph 3 2000 1800

Asynchronous Synchronous (neighbor) Restricted-synchronous Flush Three-phase


1600 1400

0 0

5


15

20

Figure 9: Dining philosopher simulation results.

1200 1000 800 600 400 200 0 0

5


15

20

Figure 7: Synthetic application simulation results (access graphs 1-3).

6 Related works There are two types of protocols that are used to implement sequential consistency: update-based and invalidation-based protocols. Our protocols are update-based protocols like those presented by [1, 2, 5, 6, 7] but do not systematically use atomic-broadcast or 3-phase on each update. Indeed, the protocols introduced by [2, 6, 5, 7] all use atomic broadcast to implement update operations or strong writes [5]. These

operations are expensives and, in order to reduce this cost, Fekete et al. [7] have combined atomic broadcast and point-to-point communication. However, this involve remote reads. Some protocols [15] implement sequential consistency without the use of atomic broadcast or 3-phase protocols. However, these protocols are invalidation-based protocols which use the owner concept or a centralized server. They also involve non-local read operations. In most of the preceding protocols, all update operations are blocking. However, some protocols [2, 15] have introduced some form of asynchrony on write operations issued by the same process. Our ush protocol allows asynchronous operations not only between writes but also between write and read operations issued by the same process. All other protocols that allows asynchronous operations implements a relaxed memory model [1, 5, 8, 10] in which only specially labeled operations are synchronized.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

9


7 Conclusion In this paper, we have presented two types of algorithms that exploit the knowledge of spatial locality in the form of access graphs in implementing sequential consistency. There is a direct relationship between the static access graph and the potential violations of sequential consistency in a dynamic execution. In particular, non-sequential consistency means that there is a view cycle in the dynamic execution. The latter could occur only if there is a corresponding access cycle in the sharing of objects. By focussing on the synchronization on each access cycle, we could ensure sequential consistency with minimal synchronization and maximal overlap of synchronization delays. The access graph of a distributed program can be either speci ed by the programmer or derived by the compiler through a static analysis of the source code. In particular, an edge of the access graph from process to process is formed whenever process can write into an object that is readable by process . The two protocols presented here accomplish this in two fundamentally dierent ways. The synchronous neighbor protocol ensures neighbors in an access cycle are synchronized in each operation. The ush protocol allows them to be asynchronous and therefore hide all synchronization delays, except when a

ush operation in the cycle is performed. When the latter happens a full 3-phase synchronization is invoked. Hence the synchronization delay, if necessary, is postponed to that last minute. The simulation experiments demonstrate the eectiveness of these strategies with the use of synthetic applications and other common distributed applications. Ongoing eorts are devoted to the design and implementation of run-time and compilation support that go with such a system. i

[6] [7]

[8]

[9]

j

i

j

[10]

[11]

[12]

[13]

References [1] Adve, S.V. and Gharachorloo, K. Shared Memory Consistency Models: A Tutorial. IEEE Computer, Vol. 29, No. 12, December 1996, Pages 66-76. [2] Afek, Y., Brown, G., and Merritt M., Lazy Caching, ACM Transactions on Programming Languages and Systems, Vol. 15, No. 1, January 1993, Pages 182-205. [3] Ahamad M., Neiger G, Burns J.E., Kohli P. and Hutto P.W. Causal memory: de nitions, implementation, and programming. Distributed Computing, Vol. 9, No.1, 1995, Pages 37-49 [4] Andrews G. Concurrent Programming: Principles and Practice. Benjamin-Cummings, 1991 [5] Attiya H., Chaudhuri S., Friedman R. and Welch J.L. Shared Memory consistency Conditions For Non-

[14]

[15]

[16]

sequential Execution: De nitions and Programming Strategies. SIAM Journal on Compputing, Vol. 27, No. 1, February 1998, Pages 65-89. Attiya H. and Welch J.L. Sequential Consistency versus Linearizability. ACM Transactions on Computer Systems, Vol. 12, No. 2, May 1994, Pages 91-122. Fekete A., Kaashoek M.F. and Lynch N. Implementing Sequentially Consistent Shared Objects using Broadcast and Point-To-Point Communication. Journal of the ACM, Vol. 41, No. 1, January 1998, Pages 35-69. Gharachorloo, K., Adve, S.V., Gupta, A., Henessy, J. and Hill, M.D. Programming for Dierent Memory Consistency Models. Journal of Parallel and Distributed Computing, Vol. 15, No.4, August 1992, Pages 399-407. Hutto, P.W., and Ahamad, M. Slow Memory: Weakening Consistency to Enhance Concurrency in Distributed Shared Memories. Proceedings of the 10th International Conference on Distributed Computing Systems (ICDCS-10), Paris, France, May/June 1990, Pages 302-311. Iftode L., Singh J.P. and Li K. Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Theory of Computing Systems, Vol. 31, No. 4, July/August 1998, Pages 451-473. Kessler R.E. and Livny M. An Analysis of Distributed Shared Memory Algorithms. Proceedings of the 9th International Conference on Distributed Computing Systems (ICDCS-9), Newport, CA, June 1989, Pages 498505. Lamport, L. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, Vol. C-28, No. 9, September 1979, Pages 690-691. Lipton, R.J., and Sandberg, S. PRAM: A Scalable Shared Memory. Technical Report CS-TR-18088, Dept. of Computer Science, Princeton University, September 1988. McNab R. and Howell F.W., Using Java for Discrete Event Simulation, Proc. Twelth UK Computer and Telecommunications Performance Engineering Workshop (UKPEW), Univ. of Edinburgh, 1996, Pages 219228. Mizuno, M., Raynal, M., Zhou, J. Sequential Consistency in Distributed Systems. Proc. of the Int. Workshop on Theory and Practice in Distributed Systems. In K. Birman, F. Mattern, A. Schiper (Eds.), LNCS 938, July 1995, Springer-Verlag, Pages 224-241. Stumm M. and Zhou S. Algorithms Implementing Distributed Shared Memory. IEEE Computer, Vol. 23, No. 5, May 1990, Pages 54-64.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

10

Evaluation of Two Optimized Protocols for

Evaluation of Two Optimized Protocols for

Suggest Documents

Evaluation of Two Hormonal Protocols for Synchronization of ...

Evaluation of two protocols for low-level laser

Comparison of two optimized readout chains for

Diversity Characterization of Optimized Two

Performance Evaluation of Routing Protocols for ... - CiteSeerX

Design and Evaluation of Protocols for Maintaining

Evaluation of extraction protocols for anti-diabetic

Performance Evaluation of Computerized Clinical Protocols for ...

Optimized Multi-Party Certified Email Protocols - NICS

two protocols - NCBI

CLSI evaluation protocols

Research Article ARQ Protocols for Two-Way

Two Protocols of Cryopreservation of Goat Semen

A Framework for Developing Research Protocols for Evaluation of ...

Requirements for and Evaluation of RMI Protocols for ... - CiteSeerX

Comparison of two analgesia protocols for the treatment of ... - SciELO

A multicentre evaluation of two intensive care unit triage protocols for ...

effectiveness of two trapping protocols for studying the demography of ...

Comparison of two optimized readout chains for ... - Infoscience - EPFL

Comparison of two protocols with a progesterone

Scalability of Two Reliable Multicast Protocols - CiteSeerX

Optimized set of two-dimensional experiments for fast sequential ...

Effects of two respiratory physiotherapy protocols

development of standard protocols for preparation and evaluation of ...