Transaction-Based Causally Ordered Protocol for

0 downloads 0 Views 169KB Size Report
of messages to be causally ordered are reduced. .... (1) e1 precedes e2 in some object. (2) e1 = snds[m] ..... tions interconnected by the 100base Ethernet. TCP.
Transaction-Based Causally Ordered Protocol for Distributed Replicated Objects Tomoya Enokido, Takayuki Tachikawa, and Makoto Takizawa Dept. of Computers and Systems Engineering Tokyo Denki University Email

feno,tachi,[email protected]

Abstract

In group communications, larger computation and communication overhead are consumed to causally order all the messages transmitted in the network. Transactions in clients manipulate objects in servers by sending read and write requests to the servers. In this paper, we de ne signi cant messages by using the relation among the transactions. We newly propose an object vector to causally order only the signi cant messages. The scheme of the object vector is invariant in the change of the group membership. We also show a TBCO (transaction-based causally ordered) protocol which adopts the object vector, by which the number of messages to be causally ordered are reduced.

1 Introduction

Distributed applications are realized by a group of multiple objects which cooperate by exchanging messages through the communication network. Here, the messages have to be reliably delivered to the destination objects in the group. Many kinds of group protocols [2,7{9] are discussed to causally order all the messages transmitted in the network. However, the group protocol implies O(n) to O(n2 ) computation and communication overhead. It is a critical problem how to reduce the overhead. Cheriton [3] points out that it is meaningless to causally order all the messages transmitted from the application point of view. Agrawal [5] discusses a way where only processes receiving signi cant request messages are required to be rolled back if the senders of the signi cant messages are rolled back. Tachikawa and Takizawa [13] de ne the object-based causality among the messages based on the con icting relation among the abstract operations supported by the objects in the presence of the nested operations. Here, each object is not replicated. Raynal [11] discusses a way to relex the traditional message-based causality based on the write-semantics for a group of the replicas. In this paper, we introduce a transaction [1] concept to de ne the causality among the messages which are signi cant for the applications. A transaction is an atomic sequence of read and write operations. A transaction T in a client processor issues read and write requests to server processors. Thus, the subtransactions of T are computed in the server pro-

cessors. Data read from an object may be written to another object if read and write are computed in T . A message m1 may carry information in m2 if T sends m1 after receiving or sending m2 . Thus, the messages received and sent by one transaction are related. However, suppose one transaction T1 sends m2 after another transaction T2 receives m1 in the same processor. m1 and m2 are not related unless T1 and T2 manipulate same objects. In order to de ne messages to be causally ordered, we have to consider what subtransactions send and receive messages. Hence, the group is required to be composed of multiple subtransactions and servers. In this paper, we de ne how read and write messages are causally related by introducing the transaction concept. In addition, each object is replicated in order to increase the reliability, availability, and performance of the system. We de ne signi cant messages to be causally ordered based on the read and write semantics among the replicas and on the transaction concept. In most group protocols, messages are causally ordered by using the vector clock [6]. Since transactions are randomly initiated in the applications, the size of the vector clock is required to be dynamically changed depending on the number of subtransactions. In this paper, we newly propose an object vector in order to causally order the messages. The size of the object vector is given by the number of objects. We discuss a novel group protocol named a TBCO (transactionbased causally ordered) protocol which uses the object vector. In the TBCO protocol, the object can neglect insigni cant messages if they are delayed, i.e. signi cant messages can be delivered earlier without waiting for insigni cant messages preceding the messages. We present a system model in section 2. In section 3, we de ne signi cant messages. In section 4, we present the object vector and the TBCO protocol. In section 5, we evaluate the TBCO protocol.

2 System Model

2.1 System con guration

A system is composed of collection P of processors  1) interconnected by the communication network. Objects o1 , : : :, ou (u  1) are distributed in the processors. Let O be f o1 , : : :, ou g. Each object oi is stored in one processor. oi is manipulated by read and write operations. On receipt of

p 1 , : : : , p n (n

a request message m with opi, oi computes opi . On completion of opi , oi sends back the response message to the sender of m. read and write con ict and two write con ict. Objects are locked in a read (Rlock) mode and write (Wlock) mode before read and write operations, respectively, are computed. We assume that each oi supports only read and write operations and oi is replicated. Each replica is allocated in a processor. Let oti denote a replica of oi stored in a processor pt . Let rt (x) and wt (x) be read and write operations of a replica x in a processor pt , respectively. A processor sends the read request to one replica of x, say in pt . pt sends back the response with the data derived. ps sends the write request to all the replicas of x. That is, the read-one write-all principle is adopted. The communication network supports each pair of processors ps and pt with a bidirectional communication channel. No message is lost but messages may be delivered out of order in the channel.

2.2 Vector clock

The computation on an object oi is modeled to be a sequence of events. There are internal events and communication events, i.e. sending and receipt events. Let sndt [m] and rcvt [m] denote a sending event and receipt event of a message m in a processor pt , respectively. Lamport [4] de nes that an event e1 happens before e2 ( e1 ! e2 ) i one of the following conditions holds : (1) e1 precedes e2 in some object. (2) e1 = snds [m] and e2 = rcvt [m]. (3) e1 ! e3 ! e2 for some event e3 . A message m1 is referred to as causally precede m2 ( m1 ! m2 ) i snds[m1 ] ! sndt [m2] [2]. pt has to receive m1 before m2 if m1 ! m2 . The vector clock [6] V = hV1 , : : :, Vn i is used to causally order the messages. Each Vu is initially 0. pt increments Vt by one each time pt sends a message m. m carries the vector clock m.V ( = V ). On receipt of m, pt changes V as Vu = max( Vu , m.Vu ) for u = 1, : : :, n and u 6= t. m1 ! m2 i m1 .V < m2.V . Thus, the size of the vector clock depends on the number n of the processors.

2.3 Transactions

A transaction Ti is initiated in one client processor ps . Ti manipulates objects in O by issuing read and write requests to the server processors which store the replicas. Here, op1 !Ti op2 i an operation op1 precedes op2 in Ti. Ti computes commit (ci ) or abort (ai ) at the end of Ti . Ti commits if all the operations invoked by Ti complete successfully. If at least one operation fails in Ti , Ti aborts, i.e. all the changes of the objects done by write in Ti are removed. Thus, Ti is an atomic sequence of read and write operations. Let T be a set f T1 , : : :, Tm g of transactions computed in the system. Let opit (x) denote an operation op of a transaction Ti on a replica x in pt . A subsequence of operations of Ti on the replicas in a processor pt is named a subtransaction Tit of Ti in pt . op1 !Tit op2 i op1 precedes op2 in Tit .

The interleaved computation of the subtransactions

T1t , : : :, Tmt in pt is a local history Ht of pt . For every pair of operations op1 and op2 computed in pt , op1 precedes op2 in Ht ( op1 !Ht op2 ) i op1 is computed before op2 in pt . op1 !Ht op2 if op1 !Tit op2

We de ne the meaningful precedence of the operations in the processor pt . [De nition] An operation op1 meaningfully precedes op2 in pt (op1 )Ht op2 ) i op1 !Ht op2 , and (1) op1 and op2 are in some transaction Ti or (2) op1 and op2 con ict. 2 Here, op1 !Ht op2 if op1 )Ht op2 . op1 and op2 are concurrent in pt (op1 kHt op2 ) i neither op1 )Ht op2 nor op2 )Ht op1 . The read-from relation [1] is de ned as follows. [De nition] op1 reads from op2 in pt (op1 >Ht op2 ) i (1) op1 )Ht op2 , (2) op1 = wti (x) and op2 = rti(x) for a replica x in pt , (3) there is no wt (x) such that op1 !Ht wt (x) !Ht op2 , and there is no abort ai such that op1 !Ht ai !Ht op2 . 2 A global history H for T is a collection of the local histories H1 , : : :, Hn . Since write is issued to all the replicas of oa , rt (ota ) can be considered to read oua written by wt (oua ) in pu . [De nition] op1 reads from op2 in H (op1 >H op2 ) i (1) op1 >Ht op2 in some pt , or (2) there is some operation op3 such that op1 = wti (xt ), op3 = wui (xu), and op3 >u op2 . 2 If op1 >H op2 , op2 reads an object written by op1 . [De nition] For every pair of operations op1 and op2 , op1 meaningfully precedes op2 in H (op1 )H op2 ) i (1) op1 !Ti op2 in some transaction Ti , (2) op1 )Ht op2 in some processor pt, (3) op1 >H op2, or (4) op1 )H op3 )H op2 for some op3 . 2 op1 and op2 are concurrent in H (op1 kH op2 ) i neither op1 )H op2 nor op2 )H op1 . If op1 and op2 are concurrent, op1 and op2 can be computed in any order. For example, rt (x) and ru(x) are concurrent. ps

pt

s

w sh (x )

pu u w uh (x )

Ti t t

r ti (x )

w ui (y u)

w ti (y )

Tk t

u r ul (y )

Tj t t r tj (x )

w tk (y )

u r ul (y ) :

H

Figure 1: Concurrent operations. [Example] Figure 1 shows three processors ps , pt , and pu. pt has replicas xt and yt , pu has xu and y u , and ps has xs . Here, each dotted arc shows the read-from relation >H . Three subtransactions Tit , Tjt , and Tkt are computed in pt where Tjt and Tkt are interleaved.

Here, rti (xt ) reads x from wsh (xs ), i.e. wsh (xs ) >H rti (xt). Since two operations rti (xt) and wti (yt) are issued by a transaction Ti , rti (xt ) )H wti (y t ). Suppose rul (yu ) reads y from wti (yt), i.e. wti (yt) >H rul (yu). Here, wui (y u ) )H rul (y u ), rtj (xt ) reads x from wsh (xs ) and rul (y u ) reads y from wtk (y t ). 2 A transaction Ti precedes Tj (Ti !Tj ) i i(xt ) !Ht j (xt ) for every pair of con icting operations i and j on an object x in Ti and Tj . H is serializable i T is totally ordered in ! [1]. Each transaction Ti is given a unique identifer t(Ti ) in order to totally order the transactions, i.e. either t(Ti ) < t(Tj ) or t(Ti ) > t(Tj ) for every pair of Ti and Tj . If an operation opi of Ti con icts with opj of Tj in pt and t(Ti ) < t(Tj ), opi !Ht opj . The time stamp ordering (TO) concurrency control [1] is discussed in the distributed database systems where every pair of con icting operations are computed in the order of the transaction identifers, i.e. time stamp. If the following conditions hold, the global log H can be serializable. [Serializability condition] (1) For every pair of writes w1 and w2 on an object x, w1i (xt ) !Ht w2j (xt ) i w1i (xu) !Hu w2j (xu ) for every pair of processors pt and pu storing the replicas of x. (2) Suppose rti (xt ) !Ti wti (y t ) in pt . If wsj (xs ) >H rti (xt) and wui (yu ) >H rvk (yv ), wvj (xv ) >Hv rvk (yv ).

2

(1) means that the write operations on the replicas have to be totally ordered. (2) means that data in x

ows to y if Ti reads x before writing y.

3 Signi cant Messages

Due to the message loss and the unexpected delay in the network, a processor pt may not receive messages or may receive messages out of order. If pt loses a message m sent by pu, pu is required to resend m to pt . If pt receives m, pt has to wait for all messages causally preceding m. We discuss a method where pt can deliver the messages without waiting for all the messages causally preceding the messages. First, we discuss what kinds of operations can be omitted in each processor. [Example] Suppose wt1 (xt ), ws1 (xs), and wu1 (xu) are initiated by a request message m1 , rt2 (xt ) is by m2 , ws3(xs ), wt3 (xt ), and wu3 (xu) are by m3 , and ru4 (xu ) is by m4 as shown in Figure 2. wt3 (xt ) is computed after rt2 (xt ) in pt. Here, suppose pu receives m3 before m1 . In the traditional group protocols, pu has to wait for m1 without delivering m3. That is, pu has to compute ws1(xs ), wu3 (xu ), and ru4 (xu ) in this sequence. ru4 (xu ) reads x from wu3 (xu ) but not from ws1 (xs ). Here, suppose the requests wu1 (xu), wu3 (xu ), and ru4 (xu) are still stored in the receipt queue RQu of pu due to the communication delay although the operations complete in ps and pt . Since ru4 (xu ) reads from wu3 (xu ), wu1 (xu )

does not need to be computed, i.e. ws1 (xs ) can be omitted. 2

ws1 (x s )

p u

p t

p s m1

t w t1 (x )

u w u1 (x )

m2 2 r t (x t )

m3 ws (x s ) 3

wt3 (x t ) m4

m3

wu3 (x u) r u4 (x u)

Figure 2: Signi cant messages. This example shows that some write operations can be omitted. Suppose wti (x) )H wtj (x). If there is no read rtk (x) such that wti(x) )H rtk (x) )H wtj (x), pt dose not need to receive wti (x). This means pt can compute wtj without waiting for wti (x). pt can further reject wti (x) if pt receives wti after computing wtj (x). [Example] Suppose pt receives read requests rti , rtj and rtk from transactions Ti , Tj , and Tk , respectively. pt usually reads a value from the replica x and sends back the response with data each time pt receives a read request. Here, pt computes read operations on x three times. If pt reads x once and sends back the response to Ti , Tj , and Tk , the number of operations computed can be reduced. That is, rtj (x) and rtk can be removed from RQt if rti is computed. 2 pt stores messages received in the receipt queue RQt in the order of \)H ". We discuss in the succeeding subsection how to order messages in )H . If messages m1 and m2 are concurrent, m1 and m2 are stored in RQt in the receipt order. [De nition] A message m is insignificant i (1) if m = rti (x), there is some read rtj (x) in RQt such that rtj (x) precedes rti (x) and there is no write wtk (x) between rtj (x) and rti (x) in RQt . (2) if m = wti (x), there is some write wtj (x) in RQt such that wti (x) precedes wtj (x) and there is no read rtk (x) between wti (x) and wtj (x) in RQt . 2 In Figure 2, suppose pu receives request messages m1 = wu1 (xu), m3 = wu3 (xu ), and m4 = ru4 (xu) in the receipt queue RQu . Here, m3 and m4 are signi cant in RQu but m1 is insigni cant. Insigni cant messages can be omitted from the receipt queue. [Theorem] Let Rt be a sequence of messages in the receipt queue RQt of pt . Let Rt be a subsequence of Rt which includes only the signi cant messages. Here, the state of pt and the sequence of output values of read requests obtained by computing the operations 0

in Rt are the same as Rt . 0

4 TBCO Protocol

2

We present a transaction-based causally ordering (TBCO) protocol for supporting the transactionbased causally ordered delivery of messages.

4.1 Object vector

In this paper, we propose an object vector to causally order the messages based on the read and write semantics in the transactions. Each transaction Ti is given a unique identi er t(Ti ). For every pair of di erent transactions Ti and Tj , t(Ti ) 6= t(Tj ). If Ti starts before Tj in the same processor, t(Ti ) < t(Tj ). In addition, if a processor pt initiates Ti after receiving a message from Tj , t(Ti ) > t(Tj ). The transaction identi er is given by the linear clock mechanism [4]. pt manipulates a variable tid showing the linear clock whose initial value is 0. When Ti is initiated in pt , t(Ti) is given a concatenation of tid and the processor number of pt . Here, let tid(Ti ) denote tid given to Ti . pt manipulates tid as follows : (1) tid := tid + 1 if a transaction Ti is initiated in pt . (2) On receipt of a message from Tj , tid := max(tid, tid(Tj )). t(Ti) := tid in pt if Ti is initiated in pt. Ti issues read and write to the processors. Each read or write event e in Ti is given an event number no(e) as follows : Here, let e1 and e2 be events occurring in Ti , where e1 !Ti e2. (1) If there is no event e3 such that e3 !Ti e1 , i.e. e1 is the initial event of Ti , no(e1 ) = 0. (2) If e2 is a write event, no(e1 ) < no(e2 ). (3) If e2 is a read event and there is no write event e3 such that e1 !Ti e3 !Ti e2, no(e1 ) = no(e2 ). Ti has a variable noi where noi = 0 when Ti is initiated. noi is incremented by one each time a write event e occurs. no(e) := noi when e occurs in Ti . Each event e in Ti is given a global event number tno(e) as the concatenation of t(Ti ) and no(e). Every replica oja of oa has the same version number v (oja ) = v (oa ). The version number v (oja ) of oja is updated as v(oja ) := tno(w) each time a write w is computed in oja . Here, if tno(op) > v (oja ), op can be computed. Otherwise, op aborts. Suppose Th reads oja . Th issues a read r to oja and receives the response with data derived and the version number v(oja ) from oja . Th has a vector of variables h V1 , : : :, Vu i where each Va is initially 0 (a = 1, : : :, u). On receipt of v(oja ), Va := v (oja ) if Va < v (oja ). Otherwise, Ti aborts.

4.2 Message format

A message m is composed of the following elds : m:src = sender processor of m, i.e. ps . m:dst = set of destination processors. m:op = type op of operation, i.e. read, write, commit, abort, or response. m:tno = global event number h m:t, m:no i, i.e. tno(op). m:o = object to be manipulated by m:op.

m:V = object vector hV1, : : :, Vu i. m:SQ = vector of sequence numbers hsq1 , : : :, sqni. m:d = data. If m:op = read, m:dst denotes one processor which has a replica oia of oa . If m:op = write, m:dst shows all the processors which have the replicas of oa . A processor ps constructs a request message m with oph from Th as follows : m:t := t(Th ); m:no := no(oph ); m:op := oph ; m:o := oa ; m:src := ps ; m:Vi := Vi (i = 1, : : :, u ); sqt := sqt + 1 for every pt in m:dst; m:sqj := sqj (j = 1, : : :, n); ps manipulates variables sq1, : : :, sqn. Each time ps sends a message to pt , sqt is incremented by one. pt can detect a gap between messages sent by ps by checking the sequence number sqt . Then, ps sends m to every destination pt in m:dst. pt has variables rsr1, : : :, rsqu , which are initially 0. On receipt of m, if m:sqt = rsqs , there is no gap, i.e. pt expects to receive m and then rsqs := rsqs + 1. If m:sqt > rsqs, there is a gap m sent from ps where m:sqt > m :sqt  rsqs . On receipt of m from ps , pt enqueues m in the receipt queue RQt . pt manipulates the vector clock V as Va = max(Va , m:Va ) for a = 1, : : : u. A replica ota sends back a response m to ps on receipt of a read request m from ps. Here, m:t := m :t; m:no := m :no; m:op := response; m:o := oa . m carries the version number v (oa ), i.e. m:Va := v(oa ). 0

0

0

0

0

4.3 Message delivery

Mesaes m1 and m2 received can be ordered in pt according to the following rule. [Ordering rule] m1 precedes m2 (m1 ! m2 ) (1) if m1 :V  m2 :V , (2) otherwise (m1 :V = m2 :V ), m1 :t < m2 :t. 2 p ps p u t < 0, 0 > < 0, 0 > < 0, 0 > TID =1s w(x) 1s0, < 0 > w(y) < 0, 2v1 > < 0, 0 > TID =1s r s (y)

m1

< 0, 2v1 >

Res

>

m2 w s (x)

m1

(1)

TID = 1u m

m1 .V

< m 2 .V



TID =2u m1

m2

w u(x)



(2)

m1 .t

< m 2 .t

Figure 3: Message ordering. [Example] Figure 3 shows that transactions manipulate replicas of objects x and y . In Figure 3, h , i shows an object vector where and are the values of variables Vx and Vy showing the object versions of

x and y , respectively. Initially h 0, 0 i in every processor. x and y are in every processor in Figure 3(1). A transaction T1s is computed in ps where t(T1s ) = 1s. T1s rst sends a write m1 with w(x) and then sends m2 with w(y ) to ps , pt , and pu. tno(w(x)) = 1s0. m1 .V = h 1s0, 0 i and m1 .t = 1s. Then, w(y) is computed in ps , pt , and pu . Here, tno(w(y )) = 1s1. T1s sends m2 with m2 .V = h 1s0, 1s1 i and m2 .t = 1s. pt receives m2 after m1 but pu receives m1 after m2 . Since m1 .V (= h 1s0, 0 i) < m2 .V (= h 1s0, 1s1 i), m1 ! m2. pt and pu receive m1 and m2 in \!". After w(x) is computed, h 1s0, 0 i in every processor. h 1s0, 1s1 i in every processor after w(y). In Figure 3(2), T2u is computed after T1u is completed in pu . x is in every processor but y is in ps and pt . Before T1s is started, v (y) is assumed to be 2v1. T1s reads y and gets the version number 2v1 of y from pt . T1s and T2u send a write m1 with ws (x) and m2 with wu(x) tou every processor. tno(wt (x)) = 1s1 in T1s and tno(w (x)) = 2u0. Hence, m1 .V = h 1s1, 2v1 i and m2 .V = h 2u0, 0 i in T2u . m1 .V and m2 .V are not comparable but m1 .t (= 1s) < m2 .t (= 2u). Hence, m1 ! m2 . 2 On receipt of a message m from pu , pt stores m in the receipt queue RQt . Here, the messages in RQt are ordered in \!" by the message ordering rule. m is a top message in RQt i there is no message m1 in RQt such that m1 ! m. Each top message m1 in RQt still cannot be delivered because there might be some message m2 causally preceding m1 which pt has not yet received due to the unexpected network delay. Here, suppose pt receives a message m from ps . m is correctly received by pt if pt receives every message m where m .sqt < m.sqt . That is, pt receives every message which ps sends before m. We discuss what messages can be removed from RQt . [De nition] Let m be a message sent by pu to pt and stored in the receipt queue RQt . m is stable i (1) there is a message m1 from pu in RQt where m1 :sqt = m:sqt + 1, and (2) for each processor pu , there is a message m1 in RQt from pu where m ! m1 and pt correctly receives m1 . 2 The top message m of RQt can be delivered if m is stable, because it is sure that every message causally preceding m from the transaction point of view is delivered in RQt . [De nition] A top message m in RQ is ready if no operation con icting with m.op is computed in a replica denoted by m.o. 2 In addition, only signi cant messages in RQt are delivered from RQt in order to reduce time for delivering messages.  While each top message in RQt is stable and ready, m is dequeued from RQt . m is delivered to the application if m is signi cant, otherwise m is neglected. Each pt has variables D1 , : : :, Du where each Da shows the version number of the object oa (a = 1, : : :, u). Each time a message m is delivered, Da = m:t for m:o = oa . Here, if m:t < Da , m is omitted. 0

0

If some processor pu does not send any message to pt , the messages in RQt cannot be stable. In order to resolve this problem, each pu sends a message including only the header information to every other processor pt if pu had not sent any message to pt for some predetermined  time units. pt considers that pt loses a message from pu if (1) pt does not receive any message in  time units or (2) pt detects that there is a gap in the receipt sequence of messages. If pt detects that m is lost, pt requires pu to resend m. pu considers that pt may lose m if pu does not receive the receipt con rmation of m from pt in 2 time units after pu sends m to pt . Finally, the following property is obtained. [Theorem] The global log for a set of transactions is serializable if each processor pt computes requests according to the delivery rule. 2 [Example] In Figure 4, two transactions T1s and T2s are computed in ps, and T3t and T4t are in pt . T1s , T2s , and T3t issue write requests w1s (x), w2s (x), and w3t (x). w 1s (x) arrives at neither pt nor pu . w 2s (x) does not arrive at pu. In pu , since w1s (x) and w2s (x) are not signi cant, they are omitted. 2 Suppose pt receives multiple read requests r1 (x), : : :, rh (x) (h > 1) which are concurrent in RQt. In order to reduce the number of operations computed in pt , pt computes one read operation r(x) and then sends the response to all the source processors of r1 (x), : : :, rm (x). Since r (x) is computed only one time, the number of operations can be reduced. pu w 3t (x)

p t ps

3t w1s(x)

1s

w 2s(x)

r 2s(x)

r 4t (x)

Res

4t

Res

2s time

: transactions

: omit

Figure 4: Omission of write operations.

5 Evaluation

The TBCO protocol adopts the object vector whose size is the number u of objects in the system. On the other hand, the size of the vector clock is given by the number of processes, i.e. servers and subtransactions in the group. Messages communicated in a transaction are related but ones communicated among transactions may not be related. Hence, it is critical to consider subtransactions as the members of the group in order not to order all the messages exchanged among the processors. Since the transactions are initiated randomly, the scheme of the vector clock has to be dynamically changed. In order to change the scheme, all the processors, i.e. subtransactions and servers have to be synchronized. Since the object vector only depends on the number of the objects, the protocol can be adopted to transaction-oriented applications. The TBCO protocol does not causally order all the messages transmitted in the network. We evaluate the

TBCO protocol in terms of the number of messages causally ordered and the delay time compared with the traditional message-based protocol. The TBCO protocol is realized in threads of the Ultra Sparc stations interconnected by the 100base Ethernet. TCP is used to exchange messages among the processors, i.e. stations. Each processor randomly creates transactions each of which randomly manipulates objects. Traditional

number of operations

1.00

TBCO

0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.00

0.20

0.40

0.60

0.80

1.00

write / (write + read)

Figure 5: Evaluation of the TBCO protocol. We measure how many requests can be omitted in each processor. First, suppose three objects o1 , o2 , and o3 are fully replicated in three processors p1 , p2 , and p3 . In each processor, twenty transactions are randomly initiated and each transaction issues ten arbitrary kinds of requests, i.e. read and write to arbitrary objects. Figure 5 shows the total number of requests computed in the objects for the ratio of write requests issued by the transactions. The dotted line shows the TBCO and the solid line indicates the message-based protocol. The more write requests are issued, the more messages are sent to the replicas. The write ratio 1.0 means that only write requests are issued. On the other hand, 0.0 shows that only read requests are issued. The gure shows that the TBCO protocol can reduce the number of messages causally ordered. The more write messages are issued, the more messages can be omitted. For example, only about 68% of the messages transmitted in the message-based protocol are transmitted in the TBCO protocol in case that all the messages are writes, i.e. the write ratio is 1.0 while 50% for the write ratio 0.4 and 90% for the write ratio 0.0.

6 Concluding Remarks

This paper has discussed what messages have to be causally ordered in replicated objects with read and write from the application point of view. We have proposed the novel object vector for causally preceding messages based on the transaction concept. In the TBCO protocol, only the messages to be causally preceded can be causally ordered. We have also discussed a way for omitting messages which are not signi cant

for the applications. We have shown the TBCO protocol implies fewer number of operations computed than the protocols which order all the messages transmitted in the network.

References

[1] Bernstein, P. A., Hadzilacos, V. and Goodman, N., \Concurrency Control and Recovery in Database Systems," Addison -Wesley , 1987, pp.25045. [2] Birman, K., \Lightweight Causal and Atomic Group Multicast," ACM Trans. on Computer Systems, 1991, pp.2720290. [3] Cheriton, D. R. and Skeen, D., \Understanding the Limitations of Causally and Totally Ordered Communication," Proc. of the ACM SIGOPS'93 , 1993, pp.44057. [4] Lamport, L., \Time, Clocks, and the Ordering of Events in a Distributed System," Comm . ACM , Vol.21, No.7, 1978, pp.558{565. [5] Leong, H. V. and Agrawal, D., \Using Message Semantics to Reduce Rollback in Optimistic Message Logging Recovery Schemes," Proc. of IEEE ICDCS -14 , 1994, pp.2270234. [6] Mattern, F., \Virtual Time and Global States of Distributed Systems," Parallel and Distributed Algorithms (Cosnard, M. and Quinton, P. eds.), North -Holland , 1989, pp.2150226. [7] Melliar-Smith, P. M., Moser, L. E., and Agrawala, V., \Broadcast Protocols for Distributed Systems," IEEE Trans . on Parallel and Distributed Systems , Vol.1, No.1, 1990, pp.17{25. [8] Nakamura, A. and Takizawa, M., \Reliable Broadcast Protocol for Selectively Ordering PDUs," Proc. of IEEE ICDCS-11 , 1991, pp.239{ 246. [9] Nakamura, A., Tachikawa, T., and Takizawa, M., \Causally Ordering Broadcast Protocol," Proc. of IEEE ICDCS-14 , 1994, pp.48055. [10] Object Management Group Inc., \The Common Object Request Broker Architecture and Speci cation," Rev . 2.0, 1995. [11] Raynal, M. and Ahamad, M., \Exploiting Write Semantics in Implementing Partially Replicated Causal Objects," IRISA Research Report, PI1080, 1997. [12] Tachikawa, T. and Takizawa, M., \Distributed Protocol for Selective Intra-group Communication," Proc. of IEEE ICNP-95 , 1995, pp.2340241. [13] Tachikawa, T. and Takizawa, M., \Object-Based Message Ordering in Group Communication," Proc. of the IEEE 3rd Int'l Workshop on Object -oriented Real -time Dependable Systems (WORDS'97 ), 1997, pp. 3150322.