that Ej broadcasts a, b, and c and Ei accepts a as .... the most case that one message g is lost by one process ... broadcasts RST ACK rak where rak:TREQh :=.
Selective Total-Ordering Group Communication on Single High-Speed Channel Takayuki Tachikawa
and
Makoto Takizawa
Dept. of Computers and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki-gun, Saitama 350-03, JAPAN e-mail
ftachi,
g
taki @takilab.k.dendai.ac.jp
Abstract
In the group communication, multiple processes have to receive messages in some order. In this paper, we discuss a group communication protocol which supports a selective total-ordered (ST) and atomic delivery of messages to the destinations in a group of processes interconnected by a high-speed channel, where the processes may fail to receive messages due to the buer overruns. That is, each process receives messages destined to it in the sending order and any two common destinations of messages are received in the same order. Its execution is controlled in a distributed scheme, i.e. no master controller.
1 Introduction
In distributed applications like groupware [6], group communication among multiple processes is required in addition to conventional one-to-one communication provided by OSI [8] and TCP/IP [5]. In the group communication, messages sent by one process have to be delivered to either all the destinations or none in the group, i.e. atomic delivery . In addition, each process has to receive messages sent by the processes in some order. Group communications have been studied in [3, 4], [7], [9], [11, 12], [13]{[16], and [17]{[21]. [17] presents a reliable broadcast protocol which uses the one-to-one communication. An important problem in the group communication is which process coordinates the cooperation of multiple processes in the group. Most approaches [4, 7, 17] adopt the centralized control scheme, where one master process decides on the atomic and ordered delivery of messages. ISIS [3] uses a decentralized one where a sender of each message controls it. We adopt the distributed approach where every process makes a decision on it by itself. [18]-[21] present the distributed protocols. Let us consider a group D composed of clients C1 and C2 and database servers S1 , S2 , and S3 where data object x is stored in S1 and S2 , and y is stored in S2 and S3. Suppose that C1 would write x and y , and C2 would write only x. C1 sends a write operation op1 to S1 , S2 , and S3 , and C2 sends write op2 to S1 and S2 . Thus, each process sends each message to any subset of the group at any time. [13] discusses a selective order -preserving (SP) protocol where each server receives operations from each client in the sending order
while S1 may receive op1 after op2 and S2 may receive op1 before op2 . If S1 and S2 could receive op1 and op2 in the same order, it is easy to realize the serializability [2]. It is a selective total-ordering (ST) service
where every common destinations of messages receive the messages in the same order. In this paper, we discuss a distributed protocol which provides the ST service for the group by using the high-speed network [1] where each process may fail to receive messages due to the buer overrun. In addition, every process can send messages asynchronously. In ISIS[3], each message p is sent to a pre-de ned group and all the processes in the group receive p. The processes in the intersection of multiple groups can receive messages in the ST order. In the ST protocol, each process can send messages to any subset of the group at any time. In section 2, we model broadcast communication services. In section 3, we present a data transmission procedure. In section 4, failure recovery mechanisms are discussed. Finally, we discuss the evaluation of the ST protocol in section 5.
2 2.1
Service Model Service properties
A communication system is composed of application , system , and network layers [Figure 1]. A cluster C [18, 19] is a set of n ( 2) system service access point s (SAPs), i.e. fS1 ,...,Sn g. Each application process Ai takes group communication service through Si which is supported by a system process Ei (i = 1,...,n). E1 ,...,En cooperate with each other to support the group communication service for C by using the network layer. C is referred to as supported by E1 ,...,En (C = hE1 ,...,En i) and support A1 ,...,An . The network layer supports the system layer with high-speed data transmission. Since the transmission speed is faster than the processing speed of Ei, Ei may fail to receive messages sent in the network layer. In this paper, we assume that each Ei loses the whole message if Ei fails to receive it, i.e. unconditional loss. Let Ti denote a process at some layer. The underlying service used by Ti is modeled as a set of logs. A log L is a sequence of messages < p1 ::: pm ], where p1 and pm are the top (top (L)) and the last (last (L)), respectively. ph precedes pk (ph !L pk ) in L if h < k. Each Ti has a sending log SLi and a receipt log RLi,
A1 1 11 application 6 layer S1 system layer network layer
?
111 A
A
i
n
6S ?
6N ?
1
6S k system SAP ? system
i
1 11 E
E1
i
6N ?
111 E
i
n
process
6N ? k network
Figure 1: Layers which are sequences of messages sent and received by Ei, respectively. RLi is order -preserved i p !RLi q if p !SLj q for some Ej . RLi is information -preserved i RLi includes all the messages in SL1 ,...,SLn . RLi is preserved i RLi is order- and information-preserved. RLi and RLj are order -equivalent i for every pair of p and q in both RLi and RLj , p !RLi q i p !RLj q . RLi and RLj are information -equivalent i they include the same messages. RLi and RLj are equivalent i they are order- and information-equivalent. In a high-speed network, system processes may fail to receive messages due to the buer overrun because the transmission speed is faster than the processing speed [1]. A one -channel (1C) service is a model of the high-speed network where every RLi is orderequivalent. Figure 2 shows an example of the 1C service for C = hT1 , T2 , T3 i where all the processes receive messages in the same order while RL2 and RL3 are not information-preserved because T2 and T3 fail to receive c and q , respectively.
SL1 : < a b c d ] SL2 : < p q ] SL3 : < x y z ]
Figure 2: 1C service An order -preserving (OP) service is one where every RLi is preserved. A total ordering (TO) service is an OP one where every RLi is order-equivalent. In the OP service, every Ti receives messages from each Tj in the sending order without message loss. Every Ti receives messages in the same order in the TO service. 2.2
Figure 3: SP and ST services
n
SAP
Selective broadcast service
In a cluster C supporting application processes A1 , ..., An, each Ai sends a message to the destinations (not necessarily all the processes) in C . RLi is selectively information-preserved i RLi includes all the messages sent to Ei . RLi is selectively preserved i RLi is order-preserved and selectively information-preserved. A selective broadcast (SB) service is one where every RLi is selectively informationpreserved.
RL1 : < x c p z ] RL2 : < a x b y q ] RL3 : < a x c z q ]
(a) SP service (b) ST service SL1 : < af2;3g bf2g cf1;3g ] SL2 : < pf1g qf2;3g ] SL3 : < xf1;2;3g yf2g zf1;3g ]
n
high-speed network
RL1: < a x b c p y d z q ] RL2: < a x b p y d z q ] RL3: < a x b c p y d z ]
RL1 : < c x z p ] RL2 : < x a q b y ] RL3 : < a x c z q ]
application process
[De nition] An SB service S is selectively order preserving (SP) i every RL is selectively preserved. S is selectively total ordering (ST) i every RL is i
selectively preserved and is order-equivalent.
2
i
Here, every RL in the ST service is referred to as selectively totally ordered . Let p:DST ( C ) be a set of destination application processes fA 1 ,...,A m g of p. Here, p can be written as pf 1 m g . p denotes that p is sent by A . In the SP service[13], if p ! j q, p ! i q for every A 2 p:DST \ q:DST . For p and q sent by dierent processes, p is sent by A and q is sent by A (i 6= j ), every pair of A and A in p:DST \ q:DST receive both p and q in the same order in the ST service. For example, af2 3g and xf1 2 3g are received by the common destinations A2 and A3 in i
d
d
i
d ;:::;d
i
SL
i
RL
i
j
k
;
h
; ;
the same order in Figure 3(b). A logical precedence relation ()) among p and q is de ned as follows.
[De nition] p logically precedes q (p ) q) i (1) p ! i q and p:DST \ q:DST 6= , or p ! i q for some E , (2) p = q, or (3) for some message r, p ) r ) q. 2 RL
SL
i
) is re exive and transitive. Since x ) c from s ! 1 z in Figure 3(b), x ) z . If ) is anti-symmetric, ) 3is consistent . Otherwise, ) is inconsistent . ) is complete i either p ) q or q ) p if p:DST \ q:DST = 6 for every p and q. In the ST service, ) has to be consistent and complete. c and c ) z from c !
RL
RL
2.3
Distributed atomic receipt
In this paper, we adopt the distributed control scheme to realize the atomic receipt of each message p among all the destinations in the system cluster C = hE1; :::; En i. There are three criteria levels [18, 19] by which each Ei decides whether to atomically receive p from Ek in C : (1) Acceptance : Ei receives p. (2) Pre-acknowledgment : Ei knows that every destination of p has accepted p. (3) Acknowledgment : Ei knows that p has been preacknowledged by every destination of p. Even if Ei pre-acknowledges p, another Ej may think that some Ek has not accepted p because Ej fails to receive the acceptance con rmation of p from Ek . Hence, the third level is required.
3 ST Protocol
We would like to discuss a data transmission procedure of the ST protocol for a cluster C = hE1 ,...,En i by using the 1C service. 3.1
Variables
There are two kinds of messages, i.e. DT (data ) and RR (receive ready ) ones. DT is used to send data in C . If Ei has no data, Ei transmits RR to all the processes every pre-de ned period. Each DT message p has the following elds (j = 1; :::; n): p:SRC = source process Ei which transmits p. p:DST = set of destination processes of p. p:TSEQ = total sequence number of p. p:PSEQ j = partial sequence number for Ej p:ACK j = T SEQ of a message which Ei expects to receive next from Ej p:BUF = number of buers available in Ei. p:DATA = data. Ei 2 p:DST means that Ei is a destination of p. For every pair of p and q from Ei, p:TSEQ < q:TSEQ if p !S Li q . If Ej 2 p:DST \ q:DST and p !S Li q, p:PSEQj < q:PSEQj . If p !S Li q, Ej 2= p:DST , and there is no message r such that p !SLi r !S Li q and Ej 2 r:DST , then p:P SEQj = q:P SEQj . p:ACK j informs every process in C that Ei has accepted every message q from Ej where q:TSEQ < p:ACK j . RR has the same elds as DT except that RR does not have DST and DATA elds because RR is sent to all the processes in C without data. Each Ei has the following variables (j = 1,...,n): TSEQ = total sequence number of a message which Ei expects to broadcast next. PSEQ j = partial sequence number of a message which Ei expects to send to Ej next. TREQ j = T SEQ of message which Ei expects to receive next from Ej . PREQ j = P SEQj of message which Ei expects to receive next from Ej . ALhj = T SEQ of message which Ei knows that Ej expects to receive next from Eh (h = 1,...,n). PALhj = T SEQ of message which Ei knows that Ej expects to pre-acknowledge next from Eh (h = 1,...,n). BUF j = number of buers in Ej which Ei knows. Ei obtains initial T SEQ, P SEQj , and BUFj for every Ej in the cluster establishment [18, 19]. Let minALj and minBUF denote minimum ones in ALj 1 ,...,ALj n and in BUF 1 ,...,BUF n , respectively. Ei has a sending log SLi and three receipt logs RRLi , PRLi, and ARLi to store messages accepted, preacknowledged, and acknowledged, respectively. 3.2
Transmission and acceptance
E transmits a message p by the following action. Here, W gives the window size and H ( W ) is a i
constant.
[Transmission action] if minAL TSEQ < minAL + min (W , minBUF / (H 2 n)), f p:TSEQ := TSEQ ; TSEQ := TSEQ + 1; for (j = 1,...,n) f p:PSEQ := PSEQ ; if (E is a destination of p) f PSEQ := PSEQ + 1; p:DST := p:DST [ f E g; g i
i
j
j
j
j
j
g
j
p:ACK := TREQ (h = 1,...,n); p is broadcast at the network SAP N .g 2 On receipt of p from E , E performs the following h
h
i
j
ACC action.
i
[Acceptance (ACC) action] if (1) p:TSEQ = TREQ or (2) p:PSEQ = PREQ and E 2 p:DST , f TREQ := p:TSEQ + 1; AL := p:ACK (h = 1,...,n); if (E 2 p:DST ) f PREQ := p:PSEQ + 1; p is enqueued into RRL ; g else p is discarded; g 2 Even if E fails to receive g from E , E does not need to receive g unless E 2 g:DST . For example, suppose that E broadcasts a, b, and c and E accepts a as shown in Figure 4. Here, TREQ = 4 and PREQ = 3 in E . On receipt of c, E detects a loss of b because TREQ (= 4) < c:TSEQ (= 5). Since c:PSEQ = PREQ (= 3) and E 2 c:DST , E knows that E 62 b:DST [Figure 4 (a)]. If E 62 c:DST and c:P SEQ = 4, there must be some message b destined to E , where b:PSEQ = 3 [Figure 4 (b)]. j
i
j
i
j
hj
h
i
j
i
i
i
j
i
i
j
i
j
i
j
i
j
i
i
j
i
i
i
i
i
i
3.3
Pre-acknowledgment procedure
Let minALj (p) be a minimum number in f ALj h j Eh 2 p:DST g. Ei knows that every destination of p has accepted p if p:TSEQ minALj (p). Hence, p is pre-acknowledged in Ei by the PACK action.
[Pre-acknowledgment (PACK) action] while (for p (= top (RRL )), p:T SEQ < minAL (p) where p:SRC = E ) f p is dequeued from RRL ; p is enqueued into P RL ; PAL := p:ACK (h = 1,...,n); g 2 i
j
j
i
i
hj
3.4
h
Acknowledgment procedure
Here, let minPALj (p) be a minimum number in f PALjh j Eh 2 p:DST g. PALj h means that Ei knows that Ej has pre-acknowledged messages from Eh whose TSEQ < PALj h . Hence, Ei knows that every destination of p has pre-acknowledged p, i.e. p is acknowledged in Ei if p:TSEQ < minPALj (p).
[Acknowledgment (ACK) action] while (for p(=top (P RL )), p:TSEQ < minPAL (p) where p:SRC =E ) f p is dequeued from P RL ; p is enqueued into ARL ; g 2 i
j
j
i
i
E TREQ PREQ
j
TREQ PREQ
j
j j
E
i
af
=3 =2
)
=4 =3
i;:::
E
j
g hTSEQ = 3; PSEQ i = 2i
TREQ PREQ
j j
bf cf
:::
g
2
i;:::
) ?
(a) Ei
h
TSEQ = 4; PSEQ
i
= 3i
TREQ PREQ
j j
g hTSEQ = 5; PSEQ i = 3i
E
j
i
af
=3 =2
)
=4 =3
bf
i;:::
cf
:::
) ?
?time
62 b:DST
i;:::
g hTSEQ = 3; PSEQ i = 2i
g hTSEQ = 4; PSEQ i = 3i
2
(b) Ei
g
h
TSEQ = 5; PSEQ
i
= 4i
?time
2 b:DST
Figure 4: Acceptance condition
) q and there is no r in RL such that g ) r ) q. 2
If p is timed out in PRLi , p is moved to ARLi , i.e. acknowledged. The application process Ai receives the messages by dequeuing ARLi .
4 Failure Recovery 4.1
i
[Proposition 1] Suppose that E fails to receive g and g is retransmitted. If g is inserted only in f (g)=hp; q i, RL is selectively totally ordered. 2 [Proof] From the de nition, p ) g ) q . Hence, if g is inserted between p and q, ) is consistent. If g is inserted in the outside of hp; q i, say, before p, ) is inconsistent because g ) p and p )g . 2 Figure 5 shows that E3 fails to receive c in Figure 3(b). Here, since x ) z ) q in RL3 and x ) c ) p ) z in RL1 , x ) c ) z . f3 (c) is hx; z i. This means that c can be inserted between x and z in RL3 . If E3 puts c in the outside of f3(c), e.g. c ! 3 x, ) is inconsistent since x ! 1 c, i.e. not selectively totally i
Detection of message loss
We assume that processes never malfunction and the cluster C is aborted if any process stops by failure. In the 1C service, each Ei may fail to receive messages due to the buer overrun. Ei has to receive only messages destined to Ei . As presented in the acceptance procedure, on receipt of p from Ej , if PREQ j < p:PSEQ i , Ei nds to lose g from Ej such that PREQ j g:PSEQ i < p:PSEQ i . On receipt of q from Eh , for some j (6= h), if TREQ j < q:ACK j , Ei has not received g from Ej such that TREQ j g:TSEQ < q:ACK j . However, g may not be destined to Ei . Ei waits for messages from Ej . If g is not pre-acknowledged in a pre-de ned time, Ei nds that some process fails to receive g . If Ei is a sender of g, Ei rebroadcasts g. 4.2
(3) g
i
i
RL
ordered.
RL1 : < x c p z ] RL2 : < a x b y q ] RL3 : < a hx z i q ]
Recovery by insertion
Suppose that Ei nds the loss of g (from Ej ). There are go-back-n and selective retransmission ways to recover from the message loss. In the go-back-n scheme, all the messages following g are removed from all the receipt logs and are retransmitted. We adopt the selective retransmission one since only g is retransmitted. On receipt of g, Ei has to keep RLi selectively totally ordered after putting g in RLi , i.e. ) be consistent. One way is that Ei nds where to insert g in RLi and then puts g there.
[De nition] Suppose that E fails to receive g. A failure section f (g ) is an ordered pair hp; qi where (1) p ! i q , (2) p ) g and there is no message r in RL such that p ) r ) g , and i
i
RL
i
RL
Figure 5: Failure section of c Here, let a and b be messages destined to E . a ) b is self -decidable in E if (1) a:SRC = b:SRC and a:SEQ < b:SEQ, or (2) there is some c in RL such that a ) c ) b. This means that E can decide on the receipt order of a and b by using the sequence numbers if E could receive a and b. a ) b is decidable if (1) it is self-decidable in some process or (2) there is some c such that a ) c and c ) b are decidable. If E fails to receive g from E , E retransmits g and E receives g. As stated before, the problem is where to put g in RL . If E can decide where to put g in RL without communicating with another process, E is independently recoverable from the loss of g. E can independently recover from the loss of g if either i
i
i
i
i
i
j
j
i
i
i
i
i
i
p ) g or g ) p is self-decidable for every p in RLi . Otherwise, Ei is dependently recoverable . [Example 1] Let af1;2g , bf2;3g , and cf1;3g be messages. (1) Suppose that E1 , E2 , and E3 fail to receive a, b, and c, respectively. If a, b, and c are sent by the same process in this sequence, each Ei can independently recover from the message loss by using the sequence number. For example, on receipt of c, b ) c is decidable in E3 since b:SEQ < c:SEQ. (2) Suppose that a, b, and c are sent by dierent processes. Suppose E1 receives a and b in a !RL1 b, E2 receives b and c in b !RL2 c, and E3 fails to receive c and c is retransmitted. a ) c is decidable in E3 because E3 could obtain a ) b from E1 and b ) c from E2 by communicating with E1 and E2 . Suppose that E1 , E2 , and E3 fail to receive a, b, and c, respectively. Here, a, b, and c are retransmitted, and are received by E1 , E2 , and E3 , respectively. Suppose that E1 puts a after c, i.e. c ) a, E2 puts b after a, i.e. a ) b, and E3 puts c after b, i.e. b ) c. In result, ) is inconsistent because c ) a ) b ) c. Here, c ) a, a ) b, and b ) c are not decidable. Some additional synchronization mechanism among E1 , E2 , and E3 is required to make an agreement on the consistent ) among a, b, and c. 2 4.3
Recovery by sorting
Another way for recovering from message loss is to sort RLi after putting g in RLi . In Figure 5, c is retransmitted and E3 puts c on the last of RL3 on receipt of c. ) is inconsistent since z !RL3 c and c !RL1 z . In order to resolve the inconsistency, the messages can be sorted, e.g. by the process number and P SEQ. Here, E1 and E3 obtain the same c ) z . Suppose that some process fails to receive g . Let T (g) be a set fr j g ) r, or q ) r if fi (g) = hp; q i for every Ei 2 q:DST g which gives messages which have to be sorted if ) with g is changed. For example, T (c) = fc; p; q; z g since c ) p ) z and z ) q in Figure 5. Here, a sort point , sorti (g) of Ei for g means a message in RLi from which to the last of RLi messages are sorted. Here, sorti(g ) is de ned to be p in RLi such that p in T (g) and there is no r in RLi such that r ) p and r in T (g ). For example, sort1 (c) = c, sort2 (c) = q , and sort3 (c) = z in Figure 5. A tuple hsort 1 (g); :::; sort n (g)i is a sort line , sline (g ) for g. Figure 6 shows that sline(c) is hc; q; z i for c, where k means that is the sort point. Let l1 be sline(g1 ) = hs11 (g1 ); :::; s1n (g1 )i and l2 be sline(g2 ) = hs21 (g2 ); :::; s2n (g2 )i. A join of l1 and l2 , l1 ^ l2 is hs1 ; :::; sni where each si is s1i if s1i !RLi s2i , s2i otherwise. If there are multiple messages g1 , ..., gn lost by some processes, a sort line sline (fg1 , ..., gn g) is sline (g1 ) ^ ... ^ sline (gn ). From the de nition of the sort line, the following proposition is straightforward.
[Proposition 2] A sublog including messages preceding the sort point in RL is selectively totally ordered.
2
i
The sublogs < x ], < a x b y ], < a x ] of RL1 , RL2 , RL3 in Figure 6 are selectively totally ordered.
E1 RL1: < xf123g k cf13g pf1g zf13g ] E2 RL2: < af23g xf123g bf2g yf2g k qf23g ] E3 RL3: < af23g xf123g k zf13g qf23g ] Figure 6: Example of sort line 4.4
Recovery procedure
Since the high-speed network is more reliable, it is the most case that one message g is lost by one process Ei . In this case, g is retransmitted, and only Ei inserts g just before the sort point. It is referred to as a single -loss . Thus, Ei can independently recover from the loss of g . On the other hand, if multiple messages are lost, i.e. multi -loss , some additional synchronization protocol is required to make the logical precedence relation ) consistent after the messages lost are inserted in the logs by the processes losing them, i.e. dependent recovery is required as presented in Example 1. There are two ways of dependent recovery, synchronization and sorting . The synchronization requires more communications than the sorting method. Hence, the following strategy is used in this paper: (1) rst, all the processes agree on the sort line, and (2) then the insertion method is used for a single-loss, otherwise the sort method is used. First, we would like to discuss how all the processes can agree on the sort line. Suppose that Ei fails to receive p from Ej .
[Agreement procedure] (1) E nds that E fails to receive p (from E ) such that p:T SEQ > TREQ and p:P SEQ > P SEQ . E broadcasts RST(reset) message r where r :T REQ := T REQ (h = 1; :::; n). (2) On receipt of r , each E stops the data transmission. E nds a set P = fqj q 2 RRL , q:SRC = E , and q:T SEQ r :TREQ g for each E . Let P be P1 [ 11 1 [ P . E nds f in P such that f ! k q for every q in P . If P is , let f be ?. For each E , let TREQ be a minimum value in T = ft:T SEQ j f ! k t and t:SRC = E g if f 6= ?. E broadcasts an RST PAK rp where rp:T REQ := T REQ (h = 1; :::; n). (3) On receipt of rp , if rp arrives before rp , E stores the arrival order of rp , i.e. E < E in ORDR. T REQ := rp :T REQ if TREQ > rp :T REQ (h = 1; :::; n). (4) On receipt of rp1 , ..., rp , E nds a set Q = fqj q 2 RRL , q:SRC = E , and q:T SEQ T REQ g for each E . Let Q be Q1 [ 1 11 [ Q . E nds f in Q such that f ! k q for every q in Q. If Q is , let f be ?. E broadcasts RST ACK ra where ra :TREQ := T REQ , ra :P REQ := P REQ , ra :P SEQ := P SEQ , and ra :ORDR := ORDR (h = 1; :::; n). (5) On receipt of ra , P R := ra :P REQ and P S := ra :PSEQ (j = 1; :::; n). If ra :TREQ 6= i
i
j
j
j
i
i
i
h
h
i
k
k
h
h
k
i
n
h
h
k
RRL
h
h
h
h
RRL
k
k
h
h
i
i
j
i
h
i
i
i
h
k
j
h
h
n
k
k
h
h
h
h
n
k
RRL
k
k
h
k
h
h
h
h
k
h
k
i
i
k
j
ij
i
j
i
ij
h
T REQ for some h or ra :ORDR 6= ORDR , E
broadcasts ABORT message to abort C . (6) On receipt of ra1 , ..., ran , each Ek has the same T REQ1, ...,T REQn. Then, f denotes a sort point of Ek . 2 h
i
k
[Example 2] In Figure 6, E3 fails to receive c. Suppose that T SEQs of a, b, and c are 1, 2, and 3, T SEQs of p and q are 1 and 2, and T SEQs of x, y, and z are 1, 2, and 3, respectively. Here, E1 , E2 , and E3 have the following T REQ = < T REQ1 , TREQ2 , T REQ3 >. E1 TREQ = < 4; 3; 4 >. E2 TREQ = < 4; 3; 4 >. E3 TREQ = < 3; 3; 4 >. First, E3 broadcasts RST r3 where T REQ = < 3; 3; 4 > when E3 nds the loss of c. On receipt of r3 , E1 obtains P1 = fcg since c:T SEQ = r3 :T REQ = 3, and P2 = P3 = , i.e. P = fcg. E2 and E3 obtain P = . Here, each Ei has the following fi and RRLi whose k denotes fi .
RRL1 : < x k c p z ] f1 = c. RRL2 : < a x b y q k ] f2 =?. RRL3 : < a x z q k ] f3 =?.
j
j
i
i
i
i
RRL
i
i
i
[Retransmission (RET) procedure] (1) E retransmits p if E has sent p and p:PSEQ PR for some E 2 p:DST . (2) On receipt of p where E 2 p:DST , if P REQ p:P SEQ , E puts p just before spoint (p) in RRL if FAIL = 1, otherwise E puts p into the tail of RRL . 2 j
RRL
i
Which recovery method, insertion or sort one is used depends on whether it is a single-loss or not. Let FAIL be the cardinality of f P Rij j P Rij < PSij (i; j =1; :::; n)g. A single-loss occurs if F AIL = 1, and multi-loss if F AIL > 1.
ij
E1 broadcasts RST PAK rp1 whose T REQ = < 3; 1; 3 > because c ! 1 p ! 1 z , c:TSEQ = 3, p:TSEQ = 1, and z:T SEQ = 3. E2 broadcasts rp2 whose TREQ = < 4; 3; 4 >. E3 broadcasts rp3 whose TREQ = < 3; 3; 4 >. On receipt of rp1 , rp2 , and rp3 , each E obtains TREQ = < 3; 1; 3 >. In E2, f2 is q since q:T SEQ 1. In E3 , f3 is z since z:T SEQ 3, q:TSEQ 1, and z ! 3 q . In E1 , f1 = c. Here, the sort line sort(c) is < f1 ; f2 ; f3 > = < c; q; z > as shown in Figure 6. Each E broadcasts RST ACK ra whose T REQ = < 3; 1; 3 >. On receipt of ra1 , ra2 , and ra3 , the agreement procedure terminates. Here, every E has the same TREQ. Hence, all the messages preceding the sort point f are acknowledged in E . That is, x in RRL1 , a, x, b, and y in RRL2 , and a and x in RRL3 are acknowledged. Finally, the following RRL is obtained. RRL
in RRLk such that p ) q !RRLk p because multiple messages may be lost. Then, Ek broadcasts rpk where rpk :T REQh = T REQh = qh :T SEQh (h = 1; :::; n). In step (3), Ek receives rp1 , ..., rpn . In (4), T REQh denotes the minimum one in f rp1 :TREQh , ..., rpn :T REQh g (h = 1; :::; n). Since every Ek sends rak and receives ra1 , ..., ran, Ek has the same TREQ. Ek nds q h preceding every message from each Eh in RRLk such that qh :TSEQ T REQh (h = 1; :::; n). Then, let f denote some qh in RRLk such that q h !RRLk qj for every Ej . This means that p ) f and no message t in RRLk such that p ) t ) f . That is, f denotes the sort point of Ek for all the messages lost. 2
i
i
RRL1: < c p z ] RRL2: < q ] RRL3: < z q ]: 2 [Proposition 3] By the agreement procedure, every process obtains the sort point for messages lost. 2 [Proof] Suppose that Ei fails to receive p. Ei broadcasts RST ri and every Ek receives ri . In step (2) of the agreement procedure, Ek nds qh preceding every message from Eh in RRLk whose T SEQ ri :TSEQh (h = 1; :::; n). Let f denote some qh such that qh !RRLi q j for every Ej . This means that p ) f ) q j for every Ej and p can be inserted in RRLk such that p !RRLk f . However, there may be some q
i
i
i
i
i
i
In the agreement procedure, each Ej knows what messages Ej fails to receive and Ej has sent but another fails to receive. Unless Ej receives p from Ei such that p:P SEQj < P Sij in a pre-de ned time, Ej requests Ei to retransmit p. If there are multiple messages lost, each process sorts RRLi by the following sorting rule.
[Sorting rule] (1) If p:SRC = E , q:SRC = E , and E < E in ORDR, then p ! j q. (2) If p:SRC = q:SRC and p:P SEQ < q:P SEQ , then p ! j q. 2 i
k
i
k
RRL
j
j
RRL
[Example 3] (1) Insertion: RRL1 =< c1 p2 z 3 ], RRL2 =< q 2 ], and RRL3 =< z3 q2 ] are obtained
by applying the agreement procedure to Figure 6 as presented in Example 2. Since it is a single-loss, i.e. E3 loses c, E1 retransmits c and E3 inserts c before z in RRL3 , i.e. RRL3 = < c z q ]. (2) Sorting: Suppose that E2 and E3 fails to receive y and c, respectively in Figure 3(b) i.e. multi-loss. The sort line slinehfc; ygi = h c; q; z i as shown in Figure 6 is obtained by applying the agreement procedure. Then, RRL1 = < c1 p2 z 3 ], RRL2 = < q2 ], and RRL3 = < z3 q 2 ]. c and y are retransmitted and are appended to the last of RRL3 and RRL2 , respectively. Suppose that RST PAKs are received by E3 , E1, and then E2 , i.e. E3 < E1 < E2 in ORDR. Each Ei obtains RRLi as shown in Figure 7 by sorting RRLi . They are selectively totally ordered. 2
RRL1: < c1f13g p2f1g zf313g ] RRL2: < qf223g yf32g ] RRL3: < zf313g qf223g c1f13g ]
)
< zf313g c1f13g p2f1g ] < yf32g qf223g ] < zf313g c1f13g qf223g ]
F 01 (1 0 F ) (l
1). The expected number of messages retransmitted for l is F l01 (1 0 F )(L 0 l + 1) (1 l L). Hence, the expected number RG of messages retransmitted is given as follows. l
Figure 7: Example of sorting
R = G
X F 0 (1 0 F )(L 0 l + 1): L
l
1
(2)
=1
l
[Proposition 4] A sublog of every RL obtained by i
sorting messages following the sort point after putting all the messages lost by Ei on the last of RLi is selectively totally ordered. 2
[Theorem 5] The ST protocol provides the ST ser-
vice by using the 1C service. [Proof] It is clear if there is no message loss. We consider a case that some messages are lost. It is sure that every message loss is detected by the failure condition. From Proposition 3, every process agrees on the sort line by the agreement procedure. There are two cases, i.e. single-loss and multi-loss. (1) Suppose that the single-loss occurs, say, Ei loses a message g . From Proposition 1, the theorem holds since only Ei inserts g in the sort point. (2) Suppose that the multi-loss occurs. From Proposition 2 and 3, every receipt log obtained by the sorting method is selectively totally ordered. 2
Figure 8 shows RS =L and RG =L for f where ST means d=n = 0:5 and TO means d=n = 1. Figure 9 shows RS =L and RG =L for d=n and f = 0:005. Following both gures, the ST protocol with the selective retransmission implies less number of messages retransmitted than the go-back-n. The number of messages retransmitted by the selective scheme is almost O(d=n) while O((d=n)c ) in the go-back-n one. 0.2
Retransmitted PDUs [R/L]
From Proposition 2, it is clear for the following proposition to hold.
TO ST TO ST
0.15
: : : :
go-back-n go-back-n selective selective
0.1
0.05
0 0
0.005 f
0.01
Figure 8: Ratios of retransmitted messages for f
5 Evaluation
Next, let us consider how many messages are retransmitted in the go-back-n scheme. In the go-back-n scheme, if Ei fails to receive p which is destined to Ei , Ei removes all the messages following p in RRLi. The messages removed are retransmitted by their source processes. If a message q removed by Ei is accepted by Ej , Ej removes all the messages following q from RLj . Thus, the removal of messages in one process is propagated to another process. The probability that p = RL[l] is lost by one destination while every RL[h](h < l) is received by every destination is
0.2
Retransmitted PDUs [R/L]
We would like to evaluate the ST protocol for a cluster C = hE1 ; :::; En i in terms of the number of messages retransmitted in the presence of the message loss. Let d ( n) be an average number of destination processes of each message. Let f be a probability that each message is lost by one process. Let RL be a sequence of messages transmitted in the 1C network. Let L be a number of messages in RL. Each Ej accepts only messages destined to Ej from RL. Let F be (1 0 f (d=n))d , i.e. probability that each message is received by every destination. The probability that each message is lost by at least one destination is (1 0 F ). Since the selective retransmission is used in the ST protocol, only messages lost are retransmitted. The expected number RS of messages retransmitted in RL is given as follows. RS = L(1 0 F ): (1)
go-back-n selective
0.1
0 0
0.2
0.4
0.6
0.8
1
d/n
Figure 9: Ratios of retransmitted messages for d=n Next, let us consider the processing time C to recover from the message loss. We assume that time to insert one message in the receipt log is 1, and time to sort h messages is h. We assume that time for comparing the sequence numbers of the messages in the log is neglectable compared with time to insert the messages in the log. The probability F0 that only one message is lost by only one process is given f (d=n)(1 0 f (d=n))nL01 . The probability F1 that multiple messages are lost by more than one process is 1 0 F0 0 F L . The expected time C of the ST recovery per each process is F0 =n + F1 RG . Here, let D be the cost of each process for processing messages in RL
without message loss. D is L from the assumption. Figure 10 shows the ratios of C to D for f and d=n. For example, if f is smaller than 0:001, the overhead for recovering from the message loss is below 20% of the normal processing time.
Processing time [C/D]
1 d/n d/n d/n d/n
0.8
= = = =
1.0 0.8 0.5 0.3
0.6
0.4
0.2
0 0
0.005 f
0.01
Figure 10: Ratio of the processing time for f
6 Concluding Remarks
In this paper, we have presented a group communication protocol which provides a group of processes, i.e. cluster with the selective totally-ordering (ST) service by using the high-speed 1C network. The protocol is based on the distributed control. In the ST service, each message is delivered to any processes in the group and dierent messages are received by the common destinations in the same order in the presence of the message loss. Furthermore, we have shown the evaluation of the ST protocol. By using the ST protocol, teleconferencing and cooperative work can be easily realized.
References
[1] Abeysundara, B. W. and Kamal, A. E., \HighSpeed Local Area Networks and Their Performance: A Survey," ACM Computing Surveys , Vol.23, No.2, 1991, pp.221-264. [2] Bernstein, P. A., Hadzilacos, V., and Goodman, N., \Concurrency Control and Recovery in Database Systems," Addisson Wesley , 1987. [3] Birman, K. P., Schiper, A., and Stephenson, P., \Lightweight Causal and Atomic Group Multicast," ACM TOCS , Vol.9, No.3, 1991, pp.272314. [4] Chang, J. M. and Maxemchuk, N. F., \Reliable Broadcast Protocols," ACM TOCS , Vol.2, No.3, 1984, pp.251-273. [5] Defense Communications Agency, \DDN Protocol Handbook," Vol.1{3, NIC 50004-50005, 1985. [6] Ellis, C. A., Gibbs, S. J., and Rein, G. L., \Groupware," Comm . ACM , Vol.34, No.1, 1991, pp.38-58. [7] Garcia-Molina, H. and Kogan, B., \An Implementation of Reliable Broadcast Using an Unreliable Multicast Facility," Proc. of the 7th IEEE Symp . on Reliable Distributed Systems , 1988, pp.428-437.
[8] International Standards Organization, \OSI { Connection Oriented Transport Protocol Speci cation," ISO 8073, 1986. [9] Kaashoek, M. F. and Tanenbaum, A. S., \Group Communication in the Amoeba Distributed Operating System," Proc. of the IEEE ICDCS-11 , 1991, pp.222-230. [10] Lamport, L., \Time, Clocks, and the Ordering of Events in a Distributed System," Comm . ACM , Vol.21, No.7, 1978, pp.558-565. [11] Luan, S. W. and Gligor, V. D., \A Fault-Tolerant Protocol for Atomic Broadcast," IEEE Trans . on Parallel and Distributed Systems , Vol.1, No.3, 1990, pp.271-285. [12] Melliar-Smith, P. M., Moser, L. E., and Agrawala, V., \Broadcast Protocols for Distributed Systems," IEEE Trans . on Parallel and Distributed Systems , Vol.1, No.1, 1990, pp.17-25. [13] Nakamura, A. and Takizawa, M., \Reliable Broadcast Protocol for Selectively Ordering PDUs," Proc . of the IEEE ICDCS-11 1991, pp.239-246. [14] Nakamura, A. and Takizawa, M., \Design of Reliable Broadcast Communication Protocol for Selectively Partially Ordering PDUs," Proc. of the IEEE COMPSAC91 , 1991, pp.673-679. [15] Nakamura, A. and Takizawa, M., \Priority-Based Total and Semi-Total Ordering Broadcast Protocols," Proc . of the IEEE ICDCS-12 , 1992, pp.178-185. [16] Nakamura, A. and Takizawa, M., \Causally Ordering Broadcast Protocol," to appear in Proc. of the IEEE ICDCS-14 , 1994. [17] Schneider, F. B., Gries, D., and Schlichting, R. D., \Fault-Tolerant Broadcasts," Science of Computer Programming , Vol.4, No.1, 1984, pp.115. [18] Takizawa, M., \Cluster Control Protocol for Highly Reliable Broadcast Communication," Proc. of the IFIP Conf . on Distributed Processing , 1987, pp.431-445. [19] Takizawa, M., \Design of Highly Reliable Broadcast Communication Protocol," Proc. of IEEE COMPSAC87 , 1987, pp.731-740. [20] Takizawa, M. and Nakamura, A., \Partially Ordering Broadcast (PO) Protocol," Proc. of the IEEE INFOCOM90 , 1990, pp.357-364. [21] Takizawa, M. and Nakamura, A., \Reliable Broadcast Communication," Proc . of IPSJ Int'l Conf . on Information Technology (InfoJapan ), 1990, pp.325-332. [22] Tanenbaum, A. S., \Computer Networks (2nd ed.)," Englewood Clis , NJ : Prentice -Hall , 1989.