Group Communication Protocols: Properties and

0 downloads 0 Views 241KB Size Report
provide various kinds of group communication services and evaluate the .... and a receipt log RLi, which are sequences of PDUs sent and received by Ti, ...
Group Communication Protocols: Properties and Evaluation Akihito NAKAMURA,

Takayuki TACHIKAWA,

and

Makoto TAKIZAWA

Dept. of Computers and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki-gun, Saitama 350-03, JAPAN e-mail

fnaka,tachi,[email protected]

Abstract

In distributed systems, group communication among multiple entities is required in addition to the conventional one-to-one communication. Group communication protocols provide multiple entities with reliable data transmission service, i.e. messages are delivered to all the destination entities in the group. It is also important to guarantee that every application entity can receive messages in a well-de ned order in the presence of multiple entities sending messages. This paper discusses logical properties of the group communication. We present communication protocols to provide various kinds of group communication services and evaluate the protocols.

1 Introduction

In distributed applications like groupware [6], group communication among multiple entities is required in addition to the conventional one-to-one communication like TCP/IP [5]. The group communication protocols support a many-to-many communication service for a group of entities. These protocols provide the group with the atomic and ordered delivery of messages. Here, a unit of message exchanged among the entities is named a protocol data unit (PDU). [23] presents a reliable broadcast protocol which uses the one-to-one communication. [15] presents a broadcast protocol which provides the totally ordered receipt of PDUs based on majority-consensus decision. [7] characterizes message ordering properties in reliable broadcast protocols using the one-to-one network. [24] presents a cluster concept which is an extension of the connection [11] to multiple service access points (SAPs) and shows how to establish the cluster by using the Ethernet [10]. Here, a cluster is considered to be a group of entities. In the group communication, an important problem is which entity coordinates the cooperation of multiple entities in the cluster. Most approaches [4, 13, 23] adopt the centralized control scheme where one master entity decides on the atomic and ordered receipt in the group. Here, the entities have to block until the decision of the master entity is delivered. In this paper, we adopt the distributed control scheme where every entity makes the decision by itself. [24, 26] present a TO (total ordering broadcast) protocol on the Ethernet, by which every application entity can receive the PDUs in the

same order without PDU loss. [25, 26] present an OP (order-preserving broadcast) protocol by which all the entities receive PDUs in the sending order but may receive the PDUs not in the same order. In these protocols, every PDU is broadcast to all the entities in the cluster. [17, 18] present an SP (selectively orderpreserving broadcast) protocol which provides the selective broadcast to a subset of the group. The group communication protocols have to tolerate communication and entity failures. In this paper, we present a formal model of the data transmission aspect in the group communication. We de ne various kinds of broadcast communication services required to implement distributed applications, and present group communication protocols supporting the services by using high-speed networks [1]. While bit error rates in the high-speed networks are low ( 1009 ), entities may fail to receive PDUs due to the bu er overrun because the processing speed of the entity is slower than the transmission speed [1]. Hence, the PDU loss i.e. omission fault is the major failure in the high-speed network. In section 2, we show a model of group communication service. In section 3, we discuss the distributed atomic receipt concept. In section 4, we present various distributed group communication protocols. The protocols are evaluated in section 5.

2

Model

In this section, we present a model of the communication service for multiple entities. 2.1

System model

A communication system is composed of application, system, and network layers as shown in Figure 1. Each application entity Ai takes communication service through a service access point (SAP) Si supported by a system entity Ei (i = 1,...,n). A cluster C [24] is a set of n ( 2) system SAPs, i.e. fS1 ,...,Sn g. E1 ,...,En cooperate with each other to support some broadcast service for C by using the underlying network service. The cooperation of the system entities is coordinated by a system protocol. Here, C is referred to as supported by E1 ,...,En (written as C = hE1 ,..., En i), and support A1 ,...,An . A data unit, i.e. message exchanged among entities at the same layer is a protocol data unit (PDU).

A1

application layer

111

6

S1

111

N1

network layer

An

6 k ?

Ei

6 ?

application entity

Sn

?

E1

6 ?

6

111

Si

?

system layer

Ai

111

En

Ni

high-speed network

6 ? k

system SAP system entity

Nn

network SAP

Figure 1: System model 2.2

Ordered delivery

Let Ti be entities which use the broadcast communication service (i = 1,...,n). Ti means either a system or application entity. Each Ti is modeled as a sequence of sending and receipt events. Let si [p] and ri [p] denote sending and receipt events of a PDU p in Ti , respectively. A happened -before relation [14] ! on the events is de ned as follows. [De nition] For every pair of events e1 and e2, e1 ! e2 (e1 happens -before e2 ) i (1) e1 occurs before e2 in Ti , (2) for some (not necessarily di erent) Ti and Tj , there exists some PDU p such that e1 = si [p] and e2 = rj [p], or (3) for some event e3 , e1 ! e3 and e3 ! e2 . 2 We model a broadcast communication service as a set of logs. A log L is a sequence of PDUs < p1 ... pm ], where p1 and pm are the top and the last denoted by top (L) and last (L), respectively. ph precedes pk in L (ph !L pk ) if h < k. Each Ti has a sending log SLi and a receipt log RLi, which are sequences of PDUs sent and received by Ti , respectively (i = 1,...,n). If Ti receives q after p, p !RLi q . If Ti sends q after p, p !SLi q . SLij and RLij are sublogs of SLi and RLi which include PDUs destined to Tj and PDUs received from Tj , respectively (j = 1,...,n).  RLi is local -order -preserved i for every Tj , RLi is informap !RLi q if p !SLj q . tion -preserved i RLi includes all the PDUs in SL1 ,...,SLn .  RLi and RLj are order -equivalent i for every pair of p and q included in both RLi and RLj , p !RLi q i p !RLj q . RLi and RLj are information -equivalent i they include the same PDUs. If RLi is local-order-preserved, Ti receives PDUs from each entity in the sending order. Ti receives all the PDUs sent in C if RLi is information-preserved. If RLi and RLj are order-equivalent, Ti and Tj receive every two PDUs in the same order if they receive the PDUs. Ti and Tj receive the same PDUs if RLi and RLj are information-equivalent.

A service where each entity Ti can send PDUs to a subset of the cluster rather than all the entities is named a selective broadcast communication [17, 18].  RLi is selectively information -preserved i RLi includes all the PDUs in SL1i ,...,SLni , i.e. Ti receives all and only the PDUs destined to Ti . There are applications where higher-priority PDUs delivered earlier than lower-priority PDUs. For example, PDUs which carry real-time data like voice have higher priorities than the le transfer. Let p:PRI denote the priority of p. If p has a higher priority than q , p:PRI > q:PRI .  RLi is priority -based ordered i for every pair of p and q in RLi , (1) if p:PRI > q:PRI , then p !RLi q , and (2) if p:PRI = q:PRI , p and q are sent by Ej , and sj [p] ! sj [q], then p !RLi q . To incorporate application semantics into the communication system, it is important to consider the cause-e ect relationship among the events. If an application entity Ak sends q after receiving p, all the common destinations of p and q have to receive p before q. The happened-before relation ! re ects this causee ect relationship. The causal order [3]  among PDUs is de ned as follows. [De nition] For every pair of p and q, p causality precedes q (p  q) i si [p] ! sj [q]. 2 The relation  is transitive but not symmetric.  RLi is causality -preserved i for every pair of p and q in RLi , p !RLi q if p  q . It is straightforward that RLi is local-order-preserved if it is causality-preserved. In Figure 2, RL3 = < g p q ] is causality-preserved. g  p  q , i.e. s1 [g] ! s1 [p] ! s2 [q ] and r3 [g] ! r3 [p] ! r3 [q ]. If A3 receives q before p as shown in a dotted line, RL3 0 = < g q p ] is not causality-preserved (but local-order-preserved). A2

A1

A3

g p

j

3

p

time

?

q

z z z RL = < g p q ] z RL 0 = < g q p ] ?

?

3

Figure 2: Causality-preserved receipt

3

Group Communication Services

3.1

System services

We consider services supported by the system layer.

A. Sender-based ordering services

 Locally Ordering Broadcast (LO) service: Every

receipt log RLi is information- and local-orderpreserved.  Totally Ordering Broadcast (TO) service: Every RLi is information-preserved, local-orderpreserved, and order-equivalent. In the LO service, every entity receives PDUs from each entity in the sending order. For example, in Figure 3 (a), every entity receives q after p from A2 . In the TO service, every entity receives all the PDUs in the same order. For example, RLi = < a x b c p y d z q ] in Figure 3 (b) (i = 1, 2, 3).

RL1 : RL2 : RL3 :

< a x b c p y d z q

]

< p x a b y c z d q

]

< a x p y b q c z d

]

RL1 : RL2 : RL3 :

(a) LO service

SL1 : SL2 : SL3 :

< a x b c p y d z q

]

< a x b c p y d z q

]

< a x b c p y d z q

]

(b) TO service < a b c d < p q

]

]

< x y z

]

Figure 3: Sender-based ordering services

several di erent programs are executed by one or more entities, each application entity needs to send each PDU to a subset of C rather than all the entities in C .  Selectively Locally Ordering Broadcast (SLO) service: Every RLi is local-order- and selectively information-preserved.  Selectively Totally Ordering Broadcast (STO) service: Every RLi is local-order-preserved, selectively information-preserved, and orderequivalent.  Selectively Causally Ordering Broadcast (SCO) service: Every RLi is selectively information- and causality-preserved. Let p:DST be a set fAd1 ,...,Adm g of destination application entities of a PDU p, i.e. p:DST  C . Here, p can be written as pfd1 ;:::;dm g . In the SLO service, p !RLij q in every Ei 2 p:DST \ q:DST if p !SLj q . In Figure 5 (a), every RLi is order- and selectivelyinformation-preserved, e.g. A1 receives y after x from A3 . In the STO service, every common destination of PDUs receives the PDUs in the same order. In Figure 5 (b), a and b are received by E2 and E3 in the same order, i.e. a !RLi b (i = 2,3).

B. Priority-based ordering services

RL1 : < x c p y d q RL2 : < p x a d y

 Priority -Based Ordering Broadcast (PriO) ser-

vice: Every RLi is priority-based-ordered and information-preserved.  Priority-Based Totally Ordering Broadcast (PriTO) service: Every RLi is priority-based -ordered, information-preserved, and orderequivalent. Let p[r] denote that p:PRI = r . Figure 4 shows an example of the PriO service. In the PriTO service, the PDUs with the same priority are received in the same order, e.g. RLi = < c b y p a x q d z ] for the PDUs in Figure 4 (i = 1, 2, 3).

RL1 : RL2 : RL3 :

< c[3] b[2] y[2] p[2] a[1] x[1] q[1] d[1] z[1]

]

< c[3] y[2] b[2] p[2] x[1] q[1] z[1] a [1] d[1]

]

< c[3] p [2] b[2] y[2] q[1] x[1] z[1] a [1] d[1]

]

SL1 : SL2 : SL3 :

< a[1] b[2] c[3] d[1] < p[2] q[1]

]

]

< x[1] y[2] z[1]

]

Figure 4: PriO service

C. Causality-based ordering service

 Causally Ordering Broadcast (CO) service: Ev-

ery RLi is information- and causality-preserved. In the CO service, if sj [p] ! sk [q] for p and q , ri [p] ! ri [q] in every Ei .

D. Selective broadcast services

The LO and TO services are suitable for applications where every application entity executes a same program. On the other hand, in applications where

]

]

RL3 : < a p y b c z

RL1 : < x c p y d q RL2 : < a x p y d

]

(a) SLO service

]

]

RL3 : < a b c p y z

]

(b) STO service

S L1 : < af2;3g bf3g cf1;3g df1;2g S L2 : < pf1;2;3g qf1g

]

]

S L3 : < xf1;2g yf1;2;3g zf3g

]

Figure 5: SBC services 3.2

Network services

Next, we de ne the services provided by the underlying network layer.  One -Channel (1C) service: Every RLi is localorder-preserved and is order-equivalent.  Multi -Channel (MC) service: Every RLi is localorder-preserved. In the 1C service, PDUs are delivered to entities in the same order, but some PDUs may be lost. The 1C service is a model of a high-speed channel [1, 12]. In the MC service, every entity receives PDUs from each entity in the sending order, but some entity may fail to receive some PDUs. The MC service can be provided by a system where computers are fully connected by logical or physical point-to-point links. The services de ned in this paper are summarized in Table 1.

4

Atomic Receipt Concept

In this section, we consider how PDUs are atomically received in a cluster C = hE1 ,...,En i. There are three approaches to deciding on the atomic receipt, i.e. centralized , decentralized , and distributed ones. In the centralized approach, one master controller decides on

Table 1: Group communication services service

LO TO CO SLO STO SCO PriO PriTO MC 1C

i -preserved



*

*

*

2 2

i -equivalent



2 2 2

2 2

lo -preserved





2 2

o -equivalent

2

2 2

2 2

2

c -preserved

p -ordered

-

-

-

-

-

-

-

-

-

-

-



-

-

-

-

-

* selectively information-preserved. i = information, lo = local-order, o = order, c = causality, p = priority.

it based on the two-phase commitment protocol [9]. In the decentralized one, the sender of each PDU is a controller of the PDU. In this paper, we would discuss the distributed control [24] where each entity decides on the atomic receipt. Every PDU from each Ej carries the receipt con rmation of PDUs which Ej has received already. On receipt of q from Ej , Ei knows that Ei has received every p where rj [p] ! sj [q ]. Here, let p and q be PDUs sent by Ek and Ej , respectively. q is referred to as pre -acknowledge p for Ej in Ei (p )j i q ) i sk [p] ! ri [p] and sk [p] ! rj [p] ! sj [q] ! ri [q ]. In Figure 6, E2 sends c after receiving a. On receipt of c, E3 knows that E2 has received a. Here, s1 [a] ! r2 [a] ! s2 [c] ! r3 [c], i.e. c preacknowledges a for E2 in E3 (a )23 c). There are three criteria levels [24] of Ei 's atomic receipt for every PDU p in C in the distributed way: (1) Acceptance : Ei receives p. (2) Pre-acknowledgment : Ei knows that every destination of p has accepted p. That is, for every Ej , there exists q such that p )j i q. (3) Acknowledgment : Ei knows that p has been preacknowledged by every destination of p. That is, for every Ej and Eh , and q where p )hj q, there exists g such that q )ji g. The acknowledgment of p by Ei means that Ei knows that every destination of p has known that every destination had accepted p. Even if p is pre-acknowledged in Ei , Ei cannot decide if p is accepted by all the destinations in C . If p is acknowledged in Ei , Ei knows that p is pre-acknowledged by every destination. Another Ek knows that p is at least accepted by every destination. That is, Ei considers that p is atomically received by every destination. In Figure 6, suppose that every Ei sends gi after receiving b, c, d, and e. If E4 accepts gi from every Ei , a is acknowledged by E4 .

5 Group Communication Protocols

In this section, we present group communication protocols for a cluster C = hE1 ,...,En i, which provide kinds of system services by using the 1C or MC service.

E1

E2

E3

E4

a

 R U N

time-

b

c d

R ^R*

e

: sending event

R -

f

(=

g3 )

: receipt event

Figure 6: Pre-acknowledgment and acknowledgment 5.1

Variables

Each PDU exchanged among system entities consists of the following elds (j = 1,...,n).  p:CID = cluster identi er.  p:SRC = entity Ei which transmits p.  p:DST = set of destination entities of p.  p:PRI = priority of p.  p:TSEQ = total sequence number of p.  p:PSEQ j = partial sequence number for Ej .  p:ACK j = total sequence number of a PDU which Ei expects to receive next from Ej .  p:BUF = number of bu ers available in Ei .  p:DATA = data to be broadcast. DST and PSEQ are used in the selective protocols. PRI is used in the priority-based ordering protocols. Each entity Ei has the following local variables (j = 1,...,n):  TSEQ = total sequence number of a PDU which Ei expects to broadcast next.  PSEQ j = partial sequence number of a PDU which Ei expects to send to Ej next.  TREQ j = total sequence number of a PDU which Ei expects to receive next from Ej .  PREQ j = partial sequence number of a PDU which Ei expects to receive next from Ej .

 ALhj

= total sequence number of a PDU which

  ?

i

i

?time (a)

i

DST

Ei 62 b:

Ej

Ei





TSEQ = 3; PSEQ = 2   TSEQ = 4; g PSEQ = 3   TSEQ = 5; g PSEQ = 3

afi;:::g



A. Transmission

broadcasts a PDU p by one of the following actions. In the non-selective protocols, Ei executes BC1. BC2 is executed after BC1 in the selective ones. BC1. (1) p:TSEQ := TSEQ , (2) TSEQ := TSEQ + 1, (3) p:ACK k = TREQ k (k = 1,...,n), and (4) p:BUF := available bu er size. BC2. (1) p:PSEQ k := PSEQ k (j = 1,...,n) and (2) if Ej is a destination of p, PSEQ j := PSEQ j + 1 and p:DST := p:DST [ f Ej g (j = 1,...,n). Ei

i

bfi;::: 2 cf:::

 ?

?time (b)

B. Acceptance

On receipt of p from Ej , Ei detects PDU loss by checking the sequence number. FC1. TREQ j < p:TSEQ . FC2. TREQ k < q:ACK k for some k (6= j ).

TSEQ = 3; g PSEQ = 2   TSEQ = 4; g PSEQ = 2   TSEQ = 5; g PSEQ = 3

cfi;:::

There are some procedures commonly used in the group communication protocols.

C. Failure detection

bf::: 2

Common procedures

On receipt of p from Ej , Ei accepts p by the ACC1 action if p satis es AC1 in the non-selective protocols. AC1. p:TSEQ = TREQ j . ACC1. (1) TREQ := p:TREQ + 1, (2) BUF j := p:BUF , and (3) ALkj := p:ACK k (k = 1,...,n). Unless AC1 holds, Ei nds loss of some PDU. In the selective protocols, even if Ei fails to receive a PDU g from Ej , if Ei 62 g:DST , the loss of g is not a failure. AC2 is used for such a case. If p satis es AC1 or AC2, Ei executes ACC1 and ACC2. AC2. p:PSEQ i = PREQ j and Ei 2 p:DST . ACC2. If Ei 2 p:DST , PREQ j := p:PSEQ i + 1. Otherwise, Ei discards p. Suppose that Ej broadcasts a, b, and c, and Ei accepts a as shown in Figure 7. Here, TREQ j = 4 and PREQ j = 3 in Ei. Ei receives c where TSEQ = 5 and PSEQ i = 3. Here, AC1 does not hold. However, since c:PSEQ i = PREQ j and Ei 2 c:DST , Ei knows that Ei 2 = b:DST [Figure 7 (a)]. If Ei 2 = c:DST , there must be some PDU b where b:PSEQ i = 3 and Ei 2 b:DST [Figure 7 (b)].





a fi;:::

Let ISS j and IBF j be an initial total sequence number and an initial available bu er size in Ej , respectively. Ei obtains ISS j and IBF j of every Ej in the cluster establishment procedure [24]. Initially, TSEQ = PSEQ j = ISS i, TREQ j = PREQ j = ALjh = ISS j , and BUF j = IBF j (j , h = 1,...,n) in Ei . 5.2

Ej

Ei

knows that Ej expects to receive next from Eh (h = 1,...,n).  PALhj = total sequence number of a PDU which Ei knows that Ej expects to preacknowledge next from Eh (h = 1,...,n).  BUF j = available bu er size in Ej which Ei knows of. Ei

i

i

DST

Ei 2 b:

Figure 7: Acceptance condition If FC1 holds, Ei has not received g from Ej such that TREQ j  g:TSEQ < p:TSEQ . If FC2 holds, Ei has not received g from Ek such that TREQ k  g:TSEQ < q:ACK k . The selective protocols use FC3 instead of FC1. FC3. PREQ j < p:PSEQ i or, PREQ j = p:PSEQ i and Ei 2 = p:DST .

D. Pre-acknowledgment

If the PC condition holds for p (p:SRC = Ej ), are recorded in PAL by the PACK action. PC. p:TSEQ < min fALj k j k = 1,...,ng. PACK. PALkj := p:ACK k (k = 1,...,n). p:ACK s

E. Acknowledgment

(from Ej ) is acknowledged if AC holds. AC. p:TSEQ < min fPALj k j k = 1,...,ng. p

E. Reset

The RST(reset) action is invoked in order to resynchronize the entities. RST. (1) Ei broadcasts an RST PDU r where r:ACK j := REQ j (j = 1,...,n). (2) On receipt of r, REQ j := r:ACK j if REQ j > r:ACK j (j = 1,...,n) in Ek . If RST is received from every entity, Ek broadcasts an RST PK rp

where rp:ACK j := REQ j . (3) On receipt of all the RST PKs, Ek broadcasts RST AK. Here, every Ei has the same REQ s. 5.3

TO protocol

The TO protocol [24, 26] provides the TO service by using the 1C service.

A. Data transmission

Each Ei accepts, pre-acknowledges, and acknowledges each PDU by the following three-phase procedure. Each RLi consists of three sublogs RRLi , PRLi , and ARLi which are composed of accepted, preacknowledged, and acknowledged PDUs, respectively. (1) Transmission and acceptance : (1-1) Ej broadcasts a PDU by the BC1 action. (1-2) On receipt of p from Ej , Ei accepts p if AC1 holds. Ei executes ACC and appends p to the tail of RRLi . (2) Pre -acknowledgment : If p = top (RRLi ) satis es PC, Ei removes p from RRLi , appends p to PRLi , and executes PACK. (3) Acknowledgment : If p = top (PRLi ) satis es AC, Ei removes p from PRLi and appends p to ARLi . Every application entity Ai receives PDUs in the same order by taking the top (ARLi ).

B. Failure detection and recovery

PDU loss can be detected by checking FC1 and FC2. The lost PDUs are retransmitted by the go back -n retransmission. That is, every entity agrees on what PDUs are lost by using RST and then all the PDUs following the lost PDUs are rebroadcast. In the selective retransmission, some additional mechanism is required to order the PDUs retransmitted.

C. Flow control

In the group communication, every entity controls its sending PDUs so that every PDU can be received without bu er over ow. Each Ei noti es every entity of the available bu er size BUF . Let minBUF be the minimum among BUF 1 ,...,BUF n . Ei can send minBUF /n PDUs continuously, i.e. minBUF /n gives the window size. It is clear that the TO protocol on the 1C provides the CO service. In the MC service, the TO protocol does not provide the TO but the LO service. 5.4

LO protocol

The LO protocol [25,26] provides the LO service by using the MC service.

A. Data transmission

Each Ei has n receipt sublogs RLi1 ,...,RLin . PDUs from each Ej are stored in RLij (j = 1,...,n). RLij consists of three sublogs RRLij , PRLij , and ARLij . The LO protocol adopts the same three-phase procedure as the TO protocol.

B. Failure detection and recovery

PDU loss can be detected by checking FC1 and FC2. The lost PDUs are retransmitted by using the selective retransmission . If Ei detects PDU loss from Ej , Ei requests Ej to retransmit the lost PDUs. The

retransmitted PDUs from Ej are inserted into RRLij in the ascending order of TSEQ . Even if the 1C is used, the LO protocol does not provide the TO service since the selective retransmission is used. 5.5

SLO protocol

5.6

PriO and PriTO protocol

5.7

CO protocol

The SLO protocol [17,18] provides the SLO service on the MC service. The data transmission procedure is the same as the LO. AC1 and AC2 are used as the acceptance condition.

The PriO [19] and PriTO [19,20] protocols provide the PriO and PriTO services by using the 1C service, respectively. (1) Acceptance : On receipt of p from Ej , if p satis es AC1, Ei inserts p between q1 and q2 in PRLi where q1 :PRI  p:PRI > q2 . Ei creates a pseudo PDU p3 and appends p3 to the tail of RRLi . p3 is the same as p except that p3 has no data. For example, PRLi = < b[4] a[3] d[3] c[1] ] and RRLi = < a3 b3 c3 d3 ] are obtained by inserting d[3] into PRLi = < b[4] a[3] c[1] ] and RRLi = < a3 b3 c3 ]. The sequences of real PDUs and pseudo-PDUs denote the priority-based order and receipt order of the PDUs, respectively. (2) Pre -acknowledgment : If p3 = top (RRLi ) satis es PC, p3 is moved from RRLi to the tail of PRLi . (3) Acknowledgment : If p = top (PRLi) satis es AC, p is moved to ARLi and p3 is deleted. When a lost PDU g is detected in the PriTO protocol, all the PDUs following g in the sequence of pseudoPDUs are removed from RLi and are retransmitted by the go-back-n and RST. The selective retransmission is adopted in the PriO protocol because the samepriority PDUs do not need to be totally ordered. In the priority-based service, a problem is starvation , i.e. lower-priority PDUs may stay inde nitely in the receipt log. One solution is to partition the receipt log into runs [20]. A run is a priority-based ordered subsequence of PDUs. When the starvation is detected, the current run is ended and a new run is started. The run-partition is synchronized among the entities to provide the PriTO service. The CO protocol [21] provides the CO service by using the MC service. The CO protocol uses the same procedure as the TO except that the selective retransmission is adopted and pre-acknowledged PDUs are ordered in the causality-precedence relation . If a PDU p is pre-acknowledged in Ei, p is inserted into PRLi so that the receipt log is causality-preserved. In the CO protocol, p  q if the following CO rule holds. Here, p:SRC = Ej . CO1. p:SEQ < q:SEQ if p:SRC = q:SRC . CO2 p:SEQ < q:ACK j if p:SRC 6= q:SRC . The CO rule is simpler than ISIS. The CO protocol can not only order the PDUs in  but also nd the lost PDUs since the CO rule uses the sequence number.

The group communication protocols are characterized by the following aspects [Table 2]:  System service : Each protocol provides some kind of system service de ned in section 2.  Network service : 1C, MC, or reliable one.  Control scheme : centralized , decentralized , and distributed ones.  Destination : selective and non -selective .  Communication mode : synchronous and asynchronous ones. In the synchronous mode, any entity does not send a PDU until the PDU sent before is atomically received. In the asynchronous mode, each entity may send PDUs without waiting for the atomic receipt of previous PDUs.  Retransmission : go -back -n and selective schemes.  Performance : The performance is measured in terms of the number of PDUs transmitted, and the delay time of PDUs among application entities. There are parameters, i.e. the number n of entities (the number m ( n) of destinations in the selective broadcast protocols) in the group and the propagation delay time T among entities. [13] presents a centralized protocol (referred to as KTHB) which provides the TO service by using the 1C network. The data transmission is composed of two phases: (1) the source entity sends a PDU p to the master entity named a sequencer and (2) the sequencer broadcasts p to the receivers. When a PDU loss is detected, the selective retransmission is used. The sequencer supports synchronous communication. [8] presents a decentralized protocol (referred to as GS) based upon the tree structured routing. Each node shows an entity and each path denotes a route of PDU to the destinations. Here, h and  denote the height of the tree and the number of entities which are not the destinations in the path, respectively. If an entity receives p from the source or the parent, p is relayed to its children until all the destinations receive p. The parent decides on the atomic receipt among the children. The lost PDUs are selectively retransmitted. ISIS [3] supports CBCAST and ABCAST protocols (referred to as BSS) which provides CO and TO services, respectively. In these protocols, each entity has a virtual clock to causally order the receipt PDUs. In the ABCAST, a decentralized procedure like twophase commitment is used. Since the LO service is used as the underlying service, there is no PDU loss. Delta-4 [22] support an atomic multicast protocol (AMp). AMp provides multiple qualities of services (QOS) including the CO and TO services. It adopts the decentralized control. AMp and ISIS discuss how to tolerate stop-failure of the entities. We have implemented the protocols discussed in this paper as processes of SunOS 4.11 in Sparc2 workstations interconnected by Ethernets. Each workstation has only one protocol entity. The program size is about 5K steps in C language, and the size of the executable object-code is about 50K bytes. Figure 8 illus1 SunOS is a trademark of Sun Microsystems, Inc.

trates the average delay time of PDUs for the number n of system entities. AP -AP -delay shows time from a DT request submission of an application entity until the receipt of all the destinations. SP -delay means how long it takes each PDU p to be acknowledged after p is accepted. Ethernet -delay shows how long it takes to transmit p by using the Ethernet MAC service. Following Figure 8, the delay time is O(n). delay time [msec]

6 Evaluation

600 500 400 AP-AP-delay SP-delay Ethernet-delay

300 200 100 0 3

4 5 6 7 number of entities (n)

8

Figure 8: Delay time the number of entities

7

Concluding Remarks

In this paper, we have presented a formal model of the group communication service from the data transmission point of view assuming no entity failure. The service is modeled as a set of logs which denote the sending and receipt sequence of PDUs in each entity. We have de ned various kinds of group communication services based on this model. We have also shown group communication protocols which provide the atomic and well-de ned ordered delivery of the PDUs.

References

[1] Abeysundara, B. W. and Kamal, A. E., \HighSpeed Local Area Networks and Their Performance: A Survey," ACM Computing Surveys , Vol.23, No.2, 1991, pp.221{264. [2] Amir, Y., Dolev, D., Kramer, S., and Malki, D., \Transis: A Communication Sub-System for High Availability," Proc . of IEEE 22th Annual Int'l Symposium on Fault -Tolerant Computing , 1993, pp.76{84. [3] Birman, K., Schiper, A., and Stephenson, P., \Lightweight Causal and Atomic Group Multicast," ACM Trans . Computer Systems , Vol.9, No.3, 1991, pp.272-314. [4] Chang, J. M. and Maxemchuk, N. F., \Reliable Broadcast Protocols," ACM Trans . Computer Systems , Vol.2, No.3, 1984, pp.251{273. [5] Defense Communications Agency, \DDN Protocol Handbook," Vol.1{3, NIC 50004{50005, 1985. [6] Ellis, C. A., Gibbs, S. J., and Rein, G. L., \Groupware," Comm . ACM , Vol.34, No.1, 1991, pp.38{58.

Table 2: Group communication protocols protocol

GS BSS

system

network

service

service

cntl.

dst.

mode

TO

1-to-1/MC

decnt.

group

async.

TO/CO

1-to-1/OP

decnt.

group

sync.

recov.

performance

notes

PDU

delay

select.

n+

(h + 1)T

tree

|

3n

3T

2 phase

TO

broadcast/1C

cnt.

group

sync.

select.

n +2

2T

2 phase

TO/CO

broadcast/MC

decnt.

group

a./s.

?

n +2

3T

2 phase

LO

OP

broadcast/MC

dist.

group

async.

select.

2n + 1

3T

3 phase

TO

TO/CO

broadcast/1C

dist.

group

async.

go-back

2n + 1

3T

3 phase

CO

CO

broadcast/MC

dist.

group

async.

select.

2n + 1

3T

3 phase

SLO

SLO

broadcast/MC

dist.

select.

async.

select.

2m + 1

3T

3 phase

STO

STO

broadcast/1C

dist.

select.

async.

select.

2m + 1

3T

3 phase

PriO

PriO

broadcast/1C

dist.

group

async.

select.

2n + 1

3T

3 phase

PriTO

broadcast/1C

dist.

group

async.

go-back

2n + 1

3T

3 phase

KTHB AMp

PriTO

[7] Garcia-Molina, H. and Spauster, A., \Message Ordering in a Multicast Environment," Proc. of IEEE ICDCS-9 , 1989, pp.354{361. [8] Garcia-Molina, H. and Spauster, A., \Ordered and Reliable Multicast Communication," ACM Trans . Computer Systems , Vol.9, No.3, 1991, pp.242{271. [9] Gray, J., \Notes on Database Operating Systems," Operating Systems : An Advanced Course , (Bayer, R. ed.), Springer-Verlag, 1978. [10] IEEE, \Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Speci cation," ANSI /IEEE Standard 802.3, IEEE, 1985. [11] ISO, \OSI { Basic Reference Model," ISO 7498, 1984. [12] Jain, R., \FDDI: Current Issues and Future Plans," IEEE Communication Magazine , Vol.31, No.9, Sep. 1993, pp.98-105. [13] Kaashoek, M. F., Tanenbaum, A. S., Hummel, S. F., and Bal, H. E., \An Ecient Reliable Broadcast Protocol," ACM Operating Systems Review , Vol.23, No.4, 1989, pp.5{19. [14] Lamport, L., \Time, Clocks, and the Ordering of Events in a Distributed System," Comm . ACM , Vol.21, No.7, 1978, pp.558{565. [15] Luan, S. W. and Gligor, V. D., \A Fault-Tolerant Protocol for Atomic Broadcast," IEEE Trans . Parallel and Distributed Systems , Vol.1, No.3, 1990, pp.271{285. [16] Mattern, F., \Virtual Time and Global States of Distributed Systems," Parallel and Distributed Algorithms (Cosnard, M. and Quinton, P. eds.), North{Holland, Amsterdam, The Netherlands, 1989, pp.215{226. [17] Nakamura, A. and Takizawa, M., \Reliable Broadcast Protocol for Selectively Ordering PDUs," Proc. of IEEE ICDCS-11 , 1991, pp.239{ 246. [18] Nakamura, A. and Takizawa, M., \Design of

[19] [20]

[21] [22] [23] [24] [25] [26] [27] [28]

Reliable Broadcast Communication Protocol for Selectively Partially Ordered PDUs," Proc. of IEEE COMPSAC'91 , 1991, pp.673{679. Nakamura, A. and Takizawa, M., \Priority-Based Total and Semi-Total Ordering Broadcast Protocols," Proc . of IEEE ICDCS-12 , 1992, pp.178{ 185. Nakamura, A. and Takizawa, M., \StarvationFree Priority-Based Total Ordering Broadcast Protocol on High-Speed Single Channel Network," Proc. of 2nd Int'l Symp . on High Performance Distributed Computing (HPDC-2 ), 1993, pp.281{288. Nakamura, A. and Takizawa, M., \Causally Ordering Broadcast Protocol," to appear in Proc. of IEEE ICDCS-14 , 1994. Powell, D., Chereque, M., and Drackley, D., \Fault-Tolerance in Delta-4," ACM Operating Systems Review , Vol.25, No.2, 1991, pp.122-125. Schneider, F. B., Gries, D., and Schlichting, R. D., \Fault-Tolerant Broadcasts," Science of Computer Programming , Vol.4, No.1, 1984, pp.1{ 15. Takizawa, M., \Cluster Control Protocol for Highly Reliable Broadcast Communication," Proc. of IFIP Conf . on Distributed Processing , 1987, pp.431{445. Takizawa, M. and Nakamura, A., \Partially Ordering Broadcast (PO) Protocol," Proc. of IEEE INFOCOM'90 , 1990, pp.357{364. Takizawa, M. and Nakamura, A., \Reliable Broadcast Communication," Proc . of IPSJ Int'l Conf . on Information Technology (InfoJapan ), 1990, pp.325{332. Tanenbaum, A. S., Computer Networks (2nd ed .)," Englewood Cli s, NJ: Prentice{Hall, 1989. Verissimo, P., Rodrigues, L., and Baptista, M., \AMp: A Highly Parallel Atomic Multicast Protocol," Proc . of the ACM SIGCOMM'89 , 1989, pp.83-93.