Group Communication Protocols: Properties and Evaluation Akihito NAKAMURA,
Takayuki TACHIKAWA,
and
Makoto TAKIZAWA
Dept. of Computers and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki-gun, Saitama 350-03, JAPAN e-mail
fnaka,tachi,
[email protected]
Abstract
In distributed systems, group communication among multiple entities is required in addition to the conventional one-to-one communication. Group communication protocols provide multiple entities with reliable data transmission service, i.e. messages are delivered to all the destination entities in the group. It is also important to guarantee that every application entity can receive messages in a well-de ned order in the presence of multiple entities sending messages. This paper discusses logical properties of the group communication. We present communication protocols to provide various kinds of group communication services and evaluate the protocols.
1 Introduction
In distributed applications like groupware [6], group communication among multiple entities is required in addition to the conventional one-to-one communication like TCP/IP [5]. The group communication protocols support a many-to-many communication service for a group of entities. These protocols provide the group with the atomic and ordered delivery of messages. Here, a unit of message exchanged among the entities is named a protocol data unit (PDU). [23] presents a reliable broadcast protocol which uses the one-to-one communication. [15] presents a broadcast protocol which provides the totally ordered receipt of PDUs based on majority-consensus decision. [7] characterizes message ordering properties in reliable broadcast protocols using the one-to-one network. [24] presents a cluster concept which is an extension of the connection [11] to multiple service access points (SAPs) and shows how to establish the cluster by using the Ethernet [10]. Here, a cluster is considered to be a group of entities. In the group communication, an important problem is which entity coordinates the cooperation of multiple entities in the cluster. Most approaches [4, 13, 23] adopt the centralized control scheme where one master entity decides on the atomic and ordered receipt in the group. Here, the entities have to block until the decision of the master entity is delivered. In this paper, we adopt the distributed control scheme where every entity makes the decision by itself. [24, 26] present a TO (total ordering broadcast) protocol on the Ethernet, by which every application entity can receive the PDUs in the
same order without PDU loss. [25, 26] present an OP (order-preserving broadcast) protocol by which all the entities receive PDUs in the sending order but may receive the PDUs not in the same order. In these protocols, every PDU is broadcast to all the entities in the cluster. [17, 18] present an SP (selectively orderpreserving broadcast) protocol which provides the selective broadcast to a subset of the group. The group communication protocols have to tolerate communication and entity failures. In this paper, we present a formal model of the data transmission aspect in the group communication. We de ne various kinds of broadcast communication services required to implement distributed applications, and present group communication protocols supporting the services by using high-speed networks [1]. While bit error rates in the high-speed networks are low ( 1009 ), entities may fail to receive PDUs due to the buer overrun because the processing speed of the entity is slower than the transmission speed [1]. Hence, the PDU loss i.e. omission fault is the major failure in the high-speed network. In section 2, we show a model of group communication service. In section 3, we discuss the distributed atomic receipt concept. In section 4, we present various distributed group communication protocols. The protocols are evaluated in section 5.
2
Model
In this section, we present a model of the communication service for multiple entities. 2.1
System model
A communication system is composed of application, system, and network layers as shown in Figure 1. Each application entity Ai takes communication service through a service access point (SAP) Si supported by a system entity Ei (i = 1,...,n). A cluster C [24] is a set of n ( 2) system SAPs, i.e. fS1 ,...,Sn g. E1 ,...,En cooperate with each other to support some broadcast service for C by using the underlying network service. The cooperation of the system entities is coordinated by a system protocol. Here, C is referred to as supported by E1 ,...,En (written as C = hE1 ,..., En i), and support A1 ,...,An . A data unit, i.e. message exchanged among entities at the same layer is a protocol data unit (PDU).
A1
application layer
111
6
S1
111
N1
network layer
An
6 k ?
Ei
6 ?
application entity
Sn
?
E1
6 ?
6
111
Si
?
system layer
Ai
111
En
Ni
high-speed network
6 ? k
system SAP system entity
Nn
network SAP
Figure 1: System model 2.2
Ordered delivery
Let Ti be entities which use the broadcast communication service (i = 1,...,n). Ti means either a system or application entity. Each Ti is modeled as a sequence of sending and receipt events. Let si [p] and ri [p] denote sending and receipt events of a PDU p in Ti , respectively. A happened -before relation [14] ! on the events is de ned as follows. [De nition] For every pair of events e1 and e2, e1 ! e2 (e1 happens -before e2 ) i (1) e1 occurs before e2 in Ti , (2) for some (not necessarily dierent) Ti and Tj , there exists some PDU p such that e1 = si [p] and e2 = rj [p], or (3) for some event e3 , e1 ! e3 and e3 ! e2 . 2 We model a broadcast communication service as a set of logs. A log L is a sequence of PDUs < p1 ... pm ], where p1 and pm are the top and the last denoted by top (L) and last (L), respectively. ph precedes pk in L (ph !L pk ) if h < k. Each Ti has a sending log SLi and a receipt log RLi, which are sequences of PDUs sent and received by Ti , respectively (i = 1,...,n). If Ti receives q after p, p !RLi q . If Ti sends q after p, p !SLi q . SLij and RLij are sublogs of SLi and RLi which include PDUs destined to Tj and PDUs received from Tj , respectively (j = 1,...,n). RLi is local -order -preserved i for every Tj , RLi is informap !RLi q if p !SLj q . tion -preserved i RLi includes all the PDUs in SL1 ,...,SLn . RLi and RLj are order -equivalent i for every pair of p and q included in both RLi and RLj , p !RLi q i p !RLj q . RLi and RLj are information -equivalent i they include the same PDUs. If RLi is local-order-preserved, Ti receives PDUs from each entity in the sending order. Ti receives all the PDUs sent in C if RLi is information-preserved. If RLi and RLj are order-equivalent, Ti and Tj receive every two PDUs in the same order if they receive the PDUs. Ti and Tj receive the same PDUs if RLi and RLj are information-equivalent.
A service where each entity Ti can send PDUs to a subset of the cluster rather than all the entities is named a selective broadcast communication [17, 18]. RLi is selectively information -preserved i RLi includes all the PDUs in SL1i ,...,SLni , i.e. Ti receives all and only the PDUs destined to Ti . There are applications where higher-priority PDUs delivered earlier than lower-priority PDUs. For example, PDUs which carry real-time data like voice have higher priorities than the le transfer. Let p:PRI denote the priority of p. If p has a higher priority than q , p:PRI > q:PRI . RLi is priority -based ordered i for every pair of p and q in RLi , (1) if p:PRI > q:PRI , then p !RLi q , and (2) if p:PRI = q:PRI , p and q are sent by Ej , and sj [p] ! sj [q], then p !RLi q . To incorporate application semantics into the communication system, it is important to consider the cause-eect relationship among the events. If an application entity Ak sends q after receiving p, all the common destinations of p and q have to receive p before q. The happened-before relation ! re ects this causeeect relationship. The causal order [3] among PDUs is de ned as follows. [De nition] For every pair of p and q, p causality precedes q (p q) i si [p] ! sj [q]. 2 The relation is transitive but not symmetric. RLi is causality -preserved i for every pair of p and q in RLi , p !RLi q if p q . It is straightforward that RLi is local-order-preserved if it is causality-preserved. In Figure 2, RL3 = < g p q ] is causality-preserved. g p q , i.e. s1 [g] ! s1 [p] ! s2 [q ] and r3 [g] ! r3 [p] ! r3 [q ]. If A3 receives q before p as shown in a dotted line, RL3 0 = < g q p ] is not causality-preserved (but local-order-preserved). A2
A1
A3
g p
j
3
p
time
?
q
z z z RL = < g p q ] z RL 0 = < g q p ] ?
?
3
Figure 2: Causality-preserved receipt
3
Group Communication Services
3.1
System services
We consider services supported by the system layer.
A. Sender-based ordering services
Locally Ordering Broadcast (LO) service: Every
receipt log RLi is information- and local-orderpreserved. Totally Ordering Broadcast (TO) service: Every RLi is information-preserved, local-orderpreserved, and order-equivalent. In the LO service, every entity receives PDUs from each entity in the sending order. For example, in Figure 3 (a), every entity receives q after p from A2 . In the TO service, every entity receives all the PDUs in the same order. For example, RLi = < a x b c p y d z q ] in Figure 3 (b) (i = 1, 2, 3).
RL1 : RL2 : RL3 :
< a x b c p y d z q
]
< p x a b y c z d q
]
< a x p y b q c z d
]
RL1 : RL2 : RL3 :
(a) LO service
SL1 : SL2 : SL3 :
< a x b c p y d z q
]
< a x b c p y d z q
]
< a x b c p y d z q
]
(b) TO service < a b c d < p q
]
]
< x y z
]
Figure 3: Sender-based ordering services
several dierent programs are executed by one or more entities, each application entity needs to send each PDU to a subset of C rather than all the entities in C . Selectively Locally Ordering Broadcast (SLO) service: Every RLi is local-order- and selectively information-preserved. Selectively Totally Ordering Broadcast (STO) service: Every RLi is local-order-preserved, selectively information-preserved, and orderequivalent. Selectively Causally Ordering Broadcast (SCO) service: Every RLi is selectively information- and causality-preserved. Let p:DST be a set fAd1 ,...,Adm g of destination application entities of a PDU p, i.e. p:DST C . Here, p can be written as pfd1 ;:::;dm g . In the SLO service, p !RLij q in every Ei 2 p:DST \ q:DST if p !SLj q . In Figure 5 (a), every RLi is order- and selectivelyinformation-preserved, e.g. A1 receives y after x from A3 . In the STO service, every common destination of PDUs receives the PDUs in the same order. In Figure 5 (b), a and b are received by E2 and E3 in the same order, i.e. a !RLi b (i = 2,3).
B. Priority-based ordering services
RL1 : < x c p y d q RL2 : < p x a d y
Priority -Based Ordering Broadcast (PriO) ser-
vice: Every RLi is priority-based-ordered and information-preserved. Priority-Based Totally Ordering Broadcast (PriTO) service: Every RLi is priority-based -ordered, information-preserved, and orderequivalent. Let p[r] denote that p:PRI = r . Figure 4 shows an example of the PriO service. In the PriTO service, the PDUs with the same priority are received in the same order, e.g. RLi = < c b y p a x q d z ] for the PDUs in Figure 4 (i = 1, 2, 3).
RL1 : RL2 : RL3 :
< c[3] b[2] y[2] p[2] a[1] x[1] q[1] d[1] z[1]
]
< c[3] y[2] b[2] p[2] x[1] q[1] z[1] a [1] d[1]
]
< c[3] p [2] b[2] y[2] q[1] x[1] z[1] a [1] d[1]
]
SL1 : SL2 : SL3 :
< a[1] b[2] c[3] d[1] < p[2] q[1]
]
]
< x[1] y[2] z[1]
]
Figure 4: PriO service
C. Causality-based ordering service
Causally Ordering Broadcast (CO) service: Ev-
ery RLi is information- and causality-preserved. In the CO service, if sj [p] ! sk [q] for p and q , ri [p] ! ri [q] in every Ei .
D. Selective broadcast services
The LO and TO services are suitable for applications where every application entity executes a same program. On the other hand, in applications where
]
]
RL3 : < a p y b c z
RL1 : < x c p y d q RL2 : < a x p y d
]
(a) SLO service
]
]
RL3 : < a b c p y z
]
(b) STO service
S L1 : < af2;3g bf3g cf1;3g df1;2g S L2 : < pf1;2;3g qf1g
]
]
S L3 : < xf1;2g yf1;2;3g zf3g
]
Figure 5: SBC services 3.2
Network services
Next, we de ne the services provided by the underlying network layer. One -Channel (1C) service: Every RLi is localorder-preserved and is order-equivalent. Multi -Channel (MC) service: Every RLi is localorder-preserved. In the 1C service, PDUs are delivered to entities in the same order, but some PDUs may be lost. The 1C service is a model of a high-speed channel [1, 12]. In the MC service, every entity receives PDUs from each entity in the sending order, but some entity may fail to receive some PDUs. The MC service can be provided by a system where computers are fully connected by logical or physical point-to-point links. The services de ned in this paper are summarized in Table 1.
4
Atomic Receipt Concept
In this section, we consider how PDUs are atomically received in a cluster C = hE1 ,...,En i. There are three approaches to deciding on the atomic receipt, i.e. centralized , decentralized , and distributed ones. In the centralized approach, one master controller decides on
Table 1: Group communication services service
LO TO CO SLO STO SCO PriO PriTO MC 1C
i -preserved
*
*
*
2 2
i -equivalent
2 2 2
2 2
lo -preserved
2 2
o -equivalent
2
2 2
2 2
2
c -preserved
p -ordered
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* selectively information-preserved. i = information, lo = local-order, o = order, c = causality, p = priority.
it based on the two-phase commitment protocol [9]. In the decentralized one, the sender of each PDU is a controller of the PDU. In this paper, we would discuss the distributed control [24] where each entity decides on the atomic receipt. Every PDU from each Ej carries the receipt con rmation of PDUs which Ej has received already. On receipt of q from Ej , Ei knows that Ei has received every p where rj [p] ! sj [q ]. Here, let p and q be PDUs sent by Ek and Ej , respectively. q is referred to as pre -acknowledge p for Ej in Ei (p )j i q ) i sk [p] ! ri [p] and sk [p] ! rj [p] ! sj [q] ! ri [q ]. In Figure 6, E2 sends c after receiving a. On receipt of c, E3 knows that E2 has received a. Here, s1 [a] ! r2 [a] ! s2 [c] ! r3 [c], i.e. c preacknowledges a for E2 in E3 (a )23 c). There are three criteria levels [24] of Ei 's atomic receipt for every PDU p in C in the distributed way: (1) Acceptance : Ei receives p. (2) Pre-acknowledgment : Ei knows that every destination of p has accepted p. That is, for every Ej , there exists q such that p )j i q. (3) Acknowledgment : Ei knows that p has been preacknowledged by every destination of p. That is, for every Ej and Eh , and q where p )hj q, there exists g such that q )ji g. The acknowledgment of p by Ei means that Ei knows that every destination of p has known that every destination had accepted p. Even if p is pre-acknowledged in Ei , Ei cannot decide if p is accepted by all the destinations in C . If p is acknowledged in Ei , Ei knows that p is pre-acknowledged by every destination. Another Ek knows that p is at least accepted by every destination. That is, Ei considers that p is atomically received by every destination. In Figure 6, suppose that every Ei sends gi after receiving b, c, d, and e. If E4 accepts gi from every Ei , a is acknowledged by E4 .
5 Group Communication Protocols
In this section, we present group communication protocols for a cluster C = hE1 ,...,En i, which provide kinds of system services by using the 1C or MC service.
E1
E2
E3
E4
a
R U N
time-
b
c d
R ^R*
e
: sending event
R -
f
(=
g3 )
: receipt event
Figure 6: Pre-acknowledgment and acknowledgment 5.1
Variables
Each PDU exchanged among system entities consists of the following elds (j = 1,...,n). p:CID = cluster identi er. p:SRC = entity Ei which transmits p. p:DST = set of destination entities of p. p:PRI = priority of p. p:TSEQ = total sequence number of p. p:PSEQ j = partial sequence number for Ej . p:ACK j = total sequence number of a PDU which Ei expects to receive next from Ej . p:BUF = number of buers available in Ei . p:DATA = data to be broadcast. DST and PSEQ are used in the selective protocols. PRI is used in the priority-based ordering protocols. Each entity Ei has the following local variables (j = 1,...,n): TSEQ = total sequence number of a PDU which Ei expects to broadcast next. PSEQ j = partial sequence number of a PDU which Ei expects to send to Ej next. TREQ j = total sequence number of a PDU which Ei expects to receive next from Ej . PREQ j = partial sequence number of a PDU which Ei expects to receive next from Ej .
ALhj
= total sequence number of a PDU which
?
i
i
?time (a)
i
DST
Ei 62 b:
Ej
Ei
TSEQ = 3; PSEQ = 2 TSEQ = 4; g PSEQ = 3 TSEQ = 5; g PSEQ = 3
afi;:::g
A. Transmission
broadcasts a PDU p by one of the following actions. In the non-selective protocols, Ei executes BC1. BC2 is executed after BC1 in the selective ones. BC1. (1) p:TSEQ := TSEQ , (2) TSEQ := TSEQ + 1, (3) p:ACK k = TREQ k (k = 1,...,n), and (4) p:BUF := available buer size. BC2. (1) p:PSEQ k := PSEQ k (j = 1,...,n) and (2) if Ej is a destination of p, PSEQ j := PSEQ j + 1 and p:DST := p:DST [ f Ej g (j = 1,...,n). Ei
i
bfi;::: 2 cf:::
?
?time (b)
B. Acceptance
On receipt of p from Ej , Ei detects PDU loss by checking the sequence number. FC1. TREQ j < p:TSEQ . FC2. TREQ k < q:ACK k for some k (6= j ).
TSEQ = 3; g PSEQ = 2 TSEQ = 4; g PSEQ = 2 TSEQ = 5; g PSEQ = 3
cfi;:::
There are some procedures commonly used in the group communication protocols.
C. Failure detection
bf::: 2
Common procedures
On receipt of p from Ej , Ei accepts p by the ACC1 action if p satis es AC1 in the non-selective protocols. AC1. p:TSEQ = TREQ j . ACC1. (1) TREQ := p:TREQ + 1, (2) BUF j := p:BUF , and (3) ALkj := p:ACK k (k = 1,...,n). Unless AC1 holds, Ei nds loss of some PDU. In the selective protocols, even if Ei fails to receive a PDU g from Ej , if Ei 62 g:DST , the loss of g is not a failure. AC2 is used for such a case. If p satis es AC1 or AC2, Ei executes ACC1 and ACC2. AC2. p:PSEQ i = PREQ j and Ei 2 p:DST . ACC2. If Ei 2 p:DST , PREQ j := p:PSEQ i + 1. Otherwise, Ei discards p. Suppose that Ej broadcasts a, b, and c, and Ei accepts a as shown in Figure 7. Here, TREQ j = 4 and PREQ j = 3 in Ei. Ei receives c where TSEQ = 5 and PSEQ i = 3. Here, AC1 does not hold. However, since c:PSEQ i = PREQ j and Ei 2 c:DST , Ei knows that Ei 2 = b:DST [Figure 7 (a)]. If Ei 2 = c:DST , there must be some PDU b where b:PSEQ i = 3 and Ei 2 b:DST [Figure 7 (b)].
a fi;:::
Let ISS j and IBF j be an initial total sequence number and an initial available buer size in Ej , respectively. Ei obtains ISS j and IBF j of every Ej in the cluster establishment procedure [24]. Initially, TSEQ = PSEQ j = ISS i, TREQ j = PREQ j = ALjh = ISS j , and BUF j = IBF j (j , h = 1,...,n) in Ei . 5.2
Ej
Ei
knows that Ej expects to receive next from Eh (h = 1,...,n). PALhj = total sequence number of a PDU which Ei knows that Ej expects to preacknowledge next from Eh (h = 1,...,n). BUF j = available buer size in Ej which Ei knows of. Ei
i
i
DST
Ei 2 b:
Figure 7: Acceptance condition If FC1 holds, Ei has not received g from Ej such that TREQ j g:TSEQ < p:TSEQ . If FC2 holds, Ei has not received g from Ek such that TREQ k g:TSEQ < q:ACK k . The selective protocols use FC3 instead of FC1. FC3. PREQ j < p:PSEQ i or, PREQ j = p:PSEQ i and Ei 2 = p:DST .
D. Pre-acknowledgment
If the PC condition holds for p (p:SRC = Ej ), are recorded in PAL by the PACK action. PC. p:TSEQ < min fALj k j k = 1,...,ng. PACK. PALkj := p:ACK k (k = 1,...,n). p:ACK s
E. Acknowledgment
(from Ej ) is acknowledged if AC holds. AC. p:TSEQ < min fPALj k j k = 1,...,ng. p
E. Reset
The RST(reset) action is invoked in order to resynchronize the entities. RST. (1) Ei broadcasts an RST PDU r where r:ACK j := REQ j (j = 1,...,n). (2) On receipt of r, REQ j := r:ACK j if REQ j > r:ACK j (j = 1,...,n) in Ek . If RST is received from every entity, Ek broadcasts an RST PK rp
where rp:ACK j := REQ j . (3) On receipt of all the RST PKs, Ek broadcasts RST AK. Here, every Ei has the same REQ s. 5.3
TO protocol
The TO protocol [24, 26] provides the TO service by using the 1C service.
A. Data transmission
Each Ei accepts, pre-acknowledges, and acknowledges each PDU by the following three-phase procedure. Each RLi consists of three sublogs RRLi , PRLi , and ARLi which are composed of accepted, preacknowledged, and acknowledged PDUs, respectively. (1) Transmission and acceptance : (1-1) Ej broadcasts a PDU by the BC1 action. (1-2) On receipt of p from Ej , Ei accepts p if AC1 holds. Ei executes ACC and appends p to the tail of RRLi . (2) Pre -acknowledgment : If p = top (RRLi ) satis es PC, Ei removes p from RRLi , appends p to PRLi , and executes PACK. (3) Acknowledgment : If p = top (PRLi ) satis es AC, Ei removes p from PRLi and appends p to ARLi . Every application entity Ai receives PDUs in the same order by taking the top (ARLi ).
B. Failure detection and recovery
PDU loss can be detected by checking FC1 and FC2. The lost PDUs are retransmitted by the go back -n retransmission. That is, every entity agrees on what PDUs are lost by using RST and then all the PDUs following the lost PDUs are rebroadcast. In the selective retransmission, some additional mechanism is required to order the PDUs retransmitted.
C. Flow control
In the group communication, every entity controls its sending PDUs so that every PDU can be received without buer over ow. Each Ei noti es every entity of the available buer size BUF . Let minBUF be the minimum among BUF 1 ,...,BUF n . Ei can send minBUF /n PDUs continuously, i.e. minBUF /n gives the window size. It is clear that the TO protocol on the 1C provides the CO service. In the MC service, the TO protocol does not provide the TO but the LO service. 5.4
LO protocol
The LO protocol [25,26] provides the LO service by using the MC service.
A. Data transmission
Each Ei has n receipt sublogs RLi1 ,...,RLin . PDUs from each Ej are stored in RLij (j = 1,...,n). RLij consists of three sublogs RRLij , PRLij , and ARLij . The LO protocol adopts the same three-phase procedure as the TO protocol.
B. Failure detection and recovery
PDU loss can be detected by checking FC1 and FC2. The lost PDUs are retransmitted by using the selective retransmission . If Ei detects PDU loss from Ej , Ei requests Ej to retransmit the lost PDUs. The
retransmitted PDUs from Ej are inserted into RRLij in the ascending order of TSEQ . Even if the 1C is used, the LO protocol does not provide the TO service since the selective retransmission is used. 5.5
SLO protocol
5.6
PriO and PriTO protocol
5.7
CO protocol
The SLO protocol [17,18] provides the SLO service on the MC service. The data transmission procedure is the same as the LO. AC1 and AC2 are used as the acceptance condition.
The PriO [19] and PriTO [19,20] protocols provide the PriO and PriTO services by using the 1C service, respectively. (1) Acceptance : On receipt of p from Ej , if p satis es AC1, Ei inserts p between q1 and q2 in PRLi where q1 :PRI p:PRI > q2 . Ei creates a pseudo PDU p3 and appends p3 to the tail of RRLi . p3 is the same as p except that p3 has no data. For example, PRLi = < b[4] a[3] d[3] c[1] ] and RRLi = < a3 b3 c3 d3 ] are obtained by inserting d[3] into PRLi = < b[4] a[3] c[1] ] and RRLi = < a3 b3 c3 ]. The sequences of real PDUs and pseudo-PDUs denote the priority-based order and receipt order of the PDUs, respectively. (2) Pre -acknowledgment : If p3 = top (RRLi ) satis es PC, p3 is moved from RRLi to the tail of PRLi . (3) Acknowledgment : If p = top (PRLi) satis es AC, p is moved to ARLi and p3 is deleted. When a lost PDU g is detected in the PriTO protocol, all the PDUs following g in the sequence of pseudoPDUs are removed from RLi and are retransmitted by the go-back-n and RST. The selective retransmission is adopted in the PriO protocol because the samepriority PDUs do not need to be totally ordered. In the priority-based service, a problem is starvation , i.e. lower-priority PDUs may stay inde nitely in the receipt log. One solution is to partition the receipt log into runs [20]. A run is a priority-based ordered subsequence of PDUs. When the starvation is detected, the current run is ended and a new run is started. The run-partition is synchronized among the entities to provide the PriTO service. The CO protocol [21] provides the CO service by using the MC service. The CO protocol uses the same procedure as the TO except that the selective retransmission is adopted and pre-acknowledged PDUs are ordered in the causality-precedence relation . If a PDU p is pre-acknowledged in Ei, p is inserted into PRLi so that the receipt log is causality-preserved. In the CO protocol, p q if the following CO rule holds. Here, p:SRC = Ej . CO1. p:SEQ < q:SEQ if p:SRC = q:SRC . CO2 p:SEQ < q:ACK j if p:SRC 6= q:SRC . The CO rule is simpler than ISIS. The CO protocol can not only order the PDUs in but also nd the lost PDUs since the CO rule uses the sequence number.
The group communication protocols are characterized by the following aspects [Table 2]: System service : Each protocol provides some kind of system service de ned in section 2. Network service : 1C, MC, or reliable one. Control scheme : centralized , decentralized , and distributed ones. Destination : selective and non -selective . Communication mode : synchronous and asynchronous ones. In the synchronous mode, any entity does not send a PDU until the PDU sent before is atomically received. In the asynchronous mode, each entity may send PDUs without waiting for the atomic receipt of previous PDUs. Retransmission : go -back -n and selective schemes. Performance : The performance is measured in terms of the number of PDUs transmitted, and the delay time of PDUs among application entities. There are parameters, i.e. the number n of entities (the number m ( n) of destinations in the selective broadcast protocols) in the group and the propagation delay time T among entities. [13] presents a centralized protocol (referred to as KTHB) which provides the TO service by using the 1C network. The data transmission is composed of two phases: (1) the source entity sends a PDU p to the master entity named a sequencer and (2) the sequencer broadcasts p to the receivers. When a PDU loss is detected, the selective retransmission is used. The sequencer supports synchronous communication. [8] presents a decentralized protocol (referred to as GS) based upon the tree structured routing. Each node shows an entity and each path denotes a route of PDU to the destinations. Here, h and denote the height of the tree and the number of entities which are not the destinations in the path, respectively. If an entity receives p from the source or the parent, p is relayed to its children until all the destinations receive p. The parent decides on the atomic receipt among the children. The lost PDUs are selectively retransmitted. ISIS [3] supports CBCAST and ABCAST protocols (referred to as BSS) which provides CO and TO services, respectively. In these protocols, each entity has a virtual clock to causally order the receipt PDUs. In the ABCAST, a decentralized procedure like twophase commitment is used. Since the LO service is used as the underlying service, there is no PDU loss. Delta-4 [22] support an atomic multicast protocol (AMp). AMp provides multiple qualities of services (QOS) including the CO and TO services. It adopts the decentralized control. AMp and ISIS discuss how to tolerate stop-failure of the entities. We have implemented the protocols discussed in this paper as processes of SunOS 4.11 in Sparc2 workstations interconnected by Ethernets. Each workstation has only one protocol entity. The program size is about 5K steps in C language, and the size of the executable object-code is about 50K bytes. Figure 8 illus1 SunOS is a trademark of Sun Microsystems, Inc.
trates the average delay time of PDUs for the number n of system entities. AP -AP -delay shows time from a DT request submission of an application entity until the receipt of all the destinations. SP -delay means how long it takes each PDU p to be acknowledged after p is accepted. Ethernet -delay shows how long it takes to transmit p by using the Ethernet MAC service. Following Figure 8, the delay time is O(n). delay time [msec]
6 Evaluation
600 500 400 AP-AP-delay SP-delay Ethernet-delay
300 200 100 0 3
4 5 6 7 number of entities (n)
8
Figure 8: Delay time the number of entities
7
Concluding Remarks
In this paper, we have presented a formal model of the group communication service from the data transmission point of view assuming no entity failure. The service is modeled as a set of logs which denote the sending and receipt sequence of PDUs in each entity. We have de ned various kinds of group communication services based on this model. We have also shown group communication protocols which provide the atomic and well-de ned ordered delivery of the PDUs.
References
[1] Abeysundara, B. W. and Kamal, A. E., \HighSpeed Local Area Networks and Their Performance: A Survey," ACM Computing Surveys , Vol.23, No.2, 1991, pp.221{264. [2] Amir, Y., Dolev, D., Kramer, S., and Malki, D., \Transis: A Communication Sub-System for High Availability," Proc . of IEEE 22th Annual Int'l Symposium on Fault -Tolerant Computing , 1993, pp.76{84. [3] Birman, K., Schiper, A., and Stephenson, P., \Lightweight Causal and Atomic Group Multicast," ACM Trans . Computer Systems , Vol.9, No.3, 1991, pp.272-314. [4] Chang, J. M. and Maxemchuk, N. F., \Reliable Broadcast Protocols," ACM Trans . Computer Systems , Vol.2, No.3, 1984, pp.251{273. [5] Defense Communications Agency, \DDN Protocol Handbook," Vol.1{3, NIC 50004{50005, 1985. [6] Ellis, C. A., Gibbs, S. J., and Rein, G. L., \Groupware," Comm . ACM , Vol.34, No.1, 1991, pp.38{58.
Table 2: Group communication protocols protocol
GS BSS
system
network
service
service
cntl.
dst.
mode
TO
1-to-1/MC
decnt.
group
async.
TO/CO
1-to-1/OP
decnt.
group
sync.
recov.
performance
notes
PDU
delay
select.
n+
(h + 1)T
tree
|
3n
3T
2 phase
TO
broadcast/1C
cnt.
group
sync.
select.
n +2
2T
2 phase
TO/CO
broadcast/MC
decnt.
group
a./s.
?
n +2
3T
2 phase
LO
OP
broadcast/MC
dist.
group
async.
select.
2n + 1
3T
3 phase
TO
TO/CO
broadcast/1C
dist.
group
async.
go-back
2n + 1
3T
3 phase
CO
CO
broadcast/MC
dist.
group
async.
select.
2n + 1
3T
3 phase
SLO
SLO
broadcast/MC
dist.
select.
async.
select.
2m + 1
3T
3 phase
STO
STO
broadcast/1C
dist.
select.
async.
select.
2m + 1
3T
3 phase
PriO
PriO
broadcast/1C
dist.
group
async.
select.
2n + 1
3T
3 phase
PriTO
broadcast/1C
dist.
group
async.
go-back
2n + 1
3T
3 phase
KTHB AMp
PriTO
[7] Garcia-Molina, H. and Spauster, A., \Message Ordering in a Multicast Environment," Proc. of IEEE ICDCS-9 , 1989, pp.354{361. [8] Garcia-Molina, H. and Spauster, A., \Ordered and Reliable Multicast Communication," ACM Trans . Computer Systems , Vol.9, No.3, 1991, pp.242{271. [9] Gray, J., \Notes on Database Operating Systems," Operating Systems : An Advanced Course , (Bayer, R. ed.), Springer-Verlag, 1978. [10] IEEE, \Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Speci cation," ANSI /IEEE Standard 802.3, IEEE, 1985. [11] ISO, \OSI { Basic Reference Model," ISO 7498, 1984. [12] Jain, R., \FDDI: Current Issues and Future Plans," IEEE Communication Magazine , Vol.31, No.9, Sep. 1993, pp.98-105. [13] Kaashoek, M. F., Tanenbaum, A. S., Hummel, S. F., and Bal, H. E., \An Ecient Reliable Broadcast Protocol," ACM Operating Systems Review , Vol.23, No.4, 1989, pp.5{19. [14] Lamport, L., \Time, Clocks, and the Ordering of Events in a Distributed System," Comm . ACM , Vol.21, No.7, 1978, pp.558{565. [15] Luan, S. W. and Gligor, V. D., \A Fault-Tolerant Protocol for Atomic Broadcast," IEEE Trans . Parallel and Distributed Systems , Vol.1, No.3, 1990, pp.271{285. [16] Mattern, F., \Virtual Time and Global States of Distributed Systems," Parallel and Distributed Algorithms (Cosnard, M. and Quinton, P. eds.), North{Holland, Amsterdam, The Netherlands, 1989, pp.215{226. [17] Nakamura, A. and Takizawa, M., \Reliable Broadcast Protocol for Selectively Ordering PDUs," Proc. of IEEE ICDCS-11 , 1991, pp.239{ 246. [18] Nakamura, A. and Takizawa, M., \Design of
[19] [20]
[21] [22] [23] [24] [25] [26] [27] [28]
Reliable Broadcast Communication Protocol for Selectively Partially Ordered PDUs," Proc. of IEEE COMPSAC'91 , 1991, pp.673{679. Nakamura, A. and Takizawa, M., \Priority-Based Total and Semi-Total Ordering Broadcast Protocols," Proc . of IEEE ICDCS-12 , 1992, pp.178{ 185. Nakamura, A. and Takizawa, M., \StarvationFree Priority-Based Total Ordering Broadcast Protocol on High-Speed Single Channel Network," Proc. of 2nd Int'l Symp . on High Performance Distributed Computing (HPDC-2 ), 1993, pp.281{288. Nakamura, A. and Takizawa, M., \Causally Ordering Broadcast Protocol," to appear in Proc. of IEEE ICDCS-14 , 1994. Powell, D., Chereque, M., and Drackley, D., \Fault-Tolerance in Delta-4," ACM Operating Systems Review , Vol.25, No.2, 1991, pp.122-125. Schneider, F. B., Gries, D., and Schlichting, R. D., \Fault-Tolerant Broadcasts," Science of Computer Programming , Vol.4, No.1, 1984, pp.1{ 15. Takizawa, M., \Cluster Control Protocol for Highly Reliable Broadcast Communication," Proc. of IFIP Conf . on Distributed Processing , 1987, pp.431{445. Takizawa, M. and Nakamura, A., \Partially Ordering Broadcast (PO) Protocol," Proc. of IEEE INFOCOM'90 , 1990, pp.357{364. Takizawa, M. and Nakamura, A., \Reliable Broadcast Communication," Proc . of IPSJ Int'l Conf . on Information Technology (InfoJapan ), 1990, pp.325{332. Tanenbaum, A. S., Computer Networks (2nd ed .)," Englewood Clis, NJ: Prentice{Hall, 1989. Verissimo, P., Rodrigues, L., and Baptista, M., \AMp: A Highly Parallel Atomic Multicast Protocol," Proc . of the ACM SIGCOMM'89 , 1989, pp.83-93.