discuss protocols which use Ethernet and are controlled in a central- ... referred to as be supported by E1,...,En (written as C = hE1,...,Eni), and support A1,...,An.
Selective Order-Preserving Broadcast Protocol for Distributed Systems Akihito Nakamura
and
Makoto Takizawa
Dept. of Computers and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki-gun, Saitama 350-03 e-mail
fnaka,
g
taki @takilab.k.dendai.ac.jp
October 21, 1997
Abstract This paper discusses how to provide
selective group communication for multiple
entities in a distributed system by using less-reliable broadcast communication services. In the group communication, a protocol data unit (PDU) sent by each entity has to be delivered atomically to all the destinations in the group. In distributed applications, each entity sends a PDU only to a subset rather than all the entities, and each entity receives only PDUs destined to it from every entity in the same or-
selective order -preserving In this paper, we discuss how to design a distributed pro-
der as they are sent. We name such a broadcast service a
broadcast
(SP) service.
tocol which provides the SP service for entities by using the less-reliable broadcast network in the presence of PDU loss. The SP service can be a useful facility for designing and implementing distributed systems like groupware.
1
1 Introduction In distributed applications like teleconferencing and cooperative work in groupware systems [?], group communication among multiple communication entities is required in addition to conventional one-to-one communication provided by OSI [?] and TCP/IP [?]. Reliable broadcast communication has been studied in [?]{[?], [?]{[?], [?]{[?], [?]{[?], and [?]{[?]. In [?], a reliable broadcast protocol which uses the one-to-one communication is presented. [?, ?, ?] discuss protocols which use Ethernet and are controlled in a centralized manner. [?] presents a broadcast protocol which provides total ordering of received messages based on majority-consensus decision. [?] characterizes message ordering properties in a reliable broadcast protocol using the conventional one-to-one network. [?, ?] present a cluster concept which is an extension of the conventional connection concept to multiple service access points (SAPs) and showed how to establish the cluster by using the Ethernet MAC [?] service. An important problem in designing protocols is which entity coordinates the cooperation of multiple entities. Most approaches [?, ?, ?, ?, ?] adopt the centralized control scheme, where one master entity decides the correct receipt among multiple entities. In the centralized control scheme, processes have to wait for the decision of the master entity. On the other hand, in the distributed control scheme, every entity decides the correct receipt by itself on the basis of the protocol data units (PDUs) received. We adopt the distributed approach, because it can take advantage of broadcast service supported by the underlying network like the Ethernet. [?, ?, ?] present a TO (total ordering broadcast) protocol on the Ethernet, which provides a service where every application process can receive the same PDUs in the same order without PDU loss. [?] presents an OP (order-preserving broadcast) protocol by which all the entities receive PDUs in the sending order by using a less-reliable broadcast service but may receive the PDUs not in the same order. In these protocols, every PDU is broadcast to all the entities in the cluster. In distributed applications, each entity sends each PDU p only to a subset of entities rather than all the entities in the cluster. Here, every entity receives the PDUs destined to it from each entity in the same order as it sends. We name such a service a selective order-preserving broadcast (SP) service. A simple mechanism for selective broadcast is studied in [?]. It uses a spanning tree for routing PDUs to their destinations and is based on the one-to-one communication service. In high-speed networks [?], entities may fail to receive PDUs due to the buer overrun since the processing speed of entities is slower than the transmission speed. In this paper, we discuss a distributed protocol which provides the SP service for the entities in the cluster by using less-reliable high-speed broadcast networks where PDU loss occurs as failure. The remainder of this paper is organized as follows. In section 2, we model lessreliable and reliable broadcast communication services. In section 3, we present a concept of correct receipt of PDUs in a cluster. In section 4, a cluster establishment procedure is shown. In section 5, we present a data transmission procedure of the SP protocol. Finally, we discuss the correctness and performance of the SP protocol in section 6.
2
application layer
111
A1
6
S1
Si
?
system layer
111
network layer
6
6
Sn
6 ?
? 111
Ei
Ni
application entity
An
?
E1
N1
111
Ai
Y
En
6 ?
Nn
6 ?
system SAP system entity network SAP
less-reliable broadcast network Figure 1: System model
2 Service Model We model the communication service for multiple entities. 2.1
System model
A distributed system is composed of application , system , and network layers [Figure ??]. A cluster C [?, ?] is a set of n ( 2) system service access point s (SAPs), i.e. fS1 ,...,S g. Each application entity A takes some communication service through S where each S is supported by a system entity E (i = 1,...,n). The system entities E1 ,...,E cooperate with each other to support some broadcast service for C by using the underlying network service. The cooperation of the system entities is coordinated by a system protocol. C is referred to as be supported by E1 ,...,E (written as C = hE1 ,...,E i), and support A1,...,A . We model a broadcast communication service as a set of logs. A log L is a sequence of PDUs < p1 ::: p ], where p1 and p are the top and the last, denoted by top (L) and last (L), respectively. In L, p precedes p (p ! p ) if h < k . Each E has a sending log SL and a receipt log RL , which are sequences of PDUs sent and received by E , respectively (i = 1,...,n). SL is a sublog of SL which includes PDUs whose destinations include E (j = 1,...,n). RL is a sublog of RL which includes PDUs from E (j = 1,...,n). The following relations exist among the logs: (1) RL is order -preserved i for every entity E , p !RL q if p !SL q . RL is information -preserved i RL includes all the PDUs in SL1 ,...,SL . RL is preserved i RL is order-preserved and information-preserved. (2) RL and RL are order -equivalent i for every pair of PDUs p and q included in both RL and RL , p !RL q i p !RL q. RL and RL are information -equivalent i RL and RL include the same PDUs. RL and RL are equivalent i RL and RL are order-equivalent and information-equivalent. (3) RL is selectively information -preserved i RL includes all the PDUs in SL1 ,...,SL . n
i
i
i
i
n
n
m
n
n
m
h
i
k
h
L
k
i
i
i
ij
j
i
ij
i
j
i
j
i
i
i
j
n
i
i
i
j
i
i
j
i
i
j
j
i
j
j
i
j
i
i
3
i
ni
E1
111
111
Ei
?
En
.. Ch 1 . .. Ch ? . Ch
?
j
R
Figure 2: MC service
RL is selectively preserved i RL is order-preserved and selectively informationpreserved. i
2.2
i
Broadcast communication service
We consider types of broadcast communication services for a cluster C = hE1 ,...,E i. First, we de ne the services provided by the underlying network layer. n
[De nition] A broadcast communication service S is a multi -channel (MC) service i every receipt log in S is order-preserved. S is a one -channel (1C) service i every receipt log in S is order-preserved and order-equivalent with each other. 2 In the 1C service, PDUs are delivered to entities in the same order, but some PDUs may be lost. Entities connected by a high-speed channel may fail to receive PDUs due to the buer overrun. The 1C service is a model of such system where entities are connected by a high-speed channel. By using the MC service, every system entity can receive PDUs in the same order as they are sent, but some entity may fail to receive some PDUs due to the buer overrun and over ow. An example of the MC service is one where workstations are connected by multiple high-speed channels Ch 1 ,...,Ch (R 2) [Figure ??]. Each channel supports the 1C service. Each entity E in a workstation uses one Ch to broadcast PDUs and receives PDUs from all the channels. Here, every entity receives PDUs from each entity in the sending order. However, some E may fail to receive PDUs due to the buer overrun and over ow. R
i
j
j
[Example 2.1] Figure ?? shows an example of the MC service. Suppose that C = hE1 , E2 , E3 i. E1 , E2 , and E3 broadcast PDUs as SL1 = < a b c d ], SL2 = < p q ], and SL3 = < x y z ], respectively. All the entities receive PDUs in the sending order, i.e. every receipt log RL is order-preserved (i = 1, 2, 3). For example, a ! b in every RL (i = 1, 2, 3). However, RL21 and RL32 are not information-preserved because RL21 and RL32 include neither c nor q . If the 1C service is used as the underlying service, each entity has the receipt log as shown in Figure ??. Every RL is order-preserved and is i
RLi
i
i
4
E1
E2
E3
RL1 : RL11 : RL12 : RL13 : RL2 : RL21 : RL22 : RL23 : RL3 : RL31 : RL32 : RL33 :
SL1 :
< a b c d
]
SL2 :
< p q
]
SL3 :
< x y z
< a x b c p y d z q < a b c d < p q
]
< x y z
< p q
]
< x y z
]
]
< a b c d
]
< x y z
]
]
< a x p y b c z d
< p
]
]
< p x a b y z d q < a b d
]
]
]
]
Figure 3: Example of the MC service E1 E2 E3
RL1 : RL2 : RL3 :
< a x b c p y d z q < a x b p y d z q < a x b c p y d z
SL1 : SL2 : SL3 :
]
] ]
< a b c d < p q
]
]
< x y z
]
Figure 4: Example of the 1C service order-equivalent with each other (i = 1, 2, 3). That is, every entity receives the PDUs in the same order, although they come from dierent entities. For example, a ! p in every RL (i = 1, 2, 3). 2 RLi
i
Next, we consider what kind of broadcast services is provided for the application entities by the system layer.
[De nition] A broadcast communication service S is an order -preserving (OP) service i every receipt log in S is preserved. S is a total ordering (TO) service i every receipt
log in S is preserved and order-equivalent with each other.
2
[Example 2.2] Figure ?? shows an example of the OP service for three application
entities A1 , A2 , and A3. In the OP service, every A receives PDUs from each A in the sending order without PDU loss (i, j = 1, 2, 3). For example, every A receives PDUs p and q from A2 and p ! q (i = 1, 2, 3). However, PDUs from dierent entities may be received in dierent order. On the other hand, in the TO service, every A receives PDUs in the same order as shown in Figure ??. 2 i
j
i
RLi
i
Next, we de ne a selective broadcast communication (SBC) service for C = hE1,...,E i. In the SBC service, each PDU is sent only to the destinations (not necessarily all the entities) in C although each PDU is sent to all the entities in the TO and OP services. n
5
A1 A2 A3
RL1 : RL2 : RL3 :
] < p x a b y c z d q ] < a x p y b q c z d ] < a x b c p y d z q
SL1 : SL2 : SL3 :
< a b c d < p q
]
]
< x y z
]
Figure 5: Example of the OP service A1 A2 A3
RL1 : RL2 : RL3 :
< a x b c p y d z q < a x b c p y d z q < a x b c p y d z q
] ] ]
SL1 : SL2 : SL3 :
< a b c d < p q
]
]
< x y z
]
Figure 6: Example of the TO service
[De nition] A broadcast communication service S is a selective broadcast communication
2
(SBC) service i every receipt log in S is selectively information-preserved.
In the SBC service, each PDU is delivered to only and all the destination entities. There are two kinds of SBC services according to the receipt ordering of PDUs.
[De nition] An SBC service
is a selective order -preserving (SP) service i every receipt log in S is selectively preserved. S is a selective total ordering (ST) service i every receipt log in S is selectively preserved and order-equivalent with each other. 2 S
DST be a set of destination entities of a PDU p, i.e. p:DST C . If p:DST = g ( C ), p is written as pf 1 fE 1 ,...,E g to explicitly denote the destination entities. In the SP service, if PDUs p and q are broadcast by E , p ! q , and E 2 p:DST \ q . In this q:DST , then E receives p and q in the same order as they are sent, i.e. p ! paper, we present a protocol which provides the SP service for C .
Let
p:
d ;:::;dm
dm
d
j
i
S Lj
i
RLij
[Example 2.3] As an example of the SP service, consider a cluster C which supports application entities A1 , A2, and A3 as shown in Figure ??. Each PDU has its destination
in C . For example, af2 3g is a PDU whose destinations are A2 and A3 . In the SP service, every entity receives only and all the PDUs which are destined to it. For example, A1 receives PDUs c and d from A1 , p and q from A2, and x and y from A3 . Furthermore, each entity receives the PDUs in the sending order. For example, c ! 1 d, p ! 1 q , and x ! 1 y . However, PDUs from dierent entities may be received in dierent order. For example, c ! 1 y in A1 and y ! 3 c in A3. Thus, every receipt log is selectively information-preserved and order-preserved. 2 ;
RL
RL
RL
RL
RL
3 Correct Receipt Concept in a Cluster In this section, we consider the criteria of correct receipt of PDUs in a cluster hE1 ,...,E i. n
6
C
=
A1 A2 A3
: RL2 : RL3 : RL1
< c x p y d q < a d x y p
]
< a y b c p z
]
: S L2 : S L3 : S L1
]
]
< af2;3g bf3g cf1;3g df1;2g < pf1;2;3g qf1g
]
< xf1;2g yf1;2;3g zf3g
]
Figure 7: Example of the SP service
time
R
E1 a
E2
-
b
c
U
E3
N
E4
R ^R*
d
-
f
(= g3 )
R
e
-
: receipt event
: sending event
Figure 8: Pre-acknowledgment and acknowledgment Each system entity E is modeled as a sequence of sending and receipt events. We do not consider local events because they are not observable from another entity. Every event occurs atomically. Let s [p] and r [p] denote sending and receipt events of a PDU p in E , respectively. Here, let E be a set of events in C . We de ne a partial ordering relation, i.e. happened -before relation ! on E [?] which denotes the ordering of event occurrences. i
i
i
i
[De nition] For every pair of events e1 and e2 in E , e1 ! e2 i
(1) e1 happens before e2 in E , (2) for some (not necessarily dierent) E and E , there exists some PDU p such that e1 = s [p] and e2 = r [p], or (3) for some event e3 in E , e1 ! e3 and e3 ! e2. 2 i
i
i
j
j
We assume that every PDU from each E carries the con rmation of receipt of PDUs which E has received already. That is, if E receives q from E (s [q ] ! r [q ]), E knows that every PDU p, where r [p] ! s [q], is received by E . A PDU p from E is referred to as pre -acknowledged for E by E (written as s [p] ) r [q ]) i s [p] ! r [p] ! s [q] ! r [q ]. In Figure ??, E1 sends a and then E2 receives a. Then, E2 sends c which includes the receipt con rmation of a. By receiving c, E3 knows that E2 has received a. Here, s1 [a] ! r2 [a] ! s2 [c] ! r3 [c], i.e. a is pre-acknowledged for E2 by E3 (s1 [a] )2 r3 [c]). There are three criteria levels of correct receipt of every PDU p from E in C : (1) Acceptance : E receives p. j
j
i
j
j
j
j
j
j
i
k
j
i
i
k
i
k
j
j
i
k
i
7
(2) Pre-acknowledgment : E knows that p has been accepted by every destination of p. That is, for every E 2 p:DST , there exists a PDU q such that s [p] ) r [q ] (written as s [p] ) r [q ]). (3) Acknowledgment : E knows that p has been pre-acknowledged by every destination E of p. That is, for every E and E in p:DST and a PDU q where s [p] ) r [q ], there exists a PDU g such that s [q] ) r [g]. i
j
k
k
j
i
i
i
j
j
h
k
h
h
j
i
The acknowledgment of p by E means that E knows that every destination of p has known that every destination of p had received p. Even if p is pre-acknowledged in E , E cannot decide that p is correctly received by all the destination entities in C . When p is acknowledged in E , E knows that p is pre-acknowledged by every destination. Also, another E knows that p is at least accepted by every destination. That is, E considers that p is correctly received by every destination. i
i
i
i
i
i
k
i
[Example 3.4] In Figure ??,
broadcasts a PDU a where a:DST = fE1, E2 , E3 , E4 g. Since s1 [a] )1 r3 [b], s1 [a] )2 r3 [c], s1 [a] )3 r3 [d], and s1 [a] )4 r3 [e], a is preacknowledged by E3 when e is accepted by E3 (s1[a] ) r3 [e]). When E4 receives f , E4 knows that b, c, d, and e are accepted by E3 . Suppose that every destination E 2 a:DST also receives b, c, d, and e and then sends a PDU g (i = 1, 2, 3, 4), i.e. g3 = f . If E4 receives PDU g from every E , a is acknowledged by E4 . 2 E1
i
i
i
i
4 Cluster Establishment In this section, we consider how to establish a cluster C = hE1 ,...,E i. When every entity agrees with the same cluster identi er CID , C is established . The CID has to be acknowledged by all the entities in C . After that, C is identi ed by CID . Each E has the following variables: n
i
CLOCK = local clock in E . i
i
LC = local clock of E when E knows E participates in the cluster establishment procedure (j = 1,...,n). j
j
i
j
IBF = number of available buers in E which E knows (j = 1,...,n). j
j
i
Initially, every entity E is in the Idle state, LC = NULL, IBF = 0 (j = 1,...,n, i 6= j ) and IBF = number of available buers in E . Let a(S ) denote the address of the system SAP S with which E is attached (i = 1,...,n). Each PDU p used in the cluster establishment procedure has the following elds: i
j
i
j
i
i
i
i
p:
SRC = entity E which transmits p.
p:
ADDR = address of the system SAP S , i.e. a(S ) (j = 1,...,n).
i
j
j
8
j
LC = local clock of E when E participates in the cluster establishment procedure (j = 1,...,n).
p:
p:
C
j
j
j
BUF = number of buers available in E . i
is established by the following cluster establishment procedure.
[Cluster establishment procedure] [Figure ??] (1) Idle state: (1-1) Active open : When E receives an active open request AOP ha(S1 ),...,a(S )i from A in the Idle state, LC := CLOCK and E broadcasts an OPEN PDU p where p:ADDR = a(S ) and p:LC = LC (j = 1,...,n). E moves to the state waiting -for -popened (WfPD). (1-2) Passive open : When E receives a passive open request POP from A , it moves to the state waiting -for -open (WfO) and waits for OPEN PDUs. i
n
i
i
j
i
j
i
j
j
i
i
i
(2) WfO state: When E receives an OPEN PDU p from E which includes a(S ), LC := p:LC , LC := CLOCK , and E broadcasts a POPENED PDU q, where q:LC = LC (j = 1,...,n). Then, E moves to the WfPD state. (3) WfPD state: When E receives an OPEN or POPENED PDU p, for each j , if LC = NULL, then LC := p:LC . If LC 6= NULL and p:LC 6= LC , E broadcasts an ABORT PDU and moves to the Idle state. When E receives an OPEN or POPENED PDU from every entity, E selects a tuple ha(S ), LC i as CID of C such that a(S ) is the minimum. Let IBF be a number of available buers in E . E broadcasts an OPENED PDU p where p :LC = LC (j = 1,...,n) and p:BUF = IBF . Then, E moves to the state waiting -for -opened (WfOD). (4) WfOD state: On receipt of an OPENED PDU p from E , if p:LC 6= LC for some h, E broadcasts an ABORT PDU and moves to the Idle state. Otherwise, IBF := p:BUF . When E receives an OPENED PDU from every entity, C is established and all the entities have the same CID for C . E moves to the state established (Estab). (5) Estab state: Each E sends and receives PDUs among the system entities in C according to the data transmission procedure presented in the following section. 2 i
j
j
i
i
j
i
j
i
j
i
i
j
j
j
j
j
j
i
i
i
h
h
i
i
i
j
i
h
j
i
j
i
h
h
j
i
i
i
We assume that every local clock has a long cycle, e.g. 32-bit clock. By this assumption, the selected CID , i.e. ha(S ), LC i, is always unique for C . Furthermore, multiple active entities are allowed to simultaneously participate in the cluster establishment. h
h
5 SP Protocol on the MC Service In this section, we discuss a data transmission procedure of the SP (selective orderpreserving broadcast) protocol for a cluster C = hE1,...,E i by using the MC service. n
9
Idle POP///
AOP/// /OPEN
?
WfPD
OPEN/
~
WfO
/POPENED
(OPENjPOPENED)//
?
/OPENED
WfOD m/ m// m/// /m m 1 j m2
: : : : :
-
OPENED//
m from an m from all receipt of m from an broadcasting of m. m1 or m2 .
Estab
receipt of
entity.
receipt of
the entities. application entity.
Figure 9: State transition diagram 5.1
Variables
There are two kinds of PDUs exchanged among entities, i.e. DT (data ) and RR (receive ready ) PDUs as shown in Figure ??. The DT PDUs are used to send data in C If E has no data to broadcast, E transmits an RR PDU every pre-de ned period. Each PDU p has the following elds: i
i
p:
CID = cluster identi er.
p:
SRC = entity E which transmits p.
p:
DST = set of destination entities of p.
p:
TSEQ = total sequence number of p.
p:
PSEQ = partial sequence number for E (j = 1,...,n).
p:
i
j
j
ACK = total sequence number of a PDU which E expects to receive next from E (j = 1,...,n). j
i
j
p:
BUF = number of buers available in E .
p:
DATA = data to be broadcast.
i
10
DT PDU : h CID ; SRC ; DST ; TSEQ ; [PSEQ 1 ...PSEQ ]; [ACK 1 ...ACK ]; BUF ; DATA i RR PDU : h CID ; SRC ; [PSEQ 1...PSEQ ]; [ACK 1 ...ACK ]; BUF ; PROMPT i n
n
n
n
Figure 10: PDU formats Every DT PDU p has a DST eld which is a bit-map showing the destinations in C . Here, E 2 p:DST means that E is a destination of p. p:DST := p:DST [ fE g means that the bit corresponding to E is set as the destination of p. When E receives p, if E 2 p:DST , E accepts p. Otherwise, E discards p. The RR PDU does not have the DST eld because p is sent to all the entities in C . Each p has a unique total sequence number p:TSEQ . For every pair of PDUs p and q broadcast by E , p:TSEQ < q:TSEQ if p ! q . In addition, p has a unique partial sequence number p:PSEQ for each E (j = 1,...,n). If p ! q , then p:PSEQ < q:PSEQ . Even if p:TSEQ < q:TSEQ , q:PSEQ may be the same as p:PSEQ if there is no PDU following p which is destined to E . p:ACK informs every entity in C that E has received every PDU q from E where q:TSEQ < p:ACK . In order to do the ow control, p carries p:BUF which is a current number of available buers in E . One PDU can be stored in one buer. On receipt of an RR PDU p, if p:PROMPT = ON, E has to send a DT or RR PDU soon. Each E has the following variables (j = 1,...,n): i
i
i
i
i
i
i
i
i
S Li
j
j
S Lij
j
j
j
j
j
i
j
j
j
i
i
i
TSEQ = total sequence number of a PDU which E expects to broadcast next.
PSEQ = partial sequence number of a PDU which E expects to send to E next.
i
i
j
j
TREQ = total sequence number of a PDU which E expects to receive next from E . i
j
j
PREQ = partial sequence number of a PDU which E expects to receive next from E . i
j
j
AL = total sequence number of a PDU which E knows E expects to receive next from E (h = 1,...,n). hj
i
j
h
PAL = total sequence number of a PDU which pre-acknowledge next from E (h = 1,...,n). hj
knows that
Ei
expects to
Ej
h
BUF = number of buers in E which E knows of. j
j
i
Let ISS and IBF be an initial total sequence number and an initial number of available buer size of E , respectively. E obtains every IBF of E in the cluster establishment procedure (j = 1,...,n). Initially, ISS = LC , TSEQ = PSEQ = ISS , TREQ = PREQ = AL = ISS , and BUF = IBF (j , h = 1,...,n) in E . Let minAL and minBUF denote the minimum value among AL 1 ,...,AL and BUF 1 ,...,BUF , respectively. i
i
i
i
j
j
jh
j
j
j
j
j
j
i
jn
j
n
11
i
j
j
j
5.2
Transmission and acceptance
Each application entity A submits a data transmission (DT ) request r with a destination set DST ( C ) at the system SAP S . When E receives a DT request from A at S , E checks the following ow condition. Here, W and H ( 2) are constants. W gives the window size. i
i
[Flow condition] minAL
i
TSEQ
i