receive the same PDUs in the same order without lost. PDUs. ... PDUs are rejected when some PDU is lost. ..... Ek accepts pj on receipt of pj, and broadcasts q.
Priority-Based Total and Semi-Total Ordering Broadcast Protocols Akihito Nakamura
and
Makoto Takizawa
Dept. of Information and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki, Saitama 350-03, Japan E-mail
fnaka,
g
taki @takilab.k.dendai.ac.jp
Abstract
In new applications like groupware systems, broadcast communication of various kinds of data like transactions and les is required. In these applications, some of protocol data units (PDUs) with control data like transactions have to be delivered earlier than another PDU. One approach to providing various kinds of communications like control data transmission and le transfer by using a common communication channel is to give a priority to each PDU and to deliver PDUs with higher priority to the destination earlier than lower ones. In this paper, we discuss broadcast protocols which provide priority-based receipt ordering of PDUs for entities in the cluster. We present distributed broadcast protocols which provide the prioritybased receipt ordering of PDUs by using a single channel system like Ethernet and radio systems. 1
Introduction
In distributed applications like groupware [4], broadcast communication among multiple entities is required in addition to the conventional one-to-one communication provided by the OSI [8] and TCP/IP [3]. Reliable broadcast communication systems have been discussed in [1, 2, 5, 6, 9, 10, 14, 15, 16, 17, 18, 20, 21, 22, 23]. In [18], a reliable broadcast protocol on a one-to-one communication network is presented. [2, 5, 9, 10] discuss centralized protocols which use Ethernet MAC [7] as the underlying network. [14] presents a broadcast protocol which provides totally ordering of received protocol data unit s (PDU s) based on majority-consensus decision. [6] characterizes message ordering properties in a broadcast protocol using the conventional one-to-one network. [20, 21] present a cluster concept which is an extension of the conventional connection concept [8] to multiple service access point s (SAP s) and discuss how to establish the cluster by using the Ethernet MAC service. An important problem in designing protocols is which entity coordinates the cooperation of multiple entities. Most approaches [2, 5, 6, 18] adopt centralized control. On the other hand, in the distributed control, every entity decides the correct receipt by itself. We adopt the distributed approach, because it can take advantage of the underlying broadcast service. [20, 21] present
a TO (totally ordering ) protocol on the Ethernet and radio networks, where every application process can receive the same PDUs in the same order without lost PDUs. It is based on the three-phase correct receipt [20, 21]. In the ISIS ABCAST protocol [1], the similar concept is proposed. Although the ISIS and our protocol adopt the distributed control scheme, one entity for each PDU coordinates the correct receipt in the ISIS. In our protocol, every entity coordinates the correct receipt for every PDU. Another ordering protocols are discussed in [16, 17, 22]. In distributed systems, various kinds of applications broadcast PDUs in the cluster. For example, transactions broadcast update and commitment commands to the database systems while bulk data retrieved are broadcast. In these applications, control commands have to be delivered to the destinations earlier than another data. There are two kinds of priority-based transmission schemes for the broadcast communication, controlled access [11, 19] and contention-based protocols [11, 13]. In [13], a contention-based method for the broadcast communication is discussed at the physical level. In this paper, we discuss contentionbased priority concepts among multiple entities in the cluster at a logical level since it takes time to make a reservation of the channel in the controlled access scheme. Every entity in the cluster receives all the PDUs broadcast by each entity in the same order under the priority scheme. Such communication service is named priority-based ordering (PriO ) service. According to the advance of communication technologies, major failure occurring in the lower layer is PDU loss. When the PDU loss occurs, one conventional technique is to reject all the PDUs following the lost PDU (i.e. go-back-n reject and retransmission method). Another technique is a selective retransmission, i.e. only the lost PDUs are retransmitted. In the rst protocol named a priority-based total ordering (PriTO ) protocol, every entity in the cluster receives the same PDUs in the same order while higher priority PDUs jump over lower priority ones. It uses the go-backn method. Also, we show a priority-based semi-total ordering (PriSO ) protocol which uses the selective retransmission. Although every entity may receive PDUs with the same priority in dierent orders, less
PDUs are rejected when some PDU is lost. The protocols use one-channel system like the Ethernet. In section 2, we discuss levels of correct receipt in broadcast communication. In section 3, we de ne a one-channel (1C) service. In section 4, we discuss the priority-based ordering of PDUs. In section 5, we present brie y the data transmission procedure of the TO protocol. In section 6 and 7, based on the TO protocol, we discuss PriTO and PriSO protocols by using the 1C service. In section 8, we present protocols which provide less correctness but lower delay time. In section 9, we present the evaluation. 2
Levels of Correct Receipt
A communication system M is composed of multiple entities. A cluster C [20, 21] is a subset of M (C M ), which is an extension of the conventional connection oriented concept [8] among two entities, i.e. service access points (SAPs) to multiple ones. A cluster is established among multiple SAPs, and then PDUs are communicated among multiple entities through the SAPs in the cluster. There are three levels [20, 21] of correct receipt among multiple entities, i.e. accepted , pre-acknowledged , and acknowledged . Here, suppose that a cluster C is composed of n entities E1 ,...,En .
(1) When a PDU p arrives at Ej , p is accepted by Ej . (2) When Ej knows that every entity in C has accepted p, p is pre-acknowledged by Ej . (3) When Ej knows that every entity in C has preacknowledged p, p is acknowledged by Ej . At (2), although Ej knows that every entity in C has accepted p, some entity Ei still may not know that another entity has accepted p. For example, Ei has not received the acknowledgment of p from some entity Eh yet. (3) represents the highest correct level. There are two orthogonal issues, i.e. correct receipt and delay time. The rst one is how correctly PDUs are delivered to the destinations. There are three levels, i.e. accepted , pre-acknowledged , and acknowledged as presented in this section. The other issue is how promptly PDUs with higher priority are delivered. Some systems may require more correctness and lower delay time, and another systems like the voice systems may require lower delay but less correctness. In this paper, we discuss two cases, i.e. correctness oriented and delay oriented systems. In the former system, only PDUs acknowledged are delivered in a priority order. In the latter, even if PDUs are not acknowledged yet, i.e. accepted or pre-acknowledged , PDUs with higher priority are delivered to the upper layer. In this paper, we assume that entities stop by failure, and the cluster is closed if some entity fails. 3
One-Channel (1C) Service
Communication service is modeled by using logs of PDUs. A log L is a sequence < p1 : : : pm ] of m ( 0) PDUs, where pi is the i-th PDU (i = 1; : : :; m), p1 is the top top (L), and pm is the last last (L). pi precedes
(pi
!
) in L if i < j . Li denotes a subsequence 01 pi ] of L (1 i m). Let L1 jj L2 denote a concatenation of logs L1 and L2 . Each entity Ek has a sending log SLk and a receipt log RLk . There are following relations among the logs [23]. RLk is order-preserved i for every Ej , Ek receives PDUs from Ej in the sending order, i.e. if p !SLj q, then p !RLk q. RLk is content-preserved i RLk includes all the PDUs in SL1 ,: : :,SLn . RLj and RLk are contentequivalent i both include the same PDUs. RLj and RLk are order-equivalent i for every p and q in RLj and RLk , p !RLk q i p !RLj q. RLk is preserved i it is content- and order-preserved. RLj and RLk are equivalent i they are content- and order-equivalent . We present one class of less reliable broadcast communication service named one-channel (1C) service, which is abstraction of the Ethernet MAC service [7] and radio networks. pj
L pj
< p1 : : : pi
[De nition] One-channel (1C) service is one where every receipt log is order-equivalent and orderpreserved . 2 Every entity can receive PDUs in the same order but may fail to receive some PDUs due to buer over ow and overruns in the 1C service. 4
Priority Based Ordering
By giving each PDU p a unique sequence number p.SEQ , PDU loss can be detected. If q is broadcast after p by an entity, p:SEQ < q:SEQ . The ordering properties in the broadcast communication are discussed in [6, 16, 17, 20, 21, 22, 23]. These are based on a cluster (or group) concept which is a collection of entities. The total ordering (TO) service [20, 21] is one where every entity receives all the PDUs broadcast in a cluster C in the same order without lost PDUs. That is, for every entity Ek in C , RLk is preserved and is equivalent to every receipt log. Here, let pj denote that p is broadcast by Ej . Each PDU p has a priority p.PRI . For two PDUs p and q , p is higher than q if p:PRI > q:PRI . Here, let p[r] denote that p:PRI = r. In this paper, we assume that each priority is a number ( 0). The lowest priority 0 is used only by the system.
[De nition] A log L = < p1 : : : pm ] is priority-based ordered i for every p and q in L, (1) if p:PRI > q:PRI , then p precedes q in L, and (2) if p and q come from the same entity, p.PRI = q .PRI , and p:SEQ < q:SEQ , then p !L q . 2 [De nition] RLj and RLk are priority-equivalent i they are both priority-based ordered and contentequivalent. 2 If p.PRI = q .PRI , and p and q are broadcast by different entities, any order of p and q is allowed in the priority-based ordered log. In the priority-equivalent logs RLj and RLk , two PDUs with dierent priorities are received in the same order. However, any receipt order of PDUs with the same priority is allowed.
[Example 4.1] Let
a[1] , b[2] , and c[1] be PDUs, i.e. PRI = c:PRI = 1 and b:PRI = 2. Two entities E1 and E2 receive the PDUs as RL1 = < b[2] a[1] c[1] ] and RL2 = < b[2] c[1] a[1] ]. RL1 and RL2 are priorityequivalent because b[2] precedes a[1] and c[1] in RL1 and RL2 . 2 a:
For a log L and a PDU p, a priority-based insert p is de ned to be a log obtained by inserting p into L as follows.
L
j
[Priority-Based Insert (PBI)] Let L be a prioritybased ordered log < p1 : : : pm ] and p be a PDU. A priority-based insert (PBI ) L j p is de ned to be a log < p1 : : : pi01 p pi : : : pm ] which is a prioritybased ordered one. 2 Here, a receipt log L which stores PDUs not acknowledged in Ek is a scope of p on the PBI.
[Example 4.2] Suppose that an entity
Ej has a receipt log RLj = < a[3] b[2] c[2] d[1] ] and accepts a PDU e[2] . Also, c and e come from the same entity and e:SEQ < c:SEQ . e is inserted to RLj , i.e. RLj j e[2] = < a[3] b[2] e[2] c[2] d[1] ]. Since b, c, and e have the same priority and e:SEQ < c:SEQ , e is inserted between b and c by the PBI. Here, < a[3] b[2] c[2] d[1] ] is a scope of e on the PBI. 2
[De nition] A priority-based ordering (PriO ) service is a service where every receipt log is priorityequivalent with each other. A priority-based total ordering (PriTO ) service is a PriO service where every receipt log is equivalent. A PriO service which is not PriTO is a priority-based semi-total ordering (PriSO ) service. 2 [Example 4.3] Suppose that E1 , E2 , and E3 broad1 a [3]
1 , c2 b [3] [2]
2 , d [2]
3 , e [2]
cast PDUs and and and respectively. (1) The PriSO service may have RL1 = < a1[3] b1[3] c2[2] 3 d2 ], RL = < a1 b1 e3 c2 d2 ], and e 2 [2] [2] [3] [3] [2] [2] [2] 1 RL3 = < a[3] b1[3] c2[2] d2[2] e3[2] ]. a and b precede c, d, and e, since a and b are higher than c, d, and e. Since c, d, and e have the same priority, any receipt sequence of c, d, and e is allowed only if c precedes d since c and d are broadcast by E2 . Hence, RL1 , RL2 , and RL3 are priority-equivalent. (2) In the PriTO service, if one entity, say E1 , has RL1 = < a1[3] b1[3] c2[2] e3[2] d2[2] ], every other entity receives the same ones in the same priority-based order. That is, RL1 , RL2 , and RL3 are equivalent.
2
In this paper, we would design PriSO and PriTO protocols by using the 1C service. Here, suppose that an entity fails to receive a PDU p. In one method, every entity rejects both p and PDUs following p, i.e. go-back-n method. Since the 1C service is used as the
underlying service, all the receipt logs obtained by the PBI are equivalent. This is a PriTO service. In the other method, only p is broadcast again, i.e. selective retransmission. One problem is how to insert p into the receipt log. By using the PBI, all the receipt logs can be priority-equivalent. This is a PriSO service. It is easy to adopt the selective retransmission to realize the PriSO service by using the 1C service.
[Example 4.4] Suppose that E1 , E2 , and E3 broadcast PDUs a[5] and c[3] , b[4] and d[3] , and e[3] and f[2] , respectively, as shown in Figure 1. E1 , E2 , and E3 fail to receive c, d, and e, respectively. (1) One method is to reject all the PDUs received after the lost PDU, i.e. go-back-n method. E1 rejects d, e, and f , E2 rejects c, e, and f , and E3 rejects c, d, and f . Here, all the entities have the same receipt log < a[5] b[4] ]. Then, these PDUs are rebroadcast. Since the 1C service is used, every entity receives PDUs in the same order, say f , d, c, and e. That is, (((< a[5] b[4] ] j f[2] ) j d[3] ) j c[3] ) j e[3] = < a[5] b[4] d[3] c[3] e[3] f[2] ]. All the receipt logs are both priority-equivalent and order-equivalent. This is a PriTO service. (2) In the other method, only lost PDUs, i.e. c, d, and e are selectively rebroadcast, and no PDUs accepted are rejected. If the entities receive duplicate PDUs, they neglect the PDUs. E1 , E2 , and E3 receive c, d, and e, respectively, and RL1 j c[3] = < a[5] b[4] d[3] e[3] c[3] f[2] ], RL2 j d[3] = < a[5] b[4] c[3] e[3] d[3] f [2] ], and RL 3 j e[3] = < a[5] b[4] c[3] d[3] e[3] f[2] ]. RL1 , RL2 , and RL3 are priority-equivalent but are not orderequivalent. This is a PriSO service. 2 RL1 = RL2 = RL3 =
] ] f[2] ]
< a[5] b[4] d[3] e[3] f[2] < a[5] b[4] c[3] e[3] f [2] < a[5] b[4] c[3] d[3]
(1) (2)
RL1 = RL2 = RL3 = RL1 = RL2 = RL3 =
SL1 = SL2 = SL3 =
< a[5] c[3] < b[4] d[3] < e[3] f [2]
< a[5] b[4] d[3] c[3] e[3] f[2] < a[5] b[4] d[3] c[3] e[3] f[2] < a[5] b[4] d[3] c[3] e[3] f[2] < a[5] b[4] d[3] e[3] c[3] f[2] < a[5] b[4] c[3] e[3] d[3] f[2] < a[5] b[4] c[3] d[3] e[3] f[2]
] ] ]
] ] ] ] ] ]
Figure 1: PriTO(1) and PriSO(2) Services 5
Total Ordering (TO) Protocol
We present brie y a data transmission procedure which provides the total ordering (TO) service [20, 21] for a cluster C by using the 1C service. Here, suppose that C is composed of n entities, E1 ,: : :,En (n > 1).
A. Variables
Each PDU pj includes the following control information. pj .ACK k is the sequence number of PDU which Ej expects to receive next from Ek (k =
1; :::; n). pj .BUF denotes the number of available buers in Ej . Each entity Ek has variables SEQ , REQ j , and ALj h (h; j = 1; :::; n). SEQ is the sequence number of a PDU which Ek expects to broadcast next. REQ j is the sequence number of a PDU which Ek expects to receive next. ALjh denotes the sequence number of a PDU which Ek knows Eh expects to receive next from Ej . BUF j is the number of available buers which Ek knows of Ej . Let minALj denote the minimum of ALj 1 ,...,ALjn . This means that Ek knows that all the entities have already received every qj where qj :SEQ < minALj . Let minBUF denote the minimum of BUF 1 ,...,BUF n . Every entity in the cluster knows the initial values of these variables when the cluster is established [20, 21]. Each receipt log RLk can be divided into sublogs ARLk , PRLk , and RRLk in which acknowledged , preacknowledged , and accepted PDUs are stored, respectively.
B. Accept and transmission
accepts pj on receipt of pj , and broadcasts q by the following procedures. Here, W and P ( 1) are constants. W gives the maximum window size. In the 1C service, PDUs may be lost. Ek can detect PDU loss by checking the sequence numbers. The go-back-n retransmission scheme is used. Ek
[Accept Procedure] if (p .SEQ = REQ and j
ALhj pj .ACK h REQ h for h = 1; :::; n) then REQ j := REQ j + 1; BUF j := pj .BUF ; ALhj := pj .ACK h for h = 1; :::; n; enqueue (RRLk , pj ); g 2 j
f
[Acknowledgment (ACK) Procedure] if (p is pre-acknowledged) then f while (top (PRL ).PRT = p) f k
:= dequeue (PRLk ); enqueue (ARLk , r); /* passed to the upper layer */ g g 2
r
6
Priority-Based Total (PriTO) Protocol
Ordering
Based on the TO protocol, we present two data transmission procedures to provide the PriTO service on the 1C service.
6.1 Basic PriTO (BPriTO) Protocol
Each PDU p is accepted, pre-acknowledged, and acknowledged by Ek as follows.
[BPriTO Protocol]
(1) When p is accepted, p is enqueued into RRLk . (2) Then, p is pre-acknowledged by the TO PACK procedure on acceptance of a PDU q . Here, p is inserted into PRLk by the PBI, i.e. PRLk j p and p.PRT = q . (3) If a PDU q denoted by p.PRT is pre-acknowledged, i.e. q is in PRLk for p = top (PRLk ), p is dequeued from PRLk and enqueued into ARLk , i.e. p is passed to the upper layer. 2 At (2), the scope of p is PRLk , which is a prioritybased ordered log of PDUs pre-acknowledged in Ek . Suppose that p is pre-acknowledged when q is accepted by Ek . Like the TO protocol, p is acknowledged when q is pre-acknowledged at (3). If some PDU loss is detected, all the PDUs rejected are rebroadcast, i.e. go-back-n retransmission.
[Example 6.1] Suppose that
Ek accepts PDUs a[1] , , and c[4] as RRLk = < a[1] b[3] c[4] ], and PRLk = ARLk = < ] [Figure 2]. Let d[2] , e[2] , f[2] , g[1] , h[4] , [Transmission Procedure] i[2] , and j[3] be PDUs. if (minALk SEQ 2 < minAL k + min (W , minBUF / (P 3 n )) then f (1) Suppose that Ek accepts d[2] and pre-acknowledges k q .SEQ := SEQ ; SEQ := SEQ + 1; a. a:PRT = d. a is inserted into PRLk . k (2) Ek accepts e[2] and b is pre-acknowledged. b is q .ACK j := REQ j for j = 1; :::; n; k inserted to PRLk , i.e. < a[1] ] j b[3] = < b[3] a[1] ]. q .BUF := current number of buer available in Ek ; (3) Ek accepts f[2] and c[4] is pre-acknowledged. PRLk enqueue (SLk , q k ); broadcast (q ); g else j c[4] = < c[4] b[3] a[1] ]. Ek waits until enough buers are available; 2 (4) Ek accepts g[1] and pre-acknowledges d. Although a is acknowledged since d is pre-acknowledged, i.e. a:P RT = d, a still stays in PRLk since PDUs C. Pre-acknowledgment and acknowledgment preceding a are not acknowledged. j On acceptance of q , p in RRLk is pre-acknowledged (5) Ek accepts h[4] and pre-acknowledges e. Here, b if it satis es the following condition. Here, q is said to is acknowledged but still stays in PRLk since c is pre-acknowledge p. not acknowledged yet. (6) Ek accepts i[2] and pre-acknowledges f . Here, c is [Pre-acknowledgment (PACK) Procedure] acknowledged since c:PRT = f . Since c is the top while (pj = top (RRLk ) and pj .SEQ < minALj ) f of PRLk , c is passed to the upper layer and then j j p := dequeue (RRLk ); enqueue (PRLk , p ); so is b. Here, PRLk = < d[2] e[2] f[2] a[1] ]. 2 j p .PRT := q ; if (j = k) then f /* p is broadcast by Ek */ [Theorem 6.1] The BPriTO protocol provides the j p := dequeue (SL k ); /* p and p are identical */ PriTO service. discard (p); g g 2 [Proof] As explained before, all the PDUs following b[3]
ARLk < ] < ] < ] < ] < ] < ]
(1) (2) (3) (4) (5) (6) (7)
< c[4] b[3]
PRLk ]