Document not found! Please try again

Starvation-Prevented Priority-Based Total Ordering

0 downloads 0 Views 211KB Size Report
transmitted through the network due to the bu er over- run. Some data ... ized protocols which use Ethernet. .... [Example] Suppose that E1 broadcasts PDUs a, b,.
Starvation-Prevented Priority-Based Total Ordering Broadcast Protocol on High-Speed Single Channel Network Akihito Nakamura

and

Makoto Takizawa

Dept. of Computers and Systems Engineering Tokyo Denki University Ishizaka, Hatoyama, Hiki-gun, Saitama 350-03, JAPAN e-mail

fnaka,

g

taki @takilab.k.dendai.ac.jp

how to provide atomic delivery of PDUs among multiple entities and some receipt ordering of them by using high-speed broadcast networks. Reliable broadcast communication systems have been discussed in [4, 5, 10, 11, 13, 17, 18, 19, 20, 22, 24, 25, 26, 27]. [22] presents a reliable broadcast protocol which uses one-to-one communication. [5, 10, 13] discuss centralized protocols which use Ethernet. [11] characterizes message ordering properties in a reliable broadcast protocol using the conventional one-to-one network. [24, 25, 19, 20, 21] present a cluster concept which is an extension of the conventional connection concept [12] to multiple service access points (SAPs) and shows how to establish the cluster by using the Ethernet MAC broadcast service. One important problem in designing protocols is which entity coordinates the cooperation of multiple entities. Most approaches [5, 10, 11, 22] adopt centralized control, where one master entity decides the correct receipt among multiple entities. In the centralized control, entities have to wait for the decision of the master entity. On the other hand, There is no master controller in the distributed control. The ISIS ABCAST protocol [4] adopts the distributed control where for each PDU p, there exists one entity which decides the correct atomic receipt of p. We discuss a cluster-oriented distributed broadcast protocol where every entity decides the correct atomic receipt of PDUs among all the entities in the cluster because we can take advantage of the underlying broadcast service. [24, 25, 26] present the TO (totally ordering) and OP (order-preserving) protocols on the broadcast networks. [19, 20] discuss an SPO (selectively partially order-preserving) protocol, where each entity rather sends each PDU p to a subset of the entities which are the destination of p than all the entities in the cluster, and each entity receives the PDUs destined to it from some entity in the same order as it sends. In the distributed applications, various kinds of data like commands, texts, image, and voice are broadcast to multiple sites. For example, a control command has to be delivered to the destinations ear-

Abstract

In distributed applications like teleconferencing, various kinds of data like transactions, les, and voice have to be delivered to multiple destinations. Current high-speed networks like the FDDI provide highreliable broadcast communication. However, since the processing speed of the communication entities is slower than the transmission speed of the network, the entities may fail to receive protocol data units (PDUs) transmitted through the network due to the bu er overrun. Some data units like control data have to be delivered to the destinations earlier than another PDUs. One approach to providing various kinds of communication by using a single-channel network is to give a priority to each PDU and to deliver higher-priority PDUs to the destinations earlier than lower-priority ones. In this paper, we discuss a distributed broadcast protocol which provides priority-based receipt ordering of PDUs for the application entities by using the highspeed single-channel network in the presence of the loss of the PDUs. there is a starvation problem, i.e. lower-priority PDUs can be left waiting inde nitely in the receipt queue since higher-priority PDUs jump over lower-priority ones. In this paper, we present a method by which even lower-priority PDUs are delivered to the application entities in some pre-de ned time by partitioning the receipt sequence of PDUs into runs, where each run is priority-based ordered. 1

Introduction

In distributed applications like teleconferencing in groupware [9], group communication among multiple communication entities is required in addition to conventional one-to-one communication [7, 12]. Current communication technologies like radio and optical bers [2, 3] provide high-speed high-reliable data transmission among multiple entities. However, entities may fail to receive PDUs due to the bu er overruns [8] since the transmission speed of the network is faster than the processing speed of each entity. Hence, PDU loss is considered to be only failure in the highspeed network. In this paper, we would like to discuss 1

lier than another data. One approach to delivering more-time-sensitive PDUs to the destinations earlier than less-time-sensitive ones is to give priority to each PDU. There are two kinds of priority-based transmission schemes for the broadcast communication, i.e. controlled access [14, 23] and contention-based [14, 16] ones. In this paper, we discuss contention-based priority schemes among multiple entities in a cluster . In the broadcast communication, it is important to consider in what order PDUs with priorities are received by each entity in the cluster. [21] presents a priority based total ordering (PriTO) protocol, where every entity receives all the PDUs not only in the same order but also the priority-based order. One problem is starvation (inde nite blocking ), i.e. a lower-priority PDU p can be left waiting inde nitely in the receipt queue since higher-priority PDUs jump over p. In this paper, we present a PriTO protocol by which even lower-priority PDUs are delivered to the application entities in some pre-de ned time by partitioning the receipt sequence of PDUs into runs , each of which is priority-based ordered and even higher-priority PDUs in one run R can not jump over ones in runs preceding R. Each entity receives the same sequence of the runs without starvation. In section 2, we present a model of the broadcast communication. In section 3, we discuss the prioritybased ordering of PDUs. In section 4, we discuss a protocol which provides the PriTO service by using high-speed one-channel. In section 5, we discuss how to resolve the starvation. Finally, we discuss the performance of the PriTO protocol in section 6. 2 2.1

Basic Concepts Cluster

A communication system is composed of three layers, i.e. application , system , and network layers. Entities of the system layer provide reliable broadcast communication service for entities of the application layer by using high-speed broadcast communication service provided by the network layer. A cluster C [24, 25] is a set of service access points (SAPs) S1 ,...,Sn (n  2). Each Si is supported by a system entity Ei , and each application entity Ai takes reliable broadcast communication service through Si (i = 1,...,n). Here, C is supported by E1,...,En (written as C = hE1,...,Eni), and be composed of A1 ,...,An . In this paper, we assume that C is established by multiple entities, e.g. by using a protocol presented in [24, 25]. 2.2

Correct receipt among multiple entities

There are three levels of correct receipt among multiple entities in a cluster C = hE1 ,...,En i, i.e. accepted , pre-acknowledged , and acknowledged [24, 25]. (1) When a PDU p arrives at Ej , p is accepted by Ej . (2) When Ej knows that every entity in C has accepted p, p is pre -acknowledged by Ej . (3) When Ej knows that every entity in C has preacknowledged p, p is acknowledged by Ej .

At (2), although Ej knows that every entity in C has accepted p, some Ei still may not know that another entity has accepted p. For example, Ei has not received the acknowledgment of p from some Eh yet. (3) represents the highest correct level. The service is modeled as a set of logs. A log L is a sequence of PDUs < p1 ... pm ], where p1 and pm are the top and the last, denoted by top (L) and last (L), respectively. In L, pi precedes pj (pi !L pj ) if i < j . Here, let L1 be a log < q1 ::: qn ]. LjjL1 denotes a concatenation of L and L 1 , i.e. < j p1 ::: pm q1 ::: qn ]. Here, let Lji (i  j ) denote a subsequence < pi pi+1 ::: pj 01 pj ] of L. Each Ei has a sending log SLi and a receipt log RLi , which are sequences of PDUs sent and received by Ei , respectively. There are the following relations among the logs. (1) RLi is order -preserved i for every Ej , p !RLi q , if p !SLj q. RLi is information -preserved i RLi includes all the PDUs in SL1 ,...,SLn . (2) RLi and RLj are information -equivalent i RLi and RLj include the same PDUs. RLi and RLj are order -equivalent i for every pair of PDUs p and q included in both RLi and RLj , p !RLi q i p !RLj q . (3) RLi is preserved i RLi is both order- and information-preserved. RLi and RLj are equivalent i RLi and RLj are both order- and information-equivalent. [De nition] A one -channel (1C) service is one where every receipt log is order-preserved and orderequivalent. 2 The 1C service is abstraction of services provided by high-speed networks [2]. Although every entity receives PDUs in the same order, each entity may fail to receive PDUs due to the bu er overruns. [De nition] Order -preserved (OP) service is one where every receipt log is preserved. Total ordering (TO) service is an OP service where every receipt log is order-equivalent with each other. 2 In the TO service, every entity receives the same PDUs in the same order without any PDU loss. 3

Priority-Based Cluster Service

Let p and q denote PDUs. Each p has a unique sequence number p:SEQ and a priority p:PRI . If p is broadcast after q , p:SEQ > q:SEQ . If p has higher priority than q, p:PRI > q:PRI . Let p[r] denote that p has priority r , i.e. p:PRI = r . In this paper, r > 0, and the system uses priority 0. Here, a notation pi is used to explicitly denote that p is broadcast by Ei. p:SRC shows an source entity of p. [De nition] A log L is priority-based ordered i for every p and q in L, (1) if p:PRI > q:PRI , then p !L q , and (2) if p:PRI = q:PRI , p:SRC = q:SRC , and p:SEQ < q:SEQ , then p !L q . 2 [De nition] Two logs L1 and L2 are priority -based equivalent i L1 and L2 are information-equivalent

RL1 = < c[3] b[2] y[2] p[2] a[1] x[1] q[1] z[1] ] RL2 = < c[3] y[2] b[2] p[2] x[1] q[1] z[1] a[1] ] RL3 = < c[3] p[2] b[2] y[2] q[1] x[1] z[1] a[1] ]

SL1 = < a[1] b[2] c[3] ] SL2 = < p[2] q[1] ] SL3 = < x[1] y[2] z[1] ]

Figure 1: Priority-based equivalent L

= < a[3] b[2] c[2] d[1] e[1] f[2] g[1] h[1] ]

1 = Lj31 = < a[3] b[2] c[2] ] 5 R2 = Lj2 = < b[2] c[2] d[1] e[1] ] 8 R3 = Lj6 = < f[2] g[1] h[1] ] 8 R4 = Lj5 = < e[1] f[2] g[1] h[1] ]

vL vL vL 6v L

R

Figure 2: Runs z

R11 }|

{ z

R12 }|

{

L

1 = < a[3] b[2] c[2] d[1] e[1] f[2] g[1] h[1] ]

L

z

RL1 :

< b

RL2 :

< b

RL3 :

< b

RL2 :

< b

Figure 3: Run-partition

RL3 :

< b

z

}|

{ z

}|

{

and priority-based ordered. 2 [Example] Suppose that E1 broadcasts PDUs a, b, and c, E2 broadcasts p and q , and E3 broadcasts x, y, and z . The receipt logs RL1 , RL2 , and RL3 , and sending logs SL1 , SL2 , and SL3 of E1 , E2 , and E3 are shown in Figure 1. RL1 , RL2 , and RL3 are prioritybased equivalent because the receipt logs include the same PDUs which are priority-based ordered. It is noted that x and y are received in the sending order because they are broadcast by E3 and have the same priority. 2 [De nition] Let R be a subsequence of a log L, i.e. i Ljj . R is a run in L (written as R v L) if R is prioritybased ordered. 2 In Figure 2, R1 , R2 , and R3 are runs of L. R4 is not a run of L because e:PRI (= 1) < f:PRI (= 2), i.e. R4 is not priority-based ordered. A log L is run -partitioned to runs R1 ,...,Rk if L = (R1 jj ... jj Rk ). Figure 3 shows run-partitions of logs L1 and L2 . [De nition] Let (R11 jj ... jj R1h ) and (R21 jj ... jj R2k ) be run-partitions of logs L1 and L2 , respectively. L1 and L2 are run -equivalent i (1) h = k (= m), and (2) R1i and R2i are priority-based equivalent for i = 1,...,m. 2 [Example] In Figure 3, the run-partitions (R11 jj R12 ) of L1 and (R21 jj R22 ) of L2 are run-equivalent because R11 and R21 , and R12 and R22 are priority-based equivalent, respectively. 2 There are two kinds of broadcast communication service on the priority. [De nition] The service of a cluster C is a priority based ordering (PriO) service i every receipt log in

R21 }|

R22

R31 }|

R11 }|

R32

z

}|

z

}|

[2] b[3] ]

[2] d[1] ] [2] e[2] ]

< e

R12

{ z }| {

R22

< a

{ z }| {

[3] c[2] e[2] d[1] a[2] f[2] ] SL2 : R31

< c

{ z }| {

[3] c[2] e[2] d[1] a[2] f[2] ] SL1 : R21

< a

{ z }| {

[3] e[2] c[2] d[1] f[2] a[2] ] SL3 : (1) PriO service

z

2 = < a[3] c[2] b[2] e[1] d[1] f[2] h[1] g[1] ]

R12

{ z }| {

[3] e[2] c[2] d[1] a[2] f[2] ] SL2 :

z

< b

R22

}|

[3] c[2] e[2] d[1] a[2] f[2] ] SL1 :

z

RL1 :

R21

R11

R32

{ z }| {

[3] c[2] e[2] d[1] a[2] f[2] ] SL3 : (2) PriTO service

< c

[2] b[3] ]

[2] d[1] ] [2] e[2] ]

< e

Figure 4: PriO and PriTO is information-preserved and is run-equivalent with each other. The PriO service of C is a priority-based total ordering (PriTO) service i all receipt logs in C are order-equivalent. 2 [Example] Examples of PriO and PriTO services for a cluster C = hE1 , E2 , E3 i are shown in (1) and (2) of Figure 4, respectively. Suppose that E1 broadcasts PDUs a and b, E2 broadcasts c and d, and E3 broadcasts e and f . In (1) and (2), every entity receives all the PDUs broadcast in C . Hence, RL1 , RL2 , and RL3 are information-preserved. R11 , R12 , and R13 are priority-equivalent, and so are R21 , R22 , and R23 . Hence, every receipt log is run-equivalent in (1) and (2). In the PriO service (1), same-priority PDUs may be received in any order, e.g. c and e, and a and f are received by the entities in di erent orders. On the other hand, all the entities receive the same PDUs in the same order in the PriTO service (2). 2 C

4

Priority-Based Total (PriTO) Protocol

Ordering

We would like to present a PriTO protocol for a cluster C = hE1 ,...,En i by using the 1C service. 4.1

Transmission and receipt

Each Ei has the following variables to send and receive PDUs by using the 1C service (j = 1,...,n). 

SEQ = sequence number of PDU which Ei would transmit next.

 

   

REQ j = sequence number of PDU which Ei expects to receive next from Ej . ALjk = sequence number of PDU which Ei knows that Ek expects to receive next from Ek (k = 1,...,n). minALj = minimum of ALj 1 ,...,ALjn . PALjk = sequence number of PDU which Ei knows that Ek expects to pre-acknowledge next from Ej . minPALj = minimum of PALj 1 ,...,PALjn . PEQ j = sequence number of PDU which Ej expects to pre-acknowledge next from Ej .

Each PDU formation.

p

from Ei has the following control in-

SRC = Ei , i.e. the entity which sends p.  p:SEQ = sequence number of p.  p:ACK j = sequence number of a PDU which Ei expects to receive next from Ej (j = 1; :::; n).  p:PRI = priority of p. 

p:

Here, enqueue (L, p) denotes a procedure to put p in the tail of L. broadcast (p) is a procedure to send a service data unit (SDU) of p at the 1C SAP. delete (L, p) is a procedure which removes p in L. E i broadcasts p according to the following transmission procedure. [Transmission] p:SEQ := SEQ ; SEQ := SEQ + 1; p:ACK j := REQ j (j = 1,...,n); enqueue (SL i , p); broadcast (p); 2 When p is received, p jumps over lower-priority PDUs received in the receipt log. Here, an insertion operation < named priority -based insert is introduced as follows. [Priority-based insert] Let L be a priority-based ordered log < p1 ... pm ], and p be a PDU. A prioritybased insert L < p is de ned to be a priority-based ordered log < p1 ::: pi01 p pi ::: pm ] where pi01 :PRI  p:PRI > pi :PRI . 2 For example, < a[4] b[3] c[1] ] < d[2] is < a[4] b[3] d[2] c[1] ], and < a[4] b[3] c[1] ] < e[3] is < a[4] b[3] e[3] c[1] ]. On receipt of p, Ei accepts p according to the following accept procedure. Ei creates a pseudo -PDU 3 3 p which is the same as p except that p has no data. 3 p is given a priority 0. Ei has two receipt logs RRLi and PRLi . [Accept procedure] if (pj :SEQ = REQ j ) f RRLi < pj[0]3 ; (PRLi jj RRLi ) < pj ; REQ j := pj :SEQ + 1; ALhj := pj :ACK h (h = 1; :::; n); g2

is priority-based inserted to a concatenation PRLi jj RRLi . Hence, PDUs in the receipt log PRLi jj RRLi are ordered on the basis of the priority. On the other hand, the pseudo-PDU p3[0] is inserted to the tail of RRLi . This means that the pseudo-PDUs are ordered in the receipt order of the PDUs. AL and minAL are changed each time when a PDU is accepted. Here, PDUs in the receipt log are preacknowledged according to the following procedure. p

[Pre-acknowledgment (PACK) procedure] while ( (pj = top (RRLi ) is not a dummy) or ( (pj is a dummy) and (pj :SEQ < minALj )) ) f j p := dequeue (RRLi ); enqueue (PRLi ; pj ); j if (p is a dummy) f PEQ j := pj :SEQ + 1; PALhj := p:ACK h (h = 1,...,n); g

2

g

Each time when a PDU is pre-acknowledged, PAL and minPAL are changed as presented in the PACK procedure. PDUs are forwarded to the application entity in the priority-based order by a log ARLi according to the following procedure. [Acknowledgment (ACK) procedure] NotEnd := TRUE; while (NotEnd ) f if (pj = top (PRLi ) is not a dummy) f if (pj :SEQ < minPALj ) f j p := dequeue (PRLi ); enqueue (ARLi ; pj ); delete (PRLi, pj3 ); g

g

2

g

else

NotEnd := FALSE;

[Example] Figure 5 shows an example of data transmission of an entity Ei in C = hE1 ,...,En i by the

PriTO protocol. Here, letters from a to j denote PDUs. (1) First, Ei accepts a, b, and c which are prioritybased ordered in the receipt log, i.e. PRLi . The pseudo-PDUs a3 , b3 , and c3 of priority 0 are stored in RRLi in the receipt order. Then, d[2] arrives at Ei . (2) d is inserted in PRLi and d3 in RRLi by the priority-based insert as shown in (2). Here, suppose that a is pre-acknowledged. Hence, a3 is moved to PRLi from RRLi . Then, e[2] arrives. (3) e and e3[0] are priority-based inserted in PRLi jj RRLi . Suppose that b is pre-acknowledged. Then, f[2] arrives. 3 are priority-based inserted in PRLi (4) f and f[0] jj RRLi . Suppose that c is pre-acknowledged. Then, g[1] arrives.

(5)

(6) (7)

3 are priority-based inserted in PRLi jj and g[0] RRLi. Suppose that d is pre-acknowledged. Since a is pre-acknowledged when d is accepted, a is acknowledged. a still stays in PRLi because higherpriority PDUs are in PRLi. Then, h[4] arrives. g

and h3[0] are priority-based inserted in PRLi jj RRLi. Suppose that e is pre-acknowledged. b is acknowledged. Then, i[2] arrives. h

and i3[0] are priority-based inserted in PRLi jj RRLi. Suppose that f is pre-acknowledged. c is pre-acknowledged when f is accepted. Hence, c is acknowledged. Since c is the top of PRLi , c is moved from PRLi to ARLi . Although b is acknowledged, b is not passed to the application because h is not acknowledged. Then, j[3] arrives. i

2

4.2

Failure

In the 1C service, some entity Ei may fail to receive some PDU. Ei detects the PDU loss checking SEQ in PDUs. If Ei detects that it fails to receive a PDU from Ej , all the entities agree on which PDU they fail to receive by broadcasting the information on REQ . In [24, 26, 27], Ei rejects all the PDUs following the lost PDU in the receipt log, i.e. go-back-n scheme [28]. The PDUs rejected are broadcast again. 5

Starvation-Prevented PriTO Protocol

In the PriTO protocol, PDUs are forwarded to the application entities in the priority-based order, i.e. higher-priority PDUs are delivered prior to lowerpriority ones. One problem is that lower-priority PDUs can be left waiting inde nitely in the receipt log even if they are acknowledged. A PDU p has to be left waiting in the receipt log until higher-priority PDUs which have jumped over p are acknowledged. In order to resolve the starvation problem, we adopt a method that PDUs left waiting in the receipt log for the pre xed time are forced to be delivered to the application entity. This means that a receipt sequence of PDUs is partitioned into runs. [Example] Suppose that an entity Ei receives PDUs a[1], b[3] , c[4] , d[2] , e[2] , f[2] , g[1] , h[4] , i[2] in this order. Suppose that a, b, and c are acknowledged on acceptance of g, h, and i, respectively. Here, Ei has a receipt log as shown in Figure 6 (1). a can be acknowledged after all the PDUs preceding a are acknowledged if every PDU to be received has priority less than or equal to a. However, if PDUs of priority greater than 1 are received successively, a is not passed to the application entity although a is acknowledged already. This is a starvation problem. One way to resolve the problem is that a is forced to be passed to the application entity if a is not passed in some pre xed time [Figure 6 (2)]. This means that one run < c[4] a[1] ] is created and passed to the application entity. In the cluster, every entity has to pass the application entity the PDUs in

the same order as Ei . In order to do so, some protocol for synchronizing the run-partitions in the cluster is required. 2 Each entity Ei in a cluster C = hE1 ,...,En i has a new variable TOSEQ h (h = 1,...,n) which denotes a maximum sequence number of timed-out PDU from Eh . Initially, each TOSEQ h = NIL. When each PDU p from Eh is acknowledged, a timer starts for p.

[Run synchronization (RSYNC)]

(1) The timer for p is expired in Ei. Ei stops the PACK and ACK procedures while Ei accepts PDUs. TOSEQ h := p:SEQ . Ei broadcasts a Run-Sync PDU s where s:TOSEQ j = TOSEQ j and s:PEQ j = PEQ j (j = 1,...,n). s carries information on which PDUs are timed out and until which PDUs from each entity are preacknowledged in s:TOSEQ j and s:PEQ j , respectively. (2) Suppose that Ej receives the Run-Sync s from Ei . Ej stops the PACK and ACK procedures while PDUs are accepted. Then, TOSEQ h := s:TOSEQ h if TOSEQ h = NIL or TOSEQ h < s:TOSEQ h (h = 1,...,n). If Ej nds that the timer for q from Ek is expired and TOSEQ k < q:SEQ , then TOSEQ k := q:SEQ (k = 1,...,n). Ej broadcasts a Run-Sync-PACK PDU sp where sp:TOSEQ h = TOSEQ h and sp:PEQ h = PEQ h (h = 1,...,n). (3) If each Ek receives Run-Sync or Run-Sync-PACK PDUs from all the entities in C , Ek broadcasts a Run-Sync-ACK sa where sa:TOSEQ h = TOSEQ h and sa:PEQ h = PEQ h (h = 1,...,n). (4) Suppose that every Ek receives the Run-SyncACKs from all the entities. Here, all the entities have the same TOSEQ j and PEQ j (j = 1,...,n). First, the PACK procedure is executed and all the pseudo-PDUs pre-acknowledged are moved from RRLi to PRLi . Next, while pk :SEQ < minPALk for pk = top (PRLi ), pk is acknowledged and p3 is removed. Then, the acknowledged PDUs which precede the PDU timed out lastly are forwarded to the application entities. Then, the PACK and ACK procedures are restarted. 2 [Theorem] The PriTO protocol provides the PriTO service. [Proof] Since the 1C service is used, every entity receives the PDUs in the same order. Suppose that some entity Ei fails to receive a PDU p (p:SRC = Ej ). Since Ei does not send any PDU q such that p:SEQ < q:ACK j , p and the PDUs following p are not preacknowledged. That is, only the PDUs preceding p can be moved to ARLi in every Ei . After the step (3) in the RSYNC procedure, every entity agrees on which PDUs are pre-acknowledged and are timed-out. Here, since each Ei stops the PACK and ACK procedures, AL is not changed even if Ei accepts PDUs during the RSYNC procedure. Further, at (4), all the entities agree on PDUs which are both acknowledged and timed-out, and move them to

(1) (2) (3) (4) (5) (6) (7)

ARLi PRLi RRLi 3 b 3 c3 ] < ] < c[4] b[3] a[1] ] < a [0] 3[0] 3[0] 3 3 < ] < c[4] b[3] d[2] a[1] a < b c d ] [0] ] 3 3 [0] 3 d[0] 3 e[0] 3 ] < ] < c[4] b[3] d[2] e[2] a[1] a ] < c [0] b[0] [0] [0] [0] 3 b 3 c3 ] 3 e3 f 3 ] < ] < c[4] b[3] d[2] e[2] f[2] a[1] a < d [0] [0] [0] [0] [0] 3 3 3 3 3 3 g [0] 3] < ] < c[4] b[3] d[2] e[2] f[2] a[1] g[1] a c d ] < e f [0] b[0] [0] [0] [0] [0] [0] 3 3 3 3 3 3 3 < ] < c[4] h[4] b[3] d[2] e[2] f[2] a[1] g[1] a d ] < f[0] g[0] h3[0] ] [0] b3[0] c[0] [0] e[0] 3 3 3 3 3 3 3 < c [4] ] < h[4] b[3] d[2] e[2] f[2] i [2] a[1] g[1] a [0] b[0] d[0] e[0] f[0] ] < g[0] h[0] i[0] ]

[2] [2] f[2] g[1] h[4] i[2] j[3] d

e

Figure 5: Data transmission in the PriTO protocol (1) (2)

ARLi PRLi RRL 3 b3 d3 e3 f 3 ] < g3 i < c[4] ] < h[4] b[3] d[2] e[2] f[2] i[2] a[1] g[1] a [0] [0] [0] [0] [0] [0] 3 d3 e3 f 3 ] 3 < c[4] a[1] ] < h[4] b[3] d[2] e[2] f[2] i[2] g[1] b < g [0] [0] [0] [0] [0]

3 [0] 3 h [0] h

3 ] [0] 3 ] i [0] i

Figure 6: Run synchronization the application in the priority-based order. Here, the same run is forwarded to every application. That is, every entity has the same receipt log which is runequivalent with each other. Hence, the PriTO protocol provides the PriTO service. 2 [Example] Figure 7 shows an example of the run synchronization. There are three entities E1 , E2 , and E3 . (1) First, E1 , E2 , and E3 receive the PDUs as shown in Figure 7 (1). For example, E1 accepts PDUs a, b, c, d, e, f , g , h, ... in this order which is denoted by the sequence of the pseudo-PDUs, and a, b, c, d, e, f are already pre-acknowledged in E1 . a and d are forwarded to the application entities in E2 and E3 , but not in E1 yet. Suppose that the time-out (T.O.) occurs for a in E1 and b in E3 . E1 and E3 broadcast Run-Sync PDUs. (2) E1 , E2 , and E3 broadcast Run-Sync-PACK and Run-Sync-ACK PDUs. Every entity agrees that b and PDUs preceding b are timed out by checking TOSEQ . PDUs preceding h are preacknowledged by checking PEQ , and all the PDUs preceding e are acknowledged. Suppose that a, d, and b are PDUs which are included in the sequence of PDUs from the top to b. (3) Then, a, d, and b are forwarded to the application entity in E1 . e and f are not forwarded because they are not acknowledged. c is not forwarded because it is not timed out although it is acknowledged. Since a and d are passed already, only b is passed to the application entities in E2 and E3 . E1 , E2 , and E3 have the same priority-equivalent runs, i.e. < a[5] d[4] b[1] ]. 2 6

Evaluation

The PriTO protocol is implemented in Sparc2 workstations interconnected by Ethernet. Each PriTO entity is running in one workstation. A cluster C is supported by n entities. In the evaluation, there are two levels of priority, i.e. 1 and 2. Each PDU is randomly assigned either priority 1 or 2 so that 10 % of PDUs have priority 2 and 90 % of PDUs have priority

1. Suppose that a PDU arrives at every entity every one time unit and every application entity takes a PDU every H time units. One time unit means how long it takes each entity to process one PDU. Since the high-speed network is used, application entities are slower than the network speed, i.e. H  1. Suppose that every received PDU times-out when it takes T time units after the PDU is acknowledged. Figure 8 and Figure 9 show for n = 3,...,10 the average delay time for PDUs of priority 1, 2, i.e. average time from when each PDU arrives at the PriTO entity until when the application entity takes the PDU. Figure 8 shows a case for T = 10 and H = 20. Figure 9 shows a case for T = 50 and H = 20. Compared with Figure 8, it is shown that lower-priority PDUs can be delivered earlier to the application entities if the time-out duration is decreased with the delay of the higher-priority PDUs varies small. The gures show that the delay time for PDUs of priority 1 is O(n). 7

Concluding Remarks

In the high-speed network, entities may fail to receive PDUs due to the bu er overruns. One problem in using the priority concept is starvation , i.e. some lower-priority PDUs can be left waiting inde nitely in the receipt queue. In this paper, we have discussed a starvation-prevention broadcast protocol named a PriTO protocol which provides priority-based receipt ordering of PDUs by using the high-speed singlechannel network. In the protocol, the receipt sequence of PDUs is partitioned into runs. PDUs in each run are ordered according to the priority. If there exist some PDU p which is acknowledged already but are left waiting in the receipt log for a long time, p is forced to be forwarded to the application entities. Here, a run including p is created. Lower-priority PDUs in a run R can not jump over PDUs in runs preceding R . By this scheme, every entity receives the same sequence of the same runs while a starvation is prevented. By the protocol, applications where various kinds of data are broadcast in a group of entities can be easily realized.

References

[1] Abramson, N., \The ALOHA System { Another Alternative for Computer Communications," Proc. of the Fall Joint Computer Conference , Vol.37, 1970, pp.281-285.

[2] American National Standards Institute, \FDDI Token Ring Physical Layer Protocol (PHY)," ANSI X3.148, 1988. [3] American National Standards Institute, \Twisted Pair Physical Medium Dependent (TP-PMD)," ANSI X3 T9.5, 1990. [4] Birman, K., Schiper, A., and Stephenson, P., \Lightweight Causal and Atomic Group Multicast," ACM Trans . on Computer Systems , Vol.9, No.3, 1991, pp.272-314. [5] Chang, J. M. and Maxemchuk, N. F., \Reliable Broadcast Protocols," ACM Trans . on Computer Systems , Vol.2, No.3, 1984, pp.251-273. [6] Chanson, S., Neufeld, G., and Liang, L., \A Bibliography on Multicast and Group Communications," ACM SIGOPS Operating Systems Review , Vol.23, No.4, Oct. 1989. [7] Defense Communications Agency, \DDN Protocol Handbook," Vol.1{3, NIC 50004-50005, 1985. [8] Doeringer, W. A., Dykeman, D., Kaiserswerth, M., Meister, B. W., Rudin, H., and Williamson, R., \A Survey of Light-Weight Transport Protocols for High-Speed Networks," IEEE Trans . on Communications , Vol.38, No.11, 1990, pp.20252039. [9] Ellis, C. A., Gibbs, S. J., and Rein, G. L., \Groupware," Comm . ACM , Vol.34, No.1, 1991, pp.38-58. [10] Garcia-Molina, H. and Kogan, B., \An Implementation of Reliable Broadcast Using an Unreliable Multicast Facility," Proc. of the 7th IEEE Symp . on Reliable Distributed Systems , 1988, pp.428-437.

[15] Lamport, R., \Time, Clocks, and the Ordering of Events in Distributed Systems," Comm . ACM , Vol.21, No.7, 1978, pp.558-565. [16] Liu, M. and Papantoni-Kazakos, P., \A Random Access Algorithm for Data Networks Carrying High Priority Trac," Proc . of the 9th IEEE INFOCOM , 1990, pp.1087-1094. [17] Luan, S. W. and Gligor, V. D., \A Fault-Tolerant Protocol for Atomic Broadcast," IEEE Trans . on Parallel and Distributed Systems , Vol.1, No.3, 1990, pp.271-285. [18] Melliar-Smith, P. M., Moser, L. E., and Agrawala, V., \Broadcast Protocols for Distributed Systems," IEEE Trans . on Parallel and Distributed Systems , Vol.1, No.1, 1990, pp.17-25. [19] Nakamura, A. and Takizawa, M., \Reliable Broadcast Protocol for Selectively Ordering PDUs," Proc. of the 11th IEEE ICDCS 1991, pp.239-246. [20] Nakamura, A. and Takizawa, M., \Design of Reliable Broadcast Communication Protocol for Selectively Partially Ordered PDUs," Proc . of the IEEE COMPSAC'91 , 1991, pp.673-679. [21] Nakamura, A. and Takizawa, M., \Priority-Based Total and Semi-Total Ordering Broadcast Protocols," Proc . of the 12th IEEE ICDCS , 1992, pp.178-185. [22] Schneider, F. B., Gries, D., and Schlichting, R. D., \Fault-Tolerant Broadcasts," Science of Computer Programming , Vol.4, No.1, pp.1-15, 1984. [23] Sharrock, S. M. and Du, D. H. C., \Ecient CSMA/CD-Based Protocols for Multiple Priority Classes," IEEE Trans . on Computers , Vol.38, No.7, 1989, pp.943-954. [24] Takizawa, M., \Cluster Control Protocol for Highly Reliable Broadcast Communication," Proc. of the IFIP Conf . on Distributed Processing , 1987, pp.431-445.

[11] Garcia-Molina, H. and Spauster, A., \Message Ordering in a Multicast Environment," Proc . of the 9th IEEE ICDCS , 1989, pp.354-361.

[25] Takizawa, M., \Design of Highly Reliable Broadcast Communication Protocol," Proc. of IEEE COMPSAC'87 , 1987, pp.731-740.

[12] International Standards Organization, \OSI { Connection Oriented Transport Protocol Speci cation," ISO 8073, 1986.

[26] Takizawa, M. and Nakamura, A., \Partially Ordering Broadcast (PO) Protocol," Proc. of the 9th IEEE INFOCOM , 1990, pp.357-364.

[13] Kaashoek, M. F. and Tanenbaum, A. S., \Group Communication in the Amoeba Distributed Operating System," Proc. of the 11th IEEE ICDCS , 1991, pp.222-230.

[27] Takizawa, M. and Nakamura, A., \Reliable Broadcast Communication," Proc. of IPSJ Int'l . Conf . on Information Technology (InfoJapan ), 1990, pp.325-332.

[14] Kurose, J. S., Schwartz, M., and Yemini, Y., \Multiple-Access Protocols and TimeConstrained Communication," ACM Computing Surveys , Vol.16, No.1, 1984, pp.43-70.

[28] Tanenbaum, A. S., \Computer Networks (2nd ed.)," Englewood Cli s , NJ : Prentice-Hall , 1989.

ARLi

RL1 :

< :::

PRLi

]

#3

RRLi

T.O.

< a[5] d[4]

e[2] f[2] b[1] c[1] :::

RL2 :

< ::: a[5] d[4]

]