We consider wait-free synchronization in multiprogrammed uniprocessor and multiprocessor systems in which ...... lines 3-7 can't be quantum-preempted = 6:.
Wait-Free Synchronization in Multiprogrammed Systems: Integrating Priority-Based and Quantum-Based Scheduling (Extended Abstract)
James H. Anderson Department of Computer Science University of North Carolina at Chapel Hill
Mark Moiry Department of Computer Science University of Pittsburgh
Abstract
We consider wait-free synchronization in multiprogrammed uniprocessor and multiprocessor systems in which \hybrid" schedulers are employed that use both priority information and a scheduling quantum in making scheduling decisions. In a hybrid-scheduled system, the scheduler on each processor always chooses for execution a process of maximal priority on its processor. However, there may be several processes of the same priority on the same processor. Processor time is allocated amongst such processes using a scheduling quantum. Hybrid-scheduled systems are not merely of theoretical interest. Indeed, a number of commercially-available operating systems schedule processes in this way. The main contribution of this paper is to show that, in any hybrid-scheduled system, any object with consensus number P in Herlihy's wait-free hierarchy is universal for any number of processes executing on P processors, provided the scheduling quantum is of a certain size. We give an asymptotically tight characterization of how large the scheduling quantum must be for this result to hold. It follows from our results that the ability to achieve wait-free synchronization in hybrid-scheduled systems is a function of three things: (i) the number of processors in the system (not the number of processes), (ii) the \power" of available synchronization primitives in terms of their consensus numbers, and (iii) the size of the scheduling quantum. Intuitively, synchronization primitives allow processes on dierent processors to coordinate, and quantum- and priority-based scheduling allow processes on the same processor to coordinate. All of the algorithms we present to establish these results are of polynomial space and time complexity. In contrast, corresponding algorithms previously proposed for priority-based systems (a subclass of the hybrid systems we consider) require exponential space and time. These results are of interest because they shed light on how the \inherent synchrony" of a system can be exploited to achieve wait-free synchronization. Our work also is of interest because it uni es prior work on waitfree synchronization in priority-scheduled and quantum-scheduled systems: any wait-free algorithm that is correct in a system with hybrid scheduling is also correct in a system that is either purely priority-based or purely quantum-based.
Work y Work
supported by NSF grants CCR 9510156 and CCR 9732916, and an Alfred P. Sloan Research Fellowship. supported in part by an NSF CAREER Award, grant number CCR 9702767.
1 Introduction This paper is concerned with wait-free synchronization in multiprogrammed systems. A multiprogrammed system consists of a set a processes and a set of processors. Each process is assigned to a distinct processor. Associated with each processor is a scheduler , which determines when each process on that processor is allowed to execute. In related previous work, Ramamurthy, Moir, and Anderson considered wait-free synchronization in multiprogrammed systems in which processes on the same processor are scheduled by priority [7]. In a priorityscheduled system, each scheduler always selects for execution the highest-priority process that is available for execution on its processor. Ramamurthy et al. showed that, for priority-based systems, any object with consensus number P in Herlihy's wait-free hierarchy [4] is universal for any number of processes executing on P processors, i.e., universality is a function of the number of processors in the system, not the number of processes . In subsequent work, Anderson, Jain, and Ott considered multiprogrammed systems in which quantum-based scheduling is used [1]. Under quantum-based scheduling, each processor is allocated to its assigned processes in discrete time units called quanta . When a processor is allocated to some process, that process is guaranteed to execute without preemption for Q time units, where Q is the length of the quantum, or until it terminates, whichever comes rst. Anderson et al. showed that, for quantum-scheduled systems, any object with consensus number P in Herlihy's wait-free hierarchy is universal for any number of processes executing on P processors, provided the scheduling quantum is of a certain size. In this paper, we present results that extend and unify the earlier work cited above. Speci cally, we consider multiprogrammed systems in which \hybrid" schedulers are employed that use both priority information and a scheduling quantum in making scheduling decisions. In a hybrid-scheduled system, the scheduler on each processor always chooses for execution a process of maximal priority on its processor. However, there may be several processes of the same priority on the same processor. Processor time is allocated amongst such processes using a scheduling quantum. Hybrid-scheduled systems are not merely of theoretical interest. Indeed, a number of commercially-available operating systems schedule processes in this way. Examples include the QNX real-time operating system [6], SGI's IRIX REACT/Pro [8] and Wind River's VxWorks [9] (all real-time operating systems). In addition, algorithms for hybrid-scheduled systems have the interesting property of being correct in systems that are either purely priority-based or purely quantum-based. We show in this paper that results established previously for quantum-based systems can be extended to also apply to hybrid-scheduled systems. In particular, we
C P P +1
Q cPTmax
n
Q c(2P + 1 ? n)Tmax Q (2P ? n)Tmin
.. .
.. . 2P ? 2 2P ? 1 2P .. .
universal if: Q c(P + 1)Tmax
not universal if:
.. .
.. .
.. .
Q c3Tmax Q c2Tmax Q c2Tmax .. .
Q PTmin Q (P ? 1)Tmin .. .
Q 2Tmin Q Tmin Q Tmin
.. . 1 Q0 | Table 1: Conditions under which an object with consensus number C is universal for any number of processes in a P processor system with hybrid scheduling.
show that, in any hybrid-scheduled system, any object with consensus number P in Herlihy's wait-free hierarchy is universal for any number of processes executing on P processors, provided the scheduling quantum is of a certain size. We give an asymptotically tight characterization of how large the scheduling quantum must be for this result to hold. All of the algorithms we present to establish these results are of polynomial space and time complexity. In contrast, the main multiprocessor algorithm given previously by Ramamurthy et al. for priority-based systems (a subclass of the hybrid systems we consider) requires exponential space and time. Our results are summarized in Table 1. This table gives conditions under which an object with consensus number C is universal in a P-processor hybrid-scheduled system. In this table, Tmax (Tmin ) is a constant denoting the maximum (minimum) time required to perform any atomic operation, Q is the length of the scheduling quantum, and c is a constant that follows from the algorithms we present. Obviously, if C < P, then universal algorithms are impossible [4]. If P C 2P, then the smallest value of Q that suces is a value proportional to (2P + 1 ? C)Tmax . If 2P C < 1, then the smallest value of Q that suces is a value proportional to Tmax . If C = 1, then Q (obviously) can be any value [4]. To see the dierence between a priority-based and a quantum-based system, consider Fig. 1. In this gure, process interleavings are shown that might arise on a processor in a quantum-based system (inset (a)) and in a priority-based system (inset (b)). The interleavings are depicted in a manner that abstracts away from the activities of the processes outside of object invocations | a less abstract look at the quantum-based interleaving in inset (a) is shown in Fig. 2. In a priority-based system, if a process p is preempted during an object invocation by another process q that invokes the same object, then p \knows" that q's invocation must be completed by the time p resumes execution. This is because q has higher priority and will not relinquish the processor until it 1
rithm's time complexity matches that of separate C&S algorithms presented previously for priority-based systems [7] and quantum-based [1] systems. process q In Sec. 4, we present our results for hybrid-scheduled process p multiprocessor systems. We begin by establishing the lower-bounds on quantum size given in the last column (a) invocation invocation of Table 1. We then establish the upper bounds given in ends begins the middle column by presenting a wait-free implementation of a consensus object for any number of processes process r running on P processors. This implementation employs objects with consensus number C, where C P. As C process q varies, the quantum required for this consensus impleprocess p mentation to work correctly is as given in Table 1. We end the paper with concluding remarks in Sec. 5. Proofs (b) for several of our results are included in an appendix. We Figure 1: Accesses to a common object by three processes also have included an appendix that summarizes some running on the same processor (a) in a quantum-based system results presented in [1] that are used in this paper. process r
and (b) in a priority-based system. Object invocations are shown between the brackets [ and ], with time running from left to right. In (b), process r (p) has highest (lowest) priority.
2 Preliminaries
completes. Thus, operations of higher-priority processes \automatically" appear to be atomic to lower priority processes executing on the same processor. This is the fundamental insight behind the results of Ramamurthy et al. [7]. In contrast, in a quantum-based system, if a process is ever preempted while accessing some object, then there are no guarantees that the process preempting it will complete any pending object invocation before relinquishing the processor. However, if a process can ever detect that it has \crossed" a quantum boundary, then it can be sure that the next few instructions it executes will be performed without preemption. Many of the quantum-based algorithms presented by Anderson et al. in [1] employ such a detection mechanism. In the algorithms presented in this paper, mechanisms are incorporated to allow processes to deal with both prioritybased and quantum-based preemptions. The remainder of this paper is organized as follows. In Sec. 2, we present de nitions and notation used in the remainder of the paper. Then, in Sec. 3, we present our results for hybrid-scheduled uniprocessor systems. We begin by presenting a wait-free, constant-time implementation of a consensus object that uses only reads and writes. This implementation is correct as long as the quantum is larger than a speci ed constant. This result shows that reads and writes are universal in a hybrid-scheduled uniprocessor. Although consensus objects can be used directly to implement other wait-free and lock-free objects [4], implementations of practical interest are usually based on stronger synchronization primitives such as compare-and-swap (C&S). We show that, using only reads and writes, C&S can be implemented in a hybrid-scheduled uniprocessor in time that is linear in the number of priority levels, as long as the quantum exceeds a certain constant value. This algo-
In this section, we describe the schedulers considered in this paper, and then formallyde ne our execution model. We consider pure priority- and quantum-based schedulers, and also hybrid schedulers. Hybrid schedulers generalize both priority- and quantum-based schedulers, so we present formal de nitions for hybrid systems only. In a priority-scheduled system, each process on a processor is assumed to have a distinct priority, and each scheduler always selects for execution the highestpriority process that is available for execution on its processor. For simplicity, we will assume throughout most of this paper that all process priorities are statically de ned. However, most of the algorithms we present can be easily modi ed to work correctly in systems that allow priorities to changed dynamically, as long as a process's priority cannot change during an object invocation. We consider the issue of supporting dynamic priorities in more detail in Sec. 5. Associated with any quantum-scheduled system is a scheduling quantum (or quantum for short), which is a nonnegative integer value. In an actual quantum-based system, the quantum would be given in time units. In this paper, we nd it convenient to more abstractly view a quantum as specifying a statement count. This allows us to avoid having to incorporate time explicitly into our model. In a pure quantum-scheduled system, when a processor is allocated to some process, that process is guaranteed to execute without preemption for at least Q atomic statements, where Q is the value of the quantum, or until its current object invocation terminates. In a hybrid-scheduled system, there may be several processes of the same priority on the same processor. A hybrid scheduler will always choose a process of maximal priority on its processor for execution. If there are several such processes, then processor time is allocated among them using a scheduling quantum. In de ning the behav2
process r process q process p Q
invocation begins
invocation ends
Figure 2: A closer look at the interleaving in Fig. 1(a). creating a run that is essentially asynchronous from the standpoint of the lower-priority processes. This implies that consensus numbers in such a system have the same interpretation as in an asynchronous system. From the explanation in the previous paragraph, it may seem that we enforce Axiom 2 only because systems that satisfy this axiom are more \interesting". However, we shall see that, with Axiom 2 in place, it is possible to achieve wait-free synchronization using primitives that are weaker than those given by Herlihy's hierarchy. Thus, if one is interested in supporting wait-free synchronization in a system, then our results suggest that designing that system in accordance with Axiom 2 might be a good idea. As a nal defense of these axioms, we note that any algorithm designed in accordance with Axioms 1 and 2 will work correctly in either a pure priorityscheduled system or a pure quantum-scheduled systems. In some sense, such algorithms are resilient to the speci c type of scheduler used in a multiprogrammed system. Throughout most of this paper, we do not constrain the manner in which the scheduler on each processor allocates quanta to processes of the same priority. Indeed, a scheduler on some processor may choose to never allocate a quantum to some ready process. Such behavior corresponds to a \halting failure" in an asynchronous system. In Sec. 5, we brie y consider schedulers that are constrained to allocate quanta \fairly". Given the above discussion, we can now more carefully de ne our model of program execution for hybridscheduled systems. Our programming notation should be self explanatory; as an example of this notation, see Fig. 3. In this and subsequent gures, each numbered statement is assumed to be atomic. When considering a given object implementation, we consider only statement executions that arise when processes perform operations on the given object, i.e., we abstract away from the other activities of these processes outside of object accesses. For objects that are \long lived", i.e., may be invoked repeatedly by a process, we abstractly view each process as alternating between thinking and ready phases. While thinking , a process p has no enabled atomic statements. 1 Note that a system that allows preemptions only at quantum A transition by p from thinking to ready is caused by the boundaries satis es Axiom 2 by de nition. scheduler on p's processor. When such a transition oc-
ior of a hybrid scheduler, there a variety of dierent ways in which one can specify the manner in which higherpriority processes may preempt lower-priority ones. In this paper, we mostly limit our attention to hybrid schedulers satisfying the following two axioms. Axiom 1: (Priority-Based Scheduling ) If a higherpriority process p becomes ready for execution while a lower-priority process q is executing on the same processor, then p immediately preempts q, i.e., q may be preempted before nishing a full quantum. 2 Axiom 2: (Quantum-Based Scheduling ) Preemptions by higher-priority processes do not aect the quantumbased scheduling at each priority level. Speci cally, each process p is guaranteed to execute at least Q statements between preemptions by processes of equal priority, even if p is preempted by higher-priority processes. 2 One might be tempted to modify Axiom 1 so that higher-priority processes may preempt lower-priority ones only at quantum boundaries. However, as we shall see, the lower bounds on quantum size that we establish do not depend on Axiom 1. Furthermore, any algorithm that works correctly assuming Axiom 1 is also correct if preemptions occur only at quantum boundaries. One might also be tempted to eliminate Axiom 2. Indeed, one of the overriding goals of this paper is to consider how various nuances of hybrid schedulers can be exploited to achieve wait-free synchronization, and a hybrid scheduler that violates Axiom 2 is perfectly reasonable. It turns out, however, that hybrid schedulers satisfying Axiom 1 but violating Axiom 2 can be quickly dispensed with, because in this case it can be easily shown that Herlihy's hierarchy remains intact.1 To see this informally, consider a collection of processes on one processor, with two priority levels. Suppose the lower-priority processes all access a common consensus object, and the higher-priority processes are engaged in some unrelated computation that does not involve that object. A scheduler violating Axiom 2 can allow higher-priority processes to preempt lower-priority ones at arbitrary points,
3
shared variable P : array[1::3] of valtype [ ? initially ? procedure decide (val: valtype) returns valtype private variable v; w: valtype
curs, an operation of the implemented object is selected nondeterministically and p's program counter is updated to point to the rst statement of that operation. p transits from ready to thinking when it completes an object invocation. We de ne a program's semantics by a set shistos0 tof1?! 1 , ?! ries. A history of a program is a sequence t 0 s where t0 is an initial state and ti ?!ti+1 denotes that state ti+1 is reached from state ti via the execution of statement si ; unless stated otherwise, a history is assumed to be a maximal such sequence. the s t Consider s ti+1 tj ?! s1 ti ?! s0 t1?! history t0 ?! j +1 , where si and sj are successive statement executions by some process p. We say that p is preempted before sj in this history i some other process on p's processor executes a statement between states ti+1 and tj . A history s1 is well-formed i it satis es the fols0 t1?! h = t0 ?! lowing condition: for any statement execution sj in h by any process p, no higher-priority process on p's processor has an enabled statement at state tj ; if p is preempted before sj , then no other process on p's processor with the same priority as p executes a statement after state tj +1 until either (i) p executes at least Q statements or (ii) p's object invocation that includes sj terminates. We henceforth assume all histories are well-formed. Notational Conventions: The number of processes and the number of processors in the system are denoted N and P, respectively. Processors are labeled from 1 to P. We assume that the priority levels on each processor range from 1 to V , with V being the highest priority. We let M denote the maximum number of processes on any processor. Q denotes the value of the quantum, and C will be used to refer to a given object's consensus number (see Sec. 1). Unless stated otherwise, p; q, and r are assumed to be universally quanti ed over process identi ers. We use valtype to denote an arbitrary type. 2
1: v := val ; 2: for i := 1 to 3 do 3: w := P [i]; 4: if w 6= ? then 5: v := w
i
i
else
6:
P [i] := v
od; 7: return P [3]
j
Figure 3: Consensus algorithm for hybrid uniprocessors. P[1]:=v1 p
O P[2]=
P[1]=v1 r1
L
P[2]:=v1 H
K
P[3]=
P[2]=v1 s1
E
B
C
D
G
N
F
s2
P[2]=v2 P[3]= M
r2
J
P[1]=v2 P[2]= q
Q P[1]=
P[3]:=v1
A
P[3]:=v2
I P[2]:=v2
P P[1]:=v2
Figure 4: Proof of Lemma 1. not publish it). We show here that it is also correct in hybrid-scheduled systems. The algorithm is correct provided Q is large enough to ensure that each process can be quantum-preempted at most once. By replacing the for loop by straight-line code, it can be seen that Q = 8 suces. The algorithm employs three shared variables, P[1], P[2], and P[3]. The idea behind the algorithm is to attempt to copy a value from P[1] to P[2], and then from P[2] to P[3]. Each process returns the value in P[3]. Correctness follows from the following lemma.
Lemma 1: In the consensus algorithm in Fig. 3, each process returns the same value . 2 Proof : Suppose towards a contradiction that two pro-
3 Uniprocessor Systems
cesses return dierent values v1 and v2. Then, some process s1 must write v1 to P[3] and some process s2 In this section, we present constant-time implementa- must write v2 to P[3]. Fig. 4 depicts what a history tions of a consensus object (Sec. 3.1) and a C&S object leading to such a situation must look like. Speci cally, (Sec. 3.2) for hybrid-scheduled systems. Each of these we have the following: implementations uses only reads and writes and is cor(A) s1 reads P [3] = ? and then rect if the quantum exceeds a certain constant value. (B) writes P [3] := v1 and (C) s2 reads P [3] = ? and then (D) writes P [3] := v2. This implies that 3.1 Consensus (E) s1 either reads P [2] = v1 or writes P [2] := v1 before (A) Uniprocessor consensus can be solved using the algo- and rithm in Fig. 3. This algorithm was originally proposed (F) s2 either reads P [2] = v2 or writes P [2] := v2 before (C). for quantum-scheduled systems (as noted in [1], the al- (G) We assume, without loss of generality, that E < F . gorithm is due to Moir and Ramamurthy, but they did (H) Some process r1 writes P [2] := v1 at or before (E). (N.B. 4
The dashed line indicates that it could be that r1 = s1 and (H)=(E).) (I) Also, some process r2 writes P [2] := v2 after (E) and before or at (F). (N.B. The dashed line indicates that it could be that r2 = s2 and (I)=(F).) (J) r2 previously reads P [2] = ? (must be before (H) because P [i] 6= ? is stable). (K) Similarly, r1 previously reads P [2] = ?. (L) r1 reads P [1] = v1 or writes P [1] := v1 before (K). (M) r2 reads P [1] = v2 or writes P [1] := v2 before (J). (N) In the diagram, we have assumed (L) < (M ). The rest of the proof is symmetric for the other case. (O) Some process p writes P [1] := v1 at or before (L). (N.B. The dashed line indicates that it could be that p = r1 and (O)=(L).) (P) Some process q writes P [1] := v2 after (L) and at or before (M). (N.B. The dashed line indicates that it could be that q = r2 and (P)=(M).) (Q) q reads P [1] = ? before (P) and before (O) because P [i] 6= ? is stable.
plete their operations. When implementing C&S, helping is not required. In particular, in our algorithm, only nontrivial C&S operations that change the implemented object's state are appended to the list. Thus, if a process p detects that another process q has successfully applied a C&S operation while p is trying to append a cell to the list, then p can simply fail its operation, because it can be linearized to occur either before or after q's. In Herlihy's algorithm, one of the main complications is that of memory management. Each process needs a nite pool of cells that is sucient to guarantee that a free cell is always available. Herlihy's mechanism for maintaining this pool of cells is quite complicated, and requires O(N 3) time, where N is the number of processes. In our algorithm, we instead perform memory management by using a constant-time tag selection mechanism proposed previously by us in [2]. In Herlihy's algorithm, each process has a head variable that points to a cell at or near the beginning of the list. A process determines the head of the list by scanning each of these variables. In our algorithm, we instead have one head variable for each of the V priority levels. These head \pointers" are not true pointers, but they contain enough information to locate a cell in the list. The O(V ) time required to scan these variables to determine the list's head is the dominant factor in the running time of our algorithm. Because we have one head variable per priority level, several processes may be capable of updating the same head variable, so we must be careful in dealing with modi cations to such variables by concurrently-executing processes. In a previous paper, Anderson et al. presented a constant-time C&S algorithm for quantum-scheduled uniprocessor systems that uses only reads and writes and that requires a quantum of constant size [1] (an overview of this algorithm is given in Appendix C). We use that algorithm here to update the head variables. Each such variable is only updated by processes at the same priority level, and these processes are quantum-scheduled with respect to one another. Each head variable may be read by processes at other priority levels, but fortunately, in the C&S algorithm of [1], a read is performed by simply reading one shared variable. Thus, reads by processes at other priority levels do not cause any problems. In Fig. 5, the quantum-based C&S operation is denoted \Q-C&S" (see lines 34, 36, 41, and 43 in Fig. 5). The algorithm also employs the decide routine of the consensus algorithm in Fig. 3 (see line 37). In Fig. 5, decide has been given an extra parameter to indicate the consensus object being accessed. We assume that such a consensus object can be read; for our consensus algorithm, a read can be implemented by the code sequence \if P[1] = ? then return(?) else return(decide(P[1])) ". With the above overview in mind, we now consider the C&S procedure in detail. Each process p begins by selecting a tag in constant time (lines 8-10) and init-
Observe that process p takes a step ((O)) between two steps of process q ((Q) and (P)). Therefore, priority(p) priority(q). If priority(p) > priority(q), then p completes execution before q resumes, which contradicts P[2] = ? at (J). Thus, priority(p) = priority(q). Similarly, priority(r1) = priority(r2). (Note that (J) < (H) < (I) and that if r1 completed execution before (I), P[3] 6= ? would hold at (C).) If priority(q) > priority(r2), then q would run to completion before (M), contradicting P[2] = ? at (J). Also, if priority(q) = priority(r2), then q would run to completion before (M), contradicting P[2] = ? at (J). In particular, because q has already been preempted once by p before (P), it cannot be preempted by r2 after (P). Therefore, priority(q) < priority(r2), which implies that priority(q) < priority(r1) (recall that priority(r1) = priority(r2)). But q executed a statement ((P)) between two statement executions of r1 ((L) and (H)). Contradiction. 2
Theorem 1: In a hybrid-scheduled uniprocessor system with Q 8, consensus can be implemented in constant time using only reads and writes . 2
3.2 Compare-and-Swap
Our uniprocessor C&S implementation is shown in Fig. 5. This algorithm is based upon Herlihy's \append-to-alist" universal algorithm [4]. In Herlihy's algorithm, an object is implemented by means of a linked list of cells, with one cell for each operation that has been applied to the object. The pointers that link the cells in the list are consensus objects that are updated by invoking a consensus algorithm. Because we are implementing C&S, the list algorithm can be simpli ed considerably. For example, in Herlihy's algorithm, processes that attempt to concurrently modify the list \help" one another so that all eventually com5
procedure Apply (old ; new : valtype ; hd : hdtype ) returns boolean
type
hdtype =record id : 1::N ; tag : 0::4N + 1; last : 1::N end; = stored in one word; (id ; tag ) identify list cell = = last is id of last process to update this Hd variable = ptrtype =record id : 1::N ; tag : 0::4N + 1 end; = next \pointer"; stored in one word = ctype =record val: valtype; nxt: ptrtype [ ? end
= apply C&S operation; rst, check for trivial cases = 26: if Cell [hd :id ; hd :tag ]:val 6= old then return(false) ; 27: if old = new then return(true) ; = update Seen variables to help lower-priority reads = 28: for i := 1 to pri ? 1 do 29: Seen [i] := old od;
shared variable Cell : array[1::N; 0::4N + 1] of ctype ; = list cells = Hd : array[1::V ] of hdtype ; = used to nd list head = A: array[1::2N; 1::V ] of 0::4N + 1; = used to select tag = Seen : array[1::N ] of valtype = used to help reads =
= nontrivial case | attempt to append to list = = Q is assumed large enough to ensure remaining lines = = are quantum-preempted at most once = 30: if priority (hd :id ) pri then 31: hd :last := p; 32: repeat = repeats at most once = 33: repeat tmp := Hd [pri ]; = repeats at most once = 34: until Q -C&S(&Hd [pri ]; tmp ; (tmp :id ; tmp :tag ;p)); 35: if Cell [hd :id ; hd :tag ]:nxt 6= ? then return(false) 36: until Q -C&S(&Hd [pri ]; (tmp :id ; tmp :tag ; p); hd ) ; 37: if decide (&Cell [hd :id ; hd :tag ]:nxt ; (p; tag )) = (p; tag ) then 38: hd ; lasttag := (p; tag ;p); tag ; 39: repeat = repeats at most once = 40: repeat tmp := Hd [pri ]; = repeats at most once = 41: until Q -C&S(&Hd [pri ]; tmp ; (tmp :id ; tmp :tag ;p)); 42: if Cell [hd :id ; hd :tag ]:nxt 6= ? then return(false) 43: until Q -C&S(&Hd [pri ]; (tmp :id ; tmp :tag ; p); hd ); 44: return(true)
= local to process p = j : 1::2N ; = used in tag section = lasttag : 0::4N + 1; = tag of last cell p added to list = pri ;i: 1::V ; = p's priority level, loop index = hd , tmp , next : hdtype ; = head variables = rhd : array[1::V ] of hdtype = used in Read () = = We assume list is initialized as if some process had = = previously performed a successful C&S in isolation. = procedure Feedback (q: 1::2N ; i: 1::V ; cmp : hdtype ; var hd : hdtype ) returns boolean = tag feedback to same- and higher-priority processes = = returns false if higher-priority preemption = 1: if i < pri then return(true) ; = no feedback here = 2: A[q; i] := hd :tag ; 3: tmp := Hd [i]; 4: if (cmp :id ; cmp :tag ) 6= (tmp :id ; tmp :tag ) then 5: if i > pri then return(false) = same-priority preemption (i = pri ) = = lines 3-7 can't be quantum-preempted = 6: A[q; i] := tmp :tag ; 7: hd := tmp ; return(true) procedure C&S(old ; new : valtype ) returns boolean = select tag in constant time as in reference [2] = 8: read A[j; pri]; 9: if j = 2N then j := 1 else j := j + 1 ; 10: select tag : tag 2= flast 2N tags readg [ flast 2N tags selectedg [ flasttag g; = initialize new cell = 11: Cell [p; tag ]:val := new ; 12: Cell [p; tag ]:nxt := consensus object initialized to ?; = nd list head by reading each level's Hd variable = = Q is assumed large enough to ensure each loop = = iteration is quantum-preempted at most once = 13: for i := 1 to V do 14: hd := Hd [i]; 15: if i pri _ (i > pri ^ priority (hd :id ) = i) then 16: if Feedback (p;i; hd ; hd ) = false then return(false) ; 17: if Cell [hd :id ; hd :tag ]:nxt = ? then 18: return(Apply (old ; new ; hd )) ; 19: if i pri then = these Hd 's can be o by one = 20: next := Cell [hd :id ; hd :tag ]:nxt ; 21: if priority (next :id ) = i then 22: Feedback (p + N;i; hd ; next ); 23: if Cell [next :id ; next :tag ]:nxt = ? then 24: return(Apply (old ; new ; next )) private variable
else
45: return(false)
procedure Read () returns valtype
= nd list head by reading each level's Hd variable = = Q is assumed large enough to ensure each loop = = iteration is quantum-preempted at most once = 46: for i := 1 to V with i = pri last do 47: rhd [i] := Hd [i]; 48: if i pri _ (i > pri ^ priority (rhd [i]:id ) = i) then 49: if Feedback (p;i; rhd [i]; rhd [i]) = false then 50: return(Seen[pri]) ; 51: if Cell [rhd [i]:id ; rhd [i]:tag ]:nxt = ? then 52: return(Cell[rhd [i]:id ; rhd [i]:tag ]:val ) ; 53: if i pri then = these Hd 's can be o by one = 54: next := Cell [rhd [i]:id ; rhd [i]:tag ]:nxt ; 55: if priority (next :id ) = i then 56: Feedback (p + N;i; rhd [i]; next ); 57: if Cell [next :id ; next :tag ]:nxt = ? then 58: return(Cell[next :id ; next :tag ]:val ) od;
= same- or higher-priority Hd [i] must have changed = 59: for i := i + 1 to V do 60: if Hd [i] 6= rhd [i] then 61: return(Seen [pri ]) od;
= must have been same-priority Hd [i] that changed = 62: return(Cell[next :id ; next :tag ]:val )
od; 25: return(false)
Figure 5: Uniprocessor C&S algorithm. Private variables are assumed to retain their values across procedure invocations.
For simplicity, the object being accessed is left implicit. To be precise, the object's address should be passed as a parameter to the C&S and Read procedures.
6
ializing a new list cell, which is determined using the selected tag (lines 11-12). After this initialization step, process p scans all of the Hd (head) variables to determine the current head of the list (lines 13-24). During the scan, tag feedback is performed by invoking the procedure Feedback . This ensures that a cell determined to be head will not be prematurely reused by the process that owns it. The correctness of the scan follows from an invariant, which states that, during p's scan, either some higher-priority Hd [i] points exactly to the head | and the cell pointed to is a cell of that higher-priority level i | or some same- or lower-priority Hd [i] points either to the head or to the cell one behind the head. In the latter case (it points to a cell one behind the head), the head is a cell of level i. By this invariant, p can fail to nd the head of the list only if preempted by other processes of equal or higher priority. If this happens, p can simply return false (lines 16 and 25), because it must be the case that some C&S operation has been successfully applied during p's operation. We assume that Q is large enough to ensure that at most one same-priority preemption can occur over certain code sequences of constant length (see lines 3-7, 14-24, 30-45, and 47-58). This makes such preemptions relatively easy to dispense with. After nding the head of the list, p attempts to perform its C&S by invoking the procedure Apply . If p's C&S is trivial, i.e., can be applied without modifyingthe object's state, then p determines an appropriate return value and returns (lines 26-27). Otherwise, p updates the Hd variable for its priority level (lines 30-36), and then attempts to apply its operation by performing a consensus operation on the nxt pointer of the head cell ( line 37), returning true or false accordingly (lines 44 and 45). If it successfully appends its cell, then p updates the Hd variable for its level again to point to the cell it just appended (lines 38-43). Two nested repeat-until loops are used in updating a Hd variable. Each such loop can repeat only if there is a quantum preemption, and by assumption, at most one such preemption may occur. In the outermost repeatuntil (lines 32-36, for example), a process p rst attempts to assign its own process id to the last eld of its Hd variable. This is done to detect a quantum preemption prior to the quantum-based C&S at line 36. This quantumbased C&S is what eectively updates the Hd variable. Prior to performing this update, p rst checks that its local hd variable still points to the head (lines 35). This check prevents \old" information from being assigned to a Hd variable. To perform a Read operation, a process p simply scans the Hd variables to locate the head of the list (lines 4658), and returns the value stored in that cell. Again, p can fail to locate the head only if there is a preemption. If preempted by a higher-priority process, p can simply return a value \seen" by such a process (line 61). If preempted by a same-priority process, then after the
preemption, the Hd variable for p's priority level must point to a suciently \recent" cell whose value can be safely returned (line 62). Due to space limitations, we defer the formal correctness proof for the algorithm in Fig. 5 to the full paper. Letting c be a constant denoting the longest code sequence for which we require that there by at most one quantum preemption, we have the following theorem.
Theorem 2: In a hybrid-scheduled uniprocessor system with Q c, C&S and Read can be implemented in O(V )
2 Recall from Sec. 2 that any algorithm for a hybridscheduled system also works correctly in a pure priorityor quantum-scheduled system. Interestingly, the time complexity of the C&S algorithm presented here matches that of separate C&S algorithms presented previously for priority-based systems [7] and quantum-based [1] systems, respectively. time using only reads and writes .
4 Multiprocessor Systems In this section, we establish asymptotically-tight bounds on Q for multiprocessor systems with hybrid schedulers. We begin by presenting lower bounds on Q in Sec. 4.1. Then, in Sec. 4.2, we give matching upper bounds.
4.1 Lower Bounds on Q
In this subsection, we establish the bounds on Q given in the last column of Table 1. Due to space limitations, we only sketch the main ideas of the lower-bound proof here. The complete proof can be found in Appendix A. It is fairly straightforward to establish the bounds given for C 2P. We therefore restrict our attention to establishing the bounds given for values of C such that P C < 2P. We also restrict attention to systems with schedulers that are purely quantum-based, i.e., we assume there is only one priority level per processor. Clearly, lower bounds for such a system will apply to a system with multiple priority levels as well. In the remainder of this subsection, we let A denote a consensus algorithm for an arbitrary number of (deterministic) processes executing on a P-processor system with a single priority level per processor such that all underlying objects are either read/write registers or C-consensus objects, where P C < 2P. For simplicity, we assume that any invocation of any of these underlying objects takes one atomic statement. We also assume that, in any history, if a C-consensus object is invoked more than C times, then all invocations after the C-th return the value ?.2 Our objective is to show that A cannot be 2 We could instead require that such invocations return random values, or that such invocations are in fact not allowed. The key requirement here is that all invocations after the C -th return no \useful" information.
7
correct in a system for which Q 2P ? C. Assume, to the contrary, that A is correct in such a system. We show that this assumption leads to a contradiction. Our proof is based on a valency argument [3, 4], and we assume that the reader is familiar with the general structure of such an argument. In particular, we assume that the reader knows what it means for a state of A to be bi-valent , uni-valent , or x-valent , where x is the input value of some process [3, 4]. We derive a contradiction by showing that there exists a history of A consisting of an in nite sequence of bi-valent states, contradicting the fact that A is wait-free. The proof is similar to other valency arguments presented elsewhere [3, 4], with the important exception that we must carefully keep track of when a given process may be preempted. We will view each processor's scheduler as an \adversary" that declares processes to be preemptable in a manner that hinders the overall ability of processes to choose a common decision value. More precisely, the schedulers on the various processors select processes for execution in the following way. Initially, only a set of Q processes is allowed to run, with one process on each of processors 1 through Q. We denote the processes assigned to processor i as pi1; pi2; : : :; piM [i] . The initial set of running processes is de ned to be fp11; p21; : : :; pQ1 g. Other processes are not allowed to execute algorithm A initially. This is tantamount to assuming that the schedulers on processors 1 through Q \unfairly" continue to select processes in fp11; p21; : : :; pQ1 g for execution, and the schedulers on other processors continually select other \external" processes to run that do not participate in algorithm A. Our schedulers continue to select processes in this initial set to execute until these processes determine a decision value. At that point, other processes may be selected. Because the set fp11; p21; : : :; pQ1 g consists of Q processes and Q is the size of the quantum, we can \stagger" the execution of these processes with respect to quantum boundaries so that at any reachable state, one such process is preemptable. Because only these processes may execute until a decision is reached, there exists a reachable bi-valent state t such that if any one of original set of processes takes a step from t, the resulting state is uni-valent. Consider the set of Qstates reachable from t by having each of fp11 ; p21; : : :; p1 g take a step. Among these states, there is an x-valent state ux and a y-valent state uy , where x 6= y (see Figure 6). It is easy to show that at state t, each of fp11 ; p21; : : :; pQ1 g is enabled to invoke a single C-consensus object O. Informally, this is because read/write registers do not have the required symmetry-breaking power, and operations on dierent objects commute. In reaching state t, we maintain the invariant that the processes in fp11; p21; : : :; pQ1 g are \staggered" with respect to quantum boundaries, and hence one of these processes, say p11, is preemptable.
t Q 1 p − p invoke 1 1 object O
1 execute p until it 2 invokes O
ux
uy
O
O
execute p Q+1 1 until it invokes O
O
O
Q+1 execute p 2 until it invokes O
O
O
O
O
execute p P 2 until it invokes O; invocation returns
Figure 6: Two histories ending in x-valent and y-valent states. Process pP2 chooses the same decision value in both histories, which is a contradiction. Let us suppose that p11 is preempted by another process, say p12, at both ux and uy , and that p12 executes in isolation from both of these states, as depicted in Fig. 6. Then, p12 must eventually invoke object O, because this is the only way for it to determine the decision that has been reached. As soon as p12 invokes O in both histories, we de ne processes pQ1 +1 ; : : :; pP1 to be eligible for execution. (Until this point, no process on processors Q + 1 through P have executed.) We extend each history as follows. First, we execute pQ1 +1 in isolation as shown in Fig. 6. Process pQ1 +1 must eventually invoke object O for the same reasons that p12 did. After it invokes O, we preempt it by process pQ2 +1 . This is possible because pQ1 +1 has not yet suered its rst quantum preemption. Recall that our execution model allows a process to experience its rst quantum preemption at any time (see Sec. 2). Informally, such a process may have been engaged in some other computation prior to participating in algorithm A, and therefore its execution may arbitrarily align with the next quantum boundary. Arguing as before, pQ2 +1 must eventually access object O. We continue in this manner, executing rst pQ1 +2 until after it invokes O, then pQ2 +2 , then p1Q+3 , and so on. Note that by the time we execute process pP2 in isolation, O is been invoked by Q + 2(P ? Q) = 2P ? Q processes in both of the histories being constructed. However, O has consensus number C 2P ? Q. Therefore, process pP2 obtains the value ? when it invokes O in both histories, 8
i.e., it cannot distinguish between the two histories. If we execute process pP2 until it terminates, it therefore returns the same decision value in both histories, which contradicts the fact that ux is x-valent and uy is y-valent. The argument given in the previous paragraph shows that the original set of Q processes cannot force a decision. Using this fact, we can construct a history consisting of an in nite sequence of bi-valent states. This gives us the desired contradiction, and hence the following theorem. Theorem 3: In a P -processor system in which the pro-
cess begins by skipping over any levels for which lowerpriority processes executing on the same processor have already published results (lines 5-13). The remaining levels (if any) are accessed in sequence until an overall decision can be reached (lines 14-36). Note that if a process is preempted by a higher-priority process, then by the time it resumes execution, an overall decision must have already been reached. Such a situation is checked for in line 16. If not preempted by a higher-priority process, a process returns as its decision value the output value of the highest-numbered level with a published value. The requirement that at most C processes can access a C-consensus object is enforced by de ning P +K \ports" per consensus level. Processors 1 through K have two ports per object, and processors K + 1 through P have one port. A process can access a C-consensus object only by rst claiming one of the ports allocated to its processor. A process claims a port by executing lines 17-26. On each processor, all ports across all consensus levels are numbered in sequence, starting at 1. A process p on processor i claims a port by rst incrementing a counter Port [i; v], where v is its priority, and by then invoking a local consensus object associated with that port. The Port counters ensure that at most one process of a given priority attempts to acquire a given port. The local consensus invocation ensures that, if several processes with dierent priorities attempt to acquire a port, then only one succeeds. It is easy to see that a process can fail to \win" such a local consensus invocation only if it is preempted by a higher-priority process. In this case, when the process resumes execution, it will quickly detect that a decision has been reached and terminate. If a process with priority v is not preempted by higher-priority processes, then each time it increments Port [i; v], it atomically reserves the next available port. We now describe the manner in which the Port variables are incremented. If processor i has one port per Cconsensus object, i.e., i > K, then Port [i; v] is updated by simply performing a F&I operation. If processor i has two ports per object, i.e., i K, then the situation is a bit more complicated. In particular, if a process p acquires the rst of processor i's ports at some level l and then simply increments Port [i; v], p is then positioned to acquire the second of processor i's ports at level l, which is pointless because a decision has already been reached at that level. To correct this, Port [i; v] is updated in such a case using a C&S operation in line 21. Because Port [i; v] is updated only by processes with the same priority, the F&I and C&S operations used in updating it can be implemented from reads and writes using the constant-time algorithms for purely quantum-scheduled sytems presented in [1] (these algorithms are described in Appendix C). The local consensus object for each port can also be implemented from reads and writes in constant time, using the algorithm given in Sec. 3.1. Having explained how ports are acquired, we now ex-
cesses on each processor are scheduled using either a quantum-based scheduler or a hybrid scheduler, consensus cannot be implemented in a wait-free manner for any number of processes using read/write registers and C consensus objects if C P and Q max(1; 2P ? C). 2
4.2 Upper Bounds on Q
In this subsection, we establish upper bounds on Q for hybrid-scheduled systems, as given in the second column of Table 1. Speci cally, we present a wait-free consensus algorithm for any number of processes executing on P processors, where the processes on each processor are scheduled using a hybrid scheduler. The algorithm uses C-consensus objects, where C P, and requires the quantum Q to be as speci ed in the table. For simplicity, we assume in the rest of this section that C 2P, because for larger values of C, the algorithm we give for C = 2P can be applied to obtain the results in Table 1. Our consensus algorithm is shown in Fig. 7. This algorithm has been obtained by modifying a similar algorithm presented previously in [1] for purely quantumscheduled systems. The modi cations allow multiple priority levels to be supported. In addition to C-consensus objects, a number of uniprocessor C&S, F&I, and consensus objects are used in the algorithm. We use \local -C&S", \local -F&I", and \local-consensus " in Fig. 7 to emphasize that these are uniprocessor objects. We begin our description of the algorithm by giving a very brief overview of it; details are provided in the following paragraphs. The algorithm works by having each process participate in a series of \consensus levels". There are L consensus levels, as illustrated in Fig. 8. L is a function of M, P, and C, as described below. Each consensus level consists of a C-consensus object, where C = P + K, 0 K P. Also associated with each consensus level l is a collection of shared variables Outval [i; l], where 1 i P. A process on processor i \publishes" the decision value of level l by recording it in Outval [i; l], and by then updating another shared variable, as described below, to indicate that a published value exists at this level (see lines 32-33 in Fig. 7). The processes on each processor attempt to progress through all levels in sequence, using published results from previous levels as inputs to subsequent levels. Each pro9
constant L = (K + 1)M (1 + P ? K ) + (P ? K )2 M + 1
= total number of consensus levels for C = P + K , 0 K P =
shared variable Lastpub : array[1::P; 1::V ] of 0::L initially 0;
= Lastpub [i; v] is the highest level on a processor i :: : = = :: : for which some process with priority v (or lower) has published a consensus value = Outval : array[1::P; 1::L] of valtype [ ? initially ?; = Outval [i;l] is the consensus value for level l on processor i :: : = = : : : we assume no input value (and hence consensus value) is ? = Port : array[1::P; 1::V ] of 1::2L + M initially 1 = Port [i;v] is the next available port for processes of priority v :: : = = : :: on processor i (each processor has L or 2L ports; the \+M " term is needed because of overshoots) = private variable = local to process p = input , output , lastval : valtype ; = input/output value for a level, output value of last level = level , prevlevel : 0::L + M ; = consensus level accessed by p in current (previous) iteration of the while loop = numports : 1::2; = number of ports per consensus object on processor pr(p) = port , newport , lowerport : 1::2L + M ; = port numbers = publevel , lowerpublevel : 0::L; = used to determine last level for which there is published value on processor pr(p) = v: 1::V
procedure decide (val : valtype ) returns valtype 1: lastval := Outval [pr(p);L]; = return if other processes have already decided = 2: if lastval 6= ? then return(lastval ) ; 3: if pr(p) K then numports := 2 else numports := 1 ; 4: input ; prevlevel ; level := val ; 0; 0; = lower-priority processes may have made some = = progress, so initialize Port & Lastpub accordingly = 5: for v := 1 to priority(p) ? 1 do 6: lowerport := Port [pr(p);v]; 7: port := Port [pr(p);priority(p)]; 8: if lowerport > port then 9: local -C&S(&Port [pr(p);priority(p)]; port ; lowerport ) ; 10: lowerpublevel := Lastpub [pr(p);v]; 11: publevel := Lastpub [pr(p);priority(p)]; 12: if lowerpublevel > publevel then 13: local -C&S(&Lastpub [pr(p);priority(p)]; publevel ; lowerpublevel ) od;
= decide (continued) | proceed through levels = 14: while level L do 15: lastval := Outval [pr(p);L]; = return, if HP processes preempt and decide = 16: if lastval 6= ? then return(lastval) ; = determine port and level = 17: port := Port [pr(p);priority(p)]; 18: level := ((port ? 1) div numports ) + 1;
= decide (continued) | if level didn't change, make correction = 19: if prevlevel = level then 20: newport := port + numports ; 21: if local -C&S(&Port [pr(p);priority(p)]; port ; newport + 1) then 22: port := newport 23: else port := local -F&I(&Port [pr(p);priority(p)]) 24: 25:
else ;
port :=local -F&I(&Port [pr(p);priority(p)])
26: level := ((port ? 1) div numports ) + 1; = determine input for next level = 27: publevel := Lastpub [pr(p);priority(p)]; 28: if publevel 6= 0 then input := Outval [pr(p); publevel ] ; 29: if level L then = local consensus ensures at most one process accesses port = 30: if local-consensus (pr(p); port ; p) = p then = invoke the C -consensus object = 31: output := C -consensus (level ; input ); 32: Outval [pr(p); level ] := output ; = publish result = 33: local -C&S(&Lastpub [pr(p);priority(p)]; publevel ; level ) ;
34: prevlevel := level od; 35: publevel := Lastpub [pr(p);priority(p)]; 36: return(Outval [pr(p); publevel ])
Figure 7: Multiprocessor consensus algorithm. plain in more detail how an overall decision value is reached. Each process attempts to participate in each consensus level, in order, skipping over levels for which a decision value has already been published. When a process accesses some level, the input value it uses is either the output of the highest-numbered consensus level for which there is a published value on its processor, or its own input value, if no previously-published value exists (see lines 4, 27, and 28). So that an input value for a level can be determined in constant time, a counter Lastpub [i; v] is maintained on each processor i, for each priority level v. This counter points to the highestnumbered level for which a process on processor i with priority v (or lower) has published a value. Due to pre-
emptions, Lastpub [i; v] may need to be incremented to skip over an arbitrary number of levels. It is therefore updated using a C&S operation instead of F&I (see line 33). Like Port [i; v], Lastpub [i; v] is updated only by processes with priority v, so the C&S operations that update Lastpub [i; v] can be implemented from reads and writes in constant time [1]. If a process on processor i is preempted by higher-priority processes, or if a decision is reached before it starts, then it may safely return the output value from level L (lines 2 and 15). Otherwise, it completes execution by returning the output value from level Lastpub [i; v], where v is its priority (lines 35-36). The main diculty associated with our algorithm is that of determining an input value to use when accessing 10
} }
cess failures that may occur in a history. For each lemma, we assume that the given history is xed. Let AF denote the number of levels for which there is an access failure in this history. Further, let AF same P−K K 1 (respectively, AF di ) denote the number of levels for xx which there is a same-priority (respectively, dierentxx xxxx priority) access failure in the history. It follows that 2 An 8−consensus object AF AF same + AF di , because a preemption could potentially result in both a same-priority and dierentProcessors 1..K: 2 ports Processors K+1..P: 1 port priority access failure. By de nition, lower-priority processes cannot preempt higher-priority ones. Thus, we can easily bound AF di . Lemma 2: If there are at most M processes on any processor, then AF di M. 2 In the next lemma, we concentrate on bounding Multiprocessor Consensus Value AF same . In the statement of this lemma, we consider a process to access a level i it successfully acquires a port for that level. A proof of this lemma can be found Figure 8: Consensus levels of the algorithm in Fig. 7. in Appendix B. 3: Suppose that C = P +K , where 0 K P , a given level. When a process p attempts to determine Lemma and that Q is large enough to ensure that each process an input value, there may be a number of previous lev- can be preempted at most once by processes of equal els that are inaccessible to p (because all ports on p's priority while accessing P ? K + 1 consensus levprocessor have been claimed) yet no decision value has els in succession (theseany levels have to be conbeen published. This can happen for a previous level secutive). If there are L levels,don't and M prol only if the process(es) on p's processor that acquired cesses on any processor, then AF at most KM + (P ? same that processor's port(s) at level l were preempted before K)(L + M(P ? K))=(1 + P ? K). Furthermore, publishing an output value (in particular, such a process L > (K +1)M(1+P ? K)+(P ? K)2 M, then a decidingif must have been preempted while executing within lines level exists . 2 21-33). We call such a situation an access failure , and say that the preempted process(es) at level l cause an It follows from Lemma 3 that with L as de ned in access failure at level l. We call an access failure a same- Fig. 7, the algorithm is correct. It is easy to see that priority (dierent-priority ) access failure if the processes each consensus level is accessed in constant time (recall involved (i.e., p and the process(es) preempted at level that the uniprocessor C&S, F&I, and consensus algorithms l) have the same priority (dierent priorities). used at each level take constant time). Thus, letting the Obviously, there is a correlation between the number constant c denote the worst-case number of statement of access failures that can occur on a processor and the executions per level, we have the following theorem. number of preemptions that can occur on that processor. The number of preemptions that can occur in turn Theorem 4: In a P -processor system in which the prodepends on the number of processes per processor and cesses on each processor are scheduled using a hybrid the size of the scheduling quantum Q. We show below scheduler, consensus can be implemented in a wait-free that with a suitable choice of Q, the number of levels for manner in polynomial space and time for any number which an access failure occurs on each processor is lim- of processes using read/write registers and C -consensus ited to a fraction of all the levels. Using a pigeon-hole objects if C P and Q max(2c; c(2P + 1 ? C)). 2 argument, it is possible to show that if the number of lev- If we were to incorporate time within our model, then els L is selected appropriately, then in any history, there we could easily incorporate the Tmax term given in Table exists some level for which no process on any processor 1 in the bound on Q given in Theorem 4. experiences an access failure. We call such a level a deciding level . A simple inductive argument shows that at all levels below a deciding level l, the output value of level l is used by every process on every processor when accessing any level below l, even if access failures We have shown that, in a P-processor system with hyoccur when accessing levels lower than l. Thus, the nal brid scheduling, any object with consensus number P in decision value returned by all processes is the same. Herlihy's wait-free hierarchy is universal for any numWe now state some lemmas about the number of ac- ber of processes executing on P processors, provided the L
5 Concluding Remarks
11
shared variable
Output : valtype [ ? initially ?
procedure decide (val : valtype ) returns valtype private variable
output : valtype
1: if local-consensus (pr(p); priority (p);p) 6= p then 2: while Output = ? do od; 3: return(Output ) ; 4: output :=global-PB-consensus (val ); 5: Output := output ; 6: return(output )
Figure 9: Multiprocessor consensus with fair scheduling. scheduling quantum is of a certain size. We have also given an asymptotically-tight characterization of how large the scheduling quantum must be for this result to hold. All of the algorithms we have presented have the interesting property of being correct not only in a hybridscheduled system, but also in any system that is either purely priority-based or purely quantum-based. Thus, our results both unify and extend earlier work on waitfree synchronization in multiprogrammed systems [1, 7]. We have heretofore assumed that all task priorities are statically de ned. However, the multiprocessor consensus algorithm given in Sec. 4 can be easily extended to support dynamic priorities. We can obtain the identi ers required by the algorithm by using a uniprocessor renaming algorithm that assigns the same name to same-priority processes. Our uniprocessor consensus algorithm is correct in a dynamic-priority system as stated, so we know that reads and writes are universal in a hybrid-scheduled uniprocessor with dynamic priorities. As a result, it is straightforward to show that the required renaming object can be implemented. Unfortunately, extending our uniprocessor C&S algorithm to support dynamic priorities isn't as simple. Unlike our multiprocessor consensus algorithm, our C&S algorithm requires a long-lived renaming algorithm [5] that allows names to be repeatedly acquired and released. Such a renaming algorithm must ensure that names are assigned to processes in a manner that is consistent with their relative priorities. Given the universality of reads and writes on a hybrid-scheduled uniprocessor, we know the required renaming object can be implemented. However, for our purposes, it must be implemented in O(V ) time, and it is currently unclear to us how to do this. Our de nition of a hybrid scheduler assumes that such a scheduler can \unfairly" allocate quanta to processes of the same priority, i.e., such a process may be forever ignored. If we assume that quanta are \fairly" allocated, then it is possible to obtain a smaller upper bound on the size of Q than that given by Theorem 4. In particular, P-consensus primitives can be used to implement consensus for any number of processes on P processors, using a quantum of constant size. To see this, consider
the multiprocessor consensus algorithm given in Fig. 9. In this algorithm, we rst \elect" one process per priority level on each processor by means of invoking a local uniprocessor consensus object for that level/processor. Each local consensus object can be implemented using the uniprocessor consensus algorithm described in Sec. 3.1. If a process loses the election for its priority level, then it waits until the other processes choose a decision value. Because the quantum-based scheduling on each processor is fair, each process waits for only a nite time. The election winners all invoke a multiprocessor prioritybased consensus algorithm to determine a decision value. This algorithm can be implemented using a quantum of constant size exactly as in Fig. 7. One may feel that the algorithm in Fig. 9 violates the spirit of wait-freedom. However, if we de ne wait-freedom to mean that each process completes an operation in a nite number of its own steps, then this algorithm is indeed wait-free. Acknowledgement: Srikanth Ramamurthy helped in developing the algorithm in Fig. 3. Eli Gafni initially suggested to us the possibility of designing wait-free algorithms that work correctly in both priority- and quantum-based systems.
References [1] J. Anderson, R. Jain, and D. Ott. Wait-free synchronization in quantum-based multiprogrammed systems. In Proceedings of the 12th International Symposium on Distributed Computing, pp. 34{48. Springer Verlag, Sept. 1998. [2] J. Anderson and M. Moir. Universal constructions for multi-object operations. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, pp. 184{193. Aug. 1995.
[3] M. Fischer, N. Lynch, and M. Patterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374{382, April 1985. [4] M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124{149, 1991. [5] M. Moir and J. Anderson. Wait-free algorithms for fast, long-lived renaming. Science of Computer Programming, 25(1):1{39, Oct. 1995. [6] QNX Software Systems, http://www.qnx.com/. [7] S. Ramamurthy, M. Moir, and J. Anderson. Realtime object sharing with minimal support. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing, pp. 233{242. May 1996.
[8] Silicon Graphics IRIX REACT and REACT Pro Real-Time Support, http://www.sgi.com/real-time/. [9] Wind River Systems, http://www.wrs.com/.
12
Appendix A: Detailed LowerBound Proof In this appendix, we give a more detailed version of the lower-bound proof sketched in Sec. 4.1. As stated before, we restrict attention to establishing the bounds given in Table 1 for values of C such that P C < 2P. We also restrict attention to systems with schedulers that are purely quantum-based, i.e., we assume there is only one priority level per processor. In the remainder of this appendix, we let A denote a consensus algorithm for an arbitrary number of (deterministic) processes executing on a P-processor system with a single priority level per processor such that all underlying objects are either read/write registers or C-consensus objects, where P C < 2P. As before, we assume that any invocation of any of these underlying objects takes one atomic statement. We also assume that, in any history, if a C-consensus object is invoked more than C times, then all invocations after the C-th return the value ?. Our objective is to show that A cannot be correct in a system for which Q 2P ? C. Assume, to the contrary, that A is correct in such a system. We show that this assumption leads to a contradiction. Our proof is based on a valency argument [3, 4], and we assume that the reader is familiar with the general structure of such an argument. In particular, we assume that the reader knows what it means for a state of A to be bi-valent , uni-valent , or x-valent , where x is the input value of some process [3, 4]. We will denote the processes assigned to processor i as pi1 ; pi2; : : :; piM [i] . We will derive a contradiction by showing that there exists a history of A consisting of an in nite sequence of bi-valent states, contradicting the fact that A is waitfree. The proof is similar to other valency arguments presented elsewhere [3, 4], with the important exception that we must carefully keep track of when a given process may be preempted. We will view each processor's scheduler as an \adversary" that declares processes to be preemptable in a manner that hinders the overall ability of processes to choose a common decision value. (Recall that each processor's scheduler is permitted to continually choose not to schedule some ready process.) Formally, we model the scheduler on each processor i by means of a collection of boolean auxiliary variables Eligible ij , Running ij , and Preemptable ij , where j ranges over f1; : : :; M[i]g. Informally, Eligible ij indicates \when" process pij begins executing algorithm A, Running ij indicates whether pij that is participating in algorithm A is the currently-running process on processor i, and Preemptable ij indicates whether process pij , if running, can be preempted. Formally, we require that Eligible ij is stable, i.e., once it holds, it continues to hold; pij has an enabled statement i Running ij holds;
Running ij ) Eligible ij ; (8j; j 0 : j = 6 j 0 :: :(Running ij ^ Running ij 0 )); and for any pair of consecutive states t and t0 in any i history, if Preemptable j holds at t0 , then Running ij holds at t, if Running ij holds at t but is false at t0, then Preemptable ij holds at t0 , and if Preemptable ij holds at t then Preemptable ij is false at t0 . For brevity, we say that pij is eligible (preemptable ) at a state if Eligible ij (Preemptable ij ) holds at that state. Let s2 s?! ?1 s s1 t2 ?! s0 t1?! tk ?! be any history of h = t0 ?! A. We say that process pij is m-ful lled at state tk i one of the following holds: (i) sk?1 is a statement execution of pij and this statement execution causes pij to terminate; (ii) Preemptable ij is false at each state tl , where l k; (iii) pij executes at least m atomic statements in h between states tl and tk , where tl is the most recent state at or before tk at which Preemptable ij holds. With these k
k
de nitions, we can now state our primary assumption regarding preemptions. Preemption Axiom: At each state t in any history, Preemptable ij holds only if pij is Q-ful lled at t. Furthermore, Running ij ^ :Preemptable ij is not stable in any history. 2 Informally, a process may suer its rst preemption at any time, and the number of statements it can execute in a single quantum is nite, but at least Q, unless it terminates. Other than the restrictions stated here, we do not constrain preemptions in any way. Our job now is to show that there exists a history of A consisting of an in nite sequence of bi-valent states such that in that history the processes on each processor are scheduled in accordance with the Preemption Axiom and the other restrictions stated above. The existence of such a history will show that our characterization of the minimum quantum required for wait-free synchronization as given in Table 1 is correct. Our proof focuses on bi-valent states that satisfy some additional properties regarding the preemptability of processes, which we now de ne. In the remainder of this appendix, we drop our convention that histories be s s ?1 s s 2 1 0 maximal. Consider a history t0?!t1 ?!t2?! ?!tk . We say that state t is reachable from state tks i there exs2 s?! ?1 s1 t2?! ?1 s0 t1 ?! tk ?! tl , where ists a history t0?! t = tl . In this case, we say tl is reachable from tk by applying statement sl?1 . The \special" bi-valent states that our proof focuses on are de ned next. s2 s1 t2 ?! s0 t1?! De nition: Consider the history t0 ?! s ?1 ?! tk . We say that state tk is Q-bi-valent i it is bi-valent and the following two conditions hold. (i) Eligible ij is true at tk for i Q ^ j = 1 and is false at tk for all other values of i and j, i.e., precisely Q pro-
13
k
k
k
l
cesses are eligible for execution at tk , one on each of Fig. 10. If s is applied at t0 , we reach an x-valent state processors 1 through Q. In particular, no processes ux, and if another statement s0 is applied and then s is are eligible for execution on other processors. applied, we reach a y-valent state uy for some y 6= x. Consider the set of states T that are reached from state (ii) There exists a one-to-one mapping qt : f0; : : :; Q ? t0 by executing exactly one statement from each process 1g ! fp11; p21; : : :; pQ1 g such that process qt (j) is j- in fqt (0); : : :; qt (Q ? 1)g = fp11; p21; : : :; pQ1 g. Because ful lled at state tk . Informally, there is at least one statement s is enabled at t0, and because the execution preemptable process at any state. 2 of statement s always (by assumption) results in a univalent state, all states in T are uni-valent. Moreover, at Most valency proofs begin by showing that that there least one state in T is reachable from each of ux and uy . exists an initial state that is bi-valent, which is straight- Therefore, T contains at least one x-valent and y-valent forward [3]. We need an initial state of A that is Q- state. Let u0x (u0y ) denote an x-valent (y-valent) state in bi-valent. Showing the existence of such a state is also T. These states are illustrated in inset (d) of Fig. 10. straightforward, because for any initial bi-valent state, we can \de ne" requirements (i) and (ii) in the de nition It is easy to show that at state t0, each of of a Q-bi-valent state to hold. Intuitively, this amounts qt (0); : : :; qt (Q ? 1) is enabled to invoke a single to properly positioning each process's invocation of al- C-consensus object O. Informally, this is because gorithm A at an appropriate point within a quantum. read/write registers do not have the required symmetryThus, we have the following lemma. breaking power, and operations on dierent objects commute. By construction, process qt (Q ? 1) is Q-ful lled Lemma 4: There exists an initial state of A that is at both states u0x and u0y . Therefore, we can de ne Q-bi-valent . 2 qt (Q ? 1) to be preemptable at both of these states. Let t0 be such an initial state. In the remainder of Let us suppose that qt (Q ? 1) is preempted by some this appendix, we limit attention to histories that begin process q at both u0x and u0y , and that q executes in isowith t0 . We also make the following assumption about lation from both of these states, as depicted in Fig. 10(d). the histories of A. Then, q must eventually invoke object O, because this is the only way for it to determine the decision that has Eligibility Axiom: In all bi-valent states that are been reached. As soon asQ+1q invokes O in both histoQ 1 2 reachable from t0, fp1 ; p1; : : :; p1 g = fqt0 (0); : : :; ries, we de ne processes p1 ; : : :; pP1 to be eligible. We 2 extend each qt0 (Q ? 1)g are the only eligible processes. history as follows (in accordance with the and Eligibility Axioms). First, we execute Thus, we are restricting attention to systems in which Preemption new processes become eligible for execution only after p1Q+1 in isolation as shown in Fig. 10(d). Process pQ1 +1 the original set of Q running processes reaches a deci- must eventually invoke object O for the same reasons q did. After it invokes O, we preempt it by process sion. Informally, this models a situation in which the that yet sufschedulers on these processors \maliciously" choose to p2Q+1 . This is possible because pQ1 +1 has notQ+1 only allow processes that do not invoke the consensus fered its rst preemption. Arguing as before, p2 must algorithm to run until after a decision has been reached. eventually access object O. We continue in this manThe next lemma will be used to inductively construct an ner, executing rst pQ1 +2 until after it invokes O, then in nite sequence of Q-bi-valent states. p2Q+2 , then pQ1 +3 , and so on. Note that by the time we s2 execute process pP2 in isolation, O is been invoked by s1 t2?! s0 t1 ?! Lemma 5: Consider the history t0?! Q+2(P ? Q) = 2P ? Q processes in both of the histories ?1 s?! tk , where tk is Q-bi-valent. Let j be any value being constructed. However, O has consensus number such that 1 j Q, and let s denote the statement C 2P ? Q. Therefore, process pP2 obtains the value ? of process qt (j) that is enabled at state tk . Then, there when it invokes O in both histories, i.e., it cannot distinexists a bi-valent state t that is reachable from tk by ap- guish between the two histories. If we execute process plying statement s. pP2 until it terminates, it therefore returns the same dein both histories, which contradicts the fact Proof : Suppose that no such state t exists, i.e., every cision value 0x is x-valent and u0y is y-valent. This completes that u state that is reachable from tk by applying s is uni-valent. 2 Let u be the state that results from executing statement the proof. s at state tk . Then, u is uni-valent. Suppose that it is x-valent. Because tk is bi-valent, there exists a z-valent Lemma 6: From any Q-bi-valent state t, it is possible state u0 that is reachable from tk for some z 6= x. Either to reach another Q-bi-valent state t0 such that qt(Q ? 1) statement s is applied prior to reaching u0 or not. These is preemptable at t0 . two possibilities are illustrated in insets (a) and (b) of Fig. 10. In either case, we can conclude that there exists a state t0 reachable from tk as illustrated in inset (c) of Proof : Let t be a Q-bi-valent state. Let s be the statek
k
k
k
k
k
k
k
k
k
k
14
ment of process qt(0) that is enabled at state t. By Lemma 5, there exists a bi-valent state u0 that is reachable from t by applying statement s. If we de ne each of qt(0); : : :; qt(Q ? 1) to be nonpreemptable at each state between states t and u0 inclusive, then u0 is a Q-bi-valent state and qt(0) is 1-ful lled at u0 and each qt (j), where j 1, is j-ful lled. In a similar manner, we can reach a Q-bi-valent state u1 from u0 such that qt(0) is 1-ful lled at u1, qt(1) is 2-ful lled, and each qt(j), where j 2, is j-ful lled. Continuing on, we can eventually reach a Qbi-valent state uQ?1 at which each qt(j) is (j+1)-ful lled. If we de ne process qt(Q ? 1) to be preemptable at state uQ?1, in which case it is 0-ful lled, then uQ?1 still meets the de nition of a Q-bi-valent state. 2 Using Lemmas 4 and 6, an in nite history can be constructed that does not violate the Preemption Axiom. The arguments given here can be easily extended to apply if C 2P. This gives us the following theorem, which was originally stated in Sec. 4.1. Theorem 3: In a P -processor system in which the processes on each processor are scheduled using either a quantum-based scheduler or a hybrid scheduler, consensus cannot be implemented in a wait-free manner for any number of processes using read/write registers and C consensus objects if C P and Q max(1; 2P ? C). 2
15
(z−valent) t
s
s
u (x−valent)
t
t’ s’
k
u’ (z−valent) s
s
s
s
(z−valent)
(x−valent) (y−valent)
t’ s’
k
u (x−valent)
(x−valent) (y−valent) (b)
t’
t’ Q p1 − p invoke 1 1 object O
s’
s
u
x (x−valent)
u’ (z−valent)
s
(a)
s
s
u’x uy
u’y
execute q until it invokes O
O
O
execute p Q+1 1 until it invokes O
O
O
O
O
O
O
(y−valent) (c)
execute p
Q+1 2
until it invokes O
P execute p 2 until it invokes O; invocation returns
(d)
Figure 10: Proof of Lemma 5. Statement s either is (inset (a)) or is not (inset (b)) applied in reaching state u0 . In either case, the states depicted in inset (c) exist for some y = 6 x. This follows by applying a simple inductive argument that considers the execution of s at every state where it is enabled along the history ending in u0 in insets (a) and (b). Recall that s always (by assumption) results in a uni-valent state. Inset (d) shows two histories ending in x-valent and y-valent states, respectively, such that process pP2 chooses the same decision value in both histories, which is a contradiction.
16
Appendix B: Proof of Lemma 3 In this appendix, we present the proof of Lemma3, which was used in establishing the correctness of the multiprocessor consensus algorithm in Fig. 7. We begin by stating and proving two lemmas that deal with the two extreme cases C = 2P and C = P. We remind the reader that these lemmas apply to a given, xed history. Also, in the statements of these lemmas, we consider a process to access a level i it successfully acquires a port for that level. Lemma B.1: Suppose that C = 2P and that Q is large enough to ensure that each process can be preempted at most once by processes of equal priority while accessing any two consensus levels in succession (the two levels don't have to be consecutive). If processes p and q on processor i cause a same-priority access failure at level l, then at least one of p and q does not cause a samepriority access failure at any level less than l.
cess failure occurs (does not occur) on processor j. The rst same-priority access failure caused by a process p may be at any level. However, by our choice of Q, for each subsequent same-priority access failure caused by p, p must have accessed at least P levels without being preempted by processes of equal priority. Therefore, nj (fj ? M)P. From this and the fact that fj +nj = L, we can deduce that fj (L + MP)=(1 + P). Therefore, AF same P(L + MP)(1 + P). 2 We now prove Lemma 3, which is restated below. Lemma 3: Suppose that C = P +K , where 0 K P , and that Q is large enough to ensure that each process
can be preempted at most once by processes of equal priority while accessing any P ? K + 1 consensus levels in succession (these levels don't have to be consecutive). If there are L levels, and at most M processes on any processor, then AF same KM + (P ? K)(L + M(P ? K))=(1 + P ? K). Furthermore, if L > (K +1)M(1+P ? K)+(P ? K)2 M, then a deciding Proof : Assume, to the contrary, that p and q cause a level exists . same-priority access failure at level l, and that p (re- Proof : By Corollary B.1 and Lemma B.2, there are at
spectively, q) causes a same-priority access failure at a lower level together with some other process (there are two ports per level). Suppose that p and q are both of priority v. Also, without loss of generality, assume p accesses level l before q does. Process p's preemption at level l is not its rst, since it causes an earlier access failure. By our choice of Q, this implies that p must have accessed level l ? 1 and must have done so without being preempted by tasks of equal priority. In fact, p must have accessed level l ? 1 without being preempted by higher-priority tasks either, because had such a preemption occurred, p would not have subsequently accessed level l. Hence, when p gets preempted at level l, Port [i; v] equals a port number of some level that is at least l. Therefore, either q does not access level l, or l is the rst level accessed by q after having been preempted. By our choice of Q, the latter implies that q will not be preempted at level l by tasks of equal priority. In either case, we have a contradiction of our assumption that p and q cause a same-priority access failure at level l. 2 Corollary B.1: If C = 2P and Q is as de ned in Lemma B.1, and if there are at most M processes on any processor, then there are at most M same-priority access failures on any processor. Thus , AF same MP. 2 Lemma B.2: Suppose that C = P and that Q is large
most KM same-priority access failures on processors 1 through K, and at most (P ? K)(L + M(P ? K))=(1 + P ? K) same-priority access failures on processors K +1 through P. Therefore, AF same KM + (P ? K)(L + M(P ? K))=(1 + P ? K). A deciding level exists if L > AF same + AF di AF . Hence, by Lemma 2 and the bound for AF same stated in the previous paragraph, a deciding level exists if L > M + KM + (P ? K)(L + M(P ? K))=(1 + P ? K). Simplifying this inequality yields L > (K +1)M(1+P ? K) + (P ? K)2 M. 2
enough to ensure that each process can be preempted at most once by processes of equal priority while accessing any P + 1 consensus levels in succession (these levels don't have to be consecutive). If there are L levels, then AF same P(L + P)=(1 + P). Proof: In the history under consideration, let fj (nj ) be the total number of levels for which a same-priority ac-
17