a universal construction ... in this paper appeared in Proceedings of the llth .... The question ...... B, and such that in a no memo~ fault ..... 0(n3) reliable consensus objects over a bounded domain. Simple extensions ..... of Computer Science.
Computing
YEHUDA Tel-Aviv
Shared Objects
AFEK Universi@
DAVID
Tel-Aviv,
Israel, and AT&T
Bell Laboratories,
Murray Hillj
New Jersq
S. GREENBERG
Sandia National
MICHAEL AT&T
with Faulty
Laboratories
MERRITT
Bell Laboratories,
Murray Hill, New Jersqy
AND GADI AT&T
TAUBENFELD Bell Laboratories,
Murray Hill, New Jersey
Abstract. This paper investigates the effects of the failure of shared objects on distributed systems. First the notion of a faulty shared object is introduced. Then upper and lower bounds on the space complexity of implementing reliable shared objects are provided, Shared object failures are modeled as instantaneous and arbitraty changes to the state of the object. Several constructions of nonfaulty wait-free shared objects from a set of shared objects, some of which may suffer any number of faults, are presented. Three of these constructions are: (1) A reliable atomic read/write register from 20~ + 8 atomic read/write registers ~ of which may be faulty, (2) a reliable test& set register for n processes from n + 10 primitive test & set registers, one of which may be faulty, and 3n + 13 reliable atomic registers, and (3) a reliable registers when f of these may be faulty. Using consensus object from 2f + 1 read-modify-write these constructions a universal construction of any linearizable shared object from a set of either
A preliminary version of the results presented in this paper appeared in Proceedings of the llth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 47-58.
David Greenberg was supported DE-AC04-76DPO0789.
in part
by the U.S. Department
of Energy
under
contract
Authors’ addresses: Y. Afek, Tel-Aviv University, Tel-Aviv, Israel and AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974 D. S. Greenberg, Sandia National Laboratories; M. Merritt, AT&T Bell Laboratories, Room 213-409, 600 Mountain Avenue, Murray Hill, NJ 07974, e-mail: mischu@research,att.tom; G, Taubenfeld, AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and\or a fee. 01995 ACM 0004-5411/95/1100-1231 $03.50 Journal
of the Associationfor ComputinsMachine~, Vol. 42,1$0.6,November1995,pp.
1231-1274.
Y. AFEK ET AL.
1232 n-processor consensus objects or n-processor faulty, is presented. Categories General
and Subject Descriptors:
Terms: Algorithms,
Additional
Key Words
read-modifywrite
B.3.2. [Memory
fault-tolerance,
and Phrases: Atomic
registers,
Design Styles-shared memory.
Structures]:
shared memory,
some of which may be
synchronization.
operations.
1. Introduction The
desire
for increased
causing
many
real-world
tributed
systems.
performance,
reliability,
higher
applications
to be moved
to this work,
research
Previous
and decreased
cost is
mainframes
to dis-
from
on the reliability
of distributed
systems has concentrated on tolerating the failure of individual processors. It has been shown that, by using shared objects to provide communication and coordination between processors, it is possible to tolerate processor failures. Distributed systems can be built that achieve the expected high performance when no processors fail and that are able to continue (at a lower performance) when even all but one processor fails. These fault-tolerant systems depend critically on strong shared objects, yet little
research
objects.
This
has been directed paper
explicitly
to studying
defines
how to tolerate
the problem
shows how they can be made fault tolerant. Before considering ~azdty shared objects Intuitively,
as the name
implies,
a shared
of faulty
we must
object
allows
faults
in the shared
shared
define
objects
a shared
two or more
and
object.
processors
to share information (a formal definition is given in Section 2). The simplest shared object is a shared memory word (or register), into which one processor writes and any other may read.1 The necessity of having more complex shared objects was recognized by IBM and other computer manufacturers in the early 1970’s. The inability of concurrent systems that used only read/write memory cells to maintain definition
(and
a concurrent construction)
queue
or stack or to reach
of stronger
primitives
such
consensus
led to the
as test-and-set
and
semaphores [Dijkstra 1974; Peterson and Silberschatz 1985]. Most of this paper is concerned with these stronger types of objects (one section, Section 4, is devoted to read/write registers). Herlihy has defined a hierarchy of progressively stronger shared objects [Herlihy 1991]. Objects at each level are able to perform impossible for objects at the lower levels. The question asked
tasks that are in this paper is
whether it is possible to use shared objects that sometimes fail to meet their specifications. It is shown, here, that it is indeed possible to use a set of objects some of which are faulty (at each of three levels of the hierarchy: read/write, test-and-set, and read-modify-write) to produce compound objects that are fault tolerant. The highest level of the hierarchy (exemplified by our read-modi&-write) is universal, that is, allows the construction of any other shared object. A result of our constructions is that any shared memory object Y has a fault tolerant construction from any object type X in the highest level, that is, a construction from a set of objects of type X some of which may be faulty.
1See, for example, Burns and Peterson Awerbuch [1987].
[1987], Lamport
[1986], Singh et al. [1994], and Vit6nyi
and
Computing
with Faulty
It should
be clear
memory. However, less straightforward. locations
that
Shared that
Objects
a shared
1233
object
is very different
can be used for
storing
and retrieving
processors,
and is not used for synchronization.
of semantic
rules
that
for synchronization. adding tion
either
features
from
local
processor
the distinction between shared object and shared memory is A shared memory typically consists of a set of memory
restricts A shared
hardware
the behavior memory
or software
of the shared
data
by several
A shared of the object
object that
can be used to build
protocols
to enforce
different
includes
a set
thus can be used shared
objects
by
the timing/coordina-
object.
Since a shared object is defined by its behavior, it can be implemented in many ways, either through hardware or through the combination of software and hardware. Thus, the failure of a shared object can occur in many ways. Our intent is that the model of shared object failure should cover all possible failures in implementation. Hardware implementation can fail in several ways: the
data
contained
incorrectly Software infinite
order protocols loop,
assigned
in
the
events,
object
or
can
fail
a protocol
in
be can
several
mistakenly
to it, or a protocol
can
requests
receiving
corrupted, be lost
ways:
allowing
the
due
a buggy a process
data in an order
control
to
logic
switching
can
failures.
protocol
entering
an
to affect
memory
not
for which
no contingency
had been planned. One might attempt to apply common techniques for dealing with faulty local memory to faulty shared objects. These techniques include keeping many copies of each datum or, in general, using some type of redundant coding that allows errors to be detected and/or is much more complex to implement
corrected. Unfortunately, a shared object than a memory cell of a uni-processor;
that is, common primitives such as read-modify-write writes require a memory which is much more than (see, e.g., Smith memory
[1982]).
associated
with
The use of standard the
object
will
or even simple reads and a passive repository of data
redundancy
not,
access to the object through protocol failure. If redundancy is spread across objects, then
for the
techniques
example,
within
prevent
encoding
a loss
is subject
the of
to the
timing inconsistencies that are the bane of all distributed algorithms. Techniques to recti~ faults in shared memory objects will necessarily have to incorporate distributed coordination techniques. It is precisely how to supply redundancy while maintaining distributed coordination which is the subject of this paper, After a discussion of shared object failures and the specification of faulty memory primitives (in Section 2), the body of this paper presents a sequence of constructions that use faulty shared memory objects. Section 3 presents elementary relations among general fault-tolerant constructions, showing how different constructions can be composed. Section 4 begins with constructions of reliable read/write memories from faulty read/write memories. Following Lamport [1986], we present simple constructions of safe and regular registers and conclude with a construction of a reliable atomic register from faulty test & set objects from similar objects, atomic registers. In Section 5, reliable some of which might be faulty, are constructed. In Section 6, implementations of consensus are studied, using unreliable read-modifi-write primitives. The constructions in this section demonstrate that faults do not qualitatively decrease the power of these primitives, in that they retain
their
positions
in the memory
hierarchy
of Herlihy
[1991].
Moreover,
in
1234
Y. AFEK
ET AL.
combination with the earlier register results, our consensus be used to implement the universal construction of Herlihy
constructions can [1991]. Hence, for
example,
to
shared
faulty object.
read-modify-write The paper
1.1.
RELATED
memory
systems
objects
are
distributed restricted may
WORK.
reliable.
2 This
systems
to
either
affected,
previous
periods
in
has of
paper
general
without
research
specific
studies
total any
explored
time.
For
can
a discussion
Theoretical
typically
only
be
primitives
closes with
research
process in
number, restriction failures example,
on
failures
describes faults
the
first
in on
which such
assumes study
objects. the
implement
of
that the
3 Memory
number
of
data
timing
of
the
the are
any
problems.
fault-tolerance
and
shared or
be used of open
restricted
constrained
to memory
in
shared
the
shared
tolerance
of
failures
are
objects
that
faults. occur
Some during
faults
are
self-stabilizing systems defined by Dijkstra [1974]. Self-stabilizing systems are required to recover once the final memory fault has occurred, and the system is in an arbitrary state. That is, self-stabilizing protocols may start at any erroneous, globally inconsistent state and must always reach a correct global state satisfying particular safety requirements. Three previous papers investigated initialization failures, a special case of studied
in work
on
self-stabilization model, only the initial
state
[Fischer et al. to appear; 1993; Moran et al. 1992]. In initial state of shared objects may be corrupted, while
of the processors
is not corrupted
(e.g., the program
that the
counters).
In
this paper, the corruption may repeat at any time during the run. In Section 6 of this paper, we use implementations of consensus to explore properties of faulty shared memory. The consensus problem is fundamental in distributed
computing
and is at the core of many
algorithms
for fault-tolerant
distributed applications. Much is known about the consensus problem in other fault models.4 Sections 4 and 5 investigate the question of constructing reliable registers in an unreliable environment. This relates to the fundamental problem of implementing one type of shared object from another. Previous work on shared [1987],
object Herlihy
implementations [1991], Lamport
includes Bloom [1987], Burns and Peterson [1986], Li et al. [1989], Plotkin [1988], Singh et
al. [1994], and VitAnyi and Awerbuch [1987]. As noted above, there is a relationship between implementing a parallel processor shared memory and the construction of shared objects. Implementing shared objects of any type in a network-based machine is difficult. The task of including coordination protocols and fault tolerance could not begin without the basic mechanisms for sharing. One approach to providing sharing is to use local caches to hold pieces of global data required by each processor. Special operating system functions and hardware remove from the user the burden of coordinating the accesses to the shared data. One such system has been proposed, implemented and analyzed in Li and Hudak [1989]. The KSR, DASH, and Alewife machines are a few examples of machines that use such ‘See, for example, Abrahamson [1988], Afek et al. [1993], Bloom [1987], Burns and Peterson [1987], Chor et al. [1987], Herlihy [1991], Lamport [1986], Plotkin [1988], Rabin [1982], Singh et al. [1994], and Vit~nyi and Awerbuch [1987]. 3Concurrently and independently of our work, Jayanti et al. [1992] have addressed similar ~roblems (see Section 7 for more on this paper). See, for example, Abrahamson [1988], Aspnes and Herlihy [1990], Chor et al. [1987], Fischer [1983], Fischer et al. [1985b; 1985c], Loui and Abu-Amara [1987], and Saks et al. [1991].
Computing
with Faulty
Shared
Objects
1235
distributed shared memory [Bell 1992]. Another, more software-oriented, approach is implemented in the Linda system [Carriero and Gelernter 1989]. In Linda,
an abstract
operations
tuple
space is shared
are available
one of these
to insert
methods
techniques) expect that
will our
mentations
in order
2. Specifying
and
(instead delete
of implementing
of cache
tuples.
sharing
(or
lines
or pages),
Obviously,
and
the choice
any of many
other
of
clever
affect the types of errors expected from shared objects. We ideas on fault tolerance will be specialized to various impleto increase
Faulty
efficiency.
Shared Memoiy
Objects
We consider a collection of asynchronous processes that communicates via a collection of shared memory objects. The shared memory may consist of a variety of shared data objects. These shared objects are subject to faults. Each fault with
of a shared object is modeled as a state change that appears respect to the processes’ operations. The local (nonshared)
each process
is assumed
to be atomic memory for
to be reliable.
An object @’ shared by n processes can be specified via a state machine which the state transitions are labeled by the invocations and responses operations
performed
Definition 2.1. A {Q, S, 1, R, 8} where: ● ●
in of
by the processes. sequential
specification
Q 1S a ~finite or infinite) set of states. S G Q 1s a set of initial states. of I=(lrwl, ..., lnu~) is an n-tuple
●
of an object
sets, where
type
each
is a quintuple
Inui
is a set of
symbols
denoting the operation invocations by process i. Let Irm = IJ i lnui. of sets, where each Resi is a set of . R = (Resl, . . . . Res~) is an n-tuple ●
symbols denoting the operation responses for process i. Let Res = U ~Resi. Define the set of operations by process i on @ to be @pi = Inui x Resi, all the
two-character
Op = U iopi. This state concatenating
strings
Then
of
invocations
and
8 c Q x Op X Q is the transition
by some process every process
by
i with
on shared
no succeeding objects
i, and every
inui ● Invi,
that (s, inuiresi, s’) = 8. The sequential specification
response,
are required
allows
there
and
let
obtained by graph. The
prefixes of these strings. invocation, that is, an invi
resi, by i).
to be total: exists
operations
i,
relation.
machine denotes a set of (finite and infinite) strings, the edge labels along paths in the state transition
sequential runs of @ are the (finite and infinite) (Taking prefixes allows runs to end with a pending Operations
responses
for every state
s ● Q,
resi = Resi and s’ e Q such
to be specified
as atomic
state
transitions. However, in asynchronous concurrent systems operations have duration and can overlap in time. This is modeled by allowing the interleaving of invocations and responses by different processes, so that between the invocation and response by one process may be any number of invocations or responses by other processes. Thus, concurrent runs of the object are modeled as elements of ( Inu + Res)m (i.e., finite and infinite runs). Specific correctness conditions constrain these runs by relating them to those in the sequential specification. One such correctness condition is linearizability [Herlihy and Wing 1987].
Y. AFEK ET AL.
1236 We next define
the notion
of linearizable
runs of an object
@. Given
a string
of events a = (Inu + Res)m, define a partial order, t >0, let L = log,+ ~ CONST(X, (t, IXJ),X). Then:
CONST(X, PROOF. base
As
construction
strongly
wait-free
illustrated
in
of a reliable
objects
(f,@), the
left
object
of type
X) side
of type
s ((t of
+ l)f)~.
Figure X,
X, t of which
1, assume
that
CONST(X, may be ~-faulty.
using
C =
there
is a
(t, ~), X)
Computing
with Faulty
Shared
Objects
1239
Embedded
object
primllive
Base construction from
3 objects
Obje Ct
High-level
of X
from
of type X,
construction
3 embedded
of X
objects
each constructed
from
of type X,
3 primitive
objects of type X. ~G. 1.
Constrictions
The goal is to construct
a strongly
~ of the primitive base construction, X.
Call
in the proof of Theorem
wait-free
object
3.2.
X that
each of these
C primitives
each of the C embedded
objects
“embedded using
resilient to t faults of the embedded Consider next how many primitive
objects”,
and now
the same base construction.
as illustrated on the right side of Figure primitive objects of type X. Moreover, resilient to t faults of the true primitives,
even if
embedded objects return fail. The theorem follows
objects. faults are necessary
C strongly
some value, by
wait-free
applying objects
implement The result,
1, is a construction of X from C2 each of the embedded objects is and the single high-level object is to cause the high-level
objects must object to fail: at least t + 1 of the embedded these embedded objects will be correct unless at least t + objects fail. Hence, the high-level object is resilient to (t + faults. The strong wait-freedom of the base construction
construct
is reliable,
objects are w-faulty. Consider the following idea: take the of type which survives up to t faults among the C primitives
fail,
and
primitive
this
First,
idea
recursively. X,
of
1)2 – 1 primitive ensures that the
even if t + 1 or more
of type
each
1 of its primitive
each of which
objects
recursively
is resilient
to
[ ~/(t + l)J faulty objects, then use these C objects to construct a single object The result is an object that can tolerate the X, using the t fault construction. [~/( t + 1)] resilient objects. fault of t of the embedded By recursion, each of the C embedded objects, c, will behave in a fault-free objects used to manner, provided no more than 1~/( t + 1)j of the primitive in c fail, then c may construct c are faulty. If more than [ ~\( t + 1)]primitives no longer behave correctly, and may appear to be faulty, to the higher level
Y, AFEK
1240 construction. But by the strong at least return. In order to cause the highest
wait-freedom
of the construction,
level of the recursion
faulty
behavior,
behavior.
For
one of the
embedded objects to fail, it. Since the total number
there must be at least [ ~\( t + 1)1 memory faults in of faults is ~ and ~ – t[ ~\( t + 1) 1s ~/(f -t 1), none objects
faulty
to exhibit
objects
embedded
exhibit
calls to c will
~ + 1 of the C embedded
of the C – t other
must
ET AL.
is faulty.
Hence,
the final
tolerant of at least ~ memory faults. The total number of X objects used in this construction CONST(X,
Time
T
(t, @),X)
< CONST(X,
(t, ~), X) LIOg’”’fJ+l
s CONST(X,
(t, ~), X)logt+f+l
((t+
complexity
“ CONST(X,
grows
similarly:
bound on a single high-level
Fault-tolerant
constructions
in obvious
THEOREM
suppose
in the base construction
of X, that needed to Then in the
~ faults, each high-level operation ~ operations on primitive objects.
can be composed
For any f, m >0,
3.3.
(f,
CONST(X,
object
+ 1)1, ~), X)
with
fault-intolerant
requires ❑ construc-
ways:
CONST(X,
PROOF.
([~/(t
the number of primitive operations operation on the constructed object.
recursive construction to overcome at most TIOgr+1~+ 1 = ((t + l)~)log’+’
tions
is:
l)f)L.
is an upper
implement
is
(~, CZJ),X)
= CONST(X,
=
construction
The
k), Z)
< CONST(X,
(f,
m, Z)
s CONST(X,
rn, Y)
theorem
of type Z from
in a fault-tolerant
follows
objects
way from
A less obvious
and for all k = {1,...
by taking
oCONST(Y,
‘ CONST(Y,
of type Y, but constructing
allows
construction
each object
components
faulty
objects
(O, k), Z);
O, Z).
a fault-intolerant
potentially-faulty
composition
k), Y)
} U {w},
of type
of type X
of an
of type Y
X.
•l
to be used in a
strongly wait-free, but otherwise fault-intolerant construction of objects of type Y. If the X objects are in fact faulty, the result will be an object of type Y that may be m-faulty, but can then be used in fault-tolerant constructions. Note, that in the following theorem CONST( X, (O, ~), Y) is the usual space complexity of wait-free constructions of Y from X; THEOREM
CONST(X,
4. Read/Write
3.4.
For any f >1, (~,~),
Z)
< CONST(X,
(O, CO),Y)
. CONST(Y,
(f,
~), Z).
Registers
One approach to tolerating faulty read/write registers is to add a software layer between the faulty memory and the user that looks to the user like fault-free memory. In this section, it is shown that this is possible for safe, regular, and atomic read/write registers. That is, we construct safe, regular,
Computing
with Faulty
and atomic
Shared
read/write
Objects
registers
1241
from
a collection
of the corresponding
tives, f of which may be w-faulty. (Henceforth, we “read/write register,” and “atomic read/write register” Atomic registers can be specified using sequential scribed
in Section
2. Safe registers,
as defined
use
“atomic
primiregister,”
interchangeably.) specifications,
by Lamport
as de-
[1986] have behavior
that is sensitive to concurrency. Hence, we directly specify the legal runs of a safe register (over a data domain V): A safe register may be accessed by two processes. One performs read operations, which return values from V, and the other performs write operations, which take values as arguments. A sequence of invocations and responses of these operations is safe if and only if the invocations and responses by each process alternate appropriately, and each read operation that is not concurrent with a write operation returns the value written by the last write operation that precedes the read, value, if there is no such write. (Hence, read operations with
writes
may return
Regular
registers
or returns the initial that are concurrent
any value.)
are also sensitive
to concurrency,
but are more
constrained
than safe registers. They are defined as safe registers, except that read operations must return a value of a write that is either concurrent with the read, or k the last write (or initial value) that precedes the read [Lamport 1986]. A pair of read and write procedures is safe (or regular) if all concurrent executions give rise to safe (or regular) sequences of invocations and responses. The notion of a fault in a safe or regular register is introduced, in the spirit of that
of atomic
A sequence
objects,
as an atomic
of invocations
m is the smallest a safe register
number from
write
and responses such that
the
original
operation
(by some outside
for a safe register
it is possible sequence
to produce
by adding
THEOREM
2~ +
value returns
For
the
upper and
any
m
may
be
infinite.)
multi-reader\
2f + 1 similar registers, from 2f such registers,
singlef of which f of which
CO),safe) < 2f + 1.
1 registers,
(~ +
of
write
(f, 1),safe) > 2f.
—CONST(safe, PROOF.
if,
a legal sequence
4.1
—CONST(safe,(f,
the
agent).
m~azdty
m instantaneous
operations (strings of the form W(ui)ack). (Note that Faults of regular registers are defined analogously. Next we prove that, one reliable, strongly wait-free, writer safe register can be constructed from may be ~-faulty, but cannot be constructed may be l-faulty.
is called
1 with value.
the
bound,
a simple
the
reader
same
value),
Each
high-level
construction
reads then operation
them. it
If
works: the
returns requires
the
reader
that
writer
sees
value;
writes
a majority
otherwise
2f + 1 operations
it
on
primitive registers. For the lower bound, assume to the contraiy that there is a solution using 2f registers. Without loss of generality, let the initial value in the high-level object be O. Let the initial respectively.
values
of the primitive
registers
rl, ...,
rzf
Let a be a run in which process pl runs alone and performs 1 to the high-level register. Let the final values of the registers be U1, . . . . uz~, respectively.
be z-q, ..., uz~, a single write of rl,. ... r2f in a
Y. AFEK ET AL.
1242 Now
consider
a run
~
in
which
runs
pz
alone
and
performs
a single
high-level
read operation, after ~ faults have changed the initial values of Uzf, respectively. Hence, pz finds the registers holding r-2f touf+ l,..., rf+ l>...> values Ul, ..., uf, uf+l, ..., Uzf. Since pl has taken no steps in E, the read by pz must return O. But ~ is indistinguishable by pz from a run y that extends the run a in which pl wrote 1, by ~ faults that change the values of rl,. ... rf followed by a single high-level read operation back to Ul,. ... Uf, respectively, by p2. return
Hence, the read by pz in y (and the 1 written by p ~, a contradiction,
In addition
to bounding
the number
the above proof also shows that necessary in order to implement
so in the ❑
of primitive
at least high-level
indistinguishable
objects,
~)
must
the last argument
2f + 1 primitive read or write
operations operations,
in are thus
matching the upper bound in holds for constructions that are Since the upper bound holds the lower bound holds for one
the theorem. Note that the lower bound also not strongly wait-free. for an infinite number of faults per object, and fault per object, the first part of the corollary
below
is a consequence
3.1,
follows. which
The second implies
CONST(safe, COROLLARY
part
CONST(,safe,
what
of the first
1), safe)
1. previous
in
TEST
& SET.
a system new
subsection,
❑
to be wait-free.
a mechanism This
as each of the two processes test& set registers before
with that
mechanism which
The at
construction most
two
selects
at
is used selects
in the
last
processes. most
two
out
as a “doorway”
exactly
one
sec-
In
winner
this of
to out
n the of
two.9
‘The mechanism introduced in this subsection is very similar in its behavior to a 2-exclusion algorithm. (2-exclusion [Afek et al. 1994; Dolev et al. 1988; Fischer et al. 1979, 1985a] is a generalization of mutual exclusion in which up to two processes may be in the critical section at the same time, but no more.) The difference between 2-exclusion and the mechanism presented here is as follows. In a 2-exclusion algorithm, if two or more processes attempt to enter their critical sections, then eventually at least two processes, out of n that t~, enter the critical section. In contrast, in the “doorway” construct, if two or more processes try to enter then at most two and at least one enter the critical section. (In this application, the critical section is the seven object construction from the previous subsection.)
Computing
with Faulty
Shared
Objects
1247
close
(reliable
r I I I I I I I I I I I
----
atomic bit)
..-.
----
•1
----
1
----—
—--
----
1 I 1 I
Doorway
F1 At least one process and at most processes
win
I 1
lose
I 1 1 I I
two
F2
win.
1
----------
L -r -I I 1 I I I I I I I I Iv I 1 I I
------
-----
I
win ‘ose lose @----- ----- ----------
J 1 1 I
----
A
1
FIG. 3.
Schematic of single-use set for n processes.
test &
win
lose P
1“
u.
I
1 I
1
win
lose
win
8 1 I 1
I t
~o
(1)
(1)1
w
n
I win I I Two I L ------
i t 1 I
lose
processes
single-use
------
lose
I I I I
test-and-set.
------
------
--
J
If there are no faults, then two primitive test& set registers can be used to ensure that at most two processes pass the doorway, by having each p-test& set the two registers, and pass if successful in at least one. However, if one of the two registers is faulty, then at least one process and as many as n may enter (because the faulty one may let everybody require the processes to go through two
win). To overcome this problem, we such gateways. That is, we maintain
two pairs, FI and Fz, of primitive test & set registers (see function doorway in Figure 4). To pass the doorway, each process must successfully p-test&set one register in each of the two pairs, otherwise the process returns 1 for its test & set operation. (In the OR operations in the function doorway (Lines 1 and 2) only one successful clause needs to be executed. That is, if the first clause is true, then the second need not be executed—however, the construction is correct with either interpretation.) The doorway implementation is given in Figure 4.1° The doorway guarantees that at most two processes enter the two process construction; however, it could be that one process is blocked in the doorway, while a faulty primitive allows a later arriving process to pass it into the two process construction (left 10In the code,
the execution
of a return
command
terminates
(exits) the operation.
Y. AFEK ET AL.
1248 shared
F’l [1 ..2], F2 [1, ,2]: T&S close:
function
reliable
registers,
atomic
initially
O
on {O, 1}, initially
O
s-testtset
1:
if close = 1 then
2:
close
3:
if doorrray
return(1)
1
:=
%return
O (win)
%exit
if construct
fl %close
then
return
(singlemse-2_process-t
esttset)
the construct
else
or 1 (lose) is closed
and continue
return(1)
fi %from
Figure
2
end-function
function
dooruay
%return
1:
if (p-testkset(F1[l])
2:
thenif
(p-testtset(F2[l]) then
3:
= O) OR (p-test&
return
return
= O) OR
(true)
set(F1[2])
= O) at least one k F1 and one in F2: pass
%won
(false)
or false (lost)
%see note o below.
(p-tasttset(F2[2])
fi fi
true (passed)
= O)
%won
at neither
F1 nor at F2: do not pass
endlunction “In the OR operations only one su~~~~~ful clause n~eds to be executed. That is, if the first clause is true, then the second need not be ~X~Cuted—hoW~ver, the construction is correct with either interpretation.)
—.— .. FI.% 4. Single-use n-process test& set using eleven test& a reliab~~, atomic read/write bit (see also Figure 2).
set’s, one of which may be faulty,
and
as an exercise). This enables a scenario in which one process loses, and only after it exits, the eventual winner starts. This violates the linearizability requirement (operations must be serialized within the time interval in which they are active), and can be resolved add a reliable multi-writer/multi-reader
in several ways. The simplest solution is to atomic bit, close as is suggested in
Afek et al. [1992a] (which can be constructed from unreliable primitive objects as described in Section 4). Any process first reads the close bit, and if close = 1
it returns doorway. start
1 and This
exits;
ensures
otherwise that
it
once
sets
close
a process
to
1 and
continues
has lost, no other
into
process
the
can later
and win.
THEOREM 5.2.1. There is a strongly wait-free, n-process, single-use, test&set construction from 11 test &set registers, one of which may be ~-faul@, and a single reliable read/write bit (Figures 2, 3, and 4). PROOF. the
It is argued
doorway
exactly We
one serialize
and
enter
of these the
above the
that
at least
single-use,
will
be the
winner
with
one
and
two-process
at most test
two
& set.
processes
leave
By Theorem
5.1.1,
winner. it’s
high-level
request,
and
the
losers
with
their
read of close = O in line 1 of s-test & set must precede either a loser’s read of close = 1 in line 1 of s-test & set, or (if the loser also read close = O in line 1 of s-test& set), the loser’s write of close := 1 in line 2 returns.
The
winner’s
of s-test & set. Regardless, the winner is serialized before all the losers. As in the previous construction, this one is obviously strongly wait-free,
as
each of the n processes performs at most two operations on the primitive read/write register and 11 operations on the primitive test & set registers before terminating, and these are assumed to be wait-free. ❑ 5.3. the
MULTI-USE
previous
n PROCESS TEST&
construction,
resulting
in
SET.
This
a construction
section
extends
of a multi-use,
the
ideas
of
n process
Computing
with Faulty
test & set theorem.
object.
A
in Section
THEOREM
Shared
5.3.1.
series
Objects of
1249
lemmas
lead
to
the
proof
of
the
following
multi-use
test&set
5.3.4. There
is a strongly
wait--ee
n-process
construction from n + 10 test&set registers one of which 3n + 13 reliable atomic registers (Figures 5 and 6).
may be ~-faulty
and
5,3.1 Overview of the Construction. The single-use test & set provides a simple lock: whichever process wins the test & set (reads O while setting to 1) holds the lock. However, without the reset operation the lock cannot be a multiple-user test & set, released, The construction in Figure 5 implements that is, one that can be test& set describes the shared data structure. A simple
way to add a reset
test & set object
would
and
operation
be to simply
reset
reset,
and
the
to a high-level the
picture
in
Figure
implementation
constituent
primitive
6
of a registers.
Unfortunately, in the constructions given above, resetting the primitive registers would interfere with the high-level test & set operations of other processes and could The object tion, if times, two
cause incorrect
behavior
of the high-level
test & set object.
first observation towards an implementation of a multi-use test& set is that in the n process single-use construction of the previous subseca process tries to perform a high-level test& set an arbitrary number of it is guaranteed that it will lose in all its repeated attempts (unlike the
process
single-use
construction
where
a process
may
try
only
one
time).
Given this observation, a simple solution that uses an unbounded number of primitive test & set registers and read/write bits would work as follows: divide the collection of test& set registers into bundles of 11 test& set registers and one bit, and enumerate the bundles 1,2, . . . . Use one atomic read/write register as a pointer to the bundle that is currently in use. Once the high-level test
& set
that
have
is set,
started
it can
the
be
reset
high-level
by simply incrementing test & set operation
lose, since by the above observation winner was already declared.
they are blocked
the pointer. Processes before the reset would in the old copies,
This unbounded solution can be modified to use 1 in registers (one of which might be faulty) and O(n) reliable the following
observation:
All
the bundles,
except
n of them,
where
primitive test& atomic registers, could
a set by
be recycled,
since no process will ever access them again. This solution will require that each process declare which copy it is using and that the resetter find a copy not in use. A still more efficient solution carries the recycling idea one more step. Instead of duplicating and recycling the bundles, we reeycle the primitive test & set registers (see Figure 5). This is possible by requiring each process to declare which primitive test & set register it is using at each step that it is taking (Line 9 of function test& set). The resetter will now search for 11 primitiue registers that are not in use (Lines 7–10 of function reset) and will compose them into a new 1l-piece n-process single-use object. Thus, rather than giving a pointer, it will have now to provide 11 pointers (current[l . . 11] in the code) for the newly composed 1l-piece object. (The read\write bits are replicated and reused similarly.) In this way, a solution is obtained that uses a total of n + 10 primitive test& set registers (one of which might be faulty) and O(n) reliable atomic registers.
1250
Y. AFEK
Protocol
p:
for process
PTST:
shared
array
[l..(n
p-test&set stop:
atomic
tks
with
operations
on {O, 1}
reliable
my.nezt[l..n]: restart[l ..n]: current
+ 10)] of primitive
and p-reset
reliable
c~ose[l ..n]:
local
ET AL.
atomic
%written
on {O, 1}, initially
O
by winner,
Yowritten
%written
reliable atomic on {1,. . . ,rt + 10} reliable atomic on {O, 1}
[O..ll]:
siate[l..ll]:
reliable local
atomic
initially
O
by owner %initially
%initially
on {1, . . . . n + 10}
on {1,0,1},
initially
and read by everybody 1
1,2,...,11
1
inasi[l.. (ri+ 10)]: local on {0,1} function
test&set
1:
restart~]
2! 3:
mv-nezf[p] if stop =
4:
if restart~]
5:
if
7cstart
:= O
1
= 1 then
c/ose[my-nezt~]]
6:
c/ose[my_rzezt~]]
7:
for
8:
while
i = 1 to
(state)
~Y-~=tM
10:
if stop
11:
if restart~]
12:
return(1)
%and
fi
:=
%do
return(1)
state[next(state)]
stop
bit,
restart
bit,
simulation’s
close bit
and continue
L od
= L do
= 1 then
global
the construction
won
bit is used
%personal
7,indicates
:= currcnt[next(~t~te)l
= 1 then
(close)
%check
%close
state[tl
no one already
which
fi
:= 1
9:
that
fi
return(1) = 1 then
11 do
result
by ensuring %indicates
:= CUrmrLf[O] then return(1)
fl
return(1)
:= p-test*
s-t.st~set
which
tks
%check
that
step by step
register
is used next
no one already
won
fi iet(PTS~my.nezt
~]])
od 13:
return(resnlt(state))
end-function function
reset %prevent
1:
stop
2:
for
i = 1 to
n + 10 do
3:
for
i = I to
n do if i #
4:
j:=l
5:
while
6:
curreni[O] j:=l
7: 8,
for
;= 1
inuse~]
inme[i]
p
then
anyone %mark
:= O od irsuse[my-nezt[i]]
:=
%might
1 fi od
% finds
tolldo while ‘inuse~]
10:
which
be corrupted
%is register
= 1 do j := j + 1 od
winning
j used?
:= j
i =1
9:
from registers
currenf[i]
11 PTSRS
= 1 do j := j + 1 od
:= j
od 11:
c/ose[current[O]]
12:
for
i = 1 to
14:
for
i = 1 to
15:
stop := o
p-reset
13:
%re-open
:= O 11 do p-test&
set(PTST(current[i]
(PZ’STlcurreni[i]]) n do
restart
the construct
]); ?freset
od
[i] := 1 od
the tk
% make everyone
registers start
again
end_function function
result:
is incomplete. function
next:
the primitive
returns Otherwise examines register
L if the values it returns the
which
in state
indicate
that
the value
that
is returned
in staie
and
returns
values
the
the
simulation
index
should be accessed next in the simulation
FIG. 5.
Multi-use
test & set.
of s-testkset
by the simulation. (between
of s-test
1 and kset.
11) of
Computing -.. ------1
1251
with Faulty Shared Objects ----------% ,-,,1-r@”t . . . . . .. . 1
I
I
I
In o
I I
my_nexl
PTST
J
--!-”-------1
Clcse
. . . . . . . . . . . . . . . . . . . ;. . . . . ...{
I
t
I
I
\
.......................
:
-----------------z
Pl\
Ya---
slop n
H+
raw
1~
/l--d
H
FIG.6.
Shared data structure
forthemultiple-use
test &set
construction.
To ensure that, while the resetter is composing a new 1 l-piece copy, no process is switching between registers, the resetter will set a multiwrite/multi-reader stop bit telling all other processes to lose and exit (Line 1 of
reset).
Every
process
tests
the
stop
bit
before
each
primitive
operation
(Lines 3 and 10 of test &set). If the bit is set, the process aborts its high-level test & set operation and returns with 1. The stop bit by itself is not sufficient because a process performing a high-level test& set could be suspended during could
the reset period, later
and not observe
access a primitive
the stop bit being
test & set register
that
set. Such a process
it should
not.
Therefore,
another bit, called restart[ p], is associated with each process p to detect if a reset was taking place while it was not checking the stop bit: The restart bit works as follows; when starting a high-level test & set operation each process assigns O to its restart bit (Line 1 of test & set), and before finishing a high-level reset a resetter assigns 1 to the restart bits of all processes (Line 14 of reset). Now, before performing each primitive operation a process checks whether its restart bit is still O (Lines 4 and 11 of test & set). If not, it aborts its high-level test & set operation bit and the restart bits ensures
and returns with that the primitive
1. The combination of the stop objects chosen by the resetter
are not in use, and that they will not be accessed until they are themselves reset and the resetter assigns O to the stop bit. The high-level test & set operation is now performed on the chosen 12 primitives by simulating the s-test& set operation (from Figure 4). The simulation proceeds in iterations, such that in each iteration one primitive test& set
1252
Y. AFEK ET AL.
register is p-test & set (see Figure 5), A local state variable, state, records the history of the s-test& set simulation. This variable contains eleven fields, each corresponding to one of the eleven primitive test & set registers in s-test& set, as in Figure 4. The values of the fields are L , 0 or 1, and are initialized to L , indicating that no calls to the primitive registers have been made. Each call to a primitive test & set register is made to simulate a call in s-test& return value is assigned to the corresponding entry in state. Two
functions
values
in
returns returned
are defined
state
indicate
that
on
state. The
the
simulation
O (or 1) if the values O (or 1). The function
function
result
of s-test&
set, and the
returns
set
L
if the
is incomplete.
It
indicate that the simulation is complete and next examines the values in state and returns
the index (between 1 and 11) of the primitive register that should be called next in the simulation of s-test & set, given the results currently recorded. 5.3.2 Well-Formed Executions. The use of the primitive test & set registers in the construction is designed to work correctly only in constrained environments, in which a p-reset operation is invoked only if a p-test & set has returned
successfully
since
the
last
invocation
of p-reset.
The
constructions’
calls to these primitives depend, in turn, on their behaving properly. This complicates the proof, which must avoid circular reasoning. Accordingly, a system execution is well-formed for a multi-use test& set register (either a faulty primitive or high-level reliable construction) if the number of calls to the reset operation is either equal to or one less than the number of successful returns for calls to the test & set operation, in every prefix of the execution. That is, the sequence of successful returns from test & set and of calls to reset operations strictly alternate, beginning with a successful
return
from
a test & set
operation.
Hence,
in well-formed
execu-
tions, the reset and test & set operations can all be serialized so that successful test & set’s alternate with reset’s, and every unsuccessful test& set is ordered after
a successful
test&
set and before
violate well-formedness for can be serialized arbitrarily. regard for semantics Hence, regardless
the
next
a test & set register, (That is, anywhere
of the operations.) of whether an execution
reset.
In environments
that
calls may not terminate and within their interval, without
is well-formed,
we have defined
a
serialization and can consider calls to primitive test & set registers as single atomic events. Given an execution of the high-level reliable multi-use construction, we reason about this serialization, and argue that the execution is well-formed for each primitive. Then, we conclude that the atomic events have the appropriate semantics (are ordered as described in the preceding paragraph). The
fact
object is Given a will want primitive. of these
that
well-formedness
is maintained
for
each
primitive
test & set
proven by induction on an execution a of the high-level construction. that is well-formed for each primitive, we prefix, a’, of the execution to reason that an extension of the prefix is also well-formed for each To do so, we need to use the semantics of (well-formed) executions primitives. The semantics are in turn encoded in the serialization of
each primitive.
Hence,
we need to relate
serializations
of a and of a’. That
is,
we want to reason, not about the execution, a, but a serialization, ~, of the execution. In the induction step, we have a well-formed prefix a‘ of the execution. Any serialization ~‘ of a’ obeys the semantics of the test & set
Computing
with Faulty
primitives.
To use this
serialization assures
fact
to reason
~ so that
it has
Let
5.3.2,1.
r. Let
1253 about
/3 and
f3’ as a prefix.
of a‘
PROOF.
a
be an execution
a‘ be the longest prejik
a serialization
~
Objects
The
If
a
which
a, we choose technical
of a serialization
then
of a and
of a system
the
lemma
a =
a’
in a’, but serialized
and
we
/3’ be a serialization
By restricting consider
reliable
the
multi-use
until
after
class of serializations
calls
Iinearizability
early,
to
primitive
construction,
is a local
[Herlihy
set
done.
Otherwise,
let
we
subsequence of events in ~ of some operations that are ❑
/3’.)
objects,
as consisting
property
are
of a’. The serialization
as described
test & set
a test&
for r. Then there is
of a.
want is ~‘( ~ – ~ ‘). (That is, append to /3’ the that are not in ~‘. This may move the serialization pending
containing
of a that is well- foimed
is a prefix
is well-formed,
be a serialization
can
hence following
us this is possible:
LEMMA
register,
Shared
of single and
in Lemma
within
we
of
events.
(Because
atomic
Wing
5.3.2.1,
executions 1987],
it
the
suffices
to
consider each primitive separately in Lemma 5.3.2,1.) Moreover, in any wellformed prefix of the execution, these atomic events obey the test& set semantics. 5.3.3 Propetiies of the Constmction. Each call to the reset function begins with an assignment, “stop T= 1” (reset Line 1), and shortly before terminating, assigns 1’s to the restart bits (reset Line 14). Definition function
5.3.3.1.
within
For
a process
a, we say that
subinterval of a that begins with Line 1), and (stop := 1) (reset (reset We
Line 14). say that calls
to
reset
p,
execution
the reset
function
the initial ends with
are
sequential
a,
and
call
to
the
call is actiue for p within
reset the
assignment of the reset function, the assignment (restati[p] = 1) in
an execution
a
if each
call
executes its final assignment to stop (reset Line 15) before the next call begins. For now we state lemmas for executions which are assumed to satisfy this property. Note, in particular, that when calls to reset are sequential, stop = 1 is invariant during reset function calls, up to the final atomic assignment stop := O (reset LEMMA
Line
5.3.3.2.
15). Let
a be an execution
such that the two steps ( restart[p]
in which
:= O) (test&
calls to reset
set
Line
are sequential,
1),and (if
stop = 1)
(test &set Line 3 or 10), area subsequence of a from a single call to test & set by process p. If a call to the reset fimction is active forp at any point between the ( restart[p] := O) step and the (if stop = 1) step, then either stop = 1 in the if test (test & set Line 3 or 10), or restart[p] = 1 in the next step of p which is (if restart[p] = 1) (test&set Line 4 or 11). PROOF.
(restati[p] is still true
Since
a
call
to
the
reset
function
is
active
for
p
between
the
:= O) step and the (if stop = 1) step, at that point stop = 1. If this during the (if stop = 1) step, we are done. So assume that stop = O
during the (if stop = 1) step. It follows that some assignment to stop occurs after the point in which the reset function is active for p, and before the (if stop = 1) step. Since only calls to reset assign to stop, as their last step (reset Line 14), and reset’s are sequential, the reset which is active for p must assign stop := O between the ( restart[ p] := O) step of test& set and if (if stop = 1)
1254
Y. AFEK ET AL.
step. This in turn implies the assignment ( resturt[ p ] := 1) occurs between the ( restart[p] := O) step and the (if stop = 1) step. Only p assigns O to restart[p] = 1) (test & set Line 1); it follows that restart[p] = 1 during the (if restart[p] ❑
step.
LEMMA
Let
5.3.3.3.
a be an execution
in which
calls to reset
are sequential.
(1)
set(P7’ST[indl)), where 1 s irzd s n + 10, Let ( restati[p] ‘= O), (p-test& be a subsequence of a ji-om a single call to test &set, by process p (Function occurs after test & set Lines 1 and 12). No call to p-reset( PTST[ind])
(2)
Let ( restart[p] ‘= O), ( read(close[indl) = 1) be a subsequence of a from a single call to test& set, by process p (Function test C%set Lines 1 and 5). No assignment of O to close[ind] (Function reset Line 11) occurs ajler
(3)
Let ( restart[p] ‘= O), (read close[ind] = O), ( close[indl ‘= 1) be a subsequence of a from a single call to test& set, by process p (Function test & set Lines 1, 5, and 6). No assignment of O to close[ind] occurs ajler
(restart[p]
:= O) and before
( restart[p]
:= O) and before
( restart[p] PROOF.
:= O) and before
We
prove
test & set function restart[ p]
the
first
(p-test&
set(P7’ST[ind])).
(read(close[ind])
(close[ind] case—the
by p includes
= 1).
:= 1). others
the following
are
:= O
my_next [ p ] := ind ( = current [next(
similar.
sequence
state) ])
The
(test & set Line
1)
(test&
9)
set Line
(test & set Line
10)
restart[ p ] = O
(test & set Line
11)
p-test
(test & set Line
12)
(restart[p]
& set(PTST[ind]) lemma,
no call to reset
:= O) and (stop
is active
= O) and before follows.
LEMMA
5.3.3.4.
the
for p at any point
between
= O).
Consider a call to reset that is active after (stop = O) and (reset Line 13) before the end of p-test& set(PTST[ind]). reads my_next[ p ] = ind (reset Line 3) between (stop = set( PTST[ ind])). Hence, it will mark this register in use, p-reset on P7’ST[ind]. It follows that no reset that runs p-reset(PZ’ST[ind]) is lemma
to
stop = o
By the previous
(stop
call
of steps:
the end of p-test
& set(P7’ST[ind]),
that calls p-reset This call to reset O) and (p-test& and will not run active
for
The first
p after
part
of the
❑ Let
a be an execution
In a, no primitive test& set register, other call to register PTST[ind].
in which
PTST[ind],
calls to reset
is p-reset
are sequential.
concurrently
with any
PROOF. Since calls to reset are sequential, any two p-reset operations are sequential. The previous lemma implies that calls to p-test& set(PTST[ind]) cannot be concurrent with calls to p-reset(PTST[ind]). ❑ LEMMA
Then for PTST[ind].
5.3.3.5. Let a be an execution in which calls to reset are sequential. every primitive test & set register, PTST[ind], a is well- foimed for
Computing
with Faulty
PROOF, That
A simple
is, let
induction
p-test
Objects
induction
and the semantics other
within
than
a call
of a, where request
p-reset(
& set(PTST[ind])
for
ITST[
operation
of a, using
Lemma
a’ is well-formed
of the register
the
to
1255
on the prefixes
a’m be a prefix
is any event occurs
Shared
P7’ST[ ind], a reset
irui]),
(Line
of
(Line
12).
By
a ‘n is well-formed
if T
PTST[
ind].
This
13) which
By
5.3.3.4.
for IL’W[ind].
Lemma
request
has just 5.3.3.4,
run
all
a
other
operations on PTST[ind] in a’ terminated before that p-test& set(PTST[ind]) began. Hence, the p-test & set(P7’ST[ind]) returned successfully if the number of other successful p-test & set( PTST[ind]) operations in a‘ is equal to the number of calls to p-reset(PTST[ind]), and returned unsuccessfully if the number of other successful p-test& set(PTST[ irzd]) operations in a’ is one more than the number of calls to p-reset( PTST[ ind]). Hence, the new request does not violate Definition p in
5.3.3.6.
a, the
executes
well-formedness.
reset
o
For any execution preceding
that
of the call to test& set (that exists, we say the call to test&
(1)
5.3.3.7.
Let
a and any call to test& to test&
:= 1) (Line
the step ( re,start[p]
LEMMA
call
set is the
14) most recently
in which
to reset
before
is, Line 1 ( restati[ p] := O)). If set has no preceding reset.
a be an execution
set by process
call
calls to reset
no such
to test&
set, (or was in the initial
no
later
reset
assigns
ind
state of current, to current
until
set by the call
if there is no such after
step reset
are sequential.
Suppose p-test & set(PTST[indl) is called during a call to test& process pin a. Then ind was written to current by the reset preceding and
which
the first
the call
to
reset)
p-test&
set(PTST[ind]). (2)
Suppose close[indl is read during a call to test&set ind was written to current[O] by the reset preceding was in the initial state of current[O], reset assigns ind to current[O] until
by process p in a. Then the call to test&set, (or
if there is no such reset) and no later after the read and subsequent wrRe of
close[ ind]. PROOF. We prove the first test & set includes the following restart[p]
second is analogous. of steps:
:= O
rny-next[ stop =
case—the sequence
p ] := ind ( = current [next(
state)])
o
The
(test & set Line
1)
(test&
9)
set Line
(test &set
Line
11)
p-test
(test &set
Line
12)
& set( PTST[ind])
5.3.3.2, no reset is active for p at any point between the steps := O) and (stop = O). As in the previous lemma, any reset which is
active for p after stop = O either to current, or reads my_next[p] so will
not assign anything
LEMMA
Let P and read/write)
to
(test &set Line 10)
restart[ p ] = O
By Lemma ( restart[p]
call
sees my _next[ p ] = ind, and will not write iruf after the call to p-test& set(PTST[ind]), and
to current
until
after
this call.
❑
Let a be an execution in which calls to reset are sequential. Q be the corresponding sets of primitive registers (test&set and accessed in two different calls to test& set. Either the two calls to
5.3.3.8.
Y. AFEK ET AL.
1256
test & set have the same preceding reset, or every primitive register (test&set or read/write) in P n Q is either p-reset or assigned O between the jirst call and the second. PROOF.
x E P n Q
Suppose
different
preceding
and operations.
reset
that the two We assume
test & set register,
x = PTST[
analogous.
the two calls to p-test
Denote
occurring
before
operation in a:
of process
li) 2i)
@ in
ind],—the
a, where j. We
proof
the
following
ind])
two
stop := 1 my_next[i]
i, and
of operations
(reset
p-reset(PTST[indl)
4j) 5j)
restart[i] (p-test&
f= 1 set(PTS7’[ind]))
the call to test& set that includes operation that precedes the call
n. Suppose
it came
Line
1)
n, and steps 1,–4, to test & set that
contains @ (there must be such by the assumption). To establish the lemma, it suffices to show that line 3j, (p-reset(PTS7’[ after
11) 12)
(reset Line 3) (reset Line 13) (reset Line 14) (test & set Line 12)
= @
Here, li–5i are steps from are steps from the reset
comes
n
@ is an
(test &set Line (test & set Line
= m
+ ind
3j)
is
(test &set Line 1) (test & set Line 9) (test & set Line 10)
= O set(PTST[ind]))
lj)
primitives
sequences
o
2j)
have is a
by n and ~, with
of process
restart[i] := O rny_next[i] := ind
3i) stop = 4i) restart[i] .5i) (p-test&
to test& set the primitive
for the read/write
& set(PTS7’[
m is an operation
have
calls that
before
n (5i).
Since
line
ind])),
1~, ( restart[i
] := O),
precedes the assignment to restart[i] in line 4j, and the reads in lines 3i and 4i return O, Lemma 5.3.3.2 implies that line lj is ordered after line 3i. Since in turn line 3j is ordered before return ind, a contradiction. Definition
5.3.3.9.
Given
line ❑
5i, the read
an execution
of my_next[i]
in line
a of the construction,
any set of calls
to test& set which have the same preceding call to reset (or which such preceding call) are called a collection of calls to test & set. LEMMA
Let
5.3.3.10.
the following
a be an execution
2j should
of the construction.
all have no
Then
a satisfies
propetiies:
(1)
a is well$ormed for the reliable high-level multi-use test&set object. Moreover, each reset operation jinishes its jinal assignment to the stop bit before the next successyld test & set operation returns.
(2) (3)
calls to reset are sequential, the reads and writes of the close bits and calls to p-test & set made by each collection of calls to test& set are a prejik of a run of the w-test & set construction. PROOF. The
Consider lemma, and
Note
proof
third
first
that
is by induction a finite
and
on
execution
m is the
invariant
the
next are
step
first
invariant
the
length
CZ’T,
where
by the
maintained
implies of the a’
construction. in
CY’V.
the
second.
execution,
satisfies We
with the
must
a trivial
conditions show
that
basis. of the
the first
with Faulty
—Suppose
the next step is a call to reset.
the reliable required
Shared
1257
Computing
high-level
to preserve
Objects
multi-use
By induction,
test & set object.
this property.
The
first
a’
is well-formed
Thus,
invariant
for
the environment
is
follows,
and the third
The
invariant
is immediate. —Suppose
the next
immediate. implies
step is a step of a primitive
By induction,
that
calls to reset
a ‘T is well-formed
register.
are sequential
for the primitive,
and Lemma
the sequence of steps of the primitive in a’m obey Lemmas 5.3.3.7 and 5.3.3.8 imply the second invariant. —Suppose process
assignment
to
each the
reset.)
stop,
since
By induction, collection
call. Hence, from
test&
is
5.3.3.5
5.3.2.1 implies set semantics.
the next step is the successful return, T, of a call to test&set by p. We must show there is a reset call which runs through its final
test &set. ant,
first
in CY’T. Lemma
last
of calls
such
collection.
(That
successful of su-test
to test&
the two calls to test&
same
Let
the
the properties
return,
set can have
set, which is, they
at most
return
have
rm and r+ be the corresponding
~, of a call
to
& set and the last invariwith
different
calls to reset
one
successful
~ and ~, are not preceding preceding
calls
to
T and @
(That is, rn and r+ consist of the subsequences of events of a for the two calls.) Since calls to reset are sequential, @ precedes w, and well-formedness holds for calls to reset, it follows that the beginning of rfi follows @ Since m- is the successful return of a call to test& set, the third invariant implies this call simulated a successful call to su-test & set, and hence must read stop = O at some point. By Lemma 5.3.3.2, rm is not active for p during some interval
beginning
that
assignment
the final
5.3.4.
Proof
construction. complete.
with
of Theorem Without
(Recall
the first
step of the call to test&
to stop in rn must 5.3.1.
loss
of
Let
a
generality,
the construction
precede
be an arbitrary, assume
is strongly
n.
all
wait-free,
set. It follows
❑ finite
run
operations
of
of the a
are
if calls to primitives
strongly wait-free. By Lemmas 5.3.3.10 and 5.3.3.5, a is well-formed primitive, and so calls to primitives are strongly wait-free.)
for
are each
We must show that each test& set and reset operation can be appropriately serialized. We proceed as follows. Each successful test & set operation is serialized
at the point
obviously
within
ment
stop
to
test &set
of its assignment
its interval. (Line
operation
15), that
Each or
just
returns
to re.start(test
reset
is serialized
before next,
the
& set Line either
serialization
whichever
is earlier.
1), which
at its final of
the
Lemmas
is
assign-
successful 5.3.3.10
and 5.3.3.2 imply this is within the interval of the reset, indeed, no earlier than its first assignment to the array restart (Line 14 of the reset function). By Lemma 5.3.3.10, the sequence of serializations is of the appropriate form: “successful test & set, reset, successful test& set, reset, . . . .” It remains to show that given this partial serialization, the unsuccessful test & set operations can all be serialized, within their interval, after a successful test & set and before the next reset. We consider the unsuccessful calls to test&
set depending
upon
how
they
return. It suffices to show that these calls cannot begin and end between the serialization of a reset and before the serialization of the next successful test & set. Suppose an unsuccessful call returns because of losing the test on stop (Line 3 or 10 of test& set). We argue that the call can be serialized at the point of
1258
Y. AFEK ET AL.
this read of stop = 1. A call to reset is in progress when that call to reset is serialized at the point where it assigns
this read occurs. If stop := O (Line 15),
then the read of stop = 1 falls between the previous successful return to test & set and the serialization of the reset. Alternatively, suppose the call to reset was serialized just before the serialization of the next successful call to test & set. comes
Then
either
have begun,
these
before
two
or after
and the read
serialization both.
points
By Lemma
are
adjacent,
5.3.3.10,
of stop = 1 is a correct
and
the next
serialization
the
reset
read cannot
point.
Suppose an unsuccessful call returns because of losing the test on resku-t[p] (Line 4 or 11 of test& set). Then a call to reset assigns to restmt[p] (Line 14) after the test & set begins and before the read of restart. is serialized during the unsuccessful call, we can serialize
If the reset operation the latter just before
the reset. If it is serialized after the unsuccessful call, then since the reset does not begin until after the previous successful call to test & set, there is an interval during the unsuccessful call, between the previous successful call to test & set and the reset. Last, suppose the reset is serialized before the unsuccessful successful
call begins.
Then
it has been serialized
test & set was also serialized
before
early because
the unsuccessful
its assignment to restart). The return reset’s final assignment to stop (Line
of that successful test& 15), which in turn must
unsuccessful
call
the
Hence,
is a period
before
there
the next
to
test & set, after
and
next
the serialization
reset
must
the following call begins
begin
of the successful
(at
set follows come after even
test&
the the later.
set and
reset.
Finally, consider the unsuccessful calls to test& set which lost are unsuccessful in the simulated calls to su-test & set. By Lemma the semantics of the single-use test & set, this can happen only successful simulated call to su-test & set, by a call to test & set collection. Moreover, this simulated call to su-test & set, (by the
because they 5.3.3.10 and if there is a in the same serialization
of su-test
ends,
& set),
can be serialized call. That
begins
before
before
the serialization
successful
simulated
the unsuccessful point
call to su-test
simulated
call
of the simulated
& set is within
since
it
unsuccessful
a successful
call to
test & set which has been serialized even earlier, when it assigns to restart, before its simulation started. It remains to argue that no reset is serialized after the serialization point of the successful test & set and before the beginning of the unsuccessful call to test & set. Because the unsuccessful test& set progressed to run at least one step of the simulation, Lemma 5.3.3.2 implies no reset is active for this process during an interval of its operation. Hence, any reset unsuccessful call must be the call to reset which precedes
serialized before the unsuccessful
the call,
or is even earlier. But these calls to reset are also serialized before the successful call to test & set. Note that each high-level test & set operation requires at most a constant number of operations on the primitive shared objects, and each high-level reset operation requires at most O(n) operations on the primitive shared objects. ❑
6. RMW
Registers and Consensus
To complete our investigation of read-modify-write register. Herlihy
memory fault-tolerance we turn to the read-modify[1991] showed that reliable
Computing
with Faulty
TABLE I. Number
Shared
and type
# RMW regs. used
f ~-faulty
(f+
f ~-faulty
f+l 2f+l 2m+l
f ~-faulty m-faults
(RMW)
used to solve Briefly,
o
shared
consensus,
registers
enable
based on the value read, to write is accessed register, decision
by n processes,
decides The
requirements
value u such that
objects
is itself
a process
registers,
an input
value,
if it writes
output
of the
consensus
problem
and tolerate
read
Alternatively,
process
constructions
object can be
a register
and,
a consensus
are that
eventually
object
c { – 1,+ 1}.A
inputP
to its write-once
output
there
decides
exists
a
on v and
of fault-tolerant
use a variety
different
shared
registers
universal.
output
These
RMW
to atomically
a new value.
each non-faulty
any other
because
value of some process. several different constructions
are presented.
and atomic these
which
Afek et al. [1992b] 36.1 $6.1 $6.2
primitive:
follows
each with
on a value
that u is the input In this section,
memory
This
Algorithm
4 2 4+210gf 4
o o
RMW,
n-process
# bits in RMW reg.’s
4f2i-6f+2
using
RMW
# atomic bits used
1)2
is a universal
can be simulated
process
1259
I_JSEDTO REACH CONSENSUSIN THE PRESENCE OF FAULTS NUMBEROFREGISTERS
of faults
write
Objects
consensus
of read-modify-write
types of faults.
Table
I summarizes
constructions.
6.1.
UNBOUNDED
struction
FAILURES
(Theorem
registers
and
co-faulty,
We
6,1.1)
a quadratic then
and
the
support
ternary
these
~ +
1 RMW
registers
2f + 1 registers THEOREM
struction atomic
registers are
(RMW
atomic
2~ +
start
by presenting
from
bits,
fj. 1.5, how
into
~ +
such
that
to encode
1, O(log
~)-bit
~ of all
of the
wide,
RMW
lower
bounds,
(Theorem
always
necessary,
and
(Theorem
6.1.5)
any f >0,
bits, at most f of which The construction
atomic
registers. 6.1.4)
that
that
a total
wait-free
consensus
con-
in Figure
7. This figure
uses the notation
and end of atomic,
exclusive
access
to shared RMW register r. That is, it is assumed that a process can lock one register at a time, that a process does not fail between pairs of lock(r)
The
construction
uses
of
+ 1)(f + 1) = 4f 2 + 6f + 2
and 2(2f
the beginning
unlock statements, and that tion eventually executes it,
be bits
are ~-faul~. is shown
to mark
may
are necessa~.
there is a strongly
RMWregisters
RMW
them
two
read\write)
a con-
1 ternary
with
andior
For
We
object
bound
+ 1 terna~
and unlock
of
in Theorem
upper
6.1.1.
usingf
PROOF. lock(r)
RMW
a consensus
number
show
We
PER REGISTER.
of
any non-faulty
two
ideas,
The
process first
that
is that
reaches
only and
a lock instruc-
consensus
is trivially
achieved with one non-faulty, ternary RMW register. The register is initially set to O, the first process to do a RMW sets the register to its input value, and all processes agree on the value written. The second idea is to use validation bits to ensure that an invalid value cannot be chosen. Before proposing a value in a RMW register, a process establishes it as valid by writing to many validity bits. Before adopting a value from a RMW register a process confirms its validity by
1260
Y. AFEK ET AL,
Protocol
for
p,
process
shared
r: array[l
..}+
shared
v: array
[l..j+l][l..2j
input~
G {-1,
I] of ternary
1:
z :=
RMW
+1][–1,
decidep ,z: ranges over {-1,0,
local
+1}: %initialfy
registers
I] of atomic
+1},
initiafly
2:
fori=ltoj+ldo for
4:
lock(r(i))
%Set
j = I to 2j+I
5:
if r(i)
6:
tmp
7:
do
= O then
u(i, j, z) := 1 od
r(i)
= O
each bit = O
o
inputp
3:
each register
% initially
bi~
%Establ~h
validity
of cu~~t
value
%Vote
:= z fi
initial
value
of z
at this stage
for x at this stage
:= r(i)
unlock
8:
count
9:
for
%Adopt
:= O
j = 1 to 2f+l do if u(i, j,tmp) if counf > \+l then z := tmp fi
10:
= 1 then
~~~nt
:=
this stage’s co~nt
vote
if it is valid
+ 1 H od
ll:od 12: dec:dep
:= z
FIG. 7. Consensus in the presence of ~, ~-faulty registers. registers and 4~ 2 + 6~ + 2 atomic read/write bits.
reading
the validity
process
to choose
The
two ideas
bits. Faults an invalid
in a RMW
The construction
register
therefore
uses ~ + 1 RMW
cannot
convince
a
value.
are combined
as follows:
The
construction
proceeds
in ~ + 1
stages. Each stage is assigned a RMW register and two sets of 2f + 1 atomic bits. One set of bits is called the ualidity bits for value + 1 and the other set the validity bits for value – 1. In each stage the processes validate their current value using the appropriate validity bits, attempt to write their value in a RMW register
reserved
for this stage, and adopt
the value
written
in this register
if it
is valid. More
specifically,
2 f + 1 stage-i bits are denoted the
ith
RMW
the value
validity u(i, register
x is marked
bits for value *, x).) The (Lines
straightforward
generalization
LEMMA
on valid
values, RMW
6.1.2. (VALIDIm).
in stage
attempt
to write
4 to 7). A process
the register was O prior to the RMW. process checking its validity at stage validity bits for value x are set to bounded set of input value, and approp~iate
valid
of this
i by ensuring
x are set to 1. (In Figure a value writes
7, Line
that
all
3, these
at stage i is made its value
to
if and oniy
if
The value x is considered valid by a i if and only if at least f + 1 stage-i 1 (Lines 8 to 10). (Note there is a construction,
by adding additional values.) In tlae protocol
to solve
consensus
validity .- registers
of Figure
7, processes
for for
decide
any each
only
values.
PROOF. A stronger statement is proved: the value of the local variable x is always valid after Line 1 has been executed. For each process the variable x is modified only in Line 1 and possibly ~ + 1 times at Line 10. We inductively prove that no variable x can be changed to an invalid value. Suppose all changes numbered less than i were to valid values
(the assignment in Line 1 for some process serves as base shown that the change numbered i is to a valid value.
cases).
Next
it is
Computing
with Faulty
If change which
Shared
number
i is an execution
is by definition
change loop These
valid
9 read
at least
bits are initially
of Line
O (meaning
these
bits
(at Line
3) when
6.1.3.
LEMMA
1, then holds.
10 then
f + 1 validity
it must
correspond
Every
to their
process
the
the for of x.
and at least one is a
only be nonzero (valid) if But processes only write
values of x were valid and value of x and the new, adopted
(AGREEMENT).
value
hand,
be the case that
for all stages)
nonfaulty bit could at an earlier time.
they
it is to an input
If, on the other
bits set to 1 for the new value
not valid
field of a nonfaulty register. This it were written by some process induction, the earlier corresponds to a valid
of Line
and the induction
is due to an execution
at Line
1261
Objects
that
current
value
of
x. By
thus the nonfaulty value of x is valid. writes
a value
bit ❑
in decideP
writes the same value. PROOF. register process
In
any execution,
there
is at least
which is written is nonfaulty. to attempt a read-modify-write
call it Xg. Furthermore, g. Since the nonfaulty nonfaulty validity bits
one
stage
in which
the
RMW
Call the first such stage g. The first at g will succeed in writing its value,
it will have first established the validity of x~ for stage validity bits are set to 1 but never set to O, the f + 1 at stage g will always show Xg as valid for stage g. All
subsequent processes will thus be able to successfully confirm the validity of X8 for stage g, Since register g is nonfaulty all subsequent processes will also agree that its value is Xg. Thus, all processes will adopt the value X8 at stage g. Using stage g as a base case, we now show that if all processes value Xg at the end of stage i > g, then all processes agree on value end of stage i + 1. Since
all processes
agree
on value
agree on Xg at the
Xg at the end of stage i
they will all set validity bits for the value x~ at stage i + 1. None will set of the value of validity bits for the value –X8 at stage i + 1. Thus, regardless the i + 1st RMW Note
that
register
each process
performs
can switch
its value
concludes
the proof
Complementing
of Theorem Theorem
the number
of RMW
6.1.1.
operations
on the
operations. Hence, the conwith Lemmas 6.1.2 and 6.1.3,
❑
6.1.1, we have registers
❑
to –xg.
4 f 2 + 6f + 2 primitive
RMW registers, and f + 1 primitive is strongly wait-free. This fact, together
read/write struction
that
no process
a simple
used in the proof
lower
bound
of Theorem
that
shows
6.1.1, f + 1,
is optimal. THEOREM
tion
which
6.1.4. tolerates
and any number PROOF.
w-faulty
There does not exist a strongly f ~-faul~
of read/wn”te
wait-free
registers and uses fewer
consensus
than f + 1 RMW
construcregisters
registers.
Since a fault could erase the effect RMW register, this object can be trivially
of every operation in an implemented by an object
that always returns the initial value and uses no shared primitive objects. Thus, f ~-faulty RMW registers and any number of reliable read\write registers are no more powerful for solving consensus than the read/write registers alone. The theorem follows from the impossibility of consensus from read/write registers [Herlihy 1991; Loui and Abu-Amara 19871. ❑ If only
RMW
registers
are used, we have the following
THEOREM 6.1.5. If only primitive consensus, and if as many as f of
tight
bound:
~-faulty RMW registers are used to solve the registers may be faulty, then 2f -t 1
Y. AFEK ET AL.
1262 (logarithmic-size)
IWIW
registers are both necessary and suficient:
CONST(RMW, PROOF. corolla~
The
lower
modifying tion
bound,
of a stronger
lower
the next subsection. The upper bound,
CONST(H,
bound
(for
CONST(M,
the construction
encodes
(~, ~), consensus)
= 2f
(~, ~), consensus)
a weaker
fault
model),
(~, GO),consensus)
used in the proof
the ~ + 1 ternary
IIMW
+ 1.
and the
is
a
6.2.2 in
s 2~ + 1, is proved
of Theorem
registers
> z~,
Lemma
by
6,1.1. The modifica0(~
2, validity
bits into 2~ + 1 RMW registers of size 4 + (2[log f I)-bits. Each of the 2f + 1 RMW registers in the new construction
atomic
is divided
into
three fields: decide, uplus, and vminus. The f + 1 ternary RMW registers are simulated by the decide field of the first f + 1 new RMW registers. This is possible since RMW registers can be read-modify-written in one field without affecting the values in any other field. The encoding of the validity bits relies on the following two observations about
the construction.
was established
rzeuer be invalidated stage separately, which it is valid. repeatedly
First,
a value is valid
in stage k if and only if its validi~
in all stages i, i < k, and second,
record
in stage k. Thus,
instead
once valid in stage k a value will
of recording
the validation
for each
it is enough to record for each value the maximum stage in Of course, since f of the RMW registers could be faulty, we this
information
in all the 2f + 1 RMW
registers.
Each
of
the fields uplus and vminus can encode the values O through f + 1 and is initially O. The setting of bit u(i, j, 1) in the previous construction is now performed by setting the uplus field of register j to be equal to i if it is found to be less than i. The bit u(i, j, 1) is read as 1 if the value of the vplus field of register j is greater than or equal to i and as O otherwise. The bit u(i, j, – 1) is handled In RMW 6.2.
in the same way.
this
construction,
BOUNDED
each
FAILURES
per register, we are able required for consensus: THEOREM
process
now
performs
4f2
+ 7f + 3 primitive
❑
operations.
PER REGISTER.
to fully
characterize
For
a bounded
the number
number
of faults
of RMW
registers
6.2.1
—CONST(RiMW, —CONST(RJIW,
( f, 1),consensus) = 2f + 1. m, consensus) = 2m + 1.
PROOF. Lemma 6.2.2 (below) implies the lower bound: CONST(RMW, (~, 1), consensus) > 2~ while Lemma 6.2.3 implies the upper bound: CONST(RMW, m, consensus) s 2m + 1. Together with Theorem 3.1,
these imply
the theorem.
LEMMA 6.2.2. 2 f + 1 primitive [n/21
❑
There is no n-process consensus construction using fewer than objects, at most f of which may be l-faulty, and which survives
process failures.
PROOF. The proof is by contradiction. Assume to the contrary a solution using 2f primitive objects. Let the initial internal registers rl, ..., rz~, be Ul, ..., u2f, respectively.
that there is states of the
Computing The
with Faulty
Shared
process-failure
the processes, their
requires
that
in any run
in which
only
half
pl, ...,
on that value,
as they do not know
process.
processes
assumption
1263
Pr. /21 take StePS1 they must eventually decide. ~SO> if are identical, the validity condition requires that they must decide
inputs
of some
Objects
run
Thus,
alone
there
with
whether
the other
is a failure-free,
input
decision
finite
+ 1 and decide
on
value
run
+1.
a
Let
is an input
where
half
the final
the
internal
states of the primitives rl, ..., rzf, inabeul, ..., u ~~, respectively. NOW consider a run in which the other processes, p,. /21+ ~, ..., P. take stePs, run
alone
with
to Uf+l,..., this
run
decided
input
– 1, but find the primitives initially are m states This is consistent with a run in which pl,..., p,. ,2, Vf+l> . ..7 Vzf. no steps and ~ faults have changed the internal states of r + ~,..., r2~ Vzf, respectively. Hence, pf~ /Z1 + 1, ..., P. must all deci d e – 1. But
Ul, ..., Uf, have taken
is also +1,
consistent
and ~ faults
U*, ..., u~, respectwely. diction. ❑ Note
that
whether
Hence,
this lower
wait-freedom,
others
construction
depends
requires take
each
steps.
need only be correct
Moreover, when
the bound the
requirement LEMMA
object
that
P.
tion using 2m + 1 faulty failures is at most m:
RMW
CONST(RMW,
all decide
progress
make
in runs in which
at least
process
may
be strongly
(It
a contra-
assumption
1n/21
for
of
that
the
processes
are
1n/2]
does
than
regardless
we mean
on the behavior
is exceeded.
– 1 other
of the construcnot
depend
on
a
wait-free.)
there is a strongly registers,
wait
all-eadY
., rf back to +1)
progress
failures,
bound
the construction
must
to
have
states of rl,..
on a weaker
process
also does not depend fault
pl, . . . 7Ppl/21
process
(A
For any m >0,
6.2.3.
in which the internal
By [n/21
guaranteed to make progress, processes to take steps.) tion
a run
p,. ,21+ 1,...,
bound
which
the
with
have changed
provided
m, consensus)
wait-free
consensus
the total
number
construcof memoy
s 2rn + 1.
PROOF. The proof of this lemma is based on the construction in Figure 8, which solves strongly wait-free consensus using 2 m + 1 faulty RMW registers, provided this
the total
construction
number is
of memory
a consequence
failures
is at most
m. The correctness
of
series
lemmas
the
of
in
the
of next
subsection. 6.3.
PROOF
OF THE
UPPER
BOUND.
In Figure
8, the input
to process
p is
initially assigned to register must be assigned to register lock(r) and unlock to mark
inputP, local to p, and an appropriate output value decideP. As in Figure 7, Figure 8 uses the notation the beginning and end of atomic, exclusive access
to shared RMW register r. In the construction, there
are 2m
+ 1 shared
registers
r[i],
1
0. the
changes
construction. Sk to Sk+ ~.
to these That
is, let
parameters m be
a step
that
can
result
in a legal
run
from that
Y. AFEK ET AL,
1268 (1)
That is, m is r[i ].uote := u, where the value of m is a memoy fault. r[i].uote is –u in Sk. Note that ZS~+l = ZS~ and RF~fl = RF~ – 1. Here, there are four key subcases.
l~k+ll = 2, and A~+l = A~ + 4. IZ,I < 12,+,1.Then 12 ~+11 = 1~~1 + 2, and O < 1~~1 = l~~+ll. Then lZk+ll = Ilikl = 1, and 1~~+11 < llZkl.Then l~k+ll = lZkl – 2, and A~+l
(a)
Sk = O. Then
(b)
O
1. Moreover, by Lemma 6.3.6, a = al (r[i - 1].uote = O; r[i – 1] = u’)a,m, where (r[i – 1].uote = O; r[i - 1] = u’) is the successful read-modify-write to r[i – 1].vofe and there are no successful read-modify-write operations in az. Let Sj be the state at the beginning of CYz,just after the successful read-modi~-write to r[i – l],uote. By induction, Aj > 0. Also by Lemma 6.3.6, CYz contains the i – 1 reads in collect by p that precedes w. In addition, since none of the operations in are successful read-modify-writes, by the analysis above A j < “”” < A~. Next, consider the sequence of values Zj, . . . . ~~. observe that since the is legal, and because there is no successful read-modify-write operation in the value of Z never changes by only 1 in az. We examine cases depending the sign of the sum collected by p, and show that a fault described in case lb, or lC must occur in az, implying Ak >3, and so Ak+ ~ >1, The
collect
—Some
Z,
sums to a valid in Zj,.
value.
There
are two subcases.
. ., Zk has the same sign as the sum collected.
the az run az, on la,
Computing
is
with Faulty
Shared
1269
Objects
Since lZ~+ ~I = 1~~ I – 1, it must be that the sign of the value written different than the sign of ~~. Thus, the sign of 2X ( = sign
collected and
= sign of
2,+ ~ in this
u) is different
sequence
than
such that
the sign of
either
Z~. Hence,
~, = O and
by n-(~~) of sum
there
exist
IX,+ 1I = 2, or
Z,
IX,l =
l~.+ll = land ~,= –2,+1. Then either A,+l = A,+ 4 >50r A,+l = A,+2 > 3, respectively. (This must be due to a fault of type lb or lc, above.) Thus, A~ z A,+l —No
>3
and hence,
>1.
E in ~j, . . . . ~~ has the same sign as the sum collected.
By definition, the
A,~l
the sign of the sum collected
i – 1 registers
read
in the collect.
must be the sign of a majority
Thus,
a majority
of the
each have the same sign as the COIIect and as u at some point
of
i – 1 registers in the interval,
but not at the end of the interval, Then, a fault in the interval must change the value of at least one of these, from u to – u. That is, there exist 2, and 2,+ ~ in this sequence A ~+1 >3. The collect —Some
such
that
IZ,+ ~1= IZ,I +
sums to O. There
2X in ~j,.
2, and
A,+ ~ = A, + 4>5.
Hence,
are two subcases,
. . . Xk has value
O.
Since Xk # O, some fault must move the sum from O. That is, there exist ~, such that 2, = O and IX,+ 1I = 2. Hence, A,+ 1 = A, and 2,+ ~ in this sequence +4>5and A~+l>3. —No
X in ~j,...,
That first
is, half
that there
~~ has value
the registers exist
O.
are read
as positive,
2, and X,+ ~ in the sequence
sign: that IZ,I = IZ,+II = 1 and A~ >3. Hence, A~+l >1.
Z, =
–Z,+l,
and half ~j,..., Then,
A,+l
Suppose next that all of Zj, . . . . Zk have the same sign. u I < I~~ 1,their sign is different than u ‘s. The COIIect read uote = – u and half with vote = u. Since all the 2 have some fault changes a value from u to – u in cq. That Z,+l in the sequence Hence, A~,l >3. The collect By Lemma
such that
12,YII
as negative,
= A, + 2>3
successful
and
Since 1~~+ ~1= lZ~ + half the registers with sign different than u, is, there exist 2, and
= 1X,1 + 2, and A,+,
is nonzero and invalid. 6.3.1, all i – 1 of the earlier
Suppose
Zk that have different
= A, + $ z 4 z 5.
read-modify-writes
wrote
the single valid value, u, yet the sum of the collect had opposite sign. Hence, in a at least (i/21 registers had faults changing the value from u to – u. Recall that RF~+ ~ is m minus the number of faults in am; hence, RF~+ ~ < m – [i\21, and 2RF~~ ~ s 2m – 2[i/21 s 2m – i. Hence, we have
>lZ~+ll
+2m+l–i–(2m–
i)
n
Y. AFEK ET AL.
1270 LEMMA
6.3.8
PROOF. register,
Consider r[2m
All processes
(AGREEMENT).
Sk
in
the
+ 1], has been
system
written.
decide on the same value.
state
Sk, immediately
By Lemma
6.3.7,
after
2RF~
the
last
< 112k1. Hence-
forth, the number of remaining faults is insufficient to change the sign of Zk, or reduce it to O. Since all the reads in any final collect (upon which any decision
is based)
are
made
after
Sk, all
processes
decide
on
the
same
❑
value. Recall
that
without
a register
any process
3.1 imply
is k-faulty
writing
into
if it can change
it, at most
k times,
its value
Lemma
spontaneously,
6,2,3 and Theorem
the following:
COROLLARY
construction k-faulty:
6.3.9.
using
2m
For any 1 s k s m, there is a strongly wait-free consensus + 1 l?iWW registers where at most 1m/k] registers are
c0NsT(m4aklc0nsensus) ‘2m+1 6.4.
UNIVERSAL
universal
object,
implementation objects
for
Herlihy’s Recently, Herlihy’s
CONSTRUCTIONS. as an
of
any
object other
n processes
that
[1991] used
be
defined to
the
construct
notion
of a
a wait-free
object.
He
showed
that
consensus
are universal
for
systems
with
at most
and
other
n processes.
construction required atomic registers over an unbounded domain. Jayanti and Toueg [1992] showed how to bound the register size in construction.
THEOREM are universal,
6.4.1 (HERLIHY). provided
Consensus
For
objects
none of them are faulty:
CONST((consensus, PROOF.
Herlihy can
any number
atomic),
of processes
construction from n-processor consensus reliable atomic read/write registers of
together
with
atomic
registers
For all objects X, (O, ~), X)
n, Herlihy and atomic unbounded
< ~. [1991]
gives
an explicit
registers that uses 0(n3) size and 0(n3) reliable
consensus objects over a bounded domain. Simple extensions of known constructions can be used to implement multi-valued consensus from binary consensus and n read/write registers [Plotkin 1988]. (Each process p writes its input to a read/write register rP, initialized to some invalid value, then processes use binary consensus on log n consensus objects to agree on the index q of a register r~ that contains a valid value.) Although the construction does not tolerate faults, it must be strongly wait-free.
Herlihy’s
[1991]
construction
includes
a loop
with
an exit condition
dependent on values read from shared memory. If a fault occurs, the construction may loop forever, so this precise construction is not strongly wait-free. However, there is a bound on the number of low-level operations on shared data that are required to implement a single high-level operation. Hence, this construction can be made strongly wait-free by adding an exit condition when the bound on the number of low-level steps is exceeded. Other constructions such as Plotkin [1988] can be modified similarly. ❑
Computing
with Faulty
THEOREM
Shared
Objects
1271
6,4.2
—RMWregisters are universal objects, if a bounded For all objects X, and for all f < ~, CONST(RMW,
(f,
number
~), X)
of them are ~-faulty:
< w.
—Consensus objects together with safe registers are universal objects, if a bounded number of them are w-faul~: For all objects X, and for all f < ~, CONST((consensus, PROOF.
Note
that
first
safe),
it is trivial
( f, ~), X)
to use a RMW
wait-free implementation of an atomic register: = 1. From this observation, Corollary 4.4 CONST(RMW,
(f, ~), atomic)
s 20f
< CO.
+ 8 and
register
from
Theorem
CONST(RMW, ( f, ~), consensus) = 2 f + 1. Composing these the construction in Theorem 6.4.1, and appealing to Theorem first part of the theorem. To prove
the
second
part
of the theorem,
tions of atomic registers from safe registersll was done above with Herlihy’s construction:
as a strongly
CONST( RMW, (O, ~), atomic) and Theorem 3.4, we have
known
6.1.5
fault-intolerant
can be made CONST(safe,
we
have
two facts with 3.3, proves the construc-
strongly wait-free as (O, m), atomic) s m.
Composing these constructions and Theorem 6.4.1, we obtain CONST((consensus, safe), (O, M), R&fW) s w. The result follows by the first part of the theorem
and Theorem
❑
3.4.
The proof
of the second
THEOREM
6.4.3.
part
Suppose
of this theorem
X
is a universal
can be readily object,
generalized:
in the sense that for
all
objects Y, there is a strongly wait-jiee construction of Y jiom X. Then there is a fault-tolerant construction of Yfiom X: For all objects Y, and for all f < w, CONST(X,
(O, m), Y)
< ~ == CONST(X,
(f,
~), Y)
< ~.
In their recent paper on memory faults, Jayanti et al. [1992] give a direct fault-tolerant, strongly wait-free construction of consensus from consensus, which they use in an alternative proof of the second part of the previous theorem: consensus objects and registers can be used in fault-tolerant, strongly wait-free universal constructions. 7. Discussion, There
Open Problems,
remain
distributed multi-valued
many
unresolved
and More issues
about related
Related
Work
to shared
memory
failures
in
systems. Faulty versions of other shared data objects, such as test & set registers, m-registers, and compare & swap registers,
are of interest. Based on the constructions presented in this paper, fault tolerant construction of fetch-and-add and swap shared objects are presented in Afek et al, [1993b]. We have tight bounds on only a few problems; more efficient constructions and corresponding lower bounds would also be interesting. It would be particularly interesting to implement memory-fault tolerant data objects directly from similar, faulty objects, such as test & set from test & set, 11See, for example, Bloom [1987], Burns and Peterson [1987], Lamport [1986], Li et al. [1989], Peterson [1983], Peterson and Burns [1987], Singh et al. [1994], and Tromp [1989].
Y. AFEK
1272 without
using
without
the overhead
All
atomic
our solutions
registers,
or read-modify-write
of a universal
from
ET AL.
read-modify-write,
construction.
are deterministic.
It would
be interesting
to explore
the use
of randomization to tolerate memory failures. Also, there is much work to be done in exploring the effect of memory failures in other models, such as synchronous or semi-synchronous models. Earlier, we referred several times to a paper explores
other
interesting
issues related
by Jayanti
to memory
et al. [1992]
failures,
as well
that
as indepen-
dently proving some of the same results presented here. Our fault models allow the study of individual faults within an object, setting the stage for an investigation of transient fault behavior. The work of Jayanti et al. [1992] assumes an object is either permanently faulty or completely reliable, but studies a range of fault types, from extremely malicious to benign crash behavior. These definitions coincide in our notion of ~-faulty spond results —A
nonresponsive
faults,
objects,
corre-
which
to responsive, arbitrary failures in the work of Jayanti et al. Specific by Jayanti et al. that are identical or closely related to ours include:
version
of Theorem
tion, in any fault model, used in our proof,
3.2 that that
states general
suffice
conditions
on self-implementa-
to apply the same recursive
construction
—Composition results similar to Theorems 3.3 and 3.4. —Register constructions analogous to Theorems 4.1, 4.3, and Corollary 4.4. —A direct, fault-tolerant and strongly-wait free construction of consensus from
consensus,
second
part
which
of Theorem
is used to prove
The generalization ing construction that
of Theorem generalizes
Gracefully
degrading
constructions
to
the
are
required
exhibit
universality
6.4.2 and to Theorem
primitives fail, as the low-level ing constructions in different
results
analogous
to the
6.4.3.
3.2 depends on a notion of gracejidly degradour notion of strongly wait-free construction.
same
from type
potentially of
fault
faulty behavior,
low-level when
primitives too
many
primitives themselves suffer. Gracefully degradfault models can be composed in general ways,
just as we compose strongly wait-free constructions in the ~-fault model. We believe this generalization of strong wait-freedom is an important contribution. Jayanti et al. [1992] also investigate an interesting range of fault models distinct from those we study, focusing on computability and universality issues, They show that unresponsive faults (in which invocations to objects may not evoke replies) are essentially impossible to overcome. In addition to the arbitra~ fault model, they study two weaker responsive fault models, omission faults and crash faults. They show universal gracefully degrading constructions for omission faults, and argue that crash faults are too benign. That is, even if the low-level primitive objects fail by crashing, it is essentially impossible to create abstract components that gracefully degrade (or exhibit only crash faults when too many primitives crash). Following on this work, three of us have defined a benign fault model intermediate in power between the omission and crash fault models, and shown that it supports universal, gracefully degrading constructions [Afek et al. 1993]. The authors benefited substantially from the efforts of outstanding referees. They caught several errors and omissions, and the presentation is much improved, in clarity and style, from our original draft.
ACKNOWLEDGMENT.
Computing
with Faulty
Shared
Objects
1273
REFERENCES ABRAHAMSON,K.
1988. On achieving consensus using a shared memory, In Proceedings of the 7th ACM Symposium on Principles of Disti’buted Computing (Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 291-302. AFEK, Y., ATTIYA, H., DOLEV, D., GAFNI, E., MERRITT, M., AND SHAVIT, N. 1993. Atomic snapshots of shared memory. J. ACM 40, 4 (Sept.), 873–890. AFEK, Y., DOLEV, D., GAFNI, E., MERRITT, M., AND SHAVIT, N. 1994. A bounded first-in, first-enabled solution to the l-exclusion problem. ACM Trans. Program. Lang. Syst. 16,3 (May), 939-953. AFEK, Y., GAFNI, E., TROMP, J., AND VITLNYI, P. M. B. 1992a. Wait-free test-and-set. In Proceedings of the 6th International Workshop on Distributed Algorithms. Lecture Notes in Computer Science, vol. 647. Springer-Verlag, New York, pp. 85–94. AFEK, Y., GREENBERG, D., MERRITT, M., AND TAUBENFELD, G. 1992b. Computing with faulty shared memory. In Proceedings of the llth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12), AFEK, Y., MERRIn,
M., AND TAUBENFELD, G.
1993a.
Benign
failure
models for shared mem-
Workshop on Distributed Algorithms. ory. In Proceedings of the 7th International Computer Science, vol. 725. Springer-Verlag, New York, pp. 69-83. AFEK, Y., WEISBERGER, E., AND WEISMAN, H.
1993b.
A completeness
Lecture
theorem
Notes in
for a class of
synchronization objects. In ~oceedings Ofthelzth ACM $wosium onfi@leS OfDiSm”bU@d Computing (Ithaca, N.Y., Aug. 15-18). ACM, New York, pp. 159-170. ASPNES, J., AND HERLIHY, M. 1990. Fast randomized consensus using shared memory. J. Algorithms, pages 281–294, September. BELL, G. 1992. Ultracomputers: A teraflop before its time. Commun. ACM 35, 8 (Aug.), 27-47. BLOOM, B. 1987. Constructing two-writer atomic registers. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 249-259. BURNS, J. E., AND PETERSON, G. L. 1987. Constructing multi-reader atomic values from non-atomic values. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 222-231. 1989. Linda in context. Commun. ACM 32, 4 (Apr.), CARRIERO, N., AND GELERNTER, D. 444-458. CHOR, B., ISRAELI, A., AND LI, M. 1987. On processor coordination using asynchronous hardware. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 86-97. DOLEV, D., GAFNI, E., AND SHAVIT, N. 1988. Towards a non-atomic era: l-exclusion as a test case. In Proceedings of the 20th Annual ACM Symposium on Theo~ of Computing (Chicago, Ill., May 2-4). ACM, New York, pp. 78-92. DIJKSTRA, E. W. 1965, Solution of a problem in concurrent programming control. Commun. ACM 8, 9 (Sept.), 569. ACM DIJKSTRA, E. W. 1974. Self-stabilizing systems in spite of distributed control. Commun. 17, 9 (Nov.), 643-644. FISCHER, M. 1983. The consensus problem in unreliable distributed systems (a brief survey). In Foundations of Computation Theoiy, M. Karpinsky, Ed. Lecture Notes in Computer Science, vol. 158. Springer-Verlag, New York, pp. 127-140. FISCHER, M., LYNCH, N., BURNS, J., AND BORODIN, A, 1979. Resource allocation with immunity to limited process failure. In Proceedings of the 20th IEEE Annual Symposium on Foundation of Computer Science. IEEE, New York, pp. 234-254. FISCHER, M., LYNCH, N., BURNS, J., AND BORODIN, A. 1985a. Distributed FIFO allocation of identical resources using small shared space. Tech. Rep. MIT/LCS\TM-290. Laboratory for Computer Science, Massachusetts Institute Technology, Cambridge, Mass. FISCHER, M., LYNCH, N.j AND MERRITr, M. 1985b. Easy impossibility proofs for distributed consensus problems. Dist. Comput. 1, 1,26–39. FISCHER, M., LYNCH, N., AND PATERSON, M. 1985c. Impossibilhy of distributed consensus with one faulty process, J. ACM 32, 2 (Apr.), 374–382. FISCHER, M. J., MORAN, S., RUDICH, S., AND TAUBENFELD, G. The wakeup problem. SL4M J. Comput. to appear. asynchronous consensus FISCHER, M. J., MORAN, S., AND TAUBENFELD, G. 1993. Space-efficient without shared memory initialization. Inf. Proc. Lett. 45 (Feb.), 101–105.
1274
Y. AFEK
ET AL.
M. 1991. Wait-free synchronization. ACM Trans. Prog. Lang. ,$yst. 13, 1 (Jan.), 124-149. HERLIHY, M. P., AND WING, J. M. 1987. Axioms for concurrent objects. In Proceedings of’ the 14th Annual ACM Symposium on Principles of Programming Languages (Munich, West Germany, Jan. 21 -23). ACM, New York, 13-26. JAYANTI, P., CHANDRA, T., AND TOUEG, S. 1992. Fault-tolerant wait-free shared objects. In of Computer Science. IEEE Proceedings of the 33rd IEEE Annual Symposium on Foundation Computer Society Press, New York. JAYANTI, P., AND TOUEG, S. 1992. Some results on the impossibility, universality, and decidabilWorkshop on Distributed Algorithms. ity of consensus. In Proceedings of the 6th International Lecture Notes in Computer Science, vol. 647. Springer-Verlag, New York, pp. 69–84. H. H. 1987. Memory requirements for agreement among LOUI, M. C., AND ABU-AMARA, unreliable asynchronous processes. Aduarrces in Computing Research. 4, 163– 183. LAMPORT, L. 1974. A new solution of Dijkstra’s concurrent programming problem. Comrrum. ACM 17, 8 (Aug.), 453-455. LAMPORT, L. 1986. On interprocess communication, parts I and II. Dist. Comput. 1, 77-101. LI, K., AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321-359. concurrent wait-free LI, M., TROMP, J., AND VITANYI, P. M. B. 1989. How to construct variables. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Lecture Notes in Computer Science, vol. 372. Springer-Verlag, New York, pp. 488–505. 1987. Hierarchical correctness proofs for distributed algoLYNCH, N. A., AND TUTTLE, M. rithms. In Proceedings of the 6th Annual A CM Symposium on Pn”nciples of Distn”buted Computation (Vancouver, B. C., Canada, Aug. 10–12). ACM, New York, pp. 137–151. Expanded version available as Tech. Rep. MIT/LCS/TR-387. April 1987. Massachusetts Institute of Technology, Cambridge, Mass. MERRITT, M., AND ORDA, A. Efficient test& set algorithms for faulty shared memory. Unpublished note. MORAN, S., TAUBENFELD, G., AND YADIN, I. 1992. Concurrent counting. In Proceedings of the Ilth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 59-70. PETERSON, G. L. 1983. Concurrent reading while writing. ACM Trans. Prog. Lang. Syst. 5, 1 HERLIHY,
(Jan.), 46-55. PETERSON, G. L., AND BURNS, J. E. 1987. Concurrent reading while writing II: The multi-writer case. In Proceedings of the 28th IEEE Annual Symposium on Foundation of Computer Science (Oct.), IEEE, New York, pp. 383-392. 1985. Operating System Concepts. (Third edition). PmERsoN, J. L., AND SILBERSCHATZ, A. Addison-Wesley, Reading, Pa, PLOTKIN, S. A. 1988. Chapter 4: Sticky Bits and Universality of Consensus, Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, Mass. RABIN, M. O. 1982. N-process mutual exclusion with bounded waiting by 4 log n shared variables. J. Comput. Syst. Sci. 25, 66–75. SAKS, M., SHAVIT, N., AND WOLL, H. 1991. Optimal time randomized consensus—making resilient algorithms fast in practice. In Proceedings of 2nd Annual ACM–SL4M Symposium on Discrete Algorithms (San Francisco, Calif., Sept. 27-28). ACM, New York, pp. 351-362. SINGH, A. K., ANDERSON, J. H., AND GOUDA, M. G. 1994. The elusive atomic register. J. ACM 41(2), 311. SMITH, A. J. 1982. Cache memories. Comput. Suru. 14, 473-540. TROMP, J. 1987. How to construct an atomic variable. In Proceedings of the 3rd International Workshop on Distributed Algorithms, J. C. Bermond and M. Raynal, eds. Lecture Notes in Computer Science, vol. 292, Springer-Verlag, New York, pp. 292–302. VITI.NYI, P. M. B., AND AWER~UC~, B. 1987. Shared register access by asynchronous hardware. In Proceedings of the 27th IEEE Annual Symposium on Foundation of Computer Science. IEEE, New York, 233-243. RECEIVED SEPTEMBER 1992; REVISED JULY 1995; ACCEPTED JULY 1995
Journal
of the
Association
for
Computing
Machinery,
Vol.
42, No.
6, November
1995