Computing with faulty shared objects

Computing

YEHUDA Tel-Aviv

Shared Objects

AFEK Universi@

DAVID

Tel-Aviv,

Israel, and AT&T

Bell Laboratories,

Murray Hillj

New Jersq

S. GREENBERG

Sandia National

MICHAEL AT&T

with Faulty

Laboratories

MERRITT

Bell Laboratories,

Murray Hill, New Jersqy

AND GADI AT&T

TAUBENFELD Bell Laboratories,

Murray Hill, New Jersey

Abstract. This paper investigates the effects of the failure of shared objects on distributed systems. First the notion of a faulty shared object is introduced. Then upper and lower bounds on the space complexity of implementing reliable shared objects are provided, Shared object failures are modeled as instantaneous and arbitraty changes to the state of the object. Several constructions of nonfaulty wait-free shared objects from a set of shared objects, some of which may suffer any number of faults, are presented. Three of these constructions are: (1) A reliable atomic read/write register from 20~ + 8 atomic read/write registers ~ of which may be faulty, (2) a reliable test& set register for n processes from n + 10 primitive test & set registers, one of which may be faulty, and 3n + 13 reliable atomic registers, and (3) a reliable registers when f of these may be faulty. Using consensus object from 2f + 1 read-modify-write these constructions a universal construction of any linearizable shared object from a set of either

A preliminary version of the results presented in this paper appeared in Proceedings of the llth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 47-58.

David Greenberg was supported DE-AC04-76DPO0789.

in part

by the U.S. Department

of Energy

under

contract

Authors’ addresses: Y. Afek, Tel-Aviv University, Tel-Aviv, Israel and AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974 D. S. Greenberg, Sandia National Laboratories; M. Merritt, AT&T Bell Laboratories, Room 213-409, 600 Mountain Avenue, Murray Hill, NJ 07974, e-mail: mischu@research,att.tom; G, Taubenfeld, AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and\or a fee. 01995 ACM 0004-5411/95/1100-1231 $03.50 Journal

of the Associationfor ComputinsMachine~, Vol. 42,1$0.6,November1995,pp.

1231-1274.

Y. AFEK ET AL.

1232 n-processor consensus objects or n-processor faulty, is presented. Categories General

and Subject Descriptors:

Terms: Algorithms,

Additional

Key Words

read-modifywrite

B.3.2. [Memory

fault-tolerance,

and Phrases: Atomic

registers,

Design Styles-shared memory.

Structures]:

shared memory,

some of which may be

synchronization.

operations.

1. Introduction The

desire

for increased

causing

many

real-world

tributed

systems.

performance,

reliability,

higher

applications

to be moved

to this work,

research

Previous

and decreased

cost is

mainframes

to dis-

from

on the reliability

of distributed

systems has concentrated on tolerating the failure of individual processors. It has been shown that, by using shared objects to provide communication and coordination between processors, it is possible to tolerate processor failures. Distributed systems can be built that achieve the expected high performance when no processors fail and that are able to continue (at a lower performance) when even all but one processor fails. These fault-tolerant systems depend critically on strong shared objects, yet little

research

objects.

This

has been directed paper

explicitly

to studying

defines

how to tolerate

the problem

shows how they can be made fault tolerant. Before considering ~azdty shared objects Intuitively,

as the name

implies,

a shared

of faulty

we must

object

allows

faults

in the shared

shared

define

objects

a shared

two or more

and

object.

processors

to share information (a formal definition is given in Section 2). The simplest shared object is a shared memory word (or register), into which one processor writes and any other may read.1 The necessity of having more complex shared objects was recognized by IBM and other computer manufacturers in the early 1970’s. The inability of concurrent systems that used only read/write memory cells to maintain definition

(and

a concurrent construction)

queue

or stack or to reach

of stronger

primitives

such

consensus

led to the

as test-and-set

and

semaphores [Dijkstra 1974; Peterson and Silberschatz 1985]. Most of this paper is concerned with these stronger types of objects (one section, Section 4, is devoted to read/write registers). Herlihy has defined a hierarchy of progressively stronger shared objects [Herlihy 1991]. Objects at each level are able to perform impossible for objects at the lower levels. The question asked

tasks that are in this paper is

whether it is possible to use shared objects that sometimes fail to meet their specifications. It is shown, here, that it is indeed possible to use a set of objects some of which are faulty (at each of three levels of the hierarchy: read/write, test-and-set, and read-modify-write) to produce compound objects that are fault tolerant. The highest level of the hierarchy (exemplified by our read-modi&-write) is universal, that is, allows the construction of any other shared object. A result of our constructions is that any shared memory object Y has a fault tolerant construction from any object type X in the highest level, that is, a construction from a set of objects of type X some of which may be faulty.

1See, for example, Burns and Peterson Awerbuch [1987].

[1987], Lamport

[1986], Singh et al. [1994], and Vit6nyi

and

Computing

with Faulty

It should

be clear

memory. However, less straightforward. locations

that

Shared that

Objects

a shared

1233

object

is very different

can be used for

storing

and retrieving

processors,

and is not used for synchronization.

of semantic

rules

that

for synchronization. adding tion

either

features

from

local

processor

the distinction between shared object and shared memory is A shared memory typically consists of a set of memory

restricts A shared

hardware

the behavior memory

or software

of the shared

data

by several

A shared of the object

object that

can be used to build

protocols

to enforce

different

includes

a set

thus can be used shared

objects

by

the timing/coordina-

object.

Since a shared object is defined by its behavior, it can be implemented in many ways, either through hardware or through the combination of software and hardware. Thus, the failure of a shared object can occur in many ways. Our intent is that the model of shared object failure should cover all possible failures in implementation. Hardware implementation can fail in several ways: the

data

contained

incorrectly Software infinite

order protocols loop,

assigned

in

the

events,

object

or

can

fail

a protocol

in

be can

several

mistakenly

to it, or a protocol

can

requests

receiving

corrupted, be lost

ways:

allowing

the

due

a buggy a process

data in an order

control

to

logic

switching

can

failures.

protocol

entering

an

to affect

memory

not

for which

no contingency

had been planned. One might attempt to apply common techniques for dealing with faulty local memory to faulty shared objects. These techniques include keeping many copies of each datum or, in general, using some type of redundant coding that allows errors to be detected and/or is much more complex to implement

corrected. Unfortunately, a shared object than a memory cell of a uni-processor;

that is, common primitives such as read-modify-write writes require a memory which is much more than (see, e.g., Smith memory

[1982]).

associated

with

The use of standard the

object

will

or even simple reads and a passive repository of data

redundancy

not,

access to the object through protocol failure. If redundancy is spread across objects, then

for the

techniques

example,

within

prevent

encoding

a loss

is subject

the of

to the

timing inconsistencies that are the bane of all distributed algorithms. Techniques to recti~ faults in shared memory objects will necessarily have to incorporate distributed coordination techniques. It is precisely how to supply redundancy while maintaining distributed coordination which is the subject of this paper, After a discussion of shared object failures and the specification of faulty memory primitives (in Section 2), the body of this paper presents a sequence of constructions that use faulty shared memory objects. Section 3 presents elementary relations among general fault-tolerant constructions, showing how different constructions can be composed. Section 4 begins with constructions of reliable read/write memories from faulty read/write memories. Following Lamport [1986], we present simple constructions of safe and regular registers and conclude with a construction of a reliable atomic register from faulty test & set objects from similar objects, atomic registers. In Section 5, reliable some of which might be faulty, are constructed. In Section 6, implementations of consensus are studied, using unreliable read-modifi-write primitives. The constructions in this section demonstrate that faults do not qualitatively decrease the power of these primitives, in that they retain

their

positions

in the memory

hierarchy

of Herlihy

[1991].

Moreover,

in

1234

Y. AFEK

ET AL.

combination with the earlier register results, our consensus be used to implement the universal construction of Herlihy

constructions can [1991]. Hence, for

example,

to

shared

faulty object.

read-modify-write The paper

1.1.

RELATED

memory

systems

objects

are

distributed restricted may

WORK.

reliable.

2 This

systems

to

either

affected,

previous

periods

in

has of

paper

general

without

research

specific

studies

total any

explored

time.

For

can

a discussion

Theoretical

typically

only

be

primitives

closes with

research

process in

number, restriction failures example,

on

failures

describes faults

the

first

in on

which such

assumes study

objects. the

implement

of

that the

3 Memory

number

of

data

timing

of

the

the are

any

problems.

fault-tolerance

and

shared or

be used of open

restricted

constrained

to memory

in

shared

the

shared

tolerance

of

failures

are

objects

that

faults. occur

Some during

faults

are

self-stabilizing systems defined by Dijkstra [1974]. Self-stabilizing systems are required to recover once the final memory fault has occurred, and the system is in an arbitrary state. That is, self-stabilizing protocols may start at any erroneous, globally inconsistent state and must always reach a correct global state satisfying particular safety requirements. Three previous papers investigated initialization failures, a special case of studied

in work

on

self-stabilization model, only the initial

state

[Fischer et al. to appear; 1993; Moran et al. 1992]. In initial state of shared objects may be corrupted, while

of the processors

is not corrupted

(e.g., the program

that the

counters).

In

this paper, the corruption may repeat at any time during the run. In Section 6 of this paper, we use implementations of consensus to explore properties of faulty shared memory. The consensus problem is fundamental in distributed

computing

and is at the core of many

algorithms

for fault-tolerant

distributed applications. Much is known about the consensus problem in other fault models.4 Sections 4 and 5 investigate the question of constructing reliable registers in an unreliable environment. This relates to the fundamental problem of implementing one type of shared object from another. Previous work on shared [1987],

object Herlihy

implementations [1991], Lamport

includes Bloom [1987], Burns and Peterson [1986], Li et al. [1989], Plotkin [1988], Singh et

al. [1994], and VitAnyi and Awerbuch [1987]. As noted above, there is a relationship between implementing a parallel processor shared memory and the construction of shared objects. Implementing shared objects of any type in a network-based machine is difficult. The task of including coordination protocols and fault tolerance could not begin without the basic mechanisms for sharing. One approach to providing sharing is to use local caches to hold pieces of global data required by each processor. Special operating system functions and hardware remove from the user the burden of coordinating the accesses to the shared data. One such system has been proposed, implemented and analyzed in Li and Hudak [1989]. The KSR, DASH, and Alewife machines are a few examples of machines that use such ‘See, for example, Abrahamson [1988], Afek et al. [1993], Bloom [1987], Burns and Peterson [1987], Chor et al. [1987], Herlihy [1991], Lamport [1986], Plotkin [1988], Rabin [1982], Singh et al. [1994], and Vit~nyi and Awerbuch [1987]. 3Concurrently and independently of our work, Jayanti et al. [1992] have addressed similar ~roblems (see Section 7 for more on this paper). See, for example, Abrahamson [1988], Aspnes and Herlihy [1990], Chor et al. [1987], Fischer [1983], Fischer et al. [1985b; 1985c], Loui and Abu-Amara [1987], and Saks et al. [1991].

Computing

with Faulty

Shared

Objects

1235

distributed shared memory [Bell 1992]. Another, more software-oriented, approach is implemented in the Linda system [Carriero and Gelernter 1989]. In Linda,

an abstract

operations

tuple

space is shared

are available

one of these

to insert

methods

techniques) expect that

will our

mentations

in order

2. Specifying

and

(instead delete

of implementing

of cache

tuples.

sharing

(or

lines

or pages),

Obviously,

and

the choice

any of many

other

of

clever

affect the types of errors expected from shared objects. We ideas on fault tolerance will be specialized to various impleto increase

Faulty

efficiency.

Shared Memoiy

Objects

We consider a collection of asynchronous processes that communicates via a collection of shared memory objects. The shared memory may consist of a variety of shared data objects. These shared objects are subject to faults. Each fault with

of a shared object is modeled as a state change that appears respect to the processes’ operations. The local (nonshared)

each process

is assumed

to be atomic memory for

to be reliable.

An object @’ shared by n processes can be specified via a state machine which the state transitions are labeled by the invocations and responses operations

performed

Definition 2.1. A {Q, S, 1, R, 8} where: ● ●

in of

by the processes. sequential

specification

Q 1S a ~finite or infinite) set of states. S G Q 1s a set of initial states. of I=(lrwl, ..., lnu~) is an n-tuple

●

of an object

sets, where

type

each

is a quintuple

Inui

is a set of

symbols

denoting the operation invocations by process i. Let Irm = IJ i lnui. of sets, where each Resi is a set of . R = (Resl, . . . . Res~) is an n-tuple ●

symbols denoting the operation responses for process i. Let Res = U ~Resi. Define the set of operations by process i on @ to be @pi = Inui x Resi, all the

two-character

Op = U iopi. This state concatenating

strings

Then

of

invocations

and

8 c Q x Op X Q is the transition

by some process every process

by

i with

on shared

no succeeding objects

i, and every

inui ● Invi,

that (s, inuiresi, s’) = 8. The sequential specification

response,

are required

allows

there

and

let

obtained by graph. The

prefixes of these strings. invocation, that is, an invi

resi, by i).

to be total: exists

operations

i,

relation.

machine denotes a set of (finite and infinite) strings, the edge labels along paths in the state transition

sequential runs of @ are the (finite and infinite) (Taking prefixes allows runs to end with a pending Operations

responses

for every state

s ● Q,

resi = Resi and s’ e Q such

to be specified

as atomic

state

transitions. However, in asynchronous concurrent systems operations have duration and can overlap in time. This is modeled by allowing the interleaving of invocations and responses by different processes, so that between the invocation and response by one process may be any number of invocations or responses by other processes. Thus, concurrent runs of the object are modeled as elements of ( Inu + Res)m (i.e., finite and infinite runs). Specific correctness conditions constrain these runs by relating them to those in the sequential specification. One such correctness condition is linearizability [Herlihy and Wing 1987].

Y. AFEK ET AL.

1236 We next define

the notion

of linearizable

runs of an object

@. Given

a string

of events a = (Inu + Res)m, define a partial order, t >0, let L = log,+ ~ CONST(X, (t, IXJ),X). Then:

CONST(X, PROOF. base

As

construction

strongly

wait-free

illustrated

in

of a reliable

objects

(f,@), the

left

object

of type

X) side

of type

s ((t of

+ l)f)~.

Figure X,

X, t of which

1, assume

that

CONST(X, may be ~-faulty.

using

C =

there

is a

(t, ~), X)

Computing

with Faulty

Shared

Objects

1239

Embedded

object

primllive

Base construction from

3 objects

Obje Ct

High-level

of X

from

of type X,

construction

3 embedded

of X

objects

each constructed

from

of type X,

3 primitive

objects of type X. ~G. 1.

Constrictions

The goal is to construct

a strongly

~ of the primitive base construction, X.

Call

in the proof of Theorem

wait-free

object

3.2.

X that

each of these

C primitives

each of the C embedded

objects

“embedded using

resilient to t faults of the embedded Consider next how many primitive

objects”,

and now

the same base construction.

as illustrated on the right side of Figure primitive objects of type X. Moreover, resilient to t faults of the true primitives,

even if

embedded objects return fail. The theorem follows

objects. faults are necessary

C strongly

some value, by

wait-free

applying objects

implement The result,

1, is a construction of X from C2 each of the embedded objects is and the single high-level object is to cause the high-level

objects must object to fail: at least t + 1 of the embedded these embedded objects will be correct unless at least t + objects fail. Hence, the high-level object is resilient to (t + faults. The strong wait-freedom of the base construction

construct

is reliable,

objects are w-faulty. Consider the following idea: take the of type which survives up to t faults among the C primitives

fail,

and

primitive

this

First,

idea

recursively. X,

of

1)2 – 1 primitive ensures that the

even if t + 1 or more

of type

each

1 of its primitive

each of which

objects

recursively

is resilient

to

[ ~/(t + l)J faulty objects, then use these C objects to construct a single object The result is an object that can tolerate the X, using the t fault construction. [~/( t + 1)] resilient objects. fault of t of the embedded By recursion, each of the C embedded objects, c, will behave in a fault-free objects used to manner, provided no more than 1~/( t + 1)j of the primitive in c fail, then c may construct c are faulty. If more than [ ~\( t + 1)]primitives no longer behave correctly, and may appear to be faulty, to the higher level

Y, AFEK

1240 construction. But by the strong at least return. In order to cause the highest

wait-freedom

of the construction,

level of the recursion

faulty

behavior,

behavior.

For

one of the

embedded objects to fail, it. Since the total number

there must be at least [ ~\( t + 1)1 memory faults in of faults is ~ and ~ – t[ ~\( t + 1) 1s ~/(f -t 1), none objects

faulty

to exhibit

objects

embedded

exhibit

calls to c will

~ + 1 of the C embedded

of the C – t other

must

ET AL.

is faulty.

Hence,

the final

tolerant of at least ~ memory faults. The total number of X objects used in this construction CONST(X,

Time

T

(t, @),X)

< CONST(X,

(t, ~), X) LIOg’”’fJ+l

s CONST(X,

(t, ~), X)logt+f+l

((t+

complexity

“ CONST(X,

grows

similarly:

bound on a single high-level

Fault-tolerant

constructions

in obvious

THEOREM

suppose

in the base construction

of X, that needed to Then in the

~ faults, each high-level operation ~ operations on primitive objects.

can be composed

For any f, m >0,

3.3.

(f,

CONST(X,

object

+ 1)1, ~), X)

with

fault-intolerant

requires ❑ construc-

ways:

CONST(X,

PROOF.

([~/(t

the number of primitive operations operation on the constructed object.

recursive construction to overcome at most TIOgr+1~+ 1 = ((t + l)~)log’+’

tions

is:

l)f)L.

is an upper

implement

is

(~, CZJ),X)

= CONST(X,

=

construction

The

k), Z)

< CONST(X,

(f,

m, Z)

s CONST(X,

rn, Y)

theorem

of type Z from

in a fault-tolerant

follows

objects

way from

A less obvious

and for all k = {1,...

by taking

oCONST(Y,

‘ CONST(Y,

of type Y, but constructing

allows

construction

each object

components

faulty

objects

(O, k), Z);

O, Z).

a fault-intolerant

potentially-faulty

composition

k), Y)

} U {w},

of type

of type X

of an

of type Y

X.

•l

to be used in a

strongly wait-free, but otherwise fault-intolerant construction of objects of type Y. If the X objects are in fact faulty, the result will be an object of type Y that may be m-faulty, but can then be used in fault-tolerant constructions. Note, that in the following theorem CONST( X, (O, ~), Y) is the usual space complexity of wait-free constructions of Y from X; THEOREM

CONST(X,

4. Read/Write

3.4.

For any f >1, (~,~),

Z)

< CONST(X,

(O, CO),Y)

. CONST(Y,

(f,

~), Z).

Registers

One approach to tolerating faulty read/write registers is to add a software layer between the faulty memory and the user that looks to the user like fault-free memory. In this section, it is shown that this is possible for safe, regular, and atomic read/write registers. That is, we construct safe, regular,

Computing

with Faulty

and atomic

Shared

read/write

Objects

registers

1241

from

a collection

of the corresponding

tives, f of which may be w-faulty. (Henceforth, we “read/write register,” and “atomic read/write register” Atomic registers can be specified using sequential scribed

in Section

2. Safe registers,

as defined

use

“atomic

primiregister,”

interchangeably.) specifications,

by Lamport

as de-

[1986] have behavior

that is sensitive to concurrency. Hence, we directly specify the legal runs of a safe register (over a data domain V): A safe register may be accessed by two processes. One performs read operations, which return values from V, and the other performs write operations, which take values as arguments. A sequence of invocations and responses of these operations is safe if and only if the invocations and responses by each process alternate appropriately, and each read operation that is not concurrent with a write operation returns the value written by the last write operation that precedes the read, value, if there is no such write. (Hence, read operations with

writes

may return

Regular

registers

or returns the initial that are concurrent

any value.)

are also sensitive

to concurrency,

but are more

constrained

than safe registers. They are defined as safe registers, except that read operations must return a value of a write that is either concurrent with the read, or k the last write (or initial value) that precedes the read [Lamport 1986]. A pair of read and write procedures is safe (or regular) if all concurrent executions give rise to safe (or regular) sequences of invocations and responses. The notion of a fault in a safe or regular register is introduced, in the spirit of that

of atomic

A sequence

objects,

as an atomic

of invocations

m is the smallest a safe register

number from

write

and responses such that

the

original

operation

(by some outside

for a safe register

it is possible sequence

to produce

by adding

THEOREM

2~ +

value returns

For

the

upper and

any

m

may

be

infinite.)

multi-reader\

2f + 1 similar registers, from 2f such registers,

singlef of which f of which

CO),safe) < 2f + 1.

1 registers,

(~ +

of

write

(f, 1),safe) > 2f.

—CONST(safe, PROOF.

if,

a legal sequence

4.1

—CONST(safe,(f,

the

agent).

m~azdty

m instantaneous

operations (strings of the form W(ui)ack). (Note that Faults of regular registers are defined analogously. Next we prove that, one reliable, strongly wait-free, writer safe register can be constructed from may be ~-faulty, but cannot be constructed may be l-faulty.

is called

1 with value.

the

bound,

a simple

the

reader

same

value),

Each

high-level

construction

reads then operation

them. it

If

works: the

returns requires

the

reader

that

writer

sees

value;

writes

a majority

otherwise

2f + 1 operations

it

on

primitive registers. For the lower bound, assume to the contraiy that there is a solution using 2f registers. Without loss of generality, let the initial value in the high-level object be O. Let the initial respectively.

values

of the primitive

registers

rl, ...,

rzf

Let a be a run in which process pl runs alone and performs 1 to the high-level register. Let the final values of the registers be U1, . . . . uz~, respectively.

be z-q, ..., uz~, a single write of rl,. ... r2f in a

Y. AFEK ET AL.

1242 Now

consider

a run

~

in

which

runs

pz

alone

and

performs

a single

high-level

read operation, after ~ faults have changed the initial values of Uzf, respectively. Hence, pz finds the registers holding r-2f touf+ l,..., rf+ l>...> values Ul, ..., uf, uf+l, ..., Uzf. Since pl has taken no steps in E, the read by pz must return O. But ~ is indistinguishable by pz from a run y that extends the run a in which pl wrote 1, by ~ faults that change the values of rl,. ... rf followed by a single high-level read operation back to Ul,. ... Uf, respectively, by p2. return

Hence, the read by pz in y (and the 1 written by p ~, a contradiction,

In addition

to bounding

the number

the above proof also shows that necessary in order to implement

so in the ❑

of primitive

at least high-level

indistinguishable

objects,

~)

must

the last argument

2f + 1 primitive read or write

operations operations,

in are thus

matching the upper bound in holds for constructions that are Since the upper bound holds the lower bound holds for one

the theorem. Note that the lower bound also not strongly wait-free. for an infinite number of faults per object, and fault per object, the first part of the corollary

below

is a consequence

3.1,

follows. which

The second implies

CONST(safe, COROLLARY

part

CONST(,safe,

what

of the first

1), safe)

1. previous

in

TEST

& SET.

a system new

subsection,

❑

to be wait-free.

a mechanism This

as each of the two processes test& set registers before

with that

mechanism which

The at

construction most

two

selects

at

is used selects

in the

last

processes. most

two

out

as a “doorway”

exactly

one

sec-

In

winner

this of

to out

n the of

two.9

‘The mechanism introduced in this subsection is very similar in its behavior to a 2-exclusion algorithm. (2-exclusion [Afek et al. 1994; Dolev et al. 1988; Fischer et al. 1979, 1985a] is a generalization of mutual exclusion in which up to two processes may be in the critical section at the same time, but no more.) The difference between 2-exclusion and the mechanism presented here is as follows. In a 2-exclusion algorithm, if two or more processes attempt to enter their critical sections, then eventually at least two processes, out of n that t~, enter the critical section. In contrast, in the “doorway” construct, if two or more processes try to enter then at most two and at least one enter the critical section. (In this application, the critical section is the seven object construction from the previous subsection.)

Computing

with Faulty

Shared

Objects

1247

close

(reliable

r I I I I I I I I I I I

----

atomic bit)

..-.

----

•1

----

1

----—

—--

----

1 I 1 I

Doorway

F1 At least one process and at most processes

win

I 1

lose

I 1 1 I I

two

F2

win.

1

----------

L -r -I I 1 I I I I I I I I Iv I 1 I I

------

-----

I

win ‘ose lose @----- ----- ----------

J 1 1 I

----

A

1

FIG. 3.

Schematic of single-use set for n processes.

test &

win

lose P

1“

u.

I

1 I

1

win

lose

win

8 1 I 1

I t

~o

(1)

(1)1

w

n

I win I I Two I L ------

i t 1 I

lose

processes

single-use

------

lose

I I I I

test-and-set.

------

------

--

J

If there are no faults, then two primitive test& set registers can be used to ensure that at most two processes pass the doorway, by having each p-test& set the two registers, and pass if successful in at least one. However, if one of the two registers is faulty, then at least one process and as many as n may enter (because the faulty one may let everybody require the processes to go through two

win). To overcome this problem, we such gateways. That is, we maintain

two pairs, FI and Fz, of primitive test & set registers (see function doorway in Figure 4). To pass the doorway, each process must successfully p-test&set one register in each of the two pairs, otherwise the process returns 1 for its test & set operation. (In the OR operations in the function doorway (Lines 1 and 2) only one successful clause needs to be executed. That is, if the first clause is true, then the second need not be executed—however, the construction is correct with either interpretation.) The doorway implementation is given in Figure 4.1° The doorway guarantees that at most two processes enter the two process construction; however, it could be that one process is blocked in the doorway, while a faulty primitive allows a later arriving process to pass it into the two process construction (left 10In the code,

the execution

of a return

command

terminates

(exits) the operation.

Y. AFEK ET AL.

1248 shared

F’l [1 ..2], F2 [1, ,2]: T&S close:

function

reliable

registers,

atomic

initially

O

on {O, 1}, initially

O

s-testtset

1:

if close = 1 then

2:

close

3:

if doorrray

return(1)

1

:=

%return

O (win)

%exit

if construct

fl %close

then

return

(singlemse-2_process-t

esttset)

the construct

else

or 1 (lose) is closed

and continue

return(1)

fi %from

Figure

2

end-function

function

dooruay

%return

1:

if (p-testkset(F1[l])

2:

thenif

(p-testtset(F2[l]) then

3:

= O) OR (p-test&

return

return

= O) OR

(true)

set(F1[2])

= O) at least one k F1 and one in F2: pass

%won

(false)

or false (lost)

%see note o below.

(p-tasttset(F2[2])

fi fi

true (passed)

= O)

%won

at neither

F1 nor at F2: do not pass

endlunction “In the OR operations only one su~~~~~ful clause n~eds to be executed. That is, if the first clause is true, then the second need not be ~X~Cuted—hoW~ver, the construction is correct with either interpretation.)

—.— .. FI.% 4. Single-use n-process test& set using eleven test& a reliab~~, atomic read/write bit (see also Figure 2).

set’s, one of which may be faulty,

and

as an exercise). This enables a scenario in which one process loses, and only after it exits, the eventual winner starts. This violates the linearizability requirement (operations must be serialized within the time interval in which they are active), and can be resolved add a reliable multi-writer/multi-reader

in several ways. The simplest solution is to atomic bit, close as is suggested in

Afek et al. [1992a] (which can be constructed from unreliable primitive objects as described in Section 4). Any process first reads the close bit, and if close = 1

it returns doorway. start

1 and This

exits;

ensures

otherwise that

it

once

sets

close

a process

to

1 and

continues

has lost, no other

into

process

the

can later

and win.

THEOREM 5.2.1. There is a strongly wait-free, n-process, single-use, test&set construction from 11 test &set registers, one of which may be ~-faul@, and a single reliable read/write bit (Figures 2, 3, and 4). PROOF. the

It is argued

doorway

exactly We

one serialize

and

enter

of these the

above the

that

at least

single-use,

will

be the

winner

with

one

and

two-process

at most test

two

& set.

processes

leave

By Theorem

5.1.1,

winner. it’s

high-level

request,

and

the

losers

with

their

read of close = O in line 1 of s-test & set must precede either a loser’s read of close = 1 in line 1 of s-test & set, or (if the loser also read close = O in line 1 of s-test& set), the loser’s write of close := 1 in line 2 returns.

The

winner’s

of s-test & set. Regardless, the winner is serialized before all the losers. As in the previous construction, this one is obviously strongly wait-free,

as

each of the n processes performs at most two operations on the primitive read/write register and 11 operations on the primitive test & set registers before terminating, and these are assumed to be wait-free. ❑ 5.3. the

MULTI-USE

previous

n PROCESS TEST&

construction,

resulting

in

SET.

This

a construction

section

extends

of a multi-use,

the

ideas

of

n process

Computing

with Faulty

test & set theorem.

object.

A

in Section

THEOREM

Shared

5.3.1.

series

Objects of

1249

lemmas

lead

to

the

proof

of

the

following

multi-use

test&set

5.3.4. There

is a strongly

wait--ee

n-process

construction from n + 10 test&set registers one of which 3n + 13 reliable atomic registers (Figures 5 and 6).

may be ~-faulty

and

5,3.1 Overview of the Construction. The single-use test & set provides a simple lock: whichever process wins the test & set (reads O while setting to 1) holds the lock. However, without the reset operation the lock cannot be a multiple-user test & set, released, The construction in Figure 5 implements that is, one that can be test& set describes the shared data structure. A simple

way to add a reset

test & set object

would

and

operation

be to simply

reset

reset,

and

the

to a high-level the

picture

in

Figure

implementation

constituent

primitive

6

of a registers.

Unfortunately, in the constructions given above, resetting the primitive registers would interfere with the high-level test & set operations of other processes and could The object tion, if times, two

cause incorrect

behavior

of the high-level

test & set object.

first observation towards an implementation of a multi-use test& set is that in the n process single-use construction of the previous subseca process tries to perform a high-level test& set an arbitrary number of it is guaranteed that it will lose in all its repeated attempts (unlike the

process

single-use

construction

where

a process

may

try

only

one

time).

Given this observation, a simple solution that uses an unbounded number of primitive test & set registers and read/write bits would work as follows: divide the collection of test& set registers into bundles of 11 test& set registers and one bit, and enumerate the bundles 1,2, . . . . Use one atomic read/write register as a pointer to the bundle that is currently in use. Once the high-level test

& set

that

have

is set,

started

it can

the

be

reset

high-level

by simply incrementing test & set operation

lose, since by the above observation winner was already declared.

they are blocked

the pointer. Processes before the reset would in the old copies,

This unbounded solution can be modified to use 1 in registers (one of which might be faulty) and O(n) reliable the following

observation:

All

the bundles,

except

n of them,

where

primitive test& atomic registers, could

a set by

be recycled,

since no process will ever access them again. This solution will require that each process declare which copy it is using and that the resetter find a copy not in use. A still more efficient solution carries the recycling idea one more step. Instead of duplicating and recycling the bundles, we reeycle the primitive test & set registers (see Figure 5). This is possible by requiring each process to declare which primitive test & set register it is using at each step that it is taking (Line 9 of function test& set). The resetter will now search for 11 primitiue registers that are not in use (Lines 7–10 of function reset) and will compose them into a new 1l-piece n-process single-use object. Thus, rather than giving a pointer, it will have now to provide 11 pointers (current[l . . 11] in the code) for the newly composed 1l-piece object. (The read\write bits are replicated and reused similarly.) In this way, a solution is obtained that uses a total of n + 10 primitive test& set registers (one of which might be faulty) and O(n) reliable atomic registers.

1250

Y. AFEK

Protocol

p:

for process

PTST:

shared

array

[l..(n

p-test&set stop:

atomic

tks

with

operations

on {O, 1}

reliable

my.nezt[l..n]: restart[l ..n]: current

+ 10)] of primitive

and p-reset

reliable

c~ose[l ..n]:

local

ET AL.

atomic

%written

on {O, 1}, initially

O

by winner,

Yowritten

%written

reliable atomic on {1,. . . ,rt + 10} reliable atomic on {O, 1}

[O..ll]:

siate[l..ll]:

reliable local

atomic

initially

O

by owner %initially

%initially

on {1, . . . . n + 10}

on {1,0,1},

initially

and read by everybody 1

1,2,...,11

1

inasi[l.. (ri+ 10)]: local on {0,1} function

test&set

1:

restart~]

2! 3:

mv-nezf[p] if stop =

4:

if restart~]

5:

if

7cstart

:= O

1

= 1 then

c/ose[my-nezt~]]

6:

c/ose[my_rzezt~]]

7:

for

8:

while

i = 1 to

(state)

~Y-~=tM

10:

if stop

11:

if restart~]

12:

return(1)

%and

fi

:=

%do

return(1)

state[next(state)]

stop

bit,

restart

bit,

simulation’s

close bit

and continue

L od

= L do

= 1 then

global

the construction

won

bit is used

%personal

7,indicates

:= currcnt[next(~t~te)l

= 1 then

(close)

%check

%close

state[tl

no one already

which

fi

:= 1

9:

that

fi

return(1) = 1 then

11 do

result

by ensuring %indicates

:= CUrmrLf[O] then return(1)

fl

return(1)

:= p-test*

s-t.st~set

which

tks

%check

that

step by step

register

is used next

no one already

won

fi iet(PTS~my.nezt

~]])

od 13:

return(resnlt(state))

end-function function

reset %prevent

1:

stop

2:

for

i = 1 to

n + 10 do

3:

for

i = I to

n do if i #

4:

j:=l

5:

while

6:

curreni[O] j:=l

7: 8,

for

;= 1

inuse~]

inme[i]

p

then

anyone %mark

:= O od irsuse[my-nezt[i]]

:=

%might

1 fi od

% finds

tolldo while ‘inuse~]

10:

which

be corrupted

%is register

= 1 do j := j + 1 od

winning

j used?

:= j

i =1

9:

from registers

currenf[i]

11 PTSRS

= 1 do j := j + 1 od

:= j

od 11:

c/ose[current[O]]

12:

for

i = 1 to

14:

for

i = 1 to

15:

stop := o

p-reset

13:

%re-open

:= O 11 do p-test&

set(PTST(current[i]

(PZ’STlcurreni[i]]) n do

restart

the construct

]); ?freset

od

[i] := 1 od

the tk

% make everyone

registers start

again

end_function function

result:

is incomplete. function

next:

the primitive

returns Otherwise examines register

L if the values it returns the

which

in state

indicate

that

the value

that

is returned

in staie

and

returns

values

the

the

simulation

index

should be accessed next in the simulation

FIG. 5.

Multi-use

test & set.

of s-testkset

by the simulation. (between

of s-test

1 and kset.

11) of

Computing -.. ------1

1251

with Faulty Shared Objects ----------% ,-,,1-r@”t . . . . . .. . 1

I

I

I

In o

I I

my_nexl

PTST

J

--!-”-------1

Clcse

. . . . . . . . . . . . . . . . . . . ;. . . . . ...{

I

t

I

I

\

.......................

:

-----------------z

Pl\

Ya---

slop n

H+

raw

1~

/l--d

H

FIG.6.

Shared data structure

forthemultiple-use

test &set

construction.

To ensure that, while the resetter is composing a new 1 l-piece copy, no process is switching between registers, the resetter will set a multiwrite/multi-reader stop bit telling all other processes to lose and exit (Line 1 of

reset).

Every

process

tests

the

stop

bit

before

each

primitive

operation

(Lines 3 and 10 of test &set). If the bit is set, the process aborts its high-level test & set operation and returns with 1. The stop bit by itself is not sufficient because a process performing a high-level test& set could be suspended during could

the reset period, later

and not observe

access a primitive

the stop bit being

test & set register

that

set. Such a process

it should

not.

Therefore,

another bit, called restart[ p], is associated with each process p to detect if a reset was taking place while it was not checking the stop bit: The restart bit works as follows; when starting a high-level test & set operation each process assigns O to its restart bit (Line 1 of test & set), and before finishing a high-level reset a resetter assigns 1 to the restart bits of all processes (Line 14 of reset). Now, before performing each primitive operation a process checks whether its restart bit is still O (Lines 4 and 11 of test & set). If not, it aborts its high-level test & set operation bit and the restart bits ensures

and returns with that the primitive

1. The combination of the stop objects chosen by the resetter

are not in use, and that they will not be accessed until they are themselves reset and the resetter assigns O to the stop bit. The high-level test & set operation is now performed on the chosen 12 primitives by simulating the s-test& set operation (from Figure 4). The simulation proceeds in iterations, such that in each iteration one primitive test& set

1252

Y. AFEK ET AL.

register is p-test & set (see Figure 5), A local state variable, state, records the history of the s-test& set simulation. This variable contains eleven fields, each corresponding to one of the eleven primitive test & set registers in s-test& set, as in Figure 4. The values of the fields are L , 0 or 1, and are initialized to L , indicating that no calls to the primitive registers have been made. Each call to a primitive test & set register is made to simulate a call in s-test& return value is assigned to the corresponding entry in state. Two

functions

values

in

returns returned

are defined

state

indicate

that

on

state. The

the

simulation

O (or 1) if the values O (or 1). The function

function

result

of s-test&

set, and the

returns

set

L

if the

is incomplete.

It

indicate that the simulation is complete and next examines the values in state and returns

the index (between 1 and 11) of the primitive register that should be called next in the simulation of s-test & set, given the results currently recorded. 5.3.2 Well-Formed Executions. The use of the primitive test & set registers in the construction is designed to work correctly only in constrained environments, in which a p-reset operation is invoked only if a p-test & set has returned

successfully

since

the

last

invocation

of p-reset.

The

constructions’

calls to these primitives depend, in turn, on their behaving properly. This complicates the proof, which must avoid circular reasoning. Accordingly, a system execution is well-formed for a multi-use test& set register (either a faulty primitive or high-level reliable construction) if the number of calls to the reset operation is either equal to or one less than the number of successful returns for calls to the test & set operation, in every prefix of the execution. That is, the sequence of successful returns from test & set and of calls to reset operations strictly alternate, beginning with a successful

return

from

a test & set

operation.

Hence,

in well-formed

execu-

tions, the reset and test & set operations can all be serialized so that successful test & set’s alternate with reset’s, and every unsuccessful test& set is ordered after

a successful

test&

set and before

violate well-formedness for can be serialized arbitrarily. regard for semantics Hence, regardless

the

next

a test & set register, (That is, anywhere

of the operations.) of whether an execution

reset.

In environments

that

calls may not terminate and within their interval, without

is well-formed,

we have defined

a

serialization and can consider calls to primitive test & set registers as single atomic events. Given an execution of the high-level reliable multi-use construction, we reason about this serialization, and argue that the execution is well-formed for each primitive. Then, we conclude that the atomic events have the appropriate semantics (are ordered as described in the preceding paragraph). The

fact

object is Given a will want primitive. of these

that

well-formedness

is maintained

for

each

primitive

test & set

proven by induction on an execution a of the high-level construction. that is well-formed for each primitive, we prefix, a’, of the execution to reason that an extension of the prefix is also well-formed for each To do so, we need to use the semantics of (well-formed) executions primitives. The semantics are in turn encoded in the serialization of

each primitive.

Hence,

we need to relate

serializations

of a and of a’. That

is,

we want to reason, not about the execution, a, but a serialization, ~, of the execution. In the induction step, we have a well-formed prefix a‘ of the execution. Any serialization ~‘ of a’ obeys the semantics of the test & set

Computing

with Faulty

primitives.

To use this

serialization assures

fact

to reason

~ so that

it has

Let

5.3.2,1.

r. Let

1253 about

/3 and

f3’ as a prefix.

of a‘

PROOF.

a

be an execution

a‘ be the longest prejik

a serialization

~

Objects

The

If

a

which

a, we choose technical

of a serialization

then

of a and

of a system

the

lemma

a =

a’

in a’, but serialized

and

we

/3’ be a serialization

By restricting consider

reliable

the

multi-use

until

after

class of serializations

calls

Iinearizability

early,

to

primitive

construction,

is a local

[Herlihy

set

done.

Otherwise,

let

we

subsequence of events in ~ of some operations that are ❑

/3’.)

objects,

as consisting

property

are

of a’. The serialization

as described

test & set

a test&

for r. Then there is

of a.

want is ~‘( ~ – ~ ‘). (That is, append to /3’ the that are not in ~‘. This may move the serialization pending

containing

of a that is well- foimed

is a prefix

is well-formed,

be a serialization

can

hence following

us this is possible:

LEMMA

register,

Shared

of single and

in Lemma

within

we

of

events.

(Because

atomic

Wing

5.3.2.1,

executions 1987],

it

the

suffices

to

consider each primitive separately in Lemma 5.3.2,1.) Moreover, in any wellformed prefix of the execution, these atomic events obey the test& set semantics. 5.3.3 Propetiies of the Constmction. Each call to the reset function begins with an assignment, “stop T= 1” (reset Line 1), and shortly before terminating, assigns 1’s to the restart bits (reset Line 14). Definition function

5.3.3.1.

within

For

a process

a, we say that

subinterval of a that begins with Line 1), and (stop := 1) (reset (reset We

Line 14). say that calls

to

reset

p,

execution

the reset

function

the initial ends with

are

sequential

a,

and

call

to

the

call is actiue for p within

reset the

assignment of the reset function, the assignment (restati[p] = 1) in

an execution

a

if each

call

executes its final assignment to stop (reset Line 15) before the next call begins. For now we state lemmas for executions which are assumed to satisfy this property. Note, in particular, that when calls to reset are sequential, stop = 1 is invariant during reset function calls, up to the final atomic assignment stop := O (reset LEMMA

Line

5.3.3.2.

15). Let

a be an execution

such that the two steps ( restart[p]

in which

:= O) (test&

calls to reset

set

Line

are sequential,

1),and (if

stop = 1)

(test &set Line 3 or 10), area subsequence of a from a single call to test & set by process p. If a call to the reset fimction is active forp at any point between the ( restart[p] := O) step and the (if stop = 1) step, then either stop = 1 in the if test (test & set Line 3 or 10), or restart[p] = 1 in the next step of p which is (if restart[p] = 1) (test&set Line 4 or 11). PROOF.

(restati[p] is still true

Since

a

call

to

the

reset

function

is

active

for

p

between

the

:= O) step and the (if stop = 1) step, at that point stop = 1. If this during the (if stop = 1) step, we are done. So assume that stop = O

during the (if stop = 1) step. It follows that some assignment to stop occurs after the point in which the reset function is active for p, and before the (if stop = 1) step. Since only calls to reset assign to stop, as their last step (reset Line 14), and reset’s are sequential, the reset which is active for p must assign stop := O between the ( restart[ p] := O) step of test& set and if (if stop = 1)

1254

Y. AFEK ET AL.

step. This in turn implies the assignment ( resturt[ p ] := 1) occurs between the ( restart[p] := O) step and the (if stop = 1) step. Only p assigns O to restart[p] = 1) (test & set Line 1); it follows that restart[p] = 1 during the (if restart[p] ❑

step.

LEMMA

Let

5.3.3.3.

a be an execution

in which

calls to reset

are sequential.

(1)

set(P7’ST[indl)), where 1 s irzd s n + 10, Let ( restati[p] ‘= O), (p-test& be a subsequence of a ji-om a single call to test &set, by process p (Function occurs after test & set Lines 1 and 12). No call to p-reset( PTST[ind])

(2)

Let ( restart[p] ‘= O), ( read(close[indl) = 1) be a subsequence of a from a single call to test& set, by process p (Function test C%set Lines 1 and 5). No assignment of O to close[ind] (Function reset Line 11) occurs ajler

(3)

Let ( restart[p] ‘= O), (read close[ind] = O), ( close[indl ‘= 1) be a subsequence of a from a single call to test& set, by process p (Function test & set Lines 1, 5, and 6). No assignment of O to close[ind] occurs ajler

(restart[p]

:= O) and before

( restart[p]

:= O) and before

( restart[p] PROOF.

:= O) and before

We

prove

test & set function restart[ p]

the

first

(p-test&

set(P7’ST[ind])).

(read(close[ind])

(close[ind] case—the

by p includes

= 1).

:= 1). others

the following

are

:= O

my_next [ p ] := ind ( = current [next(

similar.

sequence

state) ])

The

(test & set Line

1)

(test&

9)

set Line

(test & set Line

10)

restart[ p ] = O

(test & set Line

11)

p-test

(test & set Line

12)

(restart[p]

& set(PTST[ind]) lemma,

no call to reset

:= O) and (stop

is active

= O) and before follows.

LEMMA

5.3.3.4.

the

for p at any point

between

= O).

Consider a call to reset that is active after (stop = O) and (reset Line 13) before the end of p-test& set(PTST[ind]). reads my_next[ p ] = ind (reset Line 3) between (stop = set( PTST[ ind])). Hence, it will mark this register in use, p-reset on P7’ST[ind]. It follows that no reset that runs p-reset(PZ’ST[ind]) is lemma

to

stop = o

By the previous

(stop

call

of steps:

the end of p-test

& set(P7’ST[ind]),

that calls p-reset This call to reset O) and (p-test& and will not run active

for

The first

p after

part

of the

❑ Let

a be an execution

In a, no primitive test& set register, other call to register PTST[ind].

in which

PTST[ind],

calls to reset

is p-reset

are sequential.

concurrently

with any

PROOF. Since calls to reset are sequential, any two p-reset operations are sequential. The previous lemma implies that calls to p-test& set(PTST[ind]) cannot be concurrent with calls to p-reset(PTST[ind]). ❑ LEMMA

Then for PTST[ind].

5.3.3.5. Let a be an execution in which calls to reset are sequential. every primitive test & set register, PTST[ind], a is well- foimed for

Computing

with Faulty

PROOF, That

A simple

is, let

induction

p-test

Objects

induction

and the semantics other

within

than

a call

of a, where request

p-reset(

& set(PTST[ind])

for

ITST[

operation

of a, using

Lemma

a’ is well-formed

of the register

the

to

1255

on the prefixes

a’m be a prefix

is any event occurs

Shared

P7’ST[ ind], a reset

irui]),

(Line

of

(Line

12).

By

a ‘n is well-formed

if T

PTST[

ind].

This

13) which

By

5.3.3.4.

for IL’W[ind].

Lemma

request

has just 5.3.3.4,

run

all

a

other

operations on PTST[ind] in a’ terminated before that p-test& set(PTST[ind]) began. Hence, the p-test & set(P7’ST[ind]) returned successfully if the number of other successful p-test & set( PTST[ind]) operations in a‘ is equal to the number of calls to p-reset(PTST[ind]), and returned unsuccessfully if the number of other successful p-test& set(PTST[ irzd]) operations in a’ is one more than the number of calls to p-reset( PTST[ ind]). Hence, the new request does not violate Definition p in

5.3.3.6.

a, the

executes

well-formedness.

reset

o

For any execution preceding

that

of the call to test& set (that exists, we say the call to test&

(1)

5.3.3.7.

Let

a and any call to test& to test&

:= 1) (Line

the step ( re,start[p]

LEMMA

call

set is the

14) most recently

in which

to reset

before

is, Line 1 ( restati[ p] := O)). If set has no preceding reset.

a be an execution

set by process

call

calls to reset

no such

to test&

set, (or was in the initial

no

later

reset

assigns

ind

state of current, to current

until

set by the call

if there is no such after

step reset

are sequential.

Suppose p-test & set(PTST[indl) is called during a call to test& process pin a. Then ind was written to current by the reset preceding and

which

the first

the call

to

reset)

p-test&

set(PTST[ind]). (2)

Suppose close[indl is read during a call to test&set ind was written to current[O] by the reset preceding was in the initial state of current[O], reset assigns ind to current[O] until

by process p in a. Then the call to test&set, (or

if there is no such reset) and no later after the read and subsequent wrRe of

close[ ind]. PROOF. We prove the first test & set includes the following restart[p]

second is analogous. of steps:

:= O

rny-next[ stop =

case—the sequence

p ] := ind ( = current [next(

state)])

o

The

(test & set Line

1)

(test&

9)

set Line

(test &set

Line

11)

p-test

(test &set

Line

12)

& set( PTST[ind])

5.3.3.2, no reset is active for p at any point between the steps := O) and (stop = O). As in the previous lemma, any reset which is

active for p after stop = O either to current, or reads my_next[p] so will

not assign anything

LEMMA

Let P and read/write)

to

(test &set Line 10)

restart[ p ] = O

By Lemma ( restart[p]

call

sees my _next[ p ] = ind, and will not write iruf after the call to p-test& set(PTST[ind]), and

to current

until

after

this call.

❑

Let a be an execution in which calls to reset are sequential. Q be the corresponding sets of primitive registers (test&set and accessed in two different calls to test& set. Either the two calls to

5.3.3.8.

Y. AFEK ET AL.

1256

test & set have the same preceding reset, or every primitive register (test&set or read/write) in P n Q is either p-reset or assigned O between the jirst call and the second. PROOF.

x E P n Q

Suppose

different

preceding

and operations.

reset

that the two We assume

test & set register,

x = PTST[

analogous.

the two calls to p-test

Denote

occurring

before

operation in a:

of process

li) 2i)

@ in

ind],—the

a, where j. We

proof

the

following

ind])

two

stop := 1 my_next[i]

i, and

of operations

(reset

p-reset(PTST[indl)

4j) 5j)

restart[i] (p-test&

f= 1 set(PTS7’[ind]))

the call to test& set that includes operation that precedes the call

n. Suppose

it came

Line

1)

n, and steps 1,–4, to test & set that

contains @ (there must be such by the assumption). To establish the lemma, it suffices to show that line 3j, (p-reset(PTS7’[ after

11) 12)

(reset Line 3) (reset Line 13) (reset Line 14) (test & set Line 12)

= @

Here, li–5i are steps from are steps from the reset

comes

n

@ is an

(test &set Line (test & set Line

= m

+ ind

3j)

is

(test &set Line 1) (test & set Line 9) (test & set Line 10)

= O set(PTST[ind]))

lj)

primitives

sequences

o

2j)

have is a

by n and ~, with

of process

restart[i] := O rny_next[i] := ind

3i) stop = 4i) restart[i] .5i) (p-test&

to test& set the primitive

for the read/write

& set(PTS7’[

m is an operation

have

calls that

before

n (5i).

Since

line

ind])),

1~, ( restart[i

] := O),

precedes the assignment to restart[i] in line 4j, and the reads in lines 3i and 4i return O, Lemma 5.3.3.2 implies that line lj is ordered after line 3i. Since in turn line 3j is ordered before return ind, a contradiction. Definition

5.3.3.9.

Given

line ❑

5i, the read

an execution

of my_next[i]

in line

a of the construction,

any set of calls

to test& set which have the same preceding call to reset (or which such preceding call) are called a collection of calls to test & set. LEMMA

Let

5.3.3.10.

the following

a be an execution

2j should

of the construction.

all have no

Then

a satisfies

propetiies:

(1)

a is well$ormed for the reliable high-level multi-use test&set object. Moreover, each reset operation jinishes its jinal assignment to the stop bit before the next successyld test & set operation returns.

(2) (3)

calls to reset are sequential, the reads and writes of the close bits and calls to p-test & set made by each collection of calls to test& set are a prejik of a run of the w-test & set construction. PROOF. The

Consider lemma, and

Note

proof

third

first

that

is by induction a finite

and

on

execution

m is the

invariant

the

next are

step

first

invariant

the

length

CZ’T,

where

by the

maintained

implies of the a’

construction. in

CY’V.

the

second.

execution,

satisfies We

with the

must

a trivial

conditions show

that

basis. of the

the first

with Faulty

—Suppose

the next step is a call to reset.

the reliable required

Shared

1257

Computing

high-level

to preserve

Objects

multi-use

By induction,

test & set object.

this property.

The

first

a’

is well-formed

Thus,

invariant

for

the environment

is

follows,

and the third

The

invariant

is immediate. —Suppose

the next

immediate. implies

step is a step of a primitive

By induction,

that

calls to reset

a ‘T is well-formed

register.

are sequential

for the primitive,

and Lemma

the sequence of steps of the primitive in a’m obey Lemmas 5.3.3.7 and 5.3.3.8 imply the second invariant. —Suppose process

assignment

to

each the

reset.)

stop,

since

By induction, collection

call. Hence, from

test&

is

5.3.3.5

5.3.2.1 implies set semantics.

the next step is the successful return, T, of a call to test&set by p. We must show there is a reset call which runs through its final

test &set. ant,

first

in CY’T. Lemma

last

of calls

such

collection.

(That

successful of su-test

to test&

the two calls to test&

same

Let

the

the properties

return,

set can have

set, which is, they

at most

return

have

rm and r+ be the corresponding

~, of a call

to

& set and the last invariwith

different

calls to reset

one

successful

~ and ~, are not preceding preceding

calls

to

T and @

(That is, rn and r+ consist of the subsequences of events of a for the two calls.) Since calls to reset are sequential, @ precedes w, and well-formedness holds for calls to reset, it follows that the beginning of rfi follows @ Since m- is the successful return of a call to test& set, the third invariant implies this call simulated a successful call to su-test & set, and hence must read stop = O at some point. By Lemma 5.3.3.2, rm is not active for p during some interval

beginning

that

assignment

the final

5.3.4.

Proof

construction. complete.

with

of Theorem Without

(Recall

the first

step of the call to test&

to stop in rn must 5.3.1.

loss

of

Let

a

generality,

the construction

precede

be an arbitrary, assume

is strongly

n.

all

wait-free,

set. It follows

❑ finite

run

operations

of

of the a

are

if calls to primitives

strongly wait-free. By Lemmas 5.3.3.10 and 5.3.3.5, a is well-formed primitive, and so calls to primitives are strongly wait-free.)

for

are each

We must show that each test& set and reset operation can be appropriately serialized. We proceed as follows. Each successful test & set operation is serialized

at the point

obviously

within

ment

stop

to

test &set

of its assignment

its interval. (Line

operation

15), that

Each or

just

returns

to re.start(test

reset

is serialized

before next,

the

& set Line either

serialization

whichever

is earlier.

1), which

at its final of

the

Lemmas

is

assign-

successful 5.3.3.10

and 5.3.3.2 imply this is within the interval of the reset, indeed, no earlier than its first assignment to the array restart (Line 14 of the reset function). By Lemma 5.3.3.10, the sequence of serializations is of the appropriate form: “successful test & set, reset, successful test& set, reset, . . . .” It remains to show that given this partial serialization, the unsuccessful test & set operations can all be serialized, within their interval, after a successful test & set and before the next reset. We consider the unsuccessful calls to test&

set depending

upon

how

they

return. It suffices to show that these calls cannot begin and end between the serialization of a reset and before the serialization of the next successful test & set. Suppose an unsuccessful call returns because of losing the test on stop (Line 3 or 10 of test& set). We argue that the call can be serialized at the point of

1258

Y. AFEK ET AL.

this read of stop = 1. A call to reset is in progress when that call to reset is serialized at the point where it assigns

this read occurs. If stop := O (Line 15),

then the read of stop = 1 falls between the previous successful return to test & set and the serialization of the reset. Alternatively, suppose the call to reset was serialized just before the serialization of the next successful call to test & set. comes

Then

either

have begun,

these

before

two

or after

and the read

serialization both.

points

By Lemma

are

adjacent,

5.3.3.10,

of stop = 1 is a correct

and

the next

serialization

the

reset

read cannot

point.

Suppose an unsuccessful call returns because of losing the test on resku-t[p] (Line 4 or 11 of test& set). Then a call to reset assigns to restmt[p] (Line 14) after the test & set begins and before the read of restart. is serialized during the unsuccessful call, we can serialize

If the reset operation the latter just before

the reset. If it is serialized after the unsuccessful call, then since the reset does not begin until after the previous successful call to test & set, there is an interval during the unsuccessful call, between the previous successful call to test & set and the reset. Last, suppose the reset is serialized before the unsuccessful successful

call begins.

Then

it has been serialized

test & set was also serialized

before

early because

the unsuccessful

its assignment to restart). The return reset’s final assignment to stop (Line

of that successful test& 15), which in turn must

unsuccessful

call

the

Hence,

is a period

before

there

the next

to

test & set, after

and

next

the serialization

reset

must

the following call begins

begin

of the successful

(at

set follows come after even

test&

the the later.

set and

reset.

Finally, consider the unsuccessful calls to test& set which lost are unsuccessful in the simulated calls to su-test & set. By Lemma the semantics of the single-use test & set, this can happen only successful simulated call to su-test & set, by a call to test & set collection. Moreover, this simulated call to su-test & set, (by the

because they 5.3.3.10 and if there is a in the same serialization

of su-test

ends,

& set),

can be serialized call. That

begins

before

before

the serialization

successful

simulated

the unsuccessful point

call to su-test

simulated

call

of the simulated

& set is within

since

it

unsuccessful

a successful

call to

test & set which has been serialized even earlier, when it assigns to restart, before its simulation started. It remains to argue that no reset is serialized after the serialization point of the successful test & set and before the beginning of the unsuccessful call to test & set. Because the unsuccessful test& set progressed to run at least one step of the simulation, Lemma 5.3.3.2 implies no reset is active for this process during an interval of its operation. Hence, any reset unsuccessful call must be the call to reset which precedes

serialized before the unsuccessful

the call,

or is even earlier. But these calls to reset are also serialized before the successful call to test & set. Note that each high-level test & set operation requires at most a constant number of operations on the primitive shared objects, and each high-level reset operation requires at most O(n) operations on the primitive shared objects. ❑

6. RMW

Registers and Consensus

To complete our investigation of read-modify-write register. Herlihy

memory fault-tolerance we turn to the read-modify[1991] showed that reliable

Computing

with Faulty

TABLE I. Number

Shared

and type

# RMW regs. used

f ~-faulty

(f+

f ~-faulty

f+l 2f+l 2m+l

f ~-faulty m-faults

(RMW)

used to solve Briefly,

o

shared

consensus,

registers

enable

based on the value read, to write is accessed register, decision

by n processes,

decides The

requirements

value u such that

objects

is itself

a process

registers,

an input

value,

if it writes

output

of the

consensus

problem

and tolerate

read

Alternatively,

process

constructions

object can be

a register

and,

a consensus

are that

eventually

object

c { – 1,+ 1}.A

inputP

to its write-once

output

there

decides

exists

a

on v and

of fault-tolerant

use a variety

different

shared

registers

universal.

output

These

RMW

to atomically

a new value.

each non-faulty

any other

because

value of some process. several different constructions

are presented.

and atomic these

which

Afek et al. [1992b] 36.1 $6.1 $6.2

primitive:

follows

each with

on a value

that u is the input In this section,

memory

This

Algorithm

4 2 4+210gf 4

o o

RMW,

n-process

# bits in RMW reg.’s

4f2i-6f+2

using

RMW

# atomic bits used

1)2

is a universal

can be simulated

process

1259

I_JSEDTO REACH CONSENSUSIN THE PRESENCE OF FAULTS NUMBEROFREGISTERS

of faults

write

Objects

consensus

of read-modify-write

types of faults.

Table

I summarizes

constructions.

6.1.

UNBOUNDED

struction

FAILURES

(Theorem

registers

and

co-faulty,

We

6,1.1)

a quadratic then

and

the

support

ternary

these

~ +

1 RMW

registers

2f + 1 registers THEOREM

struction atomic

registers are

(RMW

atomic

2~ +

start

by presenting

from

bits,

fj. 1.5, how

into

~ +

such

that

to encode

1, O(log

~)-bit

~ of all

of the

wide,

RMW

lower

bounds,

(Theorem

always

necessary,

and

(Theorem

6.1.5)

any f >0,

bits, at most f of which The construction

atomic

registers. 6.1.4)

that

that

a total

wait-free

consensus

con-

in Figure

7. This figure

uses the notation

and end of atomic,

exclusive

access

to shared RMW register r. That is, it is assumed that a process can lock one register at a time, that a process does not fail between pairs of lock(r)

The

construction

uses

of

+ 1)(f + 1) = 4f 2 + 6f + 2

and 2(2f

the beginning

unlock statements, and that tion eventually executes it,

be bits

are ~-faul~. is shown

to mark

may

are necessa~.

there is a strongly

RMWregisters

RMW

them

two

read\write)

a con-

1 ternary

with

andior

For

We

object

bound

+ 1 terna~

and unlock

of

in Theorem

upper

6.1.1.

usingf

PROOF. lock(r)

RMW

a consensus

number

show

We

PER REGISTER.

of

any non-faulty

two

ideas,

The

process first

that

is that

reaches

only and

a lock instruc-

consensus

is trivially

achieved with one non-faulty, ternary RMW register. The register is initially set to O, the first process to do a RMW sets the register to its input value, and all processes agree on the value written. The second idea is to use validation bits to ensure that an invalid value cannot be chosen. Before proposing a value in a RMW register, a process establishes it as valid by writing to many validity bits. Before adopting a value from a RMW register a process confirms its validity by

1260

Y. AFEK ET AL,

Protocol

for

p,

process

shared

r: array[l

..}+

shared

v: array

[l..j+l][l..2j

input~

G {-1,

I] of ternary

1:

z :=

RMW

+1][–1,

decidep ,z: ranges over {-1,0,

local

+1}: %initialfy

registers

I] of atomic

+1},

initiafly

2:

fori=ltoj+ldo for

4:

lock(r(i))

%Set

j = I to 2j+I

5:

if r(i)

6:

tmp

7:

do

= O then

u(i, j, z) := 1 od

r(i)

= O

each bit = O

o

inputp

3:

each register

% initially

bi~

%Establ~h

validity

of cu~~t

value

%Vote

:= z fi

initial

value

of z

at this stage

for x at this stage

:= r(i)

unlock

8:

count

9:

for

%Adopt

:= O

j = 1 to 2f+l do if u(i, j,tmp) if counf > \+l then z := tmp fi

10:

= 1 then

~~~nt

:=

this stage’s co~nt

vote

if it is valid

+ 1 H od

ll:od 12: dec:dep

:= z

FIG. 7. Consensus in the presence of ~, ~-faulty registers. registers and 4~ 2 + 6~ + 2 atomic read/write bits.

reading

the validity

process

to choose

The

two ideas

bits. Faults an invalid

in a RMW

The construction

register

therefore

uses ~ + 1 RMW

cannot

convince

a

value.

are combined

as follows:

The

construction

proceeds

in ~ + 1

stages. Each stage is assigned a RMW register and two sets of 2f + 1 atomic bits. One set of bits is called the ualidity bits for value + 1 and the other set the validity bits for value – 1. In each stage the processes validate their current value using the appropriate validity bits, attempt to write their value in a RMW register

reserved

for this stage, and adopt

the value

written

in this register

if it

is valid. More

specifically,

2 f + 1 stage-i bits are denoted the

ith

RMW

the value

validity u(i, register

x is marked

bits for value *, x).) The (Lines

straightforward

generalization

LEMMA

on valid

values, RMW

6.1.2. (VALIDIm).

in stage

attempt

to write

4 to 7). A process

the register was O prior to the RMW. process checking its validity at stage validity bits for value x are set to bounded set of input value, and approp~iate

valid

of this

i by ensuring

x are set to 1. (In Figure a value writes

7, Line

that

all

3, these

at stage i is made its value

to

if and oniy

if

The value x is considered valid by a i if and only if at least f + 1 stage-i 1 (Lines 8 to 10). (Note there is a construction,

by adding additional values.) In tlae protocol

to solve

consensus

validity .- registers

of Figure

7, processes

for for

decide

any each

only

values.

PROOF. A stronger statement is proved: the value of the local variable x is always valid after Line 1 has been executed. For each process the variable x is modified only in Line 1 and possibly ~ + 1 times at Line 10. We inductively prove that no variable x can be changed to an invalid value. Suppose all changes numbered less than i were to valid values

(the assignment in Line 1 for some process serves as base shown that the change numbered i is to a valid value.

cases).

Next

it is

Computing

with Faulty

If change which

Shared

number

i is an execution

is by definition

change loop These

valid

9 read

at least

bits are initially

of Line

O (meaning

these

bits

(at Line

3) when

6.1.3.

LEMMA

1, then holds.

10 then

f + 1 validity

it must

correspond

Every

to their

process

the

the for of x.

and at least one is a

only be nonzero (valid) if But processes only write

values of x were valid and value of x and the new, adopted

(AGREEMENT).

value

hand,

be the case that

for all stages)

nonfaulty bit could at an earlier time.

they

it is to an input

If, on the other

bits set to 1 for the new value

not valid

field of a nonfaulty register. This it were written by some process induction, the earlier corresponds to a valid

of Line

and the induction

is due to an execution

at Line

1261

Objects

that

current

value

of

x. By

thus the nonfaulty value of x is valid. writes

a value

bit ❑

in decideP

writes the same value. PROOF. register process

In

any execution,

there

is at least

which is written is nonfaulty. to attempt a read-modify-write

call it Xg. Furthermore, g. Since the nonfaulty nonfaulty validity bits

one

stage

in which

the

RMW

Call the first such stage g. The first at g will succeed in writing its value,

it will have first established the validity of x~ for stage validity bits are set to 1 but never set to O, the f + 1 at stage g will always show Xg as valid for stage g. All

subsequent processes will thus be able to successfully confirm the validity of X8 for stage g, Since register g is nonfaulty all subsequent processes will also agree that its value is Xg. Thus, all processes will adopt the value X8 at stage g. Using stage g as a base case, we now show that if all processes value Xg at the end of stage i > g, then all processes agree on value end of stage i + 1. Since

all processes

agree

on value

agree on Xg at the

Xg at the end of stage i

they will all set validity bits for the value x~ at stage i + 1. None will set of the value of validity bits for the value –X8 at stage i + 1. Thus, regardless the i + 1st RMW Note

that

register

each process

performs

can switch

its value

concludes

the proof

Complementing

of Theorem Theorem

the number

of RMW

6.1.1.

operations

on the

operations. Hence, the conwith Lemmas 6.1.2 and 6.1.3,

❑

6.1.1, we have registers

❑

to –xg.

4 f 2 + 6f + 2 primitive

RMW registers, and f + 1 primitive is strongly wait-free. This fact, together

read/write struction

that

no process

a simple

used in the proof

lower

bound

of Theorem

that

shows

6.1.1, f + 1,

is optimal. THEOREM

tion

which

6.1.4. tolerates

and any number PROOF.

w-faulty

There does not exist a strongly f ~-faul~

of read/wn”te

wait-free

registers and uses fewer

consensus

than f + 1 RMW

construcregisters

registers.

Since a fault could erase the effect RMW register, this object can be trivially

of every operation in an implemented by an object

that always returns the initial value and uses no shared primitive objects. Thus, f ~-faulty RMW registers and any number of reliable read\write registers are no more powerful for solving consensus than the read/write registers alone. The theorem follows from the impossibility of consensus from read/write registers [Herlihy 1991; Loui and Abu-Amara 19871. ❑ If only

RMW

registers

are used, we have the following

THEOREM 6.1.5. If only primitive consensus, and if as many as f of

tight

bound:

~-faulty RMW registers are used to solve the registers may be faulty, then 2f -t 1

Y. AFEK ET AL.

1262 (logarithmic-size)

IWIW

registers are both necessary and suficient:

CONST(RMW, PROOF. corolla~

The

lower

modifying tion

bound,

of a stronger

lower

the next subsection. The upper bound,

CONST(H,

bound

(for

CONST(M,

the construction

encodes

(~, ~), consensus)

= 2f

(~, ~), consensus)

a weaker

fault

model),

(~, GO),consensus)

used in the proof

the ~ + 1 ternary

IIMW

+ 1.

and the

is

a

6.2.2 in

s 2~ + 1, is proved

of Theorem

registers

> z~,

Lemma

by

6,1.1. The modifica0(~

2, validity

bits into 2~ + 1 RMW registers of size 4 + (2[log f I)-bits. Each of the 2f + 1 RMW registers in the new construction

atomic

is divided

into

three fields: decide, uplus, and vminus. The f + 1 ternary RMW registers are simulated by the decide field of the first f + 1 new RMW registers. This is possible since RMW registers can be read-modify-written in one field without affecting the values in any other field. The encoding of the validity bits relies on the following two observations about

the construction.

was established

rzeuer be invalidated stage separately, which it is valid. repeatedly

First,

a value is valid

in stage k if and only if its validi~

in all stages i, i < k, and second,

record

in stage k. Thus,

instead

once valid in stage k a value will

of recording

the validation

for each

it is enough to record for each value the maximum stage in Of course, since f of the RMW registers could be faulty, we this

information

in all the 2f + 1 RMW

registers.

Each

of

the fields uplus and vminus can encode the values O through f + 1 and is initially O. The setting of bit u(i, j, 1) in the previous construction is now performed by setting the uplus field of register j to be equal to i if it is found to be less than i. The bit u(i, j, 1) is read as 1 if the value of the vplus field of register j is greater than or equal to i and as O otherwise. The bit u(i, j, – 1) is handled In RMW 6.2.

in the same way.

this

construction,

BOUNDED

each

FAILURES

per register, we are able required for consensus: THEOREM

process

now

performs

4f2

+ 7f + 3 primitive

❑

operations.

PER REGISTER.

to fully

characterize

For

a bounded

the number

number

of faults

of RMW

registers

6.2.1

—CONST(RiMW, —CONST(RJIW,

( f, 1),consensus) = 2f + 1. m, consensus) = 2m + 1.

PROOF. Lemma 6.2.2 (below) implies the lower bound: CONST(RMW, (~, 1), consensus) > 2~ while Lemma 6.2.3 implies the upper bound: CONST(RMW, m, consensus) s 2m + 1. Together with Theorem 3.1,

these imply

the theorem.

LEMMA 6.2.2. 2 f + 1 primitive [n/21

❑

There is no n-process consensus construction using fewer than objects, at most f of which may be l-faulty, and which survives

process failures.

PROOF. The proof is by contradiction. Assume to the contrary a solution using 2f primitive objects. Let the initial internal registers rl, ..., rz~, be Ul, ..., u2f, respectively.

that there is states of the

Computing The

with Faulty

Shared

process-failure

the processes, their

requires

that

in any run

in which

only

half

pl, ...,

on that value,

as they do not know

process.

processes

assumption

1263

Pr. /21 take StePS1 they must eventually decide. ~SO> if are identical, the validity condition requires that they must decide

inputs

of some

Objects

run

Thus,

alone

there

with

whether

the other

is a failure-free,

input

decision

finite

+ 1 and decide

on

value

run

+1.

a

Let

is an input

where

half

the final

the

internal

states of the primitives rl, ..., rzf, inabeul, ..., u ~~, respectively. NOW consider a run in which the other processes, p,. /21+ ~, ..., P. take stePs, run

alone

with

to Uf+l,..., this

run

decided

input

– 1, but find the primitives initially are m states This is consistent with a run in which pl,..., p,. ,2, Vf+l> . ..7 Vzf. no steps and ~ faults have changed the internal states of r + ~,..., r2~ Vzf, respectively. Hence, pf~ /Z1 + 1, ..., P. must all deci d e – 1. But

Ul, ..., Uf, have taken

is also +1,

consistent

and ~ faults

U*, ..., u~, respectwely. diction. ❑ Note

that

whether

Hence,

this lower

wait-freedom,

others

construction

depends

requires take

each

steps.

need only be correct

Moreover, when

the bound the

requirement LEMMA

object

that

P.

tion using 2m + 1 faulty failures is at most m:

RMW

CONST(RMW,

all decide

progress

make

in runs in which

at least

process

may

be strongly

(It

a contra-

assumption

1n/21

for

of

that

the

processes

are

1n/2]

does

than

regardless

we mean

on the behavior

is exceeded.

– 1 other

of the construcnot

depend

on

a

wait-free.)

there is a strongly registers,

wait

all-eadY

., rf back to +1)

progress

failures,

bound

the construction

must

to

have

states of rl,..

on a weaker

process

also does not depend fault

pl, . . . 7Ppl/21

process

(A

For any m >0,

6.2.3.

in which the internal

By [n/21

guaranteed to make progress, processes to take steps.) tion

a run

p,. ,21+ 1,...,

bound

which

the

with

have changed

provided

m, consensus)

wait-free

consensus

the total

number

construcof memoy

s 2rn + 1.

PROOF. The proof of this lemma is based on the construction in Figure 8, which solves strongly wait-free consensus using 2 m + 1 faulty RMW registers, provided this

the total

construction

number is

of memory

a consequence

failures

is at most

m. The correctness

of

series

lemmas

the

of

in

the

of next

subsection. 6.3.

PROOF

OF THE

UPPER

BOUND.

In Figure

8, the input

to process

p is

initially assigned to register must be assigned to register lock(r) and unlock to mark

inputP, local to p, and an appropriate output value decideP. As in Figure 7, Figure 8 uses the notation the beginning and end of atomic, exclusive access

to shared RMW register r. In the construction, there

are 2m

+ 1 shared

registers

r[i],

1
0. the

changes

construction. Sk to Sk+ ~.

to these That

is, let

parameters m be

a step

that

can

result

in a legal

run

from that

Y. AFEK ET AL,

1268 (1)

That is, m is r[i ].uote := u, where the value of m is a memoy fault. r[i].uote is –u in Sk. Note that ZS~+l = ZS~ and RF~fl = RF~ – 1. Here, there are four key subcases.

l~k+ll = 2, and A~+l = A~ + 4. IZ,I < 12,+,1.Then 12 ~+11 = 1~~1 + 2, and O < 1~~1 = l~~+ll. Then lZk+ll = Ilikl = 1, and 1~~+11 < llZkl.Then l~k+ll = lZkl – 2, and A~+l

(a)

Sk = O. Then

(b)

O
1. Moreover, by Lemma 6.3.6, a = al (r[i - 1].uote = O; r[i – 1] = u’)a,m, where (r[i – 1].uote = O; r[i - 1] = u’) is the successful read-modify-write to r[i – 1].vofe and there are no successful read-modify-write operations in az. Let Sj be the state at the beginning of CYz,just after the successful read-modi~-write to r[i – l],uote. By induction, Aj > 0. Also by Lemma 6.3.6, CYz contains the i – 1 reads in collect by p that precedes w. In addition, since none of the operations in are successful read-modify-writes, by the analysis above A j < “”” < A~. Next, consider the sequence of values Zj, . . . . ~~. observe that since the is legal, and because there is no successful read-modify-write operation in the value of Z never changes by only 1 in az. We examine cases depending the sign of the sum collected by p, and show that a fault described in case lb, or lC must occur in az, implying Ak >3, and so Ak+ ~ >1, The

collect

—Some

Z,

sums to a valid in Zj,.

value.

There

are two subcases.

. ., Zk has the same sign as the sum collected.

the az run az, on la,

Computing

is

with Faulty

Shared

1269

Objects

Since lZ~+ ~I = 1~~ I – 1, it must be that the sign of the value written different than the sign of ~~. Thus, the sign of 2X ( = sign

collected and

= sign of

2,+ ~ in this

u) is different

sequence

than

such that

the sign of

either

Z~. Hence,

~, = O and

by n-(~~) of sum

there

exist

IX,+ 1I = 2, or

Z,

IX,l =

l~.+ll = land ~,= –2,+1. Then either A,+l = A,+ 4 >50r A,+l = A,+2 > 3, respectively. (This must be due to a fault of type lb or lc, above.) Thus, A~ z A,+l —No

>3

and hence,

>1.

E in ~j, . . . . ~~ has the same sign as the sum collected.

By definition, the

A,~l

the sign of the sum collected

i – 1 registers

read

in the collect.

must be the sign of a majority

Thus,

a majority

of the

each have the same sign as the COIIect and as u at some point

of

i – 1 registers in the interval,

but not at the end of the interval, Then, a fault in the interval must change the value of at least one of these, from u to – u. That is, there exist 2, and 2,+ ~ in this sequence A ~+1 >3. The collect —Some

such

that

IZ,+ ~1= IZ,I +

sums to O. There

2X in ~j,.

2, and

A,+ ~ = A, + 4>5.

Hence,

are two subcases,

. . . Xk has value

O.

Since Xk # O, some fault must move the sum from O. That is, there exist ~, such that 2, = O and IX,+ 1I = 2. Hence, A,+ 1 = A, and 2,+ ~ in this sequence +4>5and A~+l>3. —No

X in ~j,...,

That first

is, half

that there

~~ has value

the registers exist

O.

are read

as positive,

2, and X,+ ~ in the sequence

sign: that IZ,I = IZ,+II = 1 and A~ >3. Hence, A~+l >1.

Z, =

–Z,+l,

and half ~j,..., Then,

A,+l

Suppose next that all of Zj, . . . . Zk have the same sign. u I < I~~ 1,their sign is different than u ‘s. The COIIect read uote = – u and half with vote = u. Since all the 2 have some fault changes a value from u to – u in cq. That Z,+l in the sequence Hence, A~,l >3. The collect By Lemma

such that

12,YII

as negative,

= A, + 2>3

successful

and

Since 1~~+ ~1= lZ~ + half the registers with sign different than u, is, there exist 2, and

= 1X,1 + 2, and A,+,

is nonzero and invalid. 6.3.1, all i – 1 of the earlier

Suppose

Zk that have different

= A, + $ z 4 z 5.

read-modify-writes

wrote

the single valid value, u, yet the sum of the collect had opposite sign. Hence, in a at least (i/21 registers had faults changing the value from u to – u. Recall that RF~+ ~ is m minus the number of faults in am; hence, RF~+ ~ < m – [i\21, and 2RF~~ ~ s 2m – 2[i/21 s 2m – i. Hence, we have

>lZ~+ll

+2m+l–i–(2m–

i)

n

Y. AFEK ET AL.

1270 LEMMA

6.3.8

PROOF. register,

Consider r[2m

All processes

(AGREEMENT).

Sk

in

the

+ 1], has been

system

written.

decide on the same value.

state

Sk, immediately

By Lemma

6.3.7,

after

2RF~

the

last

< 112k1. Hence-

forth, the number of remaining faults is insufficient to change the sign of Zk, or reduce it to O. Since all the reads in any final collect (upon which any decision

is based)

are

made

after

Sk, all

processes

decide

on

the

same

❑

value. Recall

that

without

a register

any process

3.1 imply

is k-faulty

writing

into

if it can change

it, at most

k times,

its value

Lemma

spontaneously,

6,2,3 and Theorem

the following:

COROLLARY

construction k-faulty:

6.3.9.

using

2m

For any 1 s k s m, there is a strongly wait-free consensus + 1 l?iWW registers where at most 1m/k] registers are

c0NsT(m4aklc0nsensus) ‘2m+1 6.4.

UNIVERSAL

universal

object,

implementation objects

for

Herlihy’s Recently, Herlihy’s

CONSTRUCTIONS. as an

of

any

object other

n processes

that

[1991] used

be

defined to

the

construct

notion

of a

a wait-free

object.

He

showed

that

consensus

are universal

for

systems

with

at most

and

other

n processes.

construction required atomic registers over an unbounded domain. Jayanti and Toueg [1992] showed how to bound the register size in construction.

THEOREM are universal,

6.4.1 (HERLIHY). provided

Consensus

For

objects

none of them are faulty:

CONST((consensus, PROOF.

Herlihy can

any number

atomic),

of processes

construction from n-processor consensus reliable atomic read/write registers of

together

with

atomic

registers

For all objects X, (O, ~), X)

n, Herlihy and atomic unbounded

< ~. [1991]

gives

an explicit

registers that uses 0(n3) size and 0(n3) reliable

consensus objects over a bounded domain. Simple extensions of known constructions can be used to implement multi-valued consensus from binary consensus and n read/write registers [Plotkin 1988]. (Each process p writes its input to a read/write register rP, initialized to some invalid value, then processes use binary consensus on log n consensus objects to agree on the index q of a register r~ that contains a valid value.) Although the construction does not tolerate faults, it must be strongly wait-free.

Herlihy’s

[1991]

construction

includes

a loop

with

an exit condition

dependent on values read from shared memory. If a fault occurs, the construction may loop forever, so this precise construction is not strongly wait-free. However, there is a bound on the number of low-level operations on shared data that are required to implement a single high-level operation. Hence, this construction can be made strongly wait-free by adding an exit condition when the bound on the number of low-level steps is exceeded. Other constructions such as Plotkin [1988] can be modified similarly. ❑

Computing

with Faulty

THEOREM

Shared

Objects

1271

6,4.2

—RMWregisters are universal objects, if a bounded For all objects X, and for all f < ~, CONST(RMW,

(f,

number

~), X)

of them are ~-faulty:

< w.

—Consensus objects together with safe registers are universal objects, if a bounded number of them are w-faul~: For all objects X, and for all f < ~, CONST((consensus, PROOF.

Note

that

first

safe),

it is trivial

( f, ~), X)

to use a RMW

wait-free implementation of an atomic register: = 1. From this observation, Corollary 4.4 CONST(RMW,

(f, ~), atomic)

s 20f

< CO.

+ 8 and

register

from

Theorem

CONST(RMW, ( f, ~), consensus) = 2 f + 1. Composing these the construction in Theorem 6.4.1, and appealing to Theorem first part of the theorem. To prove

the

second

part

of the theorem,

tions of atomic registers from safe registersll was done above with Herlihy’s construction:

as a strongly

CONST( RMW, (O, ~), atomic) and Theorem 3.4, we have

known

6.1.5

fault-intolerant

can be made CONST(safe,

we

have

two facts with 3.3, proves the construc-

strongly wait-free as (O, m), atomic) s m.

Composing these constructions and Theorem 6.4.1, we obtain CONST((consensus, safe), (O, M), R&fW) s w. The result follows by the first part of the theorem

and Theorem

❑

3.4.

The proof

of the second

THEOREM

6.4.3.

part

Suppose

of this theorem

X

is a universal

can be readily object,

generalized:

in the sense that for

all

objects Y, there is a strongly wait-jiee construction of Y jiom X. Then there is a fault-tolerant construction of Yfiom X: For all objects Y, and for all f < w, CONST(X,

(O, m), Y)

< ~ == CONST(X,

(f,

~), Y)

< ~.

In their recent paper on memory faults, Jayanti et al. [1992] give a direct fault-tolerant, strongly wait-free construction of consensus from consensus, which they use in an alternative proof of the second part of the previous theorem: consensus objects and registers can be used in fault-tolerant, strongly wait-free universal constructions. 7. Discussion, There

Open Problems,

remain

distributed multi-valued

many

unresolved

and More issues

about related

Related

Work

to shared

memory

failures

in

systems. Faulty versions of other shared data objects, such as test & set registers, m-registers, and compare & swap registers,

are of interest. Based on the constructions presented in this paper, fault tolerant construction of fetch-and-add and swap shared objects are presented in Afek et al, [1993b]. We have tight bounds on only a few problems; more efficient constructions and corresponding lower bounds would also be interesting. It would be particularly interesting to implement memory-fault tolerant data objects directly from similar, faulty objects, such as test & set from test & set, 11See, for example, Bloom [1987], Burns and Peterson [1987], Lamport [1986], Li et al. [1989], Peterson [1983], Peterson and Burns [1987], Singh et al. [1994], and Tromp [1989].

Y. AFEK

1272 without

using

without

the overhead

All

atomic

our solutions

registers,

or read-modify-write

of a universal

from

ET AL.

read-modify-write,

construction.

are deterministic.

It would

be interesting

to explore

the use

of randomization to tolerate memory failures. Also, there is much work to be done in exploring the effect of memory failures in other models, such as synchronous or semi-synchronous models. Earlier, we referred several times to a paper explores

other

interesting

issues related

by Jayanti

to memory

et al. [1992]

failures,

as well

that

as indepen-

dently proving some of the same results presented here. Our fault models allow the study of individual faults within an object, setting the stage for an investigation of transient fault behavior. The work of Jayanti et al. [1992] assumes an object is either permanently faulty or completely reliable, but studies a range of fault types, from extremely malicious to benign crash behavior. These definitions coincide in our notion of ~-faulty spond results —A

nonresponsive

faults,

objects,

corre-

which

to responsive, arbitrary failures in the work of Jayanti et al. Specific by Jayanti et al. that are identical or closely related to ours include:

version

of Theorem

tion, in any fault model, used in our proof,

3.2 that that

states general

suffice

conditions

on self-implementa-

to apply the same recursive

construction

—Composition results similar to Theorems 3.3 and 3.4. —Register constructions analogous to Theorems 4.1, 4.3, and Corollary 4.4. —A direct, fault-tolerant and strongly-wait free construction of consensus from

consensus,

second

part

which

of Theorem

is used to prove

The generalization ing construction that

of Theorem generalizes

Gracefully

degrading

constructions

to

the

are

required

exhibit

universality

6.4.2 and to Theorem

primitives fail, as the low-level ing constructions in different

results

analogous

to the

6.4.3.

3.2 depends on a notion of gracejidly degradour notion of strongly wait-free construction.

same

from type

potentially of

fault

faulty behavior,

low-level when

primitives too

many

primitives themselves suffer. Gracefully degradfault models can be composed in general ways,

just as we compose strongly wait-free constructions in the ~-fault model. We believe this generalization of strong wait-freedom is an important contribution. Jayanti et al. [1992] also investigate an interesting range of fault models distinct from those we study, focusing on computability and universality issues, They show that unresponsive faults (in which invocations to objects may not evoke replies) are essentially impossible to overcome. In addition to the arbitra~ fault model, they study two weaker responsive fault models, omission faults and crash faults. They show universal gracefully degrading constructions for omission faults, and argue that crash faults are too benign. That is, even if the low-level primitive objects fail by crashing, it is essentially impossible to create abstract components that gracefully degrade (or exhibit only crash faults when too many primitives crash). Following on this work, three of us have defined a benign fault model intermediate in power between the omission and crash fault models, and shown that it supports universal, gracefully degrading constructions [Afek et al. 1993]. The authors benefited substantially from the efforts of outstanding referees. They caught several errors and omissions, and the presentation is much improved, in clarity and style, from our original draft.

ACKNOWLEDGMENT.

Computing

with Faulty

Shared

Objects

1273

REFERENCES ABRAHAMSON,K.

1988. On achieving consensus using a shared memory, In Proceedings of the 7th ACM Symposium on Principles of Disti’buted Computing (Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 291-302. AFEK, Y., ATTIYA, H., DOLEV, D., GAFNI, E., MERRITT, M., AND SHAVIT, N. 1993. Atomic snapshots of shared memory. J. ACM 40, 4 (Sept.), 873–890. AFEK, Y., DOLEV, D., GAFNI, E., MERRITT, M., AND SHAVIT, N. 1994. A bounded first-in, first-enabled solution to the l-exclusion problem. ACM Trans. Program. Lang. Syst. 16,3 (May), 939-953. AFEK, Y., GAFNI, E., TROMP, J., AND VITLNYI, P. M. B. 1992a. Wait-free test-and-set. In Proceedings of the 6th International Workshop on Distributed Algorithms. Lecture Notes in Computer Science, vol. 647. Springer-Verlag, New York, pp. 85–94. AFEK, Y., GREENBERG, D., MERRITT, M., AND TAUBENFELD, G. 1992b. Computing with faulty shared memory. In Proceedings of the llth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12), AFEK, Y., MERRIn,

M., AND TAUBENFELD, G.

1993a.

Benign

failure

models for shared mem-

Workshop on Distributed Algorithms. ory. In Proceedings of the 7th International Computer Science, vol. 725. Springer-Verlag, New York, pp. 69-83. AFEK, Y., WEISBERGER, E., AND WEISMAN, H.

1993b.

A completeness

Lecture

theorem

Notes in

for a class of

synchronization objects. In ~oceedings Ofthelzth ACM $wosium onfi@leS OfDiSm”bU@d Computing (Ithaca, N.Y., Aug. 15-18). ACM, New York, pp. 159-170. ASPNES, J., AND HERLIHY, M. 1990. Fast randomized consensus using shared memory. J. Algorithms, pages 281–294, September. BELL, G. 1992. Ultracomputers: A teraflop before its time. Commun. ACM 35, 8 (Aug.), 27-47. BLOOM, B. 1987. Constructing two-writer atomic registers. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 249-259. BURNS, J. E., AND PETERSON, G. L. 1987. Constructing multi-reader atomic values from non-atomic values. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 222-231. 1989. Linda in context. Commun. ACM 32, 4 (Apr.), CARRIERO, N., AND GELERNTER, D. 444-458. CHOR, B., ISRAELI, A., AND LI, M. 1987. On processor coordination using asynchronous hardware. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 86-97. DOLEV, D., GAFNI, E., AND SHAVIT, N. 1988. Towards a non-atomic era: l-exclusion as a test case. In Proceedings of the 20th Annual ACM Symposium on Theo~ of Computing (Chicago, Ill., May 2-4). ACM, New York, pp. 78-92. DIJKSTRA, E. W. 1965, Solution of a problem in concurrent programming control. Commun. ACM 8, 9 (Sept.), 569. ACM DIJKSTRA, E. W. 1974. Self-stabilizing systems in spite of distributed control. Commun. 17, 9 (Nov.), 643-644. FISCHER, M. 1983. The consensus problem in unreliable distributed systems (a brief survey). In Foundations of Computation Theoiy, M. Karpinsky, Ed. Lecture Notes in Computer Science, vol. 158. Springer-Verlag, New York, pp. 127-140. FISCHER, M., LYNCH, N., BURNS, J., AND BORODIN, A, 1979. Resource allocation with immunity to limited process failure. In Proceedings of the 20th IEEE Annual Symposium on Foundation of Computer Science. IEEE, New York, pp. 234-254. FISCHER, M., LYNCH, N., BURNS, J., AND BORODIN, A. 1985a. Distributed FIFO allocation of identical resources using small shared space. Tech. Rep. MIT/LCS\TM-290. Laboratory for Computer Science, Massachusetts Institute Technology, Cambridge, Mass. FISCHER, M., LYNCH, N.j AND MERRITr, M. 1985b. Easy impossibility proofs for distributed consensus problems. Dist. Comput. 1, 1,26–39. FISCHER, M., LYNCH, N., AND PATERSON, M. 1985c. Impossibilhy of distributed consensus with one faulty process, J. ACM 32, 2 (Apr.), 374–382. FISCHER, M. J., MORAN, S., RUDICH, S., AND TAUBENFELD, G. The wakeup problem. SL4M J. Comput. to appear. asynchronous consensus FISCHER, M. J., MORAN, S., AND TAUBENFELD, G. 1993. Space-efficient without shared memory initialization. Inf. Proc. Lett. 45 (Feb.), 101–105.

1274

Y. AFEK

ET AL.

M. 1991. Wait-free synchronization. ACM Trans. Prog. Lang. ,$yst. 13, 1 (Jan.), 124-149. HERLIHY, M. P., AND WING, J. M. 1987. Axioms for concurrent objects. In Proceedings of’ the 14th Annual ACM Symposium on Principles of Programming Languages (Munich, West Germany, Jan. 21 -23). ACM, New York, 13-26. JAYANTI, P., CHANDRA, T., AND TOUEG, S. 1992. Fault-tolerant wait-free shared objects. In of Computer Science. IEEE Proceedings of the 33rd IEEE Annual Symposium on Foundation Computer Society Press, New York. JAYANTI, P., AND TOUEG, S. 1992. Some results on the impossibility, universality, and decidabilWorkshop on Distributed Algorithms. ity of consensus. In Proceedings of the 6th International Lecture Notes in Computer Science, vol. 647. Springer-Verlag, New York, pp. 69–84. H. H. 1987. Memory requirements for agreement among LOUI, M. C., AND ABU-AMARA, unreliable asynchronous processes. Aduarrces in Computing Research. 4, 163– 183. LAMPORT, L. 1974. A new solution of Dijkstra’s concurrent programming problem. Comrrum. ACM 17, 8 (Aug.), 453-455. LAMPORT, L. 1986. On interprocess communication, parts I and II. Dist. Comput. 1, 77-101. LI, K., AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321-359. concurrent wait-free LI, M., TROMP, J., AND VITANYI, P. M. B. 1989. How to construct variables. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Lecture Notes in Computer Science, vol. 372. Springer-Verlag, New York, pp. 488–505. 1987. Hierarchical correctness proofs for distributed algoLYNCH, N. A., AND TUTTLE, M. rithms. In Proceedings of the 6th Annual A CM Symposium on Pn”nciples of Distn”buted Computation (Vancouver, B. C., Canada, Aug. 10–12). ACM, New York, pp. 137–151. Expanded version available as Tech. Rep. MIT/LCS/TR-387. April 1987. Massachusetts Institute of Technology, Cambridge, Mass. MERRITT, M., AND ORDA, A. Efficient test& set algorithms for faulty shared memory. Unpublished note. MORAN, S., TAUBENFELD, G., AND YADIN, I. 1992. Concurrent counting. In Proceedings of the Ilth Annual ACM Symposium on Principles of Distributed Computing (Vancouver, B. C., Canada, Aug. 10-12). ACM, New York, pp. 59-70. PETERSON, G. L. 1983. Concurrent reading while writing. ACM Trans. Prog. Lang. Syst. 5, 1 HERLIHY,

(Jan.), 46-55. PETERSON, G. L., AND BURNS, J. E. 1987. Concurrent reading while writing II: The multi-writer case. In Proceedings of the 28th IEEE Annual Symposium on Foundation of Computer Science (Oct.), IEEE, New York, pp. 383-392. 1985. Operating System Concepts. (Third edition). PmERsoN, J. L., AND SILBERSCHATZ, A. Addison-Wesley, Reading, Pa, PLOTKIN, S. A. 1988. Chapter 4: Sticky Bits and Universality of Consensus, Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, Mass. RABIN, M. O. 1982. N-process mutual exclusion with bounded waiting by 4 log n shared variables. J. Comput. Syst. Sci. 25, 66–75. SAKS, M., SHAVIT, N., AND WOLL, H. 1991. Optimal time randomized consensus—making resilient algorithms fast in practice. In Proceedings of 2nd Annual ACM–SL4M Symposium on Discrete Algorithms (San Francisco, Calif., Sept. 27-28). ACM, New York, pp. 351-362. SINGH, A. K., ANDERSON, J. H., AND GOUDA, M. G. 1994. The elusive atomic register. J. ACM 41(2), 311. SMITH, A. J. 1982. Cache memories. Comput. Suru. 14, 473-540. TROMP, J. 1987. How to construct an atomic variable. In Proceedings of the 3rd International Workshop on Distributed Algorithms, J. C. Bermond and M. Raynal, eds. Lecture Notes in Computer Science, vol. 292, Springer-Verlag, New York, pp. 292–302. VITI.NYI, P. M. B., AND AWER~UC~, B. 1987. Shared register access by asynchronous hardware. In Proceedings of the 27th IEEE Annual Symposium on Foundation of Computer Science. IEEE, New York, 233-243. RECEIVED SEPTEMBER 1992; REVISED JULY 1995; ACCEPTED JULY 1995

Journal

of the

Association

for

Computing

Machinery,

Vol.

42, No.

6, November

1995

Computing with faulty shared objects

Computing with faulty shared objects

Suggest Documents

Using Shared Objects

Asymptotically tight bounds for computing with faulty ... - CiteSeerX

Hybrid Cluster Computing with Mobile Objects

Building User Interfaces with Shared Objects - Semantic Scholar

Building User Interfaces with Shared Objects - Semantic Scholar

Personal Interactive Computing Objects - CiteSeerX

Modeling Collaboration using Shared Objects - CiteSeerX

Shared Memory Computing on Clusters with Symmetric ... - CiteSeerX

Shared Memory Computing on Clusters with ... - Rochester CS

Hybrid System Identification with Faulty ... - Semantic Scholar

Shared Services Canada - Cloud Computing Best Practices

outsourcing involving shared computing services (including cloud)

Conservative simulation using distributed-shared ... - NUS Computing

Shared Services Canada - Cloud Computing Best Practices

Locating Objects in Mobile Computing - CiteSeerX

Personal Interactive Computing Objects - Semantic Scholar

Personal Interactive Computing Objects - Cambridge Computer Lab

Continuous Clustering of Moving Objects - NUS Computing

Faulty fetal packing

Faulty by design. - Reform

The Object Binary InterfaceâC++ Objects for Evolvable Shared Class ...

A new approach to collaborative frameworks using shared objects - Core

Public Sound Objects: A Shared Musical Space on the Web

Distributed Shared Objects as a Communication Paradigm - CiteSeerX

Computing with faulty shared objects