Providing high availability using lazy replication

4 downloads 453 Views 2MB Size Report
“lazy replication.” The replicated service continues to provide service in spite of node failures and network ... and send mail operations need to be causally.
Providing High Availability Using Lazy Replication RIVKA LADIN Digital Equipment

Corp.

and BARBARA LISKOV LIUBA SHRIRA and SANJAY GHEMAWAT MIT Laboratory for Computer

To provide

high

One way same

order

at all

operation

order

scribes

a new

kinds

of

can

that

in

but

this

preserve

approach

of response

or bulletin

data

is to force

while

causal that

ordered

For

providing

operations.

are

with

time,

and size of messages:

as mail

is expenswe.

consistency

operations totally

such

of replicated

of implementing

are

terms

number

sites,

way

for services

consistency

operations:

operations well

availability

to guarantee

Science

totally

to all

than

data

with

also

respect

a weaker

capacity,

The

amount

methods

This one

two

deother

another

method

and

performs

of stored

based

in the causal

paper

supports to

operations.

replication

be replicated. to occur

performance.

technique

other

must

operations

apphcations

better Our

operation-processing

it does better

some

ordered

respect

boards, service

state,

on reliable

and

multicast

techniques. Categories and Subject Descriptors: C.2.4 [Computer Communication Networks]: Distecl appl~cat~ons, distributed databases; C.4 [Computer Systems tributed Systems—dzstnbu availability, and seruiceabdity; D,4,5 [OpOrganization]: Performance of Systems—re[labdity, D.4. 7 [ Operating Systems]: Organization and erating Systems]: Reliability—fa ult-tolerance; Design—dzstrzbuted systems; H,2,2 [Database Management]: Physical Design—recouery and restart; H.2.4 [Database Management]: Systems—con curren cy, distributed systems General

Terms:

Additional cation,

Key high

version

on Principles National Projects

Permission not made

ordermg,

architecture,

replication,

appeared

Computz

in the

ng, August

1990.

fault

tolerance,

scalability,

semantics

Proceedings This

of the Nznth

research

group

communi-

of application

ACM

was supported

Symposmm m part

under grant CCR-8822158 and in part by the Advanced Department of Defense, momtored by the OffIce of Naval

R. Ladin, L.

Square,

Shru-a,

Digital and

Cambridge.

to copy without or distributed and

its

specific permission. @ 1992 ACM 0734-2071

Equipment

by the

Research Research

of this

commercial

appear,

Machinery. \92\l100–0360

on Computer

MIT

One

Kendall

Laboratory

Square, for

Cambridge,

Computer

MA

Smence,

545

02139.

fee all or part date

Corp.

S. Ghemawat,

MA

for direct

for Computing

Transactmns

paper

server

NOO014-89-J-1988.

Liskov,

publication

Association

ACM

Client\

operation

of this

addresses: B.

Reliability

and Phrases:

of D~strl buted

contract

Technology

of the

Performance,

Science Foundation Agency of the U.S.

Authors’ 02139;

Words

availability,

A preliminary

under

Algorithms,

Systems,

and

material

1s granted

advantage,

the ACM

notice

is given

To copy otherwise,

that

provided copyright

copying

or to republish,

that notice

reqmres

10, No. 4, November

1992, Pages 360–391

are

and the title

is by permission

$01.50 Vol

the copies

of the

a fee and\or

Prowding High Availability Using Lazy Replication

.

361

1. INTRODUCTION Many

computer:based

accessible

with

high

services probability

must

achieve availability, the server’s replicated state can be guaranteed the

same

tency

order

with

at all sites.

a weaker,

be

despite

state must by forcing

However,

causal

highly

ordering

available:

site crashes

they

and network

should

be

failures.

To

be replicated. Consistency service operations to occur

some applications [ 18], leading

can preserve

to better

of in

consis-

performance.

This

paper describes a new technique that supports causal order. An operation is executed at just one replica; updating of other replicas happens by exchange of “gossip” messages—hence the name “lazy replication.” replicated service continues to provide service in spite of node failures

call lazy The and

network partitions. We have applied the method tributed garbage collection [17],

dis[22],

locating

movable

objects

to a number of applications, deadlock detection [8], orphan

in a distributed

versions in a hybrid concurrency can benefit from causally ordered system. Normally, clients to different as is the delivery recipients.

However

system

[14],

including detection

and deletion

of unused

control scheme [34]. Another system operations is the familiar electronic

that mail

the delivery order of mail messages sent by different recipients, or even to the same recipient, is unimportant, order of messages sent by a single client to different suppose

client

c 1 sends

a message

to client

c 2 and then

a later message to C3 that refers to information in the earlier a result of reading cl’s message, C3 sends an inquiry message

message. If, as to c2, C3 would

expect

Therefore,

its message

and send mail Applications a stronger

to be delivered

to c 2 after

c 1’s message.

operations need to be causally ordered. that use causally ordered operations may

ordering.

Our

method

allows

this.

Each

occasionally

operation

read require

has an ordering

type; in addition to the causal operations, there are forced and immediate operations. Forced operations are performed in the same order (relative to one another) at all replicas. Their ordering relative to causal operations may differ at different replicas but is consistent with the causal order at all replicas. They would be useful in the mail system to guarantee that if two clients are attempting to add the same user-name on behalf of two different users simultaneously, only one would succeed. Immediate operations are performed at all replicas in the same order relative to all other operations. They have the effect of being performed immediately when the operation returns, and are ordered consistently with external be useful to remove an individual from a classified that no messages addressed to that the remove operation returns.

list

would

events [11]. They would mailing-list “at once,” so

be delivered

to that

user

after

It is easy to construct and use highly-available applications with our method. The user of a replicated application just invokes operations and can ignore replication and distribution, and also the ordering types of the operaACM

TransactIons

on

Computer

Systems,

Vol.

10,

No,

4, November

1992

362

R. Ladin et al.

.

tions. The application ignoring complications category

programmer supplies due to distribution

of each operation

(e.g., for a mail

a nonreplicated and replication, service

implementation, and defines the

the designer

that send_mail and read_mail are causal, add–user forced, and delete_ at_once is immediate). To determine gories, the programmer can use techniques developed missible concurrency [13, 31, 32, 35].

would

indicate

and delete–user the operation for determining

are cateper-

Our method does not delay update operations (such as send-mail), and typically provides the response to a query (such as read_ mail) in one message round trip. It will perform well in terms of response time, amount of stored state,

number

of messages,

communication

failures

and

availability

provided

most

update

in

the

presence

operations

of node

are causal.

and

Forced

operations require more messages than causal operations; tions require even more messages and can temporarily

immediate operaslow responses to

queries.

on overall

However,

these

performance provided system where sending

operations

or removing a user. Our system can be instantiated provides the same order for all similarly

to other

voting

[10]

niques

because

will

have

little

they are used infrequently. and reading mail is much

replication

or primary it allows

system

with only forced operations. In this case it updates at all replicas, and will perform

techniques

copy [28]). queries

impact

This is the case in the mail more frequent than adding

Our

that method

guarantee

this

property

generalizes

these

on stale

data while

to be performed

that the information observed respects causality. The idea of allowing an application to make use of operations

(e.g.,

other

with

tech-

ensuring differing

ordering requirements appears in the work on ISIS [3] and Psync [27], and in fact we support the same three orders as ISIS. These systems provide a reliable multicast mechanism that allows processes within a process group consisting of both clients and servers to communicate; a possible application Our technique is a replication of the mechanism is a replicated service. method. It is less general than a reliable multicast mechanism, but is better suited result

to providing replicated services that are distinct from clients provides better performance: it requires many fewer messages

and as a than the

process-group approach; the messages are smaller, and it can tolerate network partitions. Our technique is based on the gossip approach first introduced by Fischer and Michael [9], and later enhanced by Wuu and Bernstein [36] and our own earlier work [15, 21]. We have extended the earlier work in two important ways: by supporting causal ordering for updates as well as queries and by incorporating forced and immediate operations to make a more geflerally applicable method. The implementation of causal operations is novel and so is the method for combining the three types of operations. The rest of the paper is organized as follows. Section 2 describes the implementation technique. The following sections discuss system performance, scalability, and related we have accomplished.

ACM

TransactIons

on

Computer

Systems,

work.

Vol

10,

We conclude

No.

4, November

with

1992

a discussion

of what

Providing High Availability Using Lazy Replication 2. THE REPLICATED Lazy

replication

is intended

for an environment

are connected

by a communication

the

may

we assume

nodes

363

SERVICE

ers, or nodes, network

.

fail,

are fail-stop

but

processors.

the

in which

individual

network.

Both

failures

The network

are

not

can partition,

comput-

the nodes

and

Byzantine.

The

and messages

can

be lost, delayed, duplicated, and delivered out of order. The the system can change; nodes can leave and join the network

configuration at any time.

of We

assume nodes have loosely synchronized clocks. There are practical protocols, such as NTP [26], that with low cost synchronize clocks in geographically distributed networks. A replicated application is implemented by semice consisting of replicas running at different nodes in a network. To hide replication from clients, the system

also

provides

front

end

code that

runs

at client

operation, a client makes a local call to the front message to one of the replicas. The replica executes and sends

a reply

information

message

(e.g.,

about

gossip messages. There are two kinds observe the application not

modify

treated update, with

it.

back

to the front

updates)

among

(Operations

the service

that

end. Replicas themselves

both

update

by a query.) to the client

in the background. from

To call

an

communicate

by

lazily

new

exchanging

of operations: update operations modify but do not state, while query operations observe the state but do

as an update followed the front end returns

for the response

nodes.

end, which sends a call the requested operation

a replica,

and

To perform

and then

observe

the

state

can

be

When requested to perform an immediately and communicates a query,

returns

the front

the result

end waits

to the client.

We assume there is a fixed number of replicas residing at fixed locations and that front ends and replicas know how to find replicas; a technique for reconfiguring services (i.e., adding or removing replicas) is described in [14]. We also assume that replicas eventually recover from crashes; Section 3.2 discusses In this

how this happens. section we describe

implement tation port

the three

of causal the other

2.1 Causal

update) label.l

operations; two types

how the front

of operations. Section

end and service

Section

replicas

2.1 describes

2.2 extends

the

the replicas

about

together

the implemen-

implementation

to sup-

of operations.

Operations

In our method, operations. identifier,

types

the front

end informs

Every reply message uid, that names that

operation The label

o takes identifies

the causal

for an update operation invocation. In addition,

ordering

of

contains a unique every (query and

a set of uids as an argument; such a set is called a the updates whose execution must precede the

execution of o. In the mail service, for example, the front end can indicate that one send–mail must precede another by including the uid of the first

1 A similar

concept occurs in [3], but no practical

ACM

Transactions

on

way to implement

Computer

Systems,

it was described

Vol.

10,

No.

4, November

1992.

364

R, Lactin et al

.

send–mail in the label passed returns a value and also a label

to the second. that identifies

Finally, a query operation the updates reflected in the

value; this label is a superset of the argument label, update that must have preceded the query is reflected replicas thus perform the following operations: update (prev: label, op: op) returns query (prev: label, op: op) returns where

op describes

the actual

ensuring that every in the result. Service

(uid: uid) (newl: label, value: value)

operation

to be performed

(i.e.,

gives

its name

and arguments). A specification of the service for causal operations is given in Figure 1. We view an execution of a service as a sequence of events, one event for each update and query operation performed by a front end on behalf of a client. At some point between when an operation is called and when it returns, an event for it is appended to the sequence. An event records the arguments and results of its operation. In a query event q, q. preu is the input label, q. op defines the query operation, q.ualz~e is the result value, and q. newl is the result label; for update u, u. preu is the input label, u. op defines the update operation,

and

u. uid

is the

uid

assigned

by the

service.

If

e is an event

in

execution sequence E, P(e) denotes the set of events preceding e in E. Also, for a set S of events, S.label denotes the set of uids of update events in S. The specification describes the behavior of queries, since this is all the clients updates

(via the identified

front ends) can observe. by a query result label

The first correspond

clause states to calls made

ends. The second clause states that the result label updates plus possibly some additional ones. The third returned label is dependency complete: if some update fied

by the

depends

label,

then

on an update

dep(u, v) = (v.uid

so is every

update

u if it is constrained

that

identifies all required clause states that the operation u is identi-

u depends

to be after

that all by front

on. An

update

u

u:

e u.prev)

The dependency relation dep is acyclic because front ends do not create uids. The fourth clause defines the relationship between the value and the label returned by a query: the result returned must be computed by applying the query

q.op

to a state

Val

arrived

at by starting

with

the

initial

state

and

performing the updates identified by the label in an order consistent with the dependency relation. If dep(u, v), v.op is performed before u.op; operations not constrained by dep are performed in arbitrary order. Note that this clause guarantees that if the returned label is used later as input to another query, the result of that query will reflect the effects of all updates observed in the earlier query. The front end maintains a label for the service. It guarantees causality as follows: (1) Client calls to the seruice. The front end sends its label message and merges the uid or label in each reply message by performing (2) Client-to-client ACM

Transactions

a union

of the two sets.

communication. on

Computer

in every call with its label

Systems,

The Vol.

10,

No.

front

end

4, November

intercepts 1992

all

messages

Providing High Availability Using Lazy Replication

.

365

Let q be a query. Then 1. q.newl G P(q).label.

2. q.prev G q.newl. 3. u.uid e q.newl

= for all updates v s.t. dep(u, v), v.uld e q.newl.

4. q.value = q.op ( Val (q.newl) ). Fig. 1.

Specification

of the causal operations

exchanged between its client message its client sends and by its client The

resulting

with order

and other clients. It adds its label to each merges the label in each message received

its label. may

be stronger

than

needed.

communicates with C2 without exposing any calls on service operations, it is not necessary C 1’s earlier

ones.

service.

Our

method

allows

For

example,

if client

information about to order C2’S later

a sophisticated

client

Cl

its earlier calls after

to use uids

and

labels directly to implement the causality that really occurs; we assume in the rest of the paper, however, that all client calls go through the front end. During normal operation a front end will always contact the same “preferred” replica. However, if the response is slow, it might send the request to a different replica or send it to several replicas in parallel. In addition, messages may be duplicated by the network. In spite of these duplicates, update

operations

must

the service

to recognize

associates

a unique

included

in every

2.1.1

be performed multiple

call

identifier,

message

Implementation

compact

representation

operation generate

is ready uids

mechanism,

or

once at each replica.

for the same update, cid,

sent by the front

with

each

For the method

for

and

labels

to be executed. All

multipart

a fast

In these

timestamp.

of that

to

properties

must

are provided

end cid

is

update. we need a

determine

replicas

A multipart

That

to be efficient,

way

addition,

To allow

the front

update.

end on behalf

Overview.

independently.

the

at most

requests

when be

an

able

to

by a single

t is a vector

timestamp

t=(tl, ....tn) where

n is the number

of replicas

in the service.

Each

integer counter, and the initial (zero) timestamp Timestamps are partially ordered in the obvious

Two

timestamps

mum.

(Multipart

t and

s are

timestamps

merged were

by taking used

part

contains way:

their

in Locus

is a nonnegative zero in each part.

component-wise

[30]

and

also in [9],

maxi[12],

[151, [211, and [36].) Both update

uids and labels are represented by multipart timestamps. Every operation is assigned a unique multipart timestamp as its uid. A label by merging timestamps; a label time stamp t identifies the updates is created

ACM

Transactions

on

Computer

Systems,

Vol.

10,

No.

4, November

1992.

366

R. Ladin et al,



relation is whose timestamps are less than or equal to t. The dependency implemented as follows: if an update u is identified by u. preu then u depends that represent labels, on U. Furthermore, if t and t‘ are two timestamps t s t‘ implies that t identifies a subset of the updates identified by t‘. A replica receives call messages from front ends and also gossip messages from other replicas. When a replica receives a call message for an update it has not seen before, appending

it processes

information

about

that

update

by assigning

it a timestamp

it to its log. The log also contains

about updates that were processed at other replicas and propagated replica in gossip messages. A replica maintains a local timestamp, that

identifies

the set of records

in its log and thus

expresses

and

information to this rep–ts,

the extent

of its

knowledge about updates. It increments its part of the timestamp each time it processes an update call message; therefore, the value of the replica’s part of rep–ts

corresponds

to the

number

of updates

it has

processed.

It

incre-

ments other parts of rep_ts when it receives gossip from other replicas. The value of any other part i of rep–ts counts the number of updates processed at replica i that have propagated to this replica via gossip. A replica executes updates in dependency order and maintains its current state in wal. When an update is executed, its uid timestamp is merged into the

timestamp

ual–ts,

which

is used

to determine

whether

an update

or

query is ready to execute; an operation op is ready if its label op.preu s val–ts. The replica keeps track of the cids of updates that have been executed in the

set

inval,

updates. Since stamp (because

and

uses the

the same the front

information

to avoid

duplicate

update may be assigned more end sent the update to several

executions

stamps of duplicates for an update are merged into val_ts. In this honor the dependency relation no matter which of the duplicates knows

about

(and

therefore

Figure 2 summarizes [ ] denotes a sequence, types

as indicated,

indicated.) information

and

includes

its uid in future

the state at a replica. oneof means a tagged ( ) denotes

a record,

of

than one uid timereplicas), the timeway we can a front end

labels).

(In the figure, { } denotes a set, union with component tags and with

components

In addition to information about updates, about acks; acks are discussed below.

the

and

log also

types

as

contains

The description above ignored two important implementation issues, controlling the size of the log and the size of inval. An update record can be discarded from the log as soon as it is known everywhere and has been reflected in val. In fact, if an update is known to be known everywhere, it will be ready to be executed and therefore will be reflected in val, for the following reason. A replica knows some update received gossip messages containing message receiver

includes receives

record u is known everywhere if it has u from all other replicas. Each gossip

enough of the sender’s log to guarantee that when the record u from replica i, it has also received (either in this

gossip message or an earlier one) all records processed at i before u. Therefore, if a replica has heard about u from all other replicas, it will know about all updates that u depends on, since these must have been performed before u (because of the constraint on front ends not to create uids). Therefore, u will be ready to execute. ACM

Transactions

on

Computer

Systems,

Vol

10,

No.

4. November

1992

Providing High Availability Using Lazy Replication node: mt log: {log-record} rep_ts: mpts val: value val_ts: mpts inval: (cid] ts_@bie: [mpts]

.

367

% replica’s id. % replica’s log 70 replica’s multlpart Limestamp YO replica’s wew of serwce state 7. timestamp associated with val % updates that panicipated in computmg val % ts_rable(p)= latest multlpart timestamp recclved from p

where log-record = < msg: op-type, mode int, ts: mpts > Op-type = oneof [ update < prev: mpts, op: op, cid: cid, time: time >, ack: < cid: cid, time time > ] mpts = [ mt ] Fig.

The

table

everywhere.

ts–table Every

is used gossip

ts_table( k ) contains replica ii. Note that large

The

of a replica.

to determine

whether

contains

time stamp timestamp

for it in ts_table.

a log

the

record

timestamp

this replica has of replica k must

If ts–table(k

is known

of its

sender;

received from be at least as

)j = t at replica

i, then

= V replicas

j, ts_table(j)r

~Od, > r.tsr

issue

is controlling

~Od,

at i. second

to discard update when

state

t update records i knows that replica k has learned of the first by replica j. Every record r contains the identity of the replica that it in field r-. node. Replica i can remove update record r from its log knows that r has been received by every replica, i.e., when

isknown(r)

holds

The

message

the largest the current

as the one stored

replica created created when it

2.

implementation

a cid c from

to ual messages

in the

inual future.

containing

only if the replica Such

a guarantee

information

about

the will

size of inual.

never

attempt

requires

an upper

c’s update

It is safe to apply

c’s

bound

on

can arrive

at the

replica. A front end will keep sending messages for an update until it receives a reply. Since reply messages can be lost, replicas have no way of knowing when the front end will stop sending these messages The front end does this by sending an acknowledgment ing the cid of the update separate on future

to one or more

unless it informs them. message ack contain-

of the replicas.

In most

applications,

ack messages will not be needed; instead, acks can be piggybacked calls. Acks are added to the log when they arrive at a replica and

are propagated

to other

replicas

in gossip

the

same

way

updates

are propa-

gated. (The case of volatile clients that crash before their acknowledgments reach the replicas can be handled by adding client crash counts to update messages and propagating crash information to replicas; a higher crash count would serve as an ack for all updates from that client with lower crash counts. ) Even though a replica has received an ack, it might still receive a message for the ack’s update since the network can deliver messages out of order. We deal with late messages by having each ack and update message contain the time at which it was created. Each time the front end sends an update or ack message, it includes in the message the current time of its node’s clock; if ACM

Transactions

on

Computer

Systems,

Vol.

10,

No.

4, November

1992.

368

.

R. Ladin et al.

multiple messages are sent for an update, they will contain The time in an ack must be greater than or equal to the time for the ack’s update.

An update

rn. time + 8< the time where 8 is greater least until a. time

After

this

they

time

of the replica’s

than

+ 8 < the time

message

any messages

because

it is “late”

if

ack a is kept

at

clock

the anticipated

of the replica’s

m is discarded

different times. in any message

network

delay.

Each

clock

for the

ack’s

update

will

be discarded

because

are late.

A replica

can discard

a cid c from

inval

as soon as an ack for c’s update

is

in the log and all records for c’s update have been discarded from the log. The former condition guarantees a duplicate call message for c’s update will not be accepted; the latter guarantees a duplicate will not be accepted from gossip (see Section 2.1.4 for a proof). A replica can discard ack a from the log once a is known everywhere; the cid of a’s update has been discarded from inual, and all call messages for a’s update are guaranteed to be late at the replica. Note that we rely here on clocks being monotonic. Typically clocks are monotonic both while nodes are up and across crashes. If they need to be adjusted this is done by slowing them down or speeding them up. Also, clocks are usually

stable;

periodically

and recovered

if not the

clock

after

value

can be saved

to stable

storage

[20]

a crash.z

For the system to run efficiently, loosely synchronized with a skew

clocks of server and client nodes should be clocks bounded by some ● . Synchronized

are not needed for correctness, but without them certain suboptimal situations can arise. For example, if a client node’s clock is slow, its messages may be discarded even though it just sent them. The delay 8 should be chosen to accommodate both the anticipated network delay and the clock skew. The value

for

this

parameter

can be as large

as needed

choosing a large delay only affects how long that we do not require reliable delivery within delivery failures by It may seem that ity of the system. A unable to carry out tion problems, and the system.

Such

because

the

servers remember 8; instead the front

penalty

of

acks. Note ends mask

resending messages. our reliance on synchronized clocks affects the availabilproblem could arise only if a node’s clock fails, the node is the clock synchronization protocol because of communicayet the node is able to communicate with other nodes in

a situation

is extremely

unlikely.

2.1.2 Processing at Each Replica. This section describes the processing of each kind of message. Initially, (1) rep_ts and val–ts are zero timestamps, (2) ts-table contains all zero timestamps, (3) val has the initial value, and (4) the log and inval are empty. The log and invcd are implemented as hash tables

hashed

on the cid.

2 Writes to stable storage can be infrequent; after a crash, a node must wait untd its clock IS later than the time on stable storage +@, where 13is a bound on how frequently writes to stable with replicas. storage happen, before communicating ACM

TransactIons

on

Computer

Systems,

Vol.

10, No.

4, November

1992

Providing High Availability Using Lazy Replication Processing from

an update

a front

clock) r.cid

end

message.

if it is late

or if it is a duplicate = u.cid

performs

is in

the

the following

(i.e.,

its cid c is in inual

If

the

message

is

or a record

not

the

update

record

r .= makeUpdateRecord(u,

and adds it to the local

r

its

:= merge( val_ts,

(5) Returns The

:= inval

comparable.

and For

another replica r. tsj > rep_tsj. u“ that

part

by one

associated

with

the

this

ith

part

execution

of

of the

i, ts)

performs

VO

on have

already

been

the op

r.ts)

{r.cid}

u

the update’s

rep_ts

replica

log.

ual := apply( wal, u. op ) inval

ith

ts, by replacing part of rep–ts.

(4) Executes u.op if all the updates that u depends incorporated into val. If u.prev < val–ts, then:

ual_ts

the

actions:

the timestamp for the update, argument u.preu with the ith

(3) Constructs update,

r such that

discarded,

(1) Advances its local timestamp rep-ts by incrementing while leaving all the other parts unchanged. (2) Computes the input

369

Replica i discards an update message u if u. time + 8< the time of the replica’s

(i.e.,

log).

.

timestamp

the

r. ts in a reply

timestamp

example,

u may

r.ts

assigned

depend

j, and which this replica In addition, this replica

u does not depend

message. to

u are

on update

u‘,

not

necessarily

which

happened

at

does not know about yet. In this case may know about some other update

cm, e.g., u“ happened

at replica

k, and therefore,

r. ts~< r6?&tsk. Processing compares

the

a query

message.

query’s

input

When label,

replica

q.prev,

i receives with

val–ts,

a query

message

q, it

which

identifies

all

updates reflected in ual. If q.prev < val_ts, it applies q. op to val and returns the result and val_ts. Otherwise, it waits since it needs more information. It can either wait for gossip messages from the other replicas or it might send a request to another replica to elicit the information. ValLts and q. prev can be compared part by part to determine which replicas have the missing information. Processing (1) Advances timestamp (2) Constructs

an ack message.

processes

an ack as follows:

its local timestamp rep-ts by incrementing the ith by one while leavi~g all the other parts unchanged. the adi

record

r ,= makei%ckl%ecord(

and adds it to the local (3) Sends

a reply

Note

ack records

that

A replica

message

r associated

a, i, rep_ts

this

execution

of the

of the ack:

)

log. to the front

do not enter ACM

with

part

Transactions

end.

inval. on

Computer

Systems,

Vol.

10,

No,

4, November

1992.

370

*

R Ladin et al, m. ts, the sender’s a gossip message. A gossip message contains and m. new, the sender’s log. The processing of the message

Processing timestamp, consists

of three

activities:

merging

the log in the message

with

the local

log,

computing the local view of the service state based on the new information, and discarding records from the log and from inual. When replica i receives a gossip message m from replica j, it discards m if m. ts < j’s (1) Adds

timestamp

in ts_table.

the new information

Otherwise,

it continues

in the message

log := log U {r ~ m.neu

I -(r.ts

as follows:

to the replica’s

log:

< rep-ts)}

(2) Merges the replica’s timestamp with the timestamp in the reflects the information known at the replica: that rep–ts rep_ts

=

merge(rep–ts,

(3)

Inserts all the update the set comp:

(4)

Computes

comp

while

do

s.t. 3 no r’ E comp

comp

Z= comp

– {r}

@ inual =

Discards replicas:

s.t. r’. ts < r.prev

then

apply( val, r.op) := inval

!= merge(

{r. cid}

u

val_ts,

r.ts)

from

the

ts-table:

ts–table(J)

log

=

update

m.ts

records

log if they

= log – {r G log I type(r ) = update

A

have

been

= inval

– {c ~ inval 3 no r’

(8)

Discards sufficient log

13 a E log G log

= log – {a G log I type(a) local

s.t. type(a)

s.t. type(r’)

ack records from the log time has passed and there

received

by all

isknown( r )}

(7) Discards records from inval if an ack for the update there is no update record for that update in the log: inval

into

< rep-ts}

comp

r.cid

to the value

of the object:

is not empty

inual

(6)

A r.preu

to be added

r from

val_ts

Updates

are ready

select

ual

(5)

that

= update

the new value

if

so

m. ts )

records

= {r E logotype

comp

message

is in the

= ack A acid

= update

A

r’.cid

log and

= c A = c}

if they are known everywhere and is no update for that ack in inual:

= ack A isknown(a)

A

a.

tzme + 8
d. ts for all duplicates

> r‘. tsh. Since this

6. When a record in the value.

r for

an update

r is deleted

from

the log at replica

PROOF.

When

u is deleted

from

i, ishown(

and therefore by Lemma 5, r. ts s rep–ts. This implies that Therefore r enters the set conzp; and either u is executed, in val;

and therefore

LEMMA

7.

any

replica

update-processing

code

for all

the log,

u is

r ) holds

at i

r. preu < rep_ts. or r’s cid is in



earlier.

val–ts

It is easy to see that

PROOF.

the

At

u was executed

is true



d.

< rep–ts.

the

claim

because

if

an

holds

initially.

update

is

It

is preserved

executed,

only

field

by i of

val_ts changes and val–tsL = rep_tsL. It is preserved by gossip processing from because for each record r in comp, r.prev < rep–ts. Since r. ts differs r.prev only in field j, where r was processed at j, and since r is known locally,

r. tsj s rep_tsJ

LEMMA

8.

u is reflected

For any update in the value. The

PROOF.

record

with

computation

by Lemma

proof

is inductive.

consider

D

u, if there exists a record

a zero timestamp. and

3.

The

Assume the

next

basis

step

the claim time

r for u s. t. r. ts < rep–ts,

is trivial holds

rep–ts

since

there

is no

up to some step in the

is modified.

Let

r

be an

update record for u s. t. r. ts < rep–ts after that step. We need to show that u is reflected in the value. First, consider a gossip-processing step. Since r. ts s rep–ts, we know u.prev s rep–ts. If r is in the log, it enters comp, and either

u is executed

already

reflected

now

or u’s

in the value.

cid is in the

set

inval;

If r is not in the log, then

3) that r was deleted (by Lemma reflected in the value. Therefore

and

therefore

r. ts < rep–ts

u is implies

from the log and by Lemma 6, u is already u is reflected in the we have shown that

value after a gossip-processing step. Now consider an update-processing step. before this step, the claim holds by the induction assumption. If . . t. < rep_fs after Otherwise, we have 1 (r. ts < rep–ts ) before this step and r. ts < rep–ts i’s the message was processed. In the processing of an update, replica timestamp rep–ts increases by one in part i with the other parts remaining unchanged. Therefore, the record being created in this step must be r and u .prev < rep-ts before this step occurred. Therefore, any u that furthermore u depends on has already been reflected in the value by the induction assumption, and u.prev s val–ts. Therefore z~is reflected in the value in this step. ACM

❑ TransactIons

on

Computer

Systems,

Vol.

10,

No.

4, November

1992

Prowdlng High Availabilky Using Lazy Replication

.

37!5

We are now ready to prove the fourth clause of the specification. a query returns ual and ual–ts. Lemmas 7 and 8 guarantee

Recall that that for any

update

in the value

record

ual. Therefore, the value.

r such that

r.ts

all updates

identified

We will

now

show

< val_ts,

that

r’s

update

by a query the updates

is reflected

output

label

are executed

are reflected only

in

once and in

the right order. To prove that updates are executed only once at any particular replica, we show that after an update u is reflected in the value, it will not be executed again. The cid c that entered the set inval when u was executed must be inval before u can be executed again. However, when c is deleted from deleted from inval no duplicate record for u is present in the log, and an ack for c’s update is present in the log. By Lemma 1, the presence of the ack guarantees that no future duplicate reenter the log. Furthermore, when for some update the

replica.

from the front end or the network will isknown(r ) holds c is deleted from inval,

r for c, so by Lemma

By Lemma

4, all duplicates

5, d. ts s rep_ts

and

processing code ensures that any future message will not reenter the replica’s log. We now show that updates are reflected with the dependency relation. val–ts and an update u such show that v is reflected the dependency relation

Consider that r‘s

d of r have

therefore

step

duplicate

d arriving

in the value

arrived

1 of the in

in an order

an update record update u depends

at

gossipa gossip

consistent

r such that r. ts < on v. We need to

in the value before u. From the implementation of we know there exists an update record s for v such

that u.prev > s. ts. Therefore, by Lemma 7 and 8 both u and v are reflected in val. Consider the first time u is reflected in the value. If this happens in the processing of an update message, at that step u.prev < val_ts; by Lemma 7, u.prev s rep–ts, and therefore by Lemma 8 v has already been reflected in val. If this happens while processing a gossip message, a record for u has entered Lemma

the set comp and so u.prev s rep_ts and therefore 3, s has entered the log at this replica and will

s. ts s rep–ts. By enter comp now

unless it has already been deleted. The code guarantees that when both s and a record for u are in comp either v is reflected before u because u. prev > s. ts, or v’s cid is in inval and so v was reflected earlier. If s was deleted from the log earlier, then by Lemma 6 v has already been reflected. Making queries

progress. indeed

eventually paired.

recover These

gate

information

and

Lemma

eventually

To ensure

return.

It

is easy

from

crashes

assumptions

also

between 3, replica

replicas. and

value

progress,

we

to see that and

network

guarantee Therefore, time

need updates

stamps

to

show

that

return

partitions

are

gossip

messages

that from

will

the

updates

provided

eventually will

gossip-processing

increase

and

and

replicas

queries

repropacode will

return.

Garbage collection. Update records will be removed from the log assuming replica crashes and network partitions are repaired eventually. Also, acks will be deleted from the log assuming crashes and partitions are repaired eventually and replica clocks advance, provided cids are deleted from inval. To show that cids are deleted from inval, we need the following lemma, ACM

Transactions

on

Computer

Systems,

Vol.

10,

No.

4, November

1992.

376

R. Ladln et al,

.

which guarantees from reappearing 9.

~EMMA

that an ack stays in inual: a’)

Isknown(

in the log long

= ach

A type(a)

*

enough

to prevent

d. ts < rep_ts

for

all

its cid

duplicates

d of a’s update. to Lemma



PROOF.

Similar

5.

Lemmas

8 and 9 and the ack deletion

code guarantee

that

an ack is deleted

only after its cid is deleted from inzml. By Lemma 1, no duplicates of the update message from the front end or network arriving after this point will be accepted assuming the time in the ack is greater than or equal to the time in any message for the update. Furthermore, step 1 of gossip processing ensures that duplicates of the update record arriving in gossip will not enter assuming the replica’s log. Therefore cids will be removed from inual and partitions are repaired eventually and front ends send acks enough times. The

Availability.

service

quested

updates

updates updates.

and no others. For example,

u.prev means

uses

the

so it is important However, consider

= v.prev and assume that r. ts > s. ts, where

query

that

the

input

label

label

identify

to identify just

labels in fact do sometimes two independent updates

the

u

crashes with big

the

re-

required

identify extra and u with

at replica i before that u is processed r and s are the update records created

u. This by i for

u and v, respectively. When r. ts is merged into a label L, L also identifies u as a required update. Nevertheless, a replica never needs to delay a query waiting for such an “extra” update to arrive because the gossip propagation at scheme ensures that whenever the “required” update record r arrives some

replica,

there.

Note

than 2.2

all

update

that

the

the timestamp Other

Operation

records

timestamp

with

timestamps

of an “extra”

of some “required”

update

smaller

than

update

record

record

identified

r‘s

will

is always

be less

by a label.

Types

The section shows how the implementation can be extended to support forced and immediate updates. These two stronger orders are provided only for updates; queries continue to be ordered as specified by their input labels. As mentioned, each update operation has a declared ordering type that is defined when the replicated application is created. Recall that forced updates are performed in the same order (relative to one another) at all replicas; immediate updates are performed at all replicas in the same order relative to all other updates. Like causal updates, forced updates are ordered via labels and uids, but their uids are totally ordered with respect to one another. Labels now identify both the causal and forced updates, and the input label for an update or query identifies the causal and forced updates that must precede it. Uids are irrelevant for immediate updates, however, because the replicas constrain their ACM

ordering Transactions

relative on

Computer

to other Systems,

operations. Vol.

10,

No

4, November

1992.

Providing High Availability Using Lazy Replication Let q be a query.

.

377

Then

1. q.newl c P(q).label.

2. q.prev u G(q).label c 3. u.uld e q.newl

~

q.newl,

for all updates v s.t. dep(u,

4. q.value = q.op ( VaI (q,newl) Fig. 3.

The specification model and

query

event

operation

in execution

to and including second

of a service

clause

the most of the

service

is given

by a front immediate

specification

now

2.2.1

requires

that

of Forced

of client.

that

Updates.

we

If e is an

all events

precedes

queries

up

e in E. The

reflect

the most

as well as what the input except that the dependency

v) = if immediate (u) then v E P(u) else if immediate (v) then u @ P(v) else if forced (u) & forced (v) then v.uid else v.uid E u.prev

Implementation

3. As before

one for each update

the set containing update

recent immediate update in the execution requires. The other clauses are unchanged tion for clause 4 is extended as follows. dep(u,

in Figure

of events,

end on behalf

E, G(e) denotes

recent

E q.ncwl.

of the service.

as a sequence

performed sequence

v.uld

).

Specification

of the complete

the execution

v),

label rela-

< u.uid

We must

provide

a total

order

for forced updates and a way to relate them to causal updates and queries. This is accomplished as follows. As before, we represent uids and labels by multipart timestamps, but the timestamps have one additional field. Conceptually this field corresponds to an additional replica R that runs all forced updates; the order of the forced updates is the order assigned to them by this replica, and is reflected in R’s part of the timestamp. Therefore it is trivial to determine the order of forced v.uid if u.uid~ < v.uidR. Of course, availability we

allow

if only

one replica

problem different

updates:

if that

could

if u and v are forced run

forced

replica

were

to act

as R over

replicas

updates,

inaccessible. time.

updates, there

would

To solve We

u.uid

this

do this


pp. 240-251. 7. EL-ABBADI, data

A., SK~EN,

management.

Systems.

ACM,

8. FARRELL,

New

A. K.

Engineering

D., AND CRISTIAN, F.

In

Proceedings

York,

1985,

A deadlock

and

an

unreliable

Systems.

ACM,

New

10. GIFFORD, D. K. ium

on Operating

York,

pp.

12. HEDDAYA, 13. HERLIHY,

of

A quorum-consensus

Int.

1-2

W.

(Ott

Two

/Nov.

(Dec.

Computer

availablhty

of data

of Database

Proceedings Dec.

of the Seuentlz

1979).

ACM

Sympos-

SIGOPS,

New

21, 7

for

Ph.D.

Cambridge, clocks,

1978),

(July

Tech.

Rep. CSL-81-

Managing

distributed

event

for abstract

data

types.

ACM

Trans.

for

service

for

Computer

a distributed

Science,

environ-

Cambridge,

Mass.,

for constructing

highly-available

services.

A general

tool for rephcating on Parallel

and

distributed

Dmtnbuted

services In formatmn

highly

Mass.,

May

the

ordering

and

Research

Designing

Center,

a global

Computmg Palo

AND L~DIN,

garbage

R.

Alto, In

(Aug.

of t/Le 17th

Pa., July

1987).

Dept.

and

a technique

of Electrical

for

Engineering

dmand

1989. of events

name

service.

(Aug.

1986).

m a distributed

system.

Comm zm.

1986),

ACM,

New

recovery

Systems,

Proceedings New

York,

of the 5th pp.

in a distributed

distributed of the New

Symposum

on

1-10. data

storage

system.

5th

services ACM

and

Svmposzum

fault-tolerant

dis-

on PrzncLples

York.

E., AND WEIHL, York,

In ACM

1979.

Proceedings

International

IEEE,

on Computer

Crash

Calif.,

LIS~OV, B., SCHEIFLER, R., WALKER, Proceedings

services

MIT

Highly-available

collection.

Computing

available

dissertation,

558-565.

of DLstrLbuted

Transactions

gossip:

location Lab.

Conference

constructing

collection.

Time,

B.,

system.

1991).

Science,

Distributed

ACM

high

cm Pi-z nczples

computer

phase

A technique

20. LAMPSON, B. W., AND STURWS, H. E.

burgh,

In

Calif.,

method

MIT

of the FLrst International

19. LAMPSON, B, W.

In

to attain

393-420.

A method

18. LAMPORT, L.

22

of Electrical

1989).

a highly-avadable

B., AND SHRIRA, L.

garbage

tributed

Dept.

1988.

thesis.

3 (1988),

R.

21. LH~OV,

of Database

32-53.

MIT/LCS/TR-410,

Master’s

In Pmceedzngs

Xerox

July

SynLposz urn

in a decentralized

16. LADIN, R., MAZER, M. S., AND WOLMAN, A.

Principles

the

Grove,

replication

1986),

Constructing

Rep.

Atgorzth?nzca

J. 49,

1 (Feb.

4,

15. LADIN, R., LISIiOV,

ACM

for replicated

S. B. thesis,

Mass.,

data.

(Pacific

ANII WEIHL,

M.

1987.

trlbuted

Argus.

serializability

for replicated

storage

HSLT, M.,

D. J.

Systems

protocol Prmclples

1983.

Sci.:

Tech.

17. LADIN,

on

pp. 70-75.

Prmczples

Irq! Syst.

ment. Nov.

Mar

A.,

histories.

14. HWANG,

for

Cambridge,

Sacrificing

1982,

Systems

scheme MIT,

Proceedings

voting

Information

Corp.,

Comput.

York,

fault-tolerant

SymposLunL

150-162.

11. GIFFORD. D. K. 8, Xerox

A. In

Weighted

Fourth

pp. 215-229.

Science,

network.

An eftlclent

the

detection

Computer

9. FISCHER, M. J., AND MICHAEL, in

of

W.

SymposLum

Orphan

detection

on Fault-Tolerant

pp. 2-7

Vol. 10, No. 4, November

1992

(extended Computmg

abstract). (Pitts-

of

Providing High Availability Using Lazy Replication 23. LISIXN’,

B., BLOOM, T., GIFFORD, D., SCHEIFLER,

Mercury

system.

In

proceedings

of the 21st

R., AND WI?IHL,

Annual

Hawaii

W.

.

391

Communication

Conference

in the

on System

sciences

(Jan. 19S8). IEEE, New York 24. LISKOV, B. 25. LISKOV, tion

B., GHEMAWAT,

in the Harp

Systems 26. MILLS,

file

Internet

(Oct.

Network

Rep. RFC

27. MISHRA,

In

In Proceedings

1991). time

1059.

S., PETERSON,

Psync.

Commun.

ACM

31, 3 (Mar.

1988),

S., GRUBER, R., JOHNSON, P., SHRIRA, L., AND WILLMMS,

system.

Principles D. L.

using

pp. 178-187. programming in Argus.

Distributed

ACM,

New

protocol

July

of the Thirteenth

Symposium

Replica-

on Operating

York.

(version

1) specification

and

implementation.

DARPA-

1988.

L. L., AND SCHLICHTING,

Proceeding

ACM

300-312.

M.

of the Eighth

R. D.

Implementing

Symposium

on Reltable

fault-tolerant Distributed

objects

Systems

(Oct.

1989). 28. OKI, B. M., AND LISKOV, B. highly-available ples 29. OKI,

Viewstamped

distributed

of Distributed B. M.

systems.

CompzLting

Viewstamped

MIT/LCS/TR-423,

(Aug.

replication

MIT

Lab.

systems.

IEEE

Trans.

31. SCHMUCK, F. B. tems.

Tech.

Softw.

The

Eng.

for

32. SCHWARZ, P., AND SPECTOR, A. Syst.

2, 3 (Aug.

33. SKEEN,

D.

SIGMOD 34. WEIHL, Eng.

35. WEIHI.,

Non-blocking

W. E. W.,

Program.

Commit

Computing

Received

July

protocols shared

systems.

Mass.,

Tech.

Rep.

1988.

B., WALTON, E., CHOW, J.,

inconsistency

in

distributed

240–247.

Science,

Syst.

in asynchronous Cornell

abstract

Univ., types.

distributed

Ithaca, ACM

sys-

N. Y., 1988.

Trans.

Comput.

B.

1990;

revised

management

Implementation

7, 2 (Apr.

1984).

Proceedings

Systems

(April

for read-only

of

the

1984).

3rd

ACM

ACM,

New

actions.

SIGACTYork.

IEEE

Trans.

Softw.

types.

ACM

Trans.

55-64.

Proceedings

(Aug.

In

of Database

version

1987),

AND LISKOV,

Lang.

Protocols.

on Principles

Distributed

1 (Jan.

In

of mutual 1983),

broadcast

Synchronizing

36. WUU, G. T. J., AND BERNSTEIN, problems.

distributed

Cambridge,

Detection

of Computer

on Prmcz-

York.

available

Science,

3 (May

to support

S.ynLposium

1984).

Symposzum

SE-13,

C.

SE-9,

Dept.

New

highly

copy method

of the 7th ACM

G., STOU~HTON, A., WALW,R,

use of efficient

Rep. TR 88-926,

ACM,

for Computer

S., AND KLINE,

KISER,

A new primary

Proceedings

1988).

30. PARKER, D. S,, POPEK, G. J., RUDISIN, EDWAKDS, D.,

replication: In

of

A. J. the

ACM,

of resilient,

atomic

data

244-269. Efficient

Third

New

January

ACM

1985),

Annual

York,

1992;

Transactions

solutions

to the

Symposium

replicated on

Principles

log and

dictionary

of Distributed

pp. 233-242.

accepted

July

on Computer

1992

Systems,

Vol. 10, No

4, November

1992.

Suggest Documents