âlazy replication.â The replicated service continues to provide service in spite of node failures and network ... and send mail operations need to be causally.
Providing High Availability Using Lazy Replication RIVKA LADIN Digital Equipment
Corp.
and BARBARA LISKOV LIUBA SHRIRA and SANJAY GHEMAWAT MIT Laboratory for Computer
To provide
high
One way same
order
at all
operation
order
scribes
a new
kinds
of
can
that
in
but
this
preserve
approach
of response
or bulletin
data
is to force
while
causal that
ordered
For
providing
operations.
are
with
time,
and size of messages:
as mail
is expenswe.
consistency
operations totally
such
of replicated
of implementing
are
terms
number
sites,
way
for services
consistency
operations:
operations well
availability
to guarantee
Science
totally
to all
than
data
with
also
respect
a weaker
capacity,
The
amount
methods
This one
two
deother
another
method
and
performs
of stored
based
in the causal
paper
supports to
operations.
replication
be replicated. to occur
performance.
technique
other
must
operations
apphcations
better Our
operation-processing
it does better
some
ordered
respect
boards, service
state,
on reliable
and
multicast
techniques. Categories and Subject Descriptors: C.2.4 [Computer Communication Networks]: Distecl appl~cat~ons, distributed databases; C.4 [Computer Systems tributed Systems—dzstnbu availability, and seruiceabdity; D,4,5 [OpOrganization]: Performance of Systems—re[labdity, D.4. 7 [ Operating Systems]: Organization and erating Systems]: Reliability—fa ult-tolerance; Design—dzstrzbuted systems; H,2,2 [Database Management]: Physical Design—recouery and restart; H.2.4 [Database Management]: Systems—con curren cy, distributed systems General
Terms:
Additional cation,
Key high
version
on Principles National Projects
Permission not made
ordermg,
architecture,
replication,
appeared
Computz
in the
ng, August
1990.
fault
tolerance,
scalability,
semantics
Proceedings This
of the Nznth
research
group
communi-
of application
ACM
was supported
Symposmm m part
under grant CCR-8822158 and in part by the Advanced Department of Defense, momtored by the OffIce of Naval
R. Ladin, L.
Square,
Shru-a,
Digital and
Cambridge.
to copy without or distributed and
its
specific permission. @ 1992 ACM 0734-2071
Equipment
by the
Research Research
of this
commercial
appear,
Machinery. \92\l100–0360
on Computer
MIT
One
Kendall
Laboratory
Square, for
Cambridge,
Computer
MA
Smence,
545
02139.
fee all or part date
Corp.
S. Ghemawat,
MA
for direct
for Computing
Transactmns
paper
server
NOO014-89-J-1988.
Liskov,
publication
Association
ACM
Client\
operation
of this
addresses: B.
Reliability
and Phrases:
of D~strl buted
contract
Technology
of the
Performance,
Science Foundation Agency of the U.S.
Authors’ 02139;
Words
availability,
A preliminary
under
Algorithms,
Systems,
and
material
1s granted
advantage,
the ACM
notice
is given
To copy otherwise,
that
provided copyright
copying
or to republish,
that notice
reqmres
10, No. 4, November
1992, Pages 360–391
are
and the title
is by permission
$01.50 Vol
the copies
of the
a fee and\or
Prowding High Availability Using Lazy Replication
.
361
1. INTRODUCTION Many
computer:based
accessible
with
high
services probability
must
achieve availability, the server’s replicated state can be guaranteed the
same
tency
order
with
at all sites.
a weaker,
be
despite
state must by forcing
However,
causal
highly
ordering
available:
site crashes
they
and network
should
be
failures.
To
be replicated. Consistency service operations to occur
some applications [ 18], leading
can preserve
to better
of in
consis-
performance.
This
paper describes a new technique that supports causal order. An operation is executed at just one replica; updating of other replicas happens by exchange of “gossip” messages—hence the name “lazy replication.” replicated service continues to provide service in spite of node failures
call lazy The and
network partitions. We have applied the method tributed garbage collection [17],
dis[22],
locating
movable
objects
to a number of applications, deadlock detection [8], orphan
in a distributed
versions in a hybrid concurrency can benefit from causally ordered system. Normally, clients to different as is the delivery recipients.
However
system
[14],
including detection
and deletion
of unused
control scheme [34]. Another system operations is the familiar electronic
that mail
the delivery order of mail messages sent by different recipients, or even to the same recipient, is unimportant, order of messages sent by a single client to different suppose
client
c 1 sends
a message
to client
c 2 and then
a later message to C3 that refers to information in the earlier a result of reading cl’s message, C3 sends an inquiry message
message. If, as to c2, C3 would
expect
Therefore,
its message
and send mail Applications a stronger
to be delivered
to c 2 after
c 1’s message.
operations need to be causally ordered. that use causally ordered operations may
ordering.
Our
method
allows
this.
Each
occasionally
operation
read require
has an ordering
type; in addition to the causal operations, there are forced and immediate operations. Forced operations are performed in the same order (relative to one another) at all replicas. Their ordering relative to causal operations may differ at different replicas but is consistent with the causal order at all replicas. They would be useful in the mail system to guarantee that if two clients are attempting to add the same user-name on behalf of two different users simultaneously, only one would succeed. Immediate operations are performed at all replicas in the same order relative to all other operations. They have the effect of being performed immediately when the operation returns, and are ordered consistently with external be useful to remove an individual from a classified that no messages addressed to that the remove operation returns.
list
would
events [11]. They would mailing-list “at once,” so
be delivered
to that
user
after
It is easy to construct and use highly-available applications with our method. The user of a replicated application just invokes operations and can ignore replication and distribution, and also the ordering types of the operaACM
TransactIons
on
Computer
Systems,
Vol.
10,
No,
4, November
1992
362
R. Ladin et al.
.
tions. The application ignoring complications category
programmer supplies due to distribution
of each operation
(e.g., for a mail
a nonreplicated and replication, service
implementation, and defines the
the designer
that send_mail and read_mail are causal, add–user forced, and delete_ at_once is immediate). To determine gories, the programmer can use techniques developed missible concurrency [13, 31, 32, 35].
would
indicate
and delete–user the operation for determining
are cateper-
Our method does not delay update operations (such as send-mail), and typically provides the response to a query (such as read_ mail) in one message round trip. It will perform well in terms of response time, amount of stored state,
number
of messages,
communication
failures
and
availability
provided
most
update
in
the
presence
operations
of node
are causal.
and
Forced
operations require more messages than causal operations; tions require even more messages and can temporarily
immediate operaslow responses to
queries.
on overall
However,
these
performance provided system where sending
operations
or removing a user. Our system can be instantiated provides the same order for all similarly
to other
voting
[10]
niques
because
will
have
little
they are used infrequently. and reading mail is much
replication
or primary it allows
system
with only forced operations. In this case it updates at all replicas, and will perform
techniques
copy [28]). queries
impact
This is the case in the mail more frequent than adding
Our
that method
guarantee
this
property
generalizes
these
on stale
data while
to be performed
that the information observed respects causality. The idea of allowing an application to make use of operations
(e.g.,
other
with
tech-
ensuring differing
ordering requirements appears in the work on ISIS [3] and Psync [27], and in fact we support the same three orders as ISIS. These systems provide a reliable multicast mechanism that allows processes within a process group consisting of both clients and servers to communicate; a possible application Our technique is a replication of the mechanism is a replicated service. method. It is less general than a reliable multicast mechanism, but is better suited result
to providing replicated services that are distinct from clients provides better performance: it requires many fewer messages
and as a than the
process-group approach; the messages are smaller, and it can tolerate network partitions. Our technique is based on the gossip approach first introduced by Fischer and Michael [9], and later enhanced by Wuu and Bernstein [36] and our own earlier work [15, 21]. We have extended the earlier work in two important ways: by supporting causal ordering for updates as well as queries and by incorporating forced and immediate operations to make a more geflerally applicable method. The implementation of causal operations is novel and so is the method for combining the three types of operations. The rest of the paper is organized as follows. Section 2 describes the implementation technique. The following sections discuss system performance, scalability, and related we have accomplished.
ACM
TransactIons
on
Computer
Systems,
work.
Vol
10,
We conclude
No.
4, November
with
1992
a discussion
of what
Providing High Availability Using Lazy Replication 2. THE REPLICATED Lazy
replication
is intended
for an environment
are connected
by a communication
the
may
we assume
nodes
363
SERVICE
ers, or nodes, network
.
fail,
are fail-stop
but
processors.
the
in which
individual
network.
Both
failures
The network
are
not
can partition,
comput-
the nodes
and
Byzantine.
The
and messages
can
be lost, delayed, duplicated, and delivered out of order. The the system can change; nodes can leave and join the network
configuration at any time.
of We
assume nodes have loosely synchronized clocks. There are practical protocols, such as NTP [26], that with low cost synchronize clocks in geographically distributed networks. A replicated application is implemented by semice consisting of replicas running at different nodes in a network. To hide replication from clients, the system
also
provides
front
end
code that
runs
at client
operation, a client makes a local call to the front message to one of the replicas. The replica executes and sends
a reply
information
message
(e.g.,
about
gossip messages. There are two kinds observe the application not
modify
treated update, with
it.
back
to the front
updates)
among
(Operations
the service
that
end. Replicas themselves
both
update
by a query.) to the client
in the background. from
To call
an
communicate
by
lazily
new
exchanging
of operations: update operations modify but do not state, while query operations observe the state but do
as an update followed the front end returns
for the response
nodes.
end, which sends a call the requested operation
a replica,
and
To perform
and then
observe
the
state
can
be
When requested to perform an immediately and communicates a query,
returns
the front
the result
end waits
to the client.
We assume there is a fixed number of replicas residing at fixed locations and that front ends and replicas know how to find replicas; a technique for reconfiguring services (i.e., adding or removing replicas) is described in [14]. We also assume that replicas eventually recover from crashes; Section 3.2 discusses In this
how this happens. section we describe
implement tation port
the three
of causal the other
2.1 Causal
update) label.l
operations; two types
how the front
of operations. Section
end and service
Section
replicas
2.1 describes
2.2 extends
the
the replicas
about
together
the implemen-
implementation
to sup-
of operations.
Operations
In our method, operations. identifier,
types
the front
end informs
Every reply message uid, that names that
operation The label
o takes identifies
the causal
for an update operation invocation. In addition,
ordering
of
contains a unique every (query and
a set of uids as an argument; such a set is called a the updates whose execution must precede the
execution of o. In the mail service, for example, the front end can indicate that one send–mail must precede another by including the uid of the first
1 A similar
concept occurs in [3], but no practical
ACM
Transactions
on
way to implement
Computer
Systems,
it was described
Vol.
10,
No.
4, November
1992.
364
R, Lactin et al
.
send–mail in the label passed returns a value and also a label
to the second. that identifies
Finally, a query operation the updates reflected in the
value; this label is a superset of the argument label, update that must have preceded the query is reflected replicas thus perform the following operations: update (prev: label, op: op) returns query (prev: label, op: op) returns where
op describes
the actual
ensuring that every in the result. Service
(uid: uid) (newl: label, value: value)
operation
to be performed
(i.e.,
gives
its name
and arguments). A specification of the service for causal operations is given in Figure 1. We view an execution of a service as a sequence of events, one event for each update and query operation performed by a front end on behalf of a client. At some point between when an operation is called and when it returns, an event for it is appended to the sequence. An event records the arguments and results of its operation. In a query event q, q. preu is the input label, q. op defines the query operation, q.ualz~e is the result value, and q. newl is the result label; for update u, u. preu is the input label, u. op defines the update operation,
and
u. uid
is the
uid
assigned
by the
service.
If
e is an event
in
execution sequence E, P(e) denotes the set of events preceding e in E. Also, for a set S of events, S.label denotes the set of uids of update events in S. The specification describes the behavior of queries, since this is all the clients updates
(via the identified
front ends) can observe. by a query result label
The first correspond
clause states to calls made
ends. The second clause states that the result label updates plus possibly some additional ones. The third returned label is dependency complete: if some update fied
by the
depends
label,
then
on an update
dep(u, v) = (v.uid
so is every
update
u if it is constrained
that
identifies all required clause states that the operation u is identi-
u depends
to be after
that all by front
on. An
update
u
u:
e u.prev)
The dependency relation dep is acyclic because front ends do not create uids. The fourth clause defines the relationship between the value and the label returned by a query: the result returned must be computed by applying the query
q.op
to a state
Val
arrived
at by starting
with
the
initial
state
and
performing the updates identified by the label in an order consistent with the dependency relation. If dep(u, v), v.op is performed before u.op; operations not constrained by dep are performed in arbitrary order. Note that this clause guarantees that if the returned label is used later as input to another query, the result of that query will reflect the effects of all updates observed in the earlier query. The front end maintains a label for the service. It guarantees causality as follows: (1) Client calls to the seruice. The front end sends its label message and merges the uid or label in each reply message by performing (2) Client-to-client ACM
Transactions
a union
of the two sets.
communication. on
Computer
in every call with its label
Systems,
The Vol.
10,
No.
front
end
4, November
intercepts 1992
all
messages
Providing High Availability Using Lazy Replication
.
365
Let q be a query. Then 1. q.newl G P(q).label.
2. q.prev G q.newl. 3. u.uid e q.newl
= for all updates v s.t. dep(u, v), v.uld e q.newl.
4. q.value = q.op ( Val (q.newl) ). Fig. 1.
Specification
of the causal operations
exchanged between its client message its client sends and by its client The
resulting
with order
and other clients. It adds its label to each merges the label in each message received
its label. may
be stronger
than
needed.
communicates with C2 without exposing any calls on service operations, it is not necessary C 1’s earlier
ones.
service.
Our
method
allows
For
example,
if client
information about to order C2’S later
a sophisticated
client
Cl
its earlier calls after
to use uids
and
labels directly to implement the causality that really occurs; we assume in the rest of the paper, however, that all client calls go through the front end. During normal operation a front end will always contact the same “preferred” replica. However, if the response is slow, it might send the request to a different replica or send it to several replicas in parallel. In addition, messages may be duplicated by the network. In spite of these duplicates, update
operations
must
the service
to recognize
associates
a unique
included
in every
2.1.1
be performed multiple
call
identifier,
message
Implementation
compact
representation
operation generate
is ready uids
mechanism,
or
once at each replica.
for the same update, cid,
sent by the front
with
each
For the method
for
and
labels
to be executed. All
multipart
a fast
In these
timestamp.
of that
to
properties
must
are provided
end cid
is
update. we need a
determine
replicas
A multipart
That
to be efficient,
way
addition,
To allow
the front
update.
end on behalf
Overview.
independently.
the
at most
requests
when be
an
able
to
by a single
t is a vector
timestamp
t=(tl, ....tn) where
n is the number
of replicas
in the service.
Each
integer counter, and the initial (zero) timestamp Timestamps are partially ordered in the obvious
Two
timestamps
mum.
(Multipart
t and
s are
timestamps
merged were
by taking used
part
contains way:
their
in Locus
is a nonnegative zero in each part.
component-wise
[30]
and
also in [9],
maxi[12],
[151, [211, and [36].) Both update
uids and labels are represented by multipart timestamps. Every operation is assigned a unique multipart timestamp as its uid. A label by merging timestamps; a label time stamp t identifies the updates is created
ACM
Transactions
on
Computer
Systems,
Vol.
10,
No.
4, November
1992.
366
R. Ladin et al,
●
relation is whose timestamps are less than or equal to t. The dependency implemented as follows: if an update u is identified by u. preu then u depends that represent labels, on U. Furthermore, if t and t‘ are two timestamps t s t‘ implies that t identifies a subset of the updates identified by t‘. A replica receives call messages from front ends and also gossip messages from other replicas. When a replica receives a call message for an update it has not seen before, appending
it processes
information
about
that
update
by assigning
it a timestamp
it to its log. The log also contains
about updates that were processed at other replicas and propagated replica in gossip messages. A replica maintains a local timestamp, that
identifies
the set of records
in its log and thus
expresses
and
information to this rep–ts,
the extent
of its
knowledge about updates. It increments its part of the timestamp each time it processes an update call message; therefore, the value of the replica’s part of rep–ts
corresponds
to the
number
of updates
it has
processed.
It
incre-
ments other parts of rep_ts when it receives gossip from other replicas. The value of any other part i of rep–ts counts the number of updates processed at replica i that have propagated to this replica via gossip. A replica executes updates in dependency order and maintains its current state in wal. When an update is executed, its uid timestamp is merged into the
timestamp
ual–ts,
which
is used
to determine
whether
an update
or
query is ready to execute; an operation op is ready if its label op.preu s val–ts. The replica keeps track of the cids of updates that have been executed in the
set
inval,
updates. Since stamp (because
and
uses the
the same the front
information
to avoid
duplicate
update may be assigned more end sent the update to several
executions
stamps of duplicates for an update are merged into val_ts. In this honor the dependency relation no matter which of the duplicates knows
about
(and
therefore
Figure 2 summarizes [ ] denotes a sequence, types
as indicated,
indicated.) information
and
includes
its uid in future
the state at a replica. oneof means a tagged ( ) denotes
a record,
of
than one uid timereplicas), the timeway we can a front end
labels).
(In the figure, { } denotes a set, union with component tags and with
components
In addition to information about updates, about acks; acks are discussed below.
the
and
log also
types
as
contains
The description above ignored two important implementation issues, controlling the size of the log and the size of inval. An update record can be discarded from the log as soon as it is known everywhere and has been reflected in val. In fact, if an update is known to be known everywhere, it will be ready to be executed and therefore will be reflected in val, for the following reason. A replica knows some update received gossip messages containing message receiver
includes receives
record u is known everywhere if it has u from all other replicas. Each gossip
enough of the sender’s log to guarantee that when the record u from replica i, it has also received (either in this
gossip message or an earlier one) all records processed at i before u. Therefore, if a replica has heard about u from all other replicas, it will know about all updates that u depends on, since these must have been performed before u (because of the constraint on front ends not to create uids). Therefore, u will be ready to execute. ACM
Transactions
on
Computer
Systems,
Vol
10,
No.
4. November
1992
Providing High Availability Using Lazy Replication node: mt log: {log-record} rep_ts: mpts val: value val_ts: mpts inval: (cid] ts_@bie: [mpts]
.
367
% replica’s id. % replica’s log 70 replica’s multlpart Limestamp YO replica’s wew of serwce state 7. timestamp associated with val % updates that panicipated in computmg val % ts_rable(p)= latest multlpart timestamp recclved from p
where log-record = < msg: op-type, mode int, ts: mpts > Op-type = oneof [ update < prev: mpts, op: op, cid: cid, time: time >, ack: < cid: cid, time time > ] mpts = [ mt ] Fig.
The
table
everywhere.
ts–table Every
is used gossip
ts_table( k ) contains replica ii. Note that large
The
of a replica.
to determine
whether
contains
time stamp timestamp
for it in ts_table.
a log
the
record
timestamp
this replica has of replica k must
If ts–table(k
is known
of its
sender;
received from be at least as
)j = t at replica
i, then
= V replicas
j, ts_table(j)r
~Od, > r.tsr
issue
is controlling
~Od,
at i. second
to discard update when
state
t update records i knows that replica k has learned of the first by replica j. Every record r contains the identity of the replica that it in field r-. node. Replica i can remove update record r from its log knows that r has been received by every replica, i.e., when
isknown(r)
holds
The
message
the largest the current
as the one stored
replica created created when it
2.
implementation
a cid c from
to ual messages
in the
inual future.
containing
only if the replica Such
a guarantee
information
about
the will
size of inual.
never
attempt
requires
an upper
c’s update
It is safe to apply
c’s
bound
on
can arrive
at the
replica. A front end will keep sending messages for an update until it receives a reply. Since reply messages can be lost, replicas have no way of knowing when the front end will stop sending these messages The front end does this by sending an acknowledgment ing the cid of the update separate on future
to one or more
unless it informs them. message ack contain-
of the replicas.
In most
applications,
ack messages will not be needed; instead, acks can be piggybacked calls. Acks are added to the log when they arrive at a replica and
are propagated
to other
replicas
in gossip
the
same
way
updates
are propa-
gated. (The case of volatile clients that crash before their acknowledgments reach the replicas can be handled by adding client crash counts to update messages and propagating crash information to replicas; a higher crash count would serve as an ack for all updates from that client with lower crash counts. ) Even though a replica has received an ack, it might still receive a message for the ack’s update since the network can deliver messages out of order. We deal with late messages by having each ack and update message contain the time at which it was created. Each time the front end sends an update or ack message, it includes in the message the current time of its node’s clock; if ACM
Transactions
on
Computer
Systems,
Vol.
10,
No.
4, November
1992.
368
.
R. Ladin et al.
multiple messages are sent for an update, they will contain The time in an ack must be greater than or equal to the time for the ack’s update.
An update
rn. time + 8< the time where 8 is greater least until a. time
After
this
they
time
of the replica’s
than
+ 8 < the time
message
any messages
because
it is “late”
if
ack a is kept
at
clock
the anticipated
of the replica’s
m is discarded
different times. in any message
network
delay.
Each
clock
for the
ack’s
update
will
be discarded
because
are late.
A replica
can discard
a cid c from
inval
as soon as an ack for c’s update
is
in the log and all records for c’s update have been discarded from the log. The former condition guarantees a duplicate call message for c’s update will not be accepted; the latter guarantees a duplicate will not be accepted from gossip (see Section 2.1.4 for a proof). A replica can discard ack a from the log once a is known everywhere; the cid of a’s update has been discarded from inual, and all call messages for a’s update are guaranteed to be late at the replica. Note that we rely here on clocks being monotonic. Typically clocks are monotonic both while nodes are up and across crashes. If they need to be adjusted this is done by slowing them down or speeding them up. Also, clocks are usually
stable;
periodically
and recovered
if not the
clock
after
value
can be saved
to stable
storage
[20]
a crash.z
For the system to run efficiently, loosely synchronized with a skew
clocks of server and client nodes should be clocks bounded by some ● . Synchronized
are not needed for correctness, but without them certain suboptimal situations can arise. For example, if a client node’s clock is slow, its messages may be discarded even though it just sent them. The delay 8 should be chosen to accommodate both the anticipated network delay and the clock skew. The value
for
this
parameter
can be as large
as needed
choosing a large delay only affects how long that we do not require reliable delivery within delivery failures by It may seem that ity of the system. A unable to carry out tion problems, and the system.
Such
because
the
servers remember 8; instead the front
penalty
of
acks. Note ends mask
resending messages. our reliance on synchronized clocks affects the availabilproblem could arise only if a node’s clock fails, the node is the clock synchronization protocol because of communicayet the node is able to communicate with other nodes in
a situation
is extremely
unlikely.
2.1.2 Processing at Each Replica. This section describes the processing of each kind of message. Initially, (1) rep_ts and val–ts are zero timestamps, (2) ts-table contains all zero timestamps, (3) val has the initial value, and (4) the log and inval are empty. The log and invcd are implemented as hash tables
hashed
on the cid.
2 Writes to stable storage can be infrequent; after a crash, a node must wait untd its clock IS later than the time on stable storage +@, where 13is a bound on how frequently writes to stable with replicas. storage happen, before communicating ACM
TransactIons
on
Computer
Systems,
Vol.
10, No.
4, November
1992
Providing High Availability Using Lazy Replication Processing from
an update
a front
clock) r.cid
end
message.
if it is late
or if it is a duplicate = u.cid
performs
is in
the
the following
(i.e.,
its cid c is in inual
If
the
message
is
or a record
not
the
update
record
r .= makeUpdateRecord(u,
and adds it to the local
r
its
:= merge( val_ts,
(5) Returns The
:= inval
comparable.
and For
another replica r. tsj > rep_tsj. u“ that
part
by one
associated
with
the
this
ith
part
execution
of
of the
i, ts)
performs
VO
on have
already
been
the op
r.ts)
{r.cid}
u
the update’s
rep_ts
replica
log.
ual := apply( wal, u. op ) inval
ith
ts, by replacing part of rep–ts.
(4) Executes u.op if all the updates that u depends incorporated into val. If u.prev < val–ts, then:
ual_ts
the
actions:
the timestamp for the update, argument u.preu with the ith
(3) Constructs update,
r such that
discarded,
(1) Advances its local timestamp rep-ts by incrementing while leaving all the other parts unchanged. (2) Computes the input
369
Replica i discards an update message u if u. time + 8< the time of the replica’s
(i.e.,
log).
.
timestamp
the
r. ts in a reply
timestamp
example,
u may
r.ts
assigned
depend
j, and which this replica In addition, this replica
u does not depend
message. to
u are
on update
u‘,
not
necessarily
which
happened
at
does not know about yet. In this case may know about some other update
cm, e.g., u“ happened
at replica
k, and therefore,
r. ts~< r6?&tsk. Processing compares
the
a query
message.
query’s
input
When label,
replica
q.prev,
i receives with
val–ts,
a query
message
q, it
which
identifies
all
updates reflected in ual. If q.prev < val_ts, it applies q. op to val and returns the result and val_ts. Otherwise, it waits since it needs more information. It can either wait for gossip messages from the other replicas or it might send a request to another replica to elicit the information. ValLts and q. prev can be compared part by part to determine which replicas have the missing information. Processing (1) Advances timestamp (2) Constructs
an ack message.
processes
an ack as follows:
its local timestamp rep-ts by incrementing the ith by one while leavi~g all the other parts unchanged. the adi
record
r ,= makei%ckl%ecord(
and adds it to the local (3) Sends
a reply
Note
ack records
that
A replica
message
r associated
a, i, rep_ts
this
execution
of the
of the ack:
)
log. to the front
do not enter ACM
with
part
Transactions
end.
inval. on
Computer
Systems,
Vol.
10,
No,
4, November
1992.
370
*
R Ladin et al, m. ts, the sender’s a gossip message. A gossip message contains and m. new, the sender’s log. The processing of the message
Processing timestamp, consists
of three
activities:
merging
the log in the message
with
the local
log,
computing the local view of the service state based on the new information, and discarding records from the log and from inual. When replica i receives a gossip message m from replica j, it discards m if m. ts < j’s (1) Adds
timestamp
in ts_table.
the new information
Otherwise,
it continues
in the message
log := log U {r ~ m.neu
I -(r.ts
as follows:
to the replica’s
log:
< rep-ts)}
(2) Merges the replica’s timestamp with the timestamp in the reflects the information known at the replica: that rep–ts rep_ts
=
merge(rep–ts,
(3)
Inserts all the update the set comp:
(4)
Computes
comp
while
do
s.t. 3 no r’ E comp
comp
Z= comp
– {r}
@ inual =
Discards replicas:
s.t. r’. ts < r.prev
then
apply( val, r.op) := inval
!= merge(
{r. cid}
u
val_ts,
r.ts)
from
the
ts-table:
ts–table(J)
log
=
update
m.ts
records
log if they
= log – {r G log I type(r ) = update
A
have
been
= inval
– {c ~ inval 3 no r’
(8)
Discards sufficient log
13 a E log G log
= log – {a G log I type(a) local
s.t. type(a)
s.t. type(r’)
ack records from the log time has passed and there
received
by all
isknown( r )}
(7) Discards records from inval if an ack for the update there is no update record for that update in the log: inval
into
< rep-ts}
comp
r.cid
to the value
of the object:
is not empty
inual
(6)
A r.preu
to be added
r from
val_ts
Updates
are ready
select
ual
(5)
that
= update
the new value
if
so
m. ts )
records
= {r E logotype
comp
message
is in the
= ack A acid
= update
A
r’.cid
log and
= c A = c}
if they are known everywhere and is no update for that ack in inual:
= ack A isknown(a)
A
a.
tzme + 8
d. ts for all duplicates
> r‘. tsh. Since this
6. When a record in the value.
r for
an update
r is deleted
from
the log at replica
PROOF.
When
u is deleted
from
i, ishown(
and therefore by Lemma 5, r. ts s rep–ts. This implies that Therefore r enters the set conzp; and either u is executed, in val;
and therefore
LEMMA
7.
any
replica
update-processing
code
for all
the log,
u is
r ) holds
at i
r. preu < rep_ts. or r’s cid is in
❑
earlier.
val–ts
It is easy to see that
PROOF.
the
At
u was executed
is true
❑
d.
< rep–ts.
the
claim
because
if
an
holds
initially.
update
is
It
is preserved
executed,
only
field
by i of
val_ts changes and val–tsL = rep_tsL. It is preserved by gossip processing from because for each record r in comp, r.prev < rep–ts. Since r. ts differs r.prev only in field j, where r was processed at j, and since r is known locally,
r. tsj s rep_tsJ
LEMMA
8.
u is reflected
For any update in the value. The
PROOF.
record
with
computation
by Lemma
proof
is inductive.
consider
D
u, if there exists a record
a zero timestamp. and
3.
The
Assume the
next
basis
step
the claim time
r for u s. t. r. ts < rep–ts,
is trivial holds
rep–ts
since
there
is no
up to some step in the
is modified.
Let
r
be an
update record for u s. t. r. ts < rep–ts after that step. We need to show that u is reflected in the value. First, consider a gossip-processing step. Since r. ts s rep–ts, we know u.prev s rep–ts. If r is in the log, it enters comp, and either
u is executed
already
reflected
now
or u’s
in the value.
cid is in the
set
inval;
If r is not in the log, then
3) that r was deleted (by Lemma reflected in the value. Therefore
and
therefore
r. ts < rep–ts
u is implies
from the log and by Lemma 6, u is already u is reflected in the we have shown that
value after a gossip-processing step. Now consider an update-processing step. before this step, the claim holds by the induction assumption. If . . t. < rep_fs after Otherwise, we have 1 (r. ts < rep–ts ) before this step and r. ts < rep–ts i’s the message was processed. In the processing of an update, replica timestamp rep–ts increases by one in part i with the other parts remaining unchanged. Therefore, the record being created in this step must be r and u .prev < rep-ts before this step occurred. Therefore, any u that furthermore u depends on has already been reflected in the value by the induction assumption, and u.prev s val–ts. Therefore z~is reflected in the value in this step. ACM
❑ TransactIons
on
Computer
Systems,
Vol.
10,
No.
4, November
1992
Prowdlng High Availabilky Using Lazy Replication
.
37!5
We are now ready to prove the fourth clause of the specification. a query returns ual and ual–ts. Lemmas 7 and 8 guarantee
Recall that that for any
update
in the value
record
ual. Therefore, the value.
r such that
r.ts
all updates
identified
We will
now
show
< val_ts,
that
r’s
update
by a query the updates
is reflected
output
label
are executed
are reflected only
in
once and in
the right order. To prove that updates are executed only once at any particular replica, we show that after an update u is reflected in the value, it will not be executed again. The cid c that entered the set inval when u was executed must be inval before u can be executed again. However, when c is deleted from deleted from inval no duplicate record for u is present in the log, and an ack for c’s update is present in the log. By Lemma 1, the presence of the ack guarantees that no future duplicate reenter the log. Furthermore, when for some update the
replica.
from the front end or the network will isknown(r ) holds c is deleted from inval,
r for c, so by Lemma
By Lemma
4, all duplicates
5, d. ts s rep_ts
and
processing code ensures that any future message will not reenter the replica’s log. We now show that updates are reflected with the dependency relation. val–ts and an update u such show that v is reflected the dependency relation
Consider that r‘s
d of r have
therefore
step
duplicate
d arriving
in the value
arrived
1 of the in
in an order
an update record update u depends
at
gossipa gossip
consistent
r such that r. ts < on v. We need to
in the value before u. From the implementation of we know there exists an update record s for v such
that u.prev > s. ts. Therefore, by Lemma 7 and 8 both u and v are reflected in val. Consider the first time u is reflected in the value. If this happens in the processing of an update message, at that step u.prev < val_ts; by Lemma 7, u.prev s rep–ts, and therefore by Lemma 8 v has already been reflected in val. If this happens while processing a gossip message, a record for u has entered Lemma
the set comp and so u.prev s rep_ts and therefore 3, s has entered the log at this replica and will
s. ts s rep–ts. By enter comp now
unless it has already been deleted. The code guarantees that when both s and a record for u are in comp either v is reflected before u because u. prev > s. ts, or v’s cid is in inval and so v was reflected earlier. If s was deleted from the log earlier, then by Lemma 6 v has already been reflected. Making queries
progress. indeed
eventually paired.
recover These
gate
information
and
Lemma
eventually
To ensure
return.
It
is easy
from
crashes
assumptions
also
between 3, replica
replicas. and
value
progress,
we
to see that and
network
guarantee Therefore, time
need updates
stamps
to
show
that
return
partitions
are
gossip
messages
that from
will
the
updates
provided
eventually will
gossip-processing
increase
and
and
replicas
queries
repropacode will
return.
Garbage collection. Update records will be removed from the log assuming replica crashes and network partitions are repaired eventually. Also, acks will be deleted from the log assuming crashes and partitions are repaired eventually and replica clocks advance, provided cids are deleted from inval. To show that cids are deleted from inval, we need the following lemma, ACM
Transactions
on
Computer
Systems,
Vol.
10,
No.
4, November
1992.
376
R. Ladln et al,
.
which guarantees from reappearing 9.
~EMMA
that an ack stays in inual: a’)
Isknown(
in the log long
= ach
A type(a)
*
enough
to prevent
d. ts < rep_ts
for
all
its cid
duplicates
d of a’s update. to Lemma
❑
PROOF.
Similar
5.
Lemmas
8 and 9 and the ack deletion
code guarantee
that
an ack is deleted
only after its cid is deleted from inzml. By Lemma 1, no duplicates of the update message from the front end or network arriving after this point will be accepted assuming the time in the ack is greater than or equal to the time in any message for the update. Furthermore, step 1 of gossip processing ensures that duplicates of the update record arriving in gossip will not enter assuming the replica’s log. Therefore cids will be removed from inual and partitions are repaired eventually and front ends send acks enough times. The
Availability.
service
quested
updates
updates updates.
and no others. For example,
u.prev means
uses
the
so it is important However, consider
= v.prev and assume that r. ts > s. ts, where
query
that
the
input
label
label
identify
to identify just
labels in fact do sometimes two independent updates
the
u
crashes with big
the
re-
required
identify extra and u with
at replica i before that u is processed r and s are the update records created
u. This by i for
u and v, respectively. When r. ts is merged into a label L, L also identifies u as a required update. Nevertheless, a replica never needs to delay a query waiting for such an “extra” update to arrive because the gossip propagation at scheme ensures that whenever the “required” update record r arrives some
replica,
there.
Note
than 2.2
all
update
that
the
the timestamp Other
Operation
records
timestamp
with
timestamps
of an “extra”
of some “required”
update
smaller
than
update
record
record
identified
r‘s
will
is always
be less
by a label.
Types
The section shows how the implementation can be extended to support forced and immediate updates. These two stronger orders are provided only for updates; queries continue to be ordered as specified by their input labels. As mentioned, each update operation has a declared ordering type that is defined when the replicated application is created. Recall that forced updates are performed in the same order (relative to one another) at all replicas; immediate updates are performed at all replicas in the same order relative to all other updates. Like causal updates, forced updates are ordered via labels and uids, but their uids are totally ordered with respect to one another. Labels now identify both the causal and forced updates, and the input label for an update or query identifies the causal and forced updates that must precede it. Uids are irrelevant for immediate updates, however, because the replicas constrain their ACM
ordering Transactions
relative on
Computer
to other Systems,
operations. Vol.
10,
No
4, November
1992.
Providing High Availability Using Lazy Replication Let q be a query.
.
377
Then
1. q.newl c P(q).label.
2. q.prev u G(q).label c 3. u.uld e q.newl
~
q.newl,
for all updates v s.t. dep(u,
4. q.value = q.op ( VaI (q,newl) Fig. 3.
The specification model and
query
event
operation
in execution
to and including second
of a service
clause
the most of the
service
is given
by a front immediate
specification
now
2.2.1
requires
that
of Forced
of client.
that
Updates.
we
If e is an
all events
precedes
queries
up
e in E. The
reflect
the most
as well as what the input except that the dependency
v) = if immediate (u) then v E P(u) else if immediate (v) then u @ P(v) else if forced (u) & forced (v) then v.uid else v.uid E u.prev
Implementation
3. As before
one for each update
the set containing update
recent immediate update in the execution requires. The other clauses are unchanged tion for clause 4 is extended as follows. dep(u,
in Figure
of events,
end on behalf
E, G(e) denotes
recent
E q.ncwl.
of the service.
as a sequence
performed sequence
v.uld
).
Specification
of the complete
the execution
v),
label rela-
< u.uid
We must
provide
a total
order
for forced updates and a way to relate them to causal updates and queries. This is accomplished as follows. As before, we represent uids and labels by multipart timestamps, but the timestamps have one additional field. Conceptually this field corresponds to an additional replica R that runs all forced updates; the order of the forced updates is the order assigned to them by this replica, and is reflected in R’s part of the timestamp. Therefore it is trivial to determine the order of forced v.uid if u.uid~ < v.uidR. Of course, availability we
allow
if only
one replica
problem different
updates:
if that
could
if u and v are forced run
forced
replica
were
to act
as R over
replicas
updates,
inaccessible. time.
updates, there
would
To solve We
u.uid
this
do this
pp. 240-251. 7. EL-ABBADI, data
A., SK~EN,
management.
Systems.
ACM,
8. FARRELL,
New
A. K.
Engineering
D., AND CRISTIAN, F.
In
Proceedings
York,
1985,
A deadlock
and
an
unreliable
Systems.
ACM,
New
10. GIFFORD, D. K. ium
on Operating
York,
pp.
12. HEDDAYA, 13. HERLIHY,
of
A quorum-consensus
Int.
1-2
W.
(Ott
Two
/Nov.
(Dec.
Computer
availablhty
of data
of Database
Proceedings Dec.
of the Seuentlz
1979).
ACM
Sympos-
SIGOPS,
New
21, 7
for
Ph.D.
Cambridge, clocks,
1978),
(July
Tech.
Rep. CSL-81-
Managing
distributed
event
for abstract
data
types.
ACM
Trans.
for
service
for
Computer
a distributed
Science,
environ-
Cambridge,
Mass.,
for constructing
highly-available
services.
A general
tool for rephcating on Parallel
and
distributed
Dmtnbuted
services In formatmn
highly
Mass.,
May
the
ordering
and
Research
Designing
Center,
a global
Computmg Palo
AND L~DIN,
garbage
R.
Alto, In
(Aug.
of t/Le 17th
Pa., July
1987).
Dept.
and
a technique
of Electrical
for
Engineering
dmand
1989. of events
name
service.
(Aug.
1986).
m a distributed
system.
Comm zm.
1986),
ACM,
New
recovery
Systems,
Proceedings New
York,
of the 5th pp.
in a distributed
distributed of the New
Symposum
on
1-10. data
storage
system.
5th
services ACM
and
Svmposzum
fault-tolerant
dis-
on PrzncLples
York.
E., AND WEIHL, York,
In ACM
1979.
Proceedings
International
IEEE,
on Computer
Crash
Calif.,
LIS~OV, B., SCHEIFLER, R., WALKER, Proceedings
services
MIT
Highly-available
collection.
Computing
available
dissertation,
558-565.
of DLstrLbuted
Transactions
gossip:
location Lab.
Conference
constructing
collection.
Time,
B.,
system.
1991).
Science,
Distributed
ACM
high
cm Pi-z nczples
computer
phase
A technique
20. LAMPSON, B. W., AND STURWS, H. E.
burgh,
In
Calif.,
method
MIT
of the FLrst International
19. LAMPSON, B, W.
In
to attain
393-420.
A method
18. LAMPORT, L.
22
of Electrical
1989).
a highly-avadable
B., AND SHRIRA, L.
garbage
tributed
Dept.
1988.
thesis.
3 (1988),
R.
21. LH~OV,
of Database
32-53.
MIT/LCS/TR-410,
Master’s
In Pmceedzngs
Xerox
July
SynLposz urn
in a decentralized
16. LADIN, R., MAZER, M. S., AND WOLMAN, A.
Principles
the
Grove,
replication
1986),
Constructing
Rep.
Atgorzth?nzca
J. 49,
1 (Feb.
4,
15. LADIN, R., LISIiOV,
ACM
for replicated
S. B. thesis,
Mass.,
data.
(Pacific
ANII WEIHL,
M.
1987.
trlbuted
Argus.
serializability
for replicated
storage
HSLT, M.,
D. J.
Systems
protocol Prmclples
1983.
Sci.:
Tech.
17. LADIN,
on
pp. 70-75.
Prmczples
Irq! Syst.
ment. Nov.
Mar
A.,
histories.
14. HWANG,
for
Cambridge,
Sacrificing
1982,
Systems
scheme MIT,
Proceedings
voting
Information
Corp.,
Comput.
York,
fault-tolerant
SymposLunL
150-162.
11. GIFFORD. D. K. 8, Xerox
A. In
Weighted
Fourth
pp. 215-229.
Science,
network.
An eftlclent
the
detection
Computer
9. FISCHER, M. J., AND MICHAEL, in
of
W.
SymposLum
Orphan
detection
on Fault-Tolerant
pp. 2-7
Vol. 10, No. 4, November
1992
(extended Computmg
abstract). (Pitts-
of
Providing High Availability Using Lazy Replication 23. LISIXN’,
B., BLOOM, T., GIFFORD, D., SCHEIFLER,
Mercury
system.
In
proceedings
of the 21st
R., AND WI?IHL,
Annual
Hawaii
W.
.
391
Communication
Conference
in the
on System
sciences
(Jan. 19S8). IEEE, New York 24. LISKOV, B. 25. LISKOV, tion
B., GHEMAWAT,
in the Harp
Systems 26. MILLS,
file
Internet
(Oct.
Network
Rep. RFC
27. MISHRA,
In
In Proceedings
1991). time
1059.
S., PETERSON,
Psync.
Commun.
ACM
31, 3 (Mar.
1988),
S., GRUBER, R., JOHNSON, P., SHRIRA, L., AND WILLMMS,
system.
Principles D. L.
using
pp. 178-187. programming in Argus.
Distributed
ACM,
New
protocol
July
of the Thirteenth
Symposium
Replica-
on Operating
York.
(version
1) specification
and
implementation.
DARPA-
1988.
L. L., AND SCHLICHTING,
Proceeding
ACM
300-312.
M.
of the Eighth
R. D.
Implementing
Symposium
on Reltable
fault-tolerant Distributed
objects
Systems
(Oct.
1989). 28. OKI, B. M., AND LISKOV, B. highly-available ples 29. OKI,
Viewstamped
distributed
of Distributed B. M.
systems.
CompzLting
Viewstamped
MIT/LCS/TR-423,
(Aug.
replication
MIT
Lab.
systems.
IEEE
Trans.
31. SCHMUCK, F. B. tems.
Tech.
Softw.
The
Eng.
for
32. SCHWARZ, P., AND SPECTOR, A. Syst.
2, 3 (Aug.
33. SKEEN,
D.
SIGMOD 34. WEIHL, Eng.
35. WEIHI.,
Non-blocking
W. E. W.,
Program.
Commit
Computing
Received
July
protocols shared
systems.
Mass.,
Tech.
Rep.
1988.
B., WALTON, E., CHOW, J.,
inconsistency
in
distributed
240–247.
Science,
Syst.
in asynchronous Cornell
abstract
Univ., types.
distributed
Ithaca, ACM
sys-
N. Y., 1988.
Trans.
Comput.
B.
1990;
revised
management
Implementation
7, 2 (Apr.
1984).
Proceedings
Systems
(April
for read-only
of
the
1984).
3rd
ACM
ACM,
New
actions.
SIGACTYork.
IEEE
Trans.
Softw.
types.
ACM
Trans.
55-64.
Proceedings
(Aug.
In
of Database
version
1987),
AND LISKOV,
Lang.
Protocols.
on Principles
Distributed
1 (Jan.
In
of mutual 1983),
broadcast
Synchronizing
36. WUU, G. T. J., AND BERNSTEIN, problems.
distributed
Cambridge,
Detection
of Computer
on Prmcz-
York.
available
Science,
3 (May
to support
S.ynLposium
1984).
Symposzum
SE-13,
C.
SE-9,
Dept.
New
highly
copy method
of the 7th ACM
G., STOU~HTON, A., WALW,R,
use of efficient
Rep. TR 88-926,
ACM,
for Computer
S., AND KLINE,
KISER,
A new primary
Proceedings
1988).
30. PARKER, D. S,, POPEK, G. J., RUDISIN, EDWAKDS, D.,
replication: In
of
A. J. the
ACM,
of resilient,
atomic
data
244-269. Efficient
Third
New
January
ACM
1985),
Annual
York,
1992;
Transactions
solutions
to the
Symposium
replicated on
Principles
log and
dictionary
of Distributed
pp. 233-242.
accepted
July
on Computer
1992
Systems,
Vol. 10, No
4, November
1992.