hoarding, optimistic replication, reintegration, second-class replication, .... protocol based on callbacks. [9] to guarantee that an open file yields its latest. 1Unix is ...
Kistler92 Disconnected System JAMES
J. KISTLER
Carnegie
Mellon
Operation
in the Coda File
and M. SATYANARAYANAN
University
Disconnected
operation
data during
temporary
is a mode of operation that enables a client to continue accessing critical failures of a shared data repository. An important, though not exclusive,
application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Coda File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability. Categories and Subject Descriptors: D,4.3 [Operating Systems]: Systems]: Reliability— Management: —distributed file systems; D.4.5 [Operating D.4.8 [Operating Systems]: Performance —nzeasurements General
Terms: Design, Experimentation,
Additional Key Words and Phrases: reintegration, second-class replication,
Measurement,
Performance,
Disconnected operation, server emulation
File fault
Systems tolerance;
Reliability
hoarding,
optimistic
replication,
1. INTRODUCTION Every serious user of a distributed system has faced situations work has been impeded by a remote failure. His frustration acute when his workstation is powerful has been configured to be dependent inst ante
of such dependence
Placing users, growing
data
and
is the use of data
in a distributed
allows
popularity
them
enough to be used on remote resources.
file
to delegate
of distributed
file
from
standalone, but An important
a distributed
system
simplifies
the
administration
systems
where critical is particularly
collaboration
such as NFS
of that
file
syst em. between
data.
[161 and AFS
The [191
This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab, Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air Force, Wright-Patterson AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order 7597), National Science Foundation (PYI Award and Grant ECD 8907068), IBM Corporation (Faculty Development Award, Graduate Fellowship, and Research Initiation Grant), Digital Equipment Corporation (External Research Project Grant), and Bellcore (Information Networking Research Grant). Authors’ address: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 ACM 0734-2071/92/0200-0003 $01.50 ACM Transactions
on Computer Systems, Vol. 10, No. 1, February 1992, Pages 3-25.
J J. Kistler and M, Satyanarayanan
4.
attests to the compelling users of these systems critical
juncture
How
may
nature of these considerations. Unfortunately, have to accept the fact that a remote failure
seriously
can we improve
the benefits when
of a shared
that
repository
inconvenience
this
state
data
repository,
them.
of affairs?
Ideally,
but
is inaccessible.
We
the at a
we would
like
be able to continue call
the
latter
to enjoy
critical
mode
work
of operation
disconnected operation, because it represents a temporary deviation from normal operation as a client of a shared repository. In this paper we show that disconnected operation in a file system is indeed feasible, efficient and usable. The central idea behind our work is that caching
of
exploited
now
data,
to enhance
widely
used
availability.
to
improve
have
We
performance,
implemented
can
tion in the Coda File System at Carnegie Mellon University. Our initial experience with Coda confirms the viability operation.
We have
to two days.
successfully
operated
For a disconnection
disconnected
of this
duration,
and propagating changes typically takes 100MB has been adequate for us during
2. DESIGN Coda
is
servers.
Unix and
for
an
1 clients
The design
academic
half
lasting
one
of reconnecting
A local disk of of disconnection. that
size should
be
OVERVIEW
designed
untrusted
the process
of about workday.
be
opera-
of disconnected
for periods
about a minute. these periods
Trace-driven simulations indicate that a disk adequate for disconnections lasting a typical
also
disconnected
environment and
a much
is optimized
research
consisting smaller
of a large
number
for the access and sharing
environments.
It
is
collection
of trusted patterns
specifically
not
of
Unix
file
typical
of
intended
for
applications that exhibit highly Each Coda client has a local
concurrent, fine granularity disk and can communicate
data access. with the servers
over a high bandwidth network. ily unable to communicate with
At certain times, a client may be temporarsome or all of the servers. This may be due to
a server or network failure, or due to the detachment from the network. Clients view Coda as a single, location-transparent
shared
tem.
servers
The Coda namespace
larity
of subtrees
called
is mapped volumes.
At
to individual each
of a portable
file
client,
a cache
The first mechanism, replicas at more than
server one
allows
replication,
server.
The
set
sys-
( Venus)
to achieve volumes
of replication
high
to have sites
for
is
protocol
based
1Unix
its
volume
on callbacks
is a trademark
ACM Transactions
storage
of AT&T
group
( VSG).
[9] to guarantee
Bell Telephone
that
subset
a
of a VSG that is accessible VSG ( A VSG). The performance currently accessible is a client’s cost of server replication is kept low by caching on disks at clients and through the use of parallel access protocols. Venus uses a cache coherence volume
The
file
at the granu-
manager
dynamically obtains and caches volume mappings. Coda uses two distinct, but complementary, mechanisms availability. read-write
Unix
client
an open file
Labs
on Computer Systems, Vol. 10, No 1, February
1992,
yields
its latest
Disconnected
Operation m the Coda File System
.
Icopy in the AVSG. This guarantee is provided by servers notifying when their cached copies are no longer valid, each notification being to as a ‘callback
break’.
all AVSG sites, Disconnected
Modifications
and eventually operation, the
mechanism While
Venus cache.
services file system requests Since cache misses cannot application
Venus
propagates
depicts
a typical
and disconnected Earlier discuss cantly
this
3. DESIGN At
a high
used
by
disconnected,
by relying solely on the contents of its be serviced or masked, they appear and
and
users.
reverts
involving
When
to server
transitions
[18, 19] have restricts
replication
disconnection replication.
between
server
ends, Figure
1
replication
its
only
our design
described
server
attention
in those
to
areas
for disconnected
replication
disconnected
where
in depth.
In
operation.
its presence
We
has signifi-
operation.
RATIONALE level,
we
empty.
to
operation.
influenced
First,
scenario
paper
server
becomes
programs
modifications
Coda papers
contrast,
the
clients referred
in parallel
AVSG
takes
to
when
are propagated
to missing VSG sites. second high availability
Coda,
as failures
effect
in Coda
5
two
wanted
our system.
factors
to
Second,
use
influenced
our
conventional,
we wished
strategy
off-the-shelf
to preserve
for
high
availability.
hardware
throughout
by seamlessly
transparency
inte-
grating the high availability mechanisms of Coda into a normal environment. At a more detailed level, other considerations influenced our design. the advent of portable workstations, include the need to scale gracefully,
These the
resource, integrity, and very different clients and servers, and the need to strike
about and
We examine
consistency.
3.1
each of these
security
a balance
issues
assumptions made availability between
in the following
Unix
sections.
Scalability
Successful Coda’s
distributed
systems
tend
ancestor,
AFS,
had impressed
rather
than
treating
a priori,
it
to
grow
upon
in
size.
Our
experience
us the need to prepare
as an afterthought
[17].
with
for growth
We brought
this
experience to bear upon Coda in two ways. First, we adopted certain mechanisms that enhance scalability. Second, we drew upon a set of general principles to guide our design choices. An example of a mechanism we adopted for scalability is callback-based whole-file caching, offers the cache coherence. Another such mechanism added occur
advantage of a much simpler failure on an open, never on a read, write,
substantially partial-file [1] would
model: a cache seek, or close.
miss This,
can only in turn,
simplifies the implementation of disconnected operation. A caching scheme such as that of AFS-4 [22], Echo [8] or MFS have complicated our implementation and made disconnected
operation less transparent. A scalability principle that has had considerable of functionality on clients rather than the placing
influence on our design servers. Only if integrity
ACM Transactions on Computer Systems, Vol
is or
10, No 1, February 1992
6.
J J. Kistler and M Satyanarayanan
‘m
_.___—l
ACM TransactIons on Computer Systems, Vol. 10, No 1, February
1992
Disconnected security Another rapid
would
been
compromised
scalability principle change. Consequently,
or agreement algorithms consensus
Powerful,
by
large
have
we have adopted we have rejected
numbers
such as that on the current
Portable
3.2
have
Operation in the Coda File System
of nodes.
this
principle.
is the avoidance ofsystem-wide strategies that require election For
example,
used in Locus [23] that depend partition state of the network.
lightweight
and compact
to observe
how
laptop a person
we
have
on nodes
avoided achieving
computers
are commonplace
with
in a shared
uses such a machine. Typically, he identifies them from the shared file system into the When
he returns,
he copies
substantially simplify use a different name propagate
today.
file
system
files
back
manual that
into
the
caching,
shared
with
disconnected
file
write-back
operation
could
the use of portable clients. Users would not have to space while isolated, nor would they have to man-
changes
champion application The use of portable
data
files of interest and downloads local name space for use while
modified
system. Such a user is effectively performing upon reconnection! Early in the design of Coda we realized
ually
violated
7
Workstations
It is instructive
isolated.
we
o
upon
reconnection.
Thus
portable
for disconnected operation. machines also gave us another
machines
insight.
The
are
fact
a
that
people are able to operate for extended periods in isolation indicates that they are quite good at predicting their future file access needs. This, in turn, suggests
that
it
is reasonable
cache management policy involuntary Functionally,
to
seek
user
for disconnected disconnections
uoluntary disconnections from Hence Coda provides a single
assistance
in
augmenting
operation. caused by failures
caused by unplugging mechanism to cope with
are no different
portable computers. all disconnections. Of
course, there may be qualitative differences: user expectations as well extent of user cooperation are likely to be different in the two cases. First- vs. Second-Class
3.3
If
disconnected
Clients unattended
as the
Replication
operation
all? The answer to this assumptions made about
the
why
is server
replication
question depends clients and servers
is feasible,
critically in Coda.
on the
needed very
at
different
are like appliances: they can be turned off at will and may be for long periods of time. They have limited disk storage capacity,
their software and hardware may be tampered with, and their owners may not be diligent about backing up the local disks. Servers are like public utilities: they have much greater disk capacity, they are physically secure, and they It
are carefully
is therefore
servers,
and
monitored
appropriate
second-class
and administered
to distinguish replicas
(i.e.,
cache
by professional
between copies)
first-class
on clients.
staff. replicas
on
First-class
replicas are of higher quality: they are more persistent, widely known, secure, available, complete and accurate. Second-class replicas, in contrast, are inferior along all these dimensions. Only by periodic revalidation with respect to a first-class replica can a second-class replica be useful. ACM Transactions
on Computer Systems, Vol. 10, No. 1, February 1992.
J. J. Kistler and M. Satyanarayanan
8.
The function and
of a cache
scalability
first-class
coherence
advantages
replica.
When
protocol
is to combine
of a second-class disconnected,
replica
the quality
with
in
for degradation. the
face
ability.
Hence
quency
Whereas
of failures,
and
server
duration
server
replication
in
operation,
because
contrast,
quality
because
operation,
which
little.
it
of data for
reduces
fre-
viewed
additional Whether
avail-
the
is properly
it requires
costs
of a replica
the quality
forsakes
is important
quality
which it is contingent is the greater the poten-
preserves
operation
of disconnected
a measure of last resort. Server replication is expensive Disconnected
replication
disconnected
the
of the second-class
may be degraded because the first-class replica upon inaccessible. The longer the duration of disconnection, tial
the performance
hardware.
to
use
server
replication or not is thus a trade-off between quality and cost. Coda permit a volume to have a sole server replica. Therefore, an installation rely exclusively on disconnected operation if it so chooses.
3.4
Optimistic
vs. Pessimistic
By definition,
a network
replica
and
all
replica
control
to the design ing
strategies,
danger
occurrence. A pessimistic
all reads
control
by
towards
a disconnected
choice
strategy and
provides
conflict-
deals
them
operation
would
of a cached reconnection. preclude
reads
much
and
of
central
avoids
resolving
disconnected
would
families
or by restricting
everywhere,
detecting
client
strategy
writes
and writes
second-class
two
[51, is therefore
A pessimistic
partitioned
exclusive control such control until
by a disconnected
between
optimistic
An optimistic
of conflicts
client to acquire shared or disconnection, and to retain exclusive
The
operation.
partition.
approach
between
and
pessimistic
by permitting
attendant
exists
by disallowing
does can
Control
associates.
of disconnected
to a single
availability
partition
its first-class
operations
and writes
Replica
as
higher
with
after
the their
require
a
object prior Possession
to of
reading
or writing
at all other replicas. Possession of shared control would allow reading at other replicas, but writes would still be forbidden everywhere. Acquiring control prior to voluntary disconnection is relatively simple. It is more
difficult
when
have to arbitrate
disconnection
among
needed
to make
system
cannot
a wise predict
is involuntary,
multiple
requesters.
decision
is not
which
requesters
because
Unfortunately,
readily would
available. actually
the
system
may
the information For use the
example, object,
the when
they would release control, or what the relative costs of denying them access would be. Retaining control until reconnection is acceptable in the case of brief disconnections. But it is unacceptable in the case of extended disconnections. A disconnected client with shared control of an object would force the rest of the system to defer all updates until it reconnected. With exclusive control, it would even prevent other users from making a copy of the object. Coercing the client
to reconnect
ACM Transactions
may
not be feasible,
since
its whereabouts
on Computer Systems, Vol. 10, No 1, February
1992
may
not be
Disconnected
Operation in the Coda File System
.
9
known. Thus, an entire user community could be at the mercy of a single errant client for an unbounded amount of time. Placing a time bound on exclusive or shared control, as done in the case of leases [7], avoids this problem but introduces others. Once a lease expires, a disconnected client loses the ability to access a cached object, even if no one else in the
system
disconnected already
is interested
operation
made
while
An optimistic disconnected
which
in it.
is to provide
disconnected
approach client
This,
have
in turn, high
conflict
the
availability.
purpose
Worse,
of
updates
to be discarded.
has its own disadvantages.
may
defeats
with
An update
an update
at another
made
at one
disconnected
or
connected client. For optimistic replication to be viable, the system has to be more sophisticated. There needs to be machinery in the system for detecting conflicts, age and manually
for automating resolution when possible, and for confining dampreserving evidence for manual repair. Having to repair conflicts violates transparency, is an annoyance to users, and reduces the
usability
of the system.
We chose optimistic replication because we felt that its strengths and weaknesses better matched our design goals. The dominant influence on our choice an
was the low degree
optimistic
strategy
of write-sharing
was
likely
to
typical lead
to
of Unix.
This
relatively
few
implied
that
conflicts.
An
optimistic strategy was also consistent with our overall goal of providing the highest possible availability of data. In principle, we could have chosen a pessimistic strategy for server replication even after choosing an optimistic strategy for disconnected But that would have reduced transparency, because a user would the anomaly of being able to update data when disconnected, unable
to do so when
the previous replication. Using system
connected
arguments
an optimistic from
strategy
the user’s
data in his accessible everyone else in that set of shrinks
updates
universe
universe.
contact. universe
become
4. DETAILED
throughout
strategy
presents
At any time,
Further, also apply
a uniform
visible
model
and his updates are immediately His accessible universe is usually
of
of the
visible to the entire
When failures occur, his accessible universe he can contact, and the set of clients that they, in
throughout
his now-enlarged
disconnected, reconnection,
accessible
his his
universe.
AND IMPLEMENTATION
In describing our implementation client since this is where much the physical structure of Venus, and Sections
many
to server
he is able to read the latest
In the limit, when he is operating consists of just his machine. Upon
DESIGN
tion of the server Section 4.5.
of the servers.
of an optimistic
perspective.
servers and clients. to the set of servers
turn, can accessible
to a subset
in favor
operation. have faced but being
of disconnected of the complexity
of a client, 4.3 to 4.5
support
needed
operation, we focus on the lies. Section 4.1 describes
Section 4.2 introduces the major states discuss these states in detail. A descripfor disconnected
ACM Transactions
operation
is contained
in
on Computer Systems, Vol. 10, No. 1, February 1992.
10
.
J. J Kistler and M Satyanarayanan
IT
—
Venus
I
1
Fig.2.
4.1
Client
Because than
Structure
ofa Coda client.
Structure of the
part but
debug.
Figure
Venus
complexity
of Venus,
of the kernel.
mance,
would
The latter
have
been
2 illustrates
intercepts
Unix
we made approach
less portable
the high-level file
system
it a user-level
may
have
access,
are handled A system
disconnected
MiniCache. If possible, the returned to the application. service returns
operation
entirely by Venus. call on a Coda object
calls
via
or server
is forwarded
from
our
for good performance
Venus
Logically, reintegration.
Venus always
rather
better
perfor-
more
difficult
to
widely
used
Sun
Vnode
performance
overhead MiniCache to filter contains no support
replication; by the
these
Vnode
on out for
functions
interface
to the
call is serviced by the MiniCache and control Otherwise, the MiniCache contacts Venus
is to
the call. This, in turn, may involve contacting Coda servers. Control from Venus via the MiniCache to the application program, updating
Measurements
4.2
process
of a Coda client.
the
MiniCache state as a side effect. MiniCache initiated by Venus on events such as callback critical
yielded
and considerably structure
interface [10]. Since this interface imposes a heavy user-level cache managers, we use a tiny in-kernel many kernel-Venus interactions. The MiniCache remote
to Coda servers
implementation
state changes breaks from
confirm
that
the
may Coda
also be servers.
MiniCache
is
[211,
States Venus operates in one of three Figure 3 depicts these states
is normally on the alert
emulation,
between
and them.
in the hoarding state, relying on server replication but for possible disconnection. Upon disconnection, it enters
the emulation state and remains Upon reconnection, Venus enters ACM Transactions
states: hoarding, and the transitions
on Computer Systems, Vol
there for the the reintegration 10, No 1, February
duration state, 1992
of disconnection. desynchronizes its
Disconnected
Operation in the Coda File System
.
11
r)
Hoarding
Fig. 3. Venus states and transitions. When disconnected, Venus is in the emulation state. It transmits to reintegration upon successful reconnection to an AVSG member, and thence to hoarding, where it resumes connected operation.
cache
with
its
AVSG,
and
then
all volumes may not be replicated be in different states with respect conditions in the system.
4.3
reverts
to
across the to different
the
hoarding
state.
same set of servers, volumes, depending
Since
Venus can on failure
Hoarding
The hoarding state
state
is to hoard
not its only
is so named
useful
data
responsibility.
because
Rather,
reference
predicted
with
behavior,
manage
is
its cache in a manner
and disconnected operation. a certain set of files is critical
the implementation especially
and reconnection
–The true cost of a cache hard to quantify. an object
this
must
For inbut may
in
of hoarding: the
distant
future,
cannot
be
certainty.
–Disconnections
—Activity
in this
However,
files. To provide good performance, Venus must to be prepared for disconnection, it must also cache
the former set of files. Many factors complicate —File
of Venus
of disconnection.
Venus
that balances the needs of connected stance, a user may have indicated that currently be using other cache the latter files. But
a key responsibility
in anticipation
at other
clients
miss must
are often while
unpredictable.
disconnected
be accounted
is highly
for, so that
variable
the latest
and
version
of
is in the cache at disconnection.
—Since cache space is finite, the availability of less critical to be sacrificed in favor of more critical objects. ACM Transactions
objects
may
have
on Computer Systems, Vol. 10, No. 1, February 1992.
12
J J. Kistler and M Satyanarayanan
. # a a a
Personal files /coda /usr/jjk d+ /coda /usr/jjk/papers /coda /usr/jjk/pape
# # a a a a a a a a a a a
100:d+ rs/sosp 1000:d+
# a a a a a
System files /usr/bin 100:d+ /usr/etc 100:d+ /usr/include 100:d+ /usr/lib 100:d+ /usr/local/gnu d+ a /usr/local/rcs d+ a /usr/ucb d+
XII files (from Xll maintainer) /usr/Xll/bin/X /usr/Xll/bin/Xvga /usr/Xll/bin/mwm /usr/Xl l/ bin/startx /usr/Xll /bin/x clock /usr/Xll/bin/xinlt /usr/Xll/bin/xtenn /usr/Xll/i nclude/Xll /bitmaps c+ /usr/Xll/ lib/app-defaults d+ /usr/Xll /lib/font s/mist c+ /usr/Xl l/lib/system .mwmrc
(a) # # a a a
Venus source files (shared among Coda developers) /coda /project /coda /src/venus 100:c+ /coda/project/coda/include 100:c+ /coda/project/coda/lib c+
(c) Fig.
4.
Sample
hoard
profiles.
These are typical
hoard
profiles
provided
hy a Coda user, an
application maintainer, and a group of project developers. Each profile IS interpreted separately by the HDB front-end program The ‘a’ at the beginning of a line indicates an add-entry command. Other commands are delete an entry, clear all entries, and list entries. The modifiers following some pathnames specify nondefault priorities (the default is 10) and/or metaexpansion for the entry. ‘/coda’.
To
Note that
address
the pathnames
these
concerns,
and periodically cache via a process known algorithm,
4.3.1
plicit
Prioritized
sources
Cache
manage
Management.
of information
of interest front-end
we
with
reevaluate which walking. as hoard
rithm. The implicit information traditional caching algorithms. hoard database per-workstation fying objects A simple
beginning
in
its
‘/usr’
the
(HDB),
cache
objects
Venus priority-based
consists Explicit
are actually
symbolic
using
merit
a
links
into
prioritized
retention
in the
combines implicit and excache management algo-
of recent reference history, as in information takes the form of a whose entries are pathnames identi-
to the user at the workstation. program allows a user to update
the
HDB
using
hoard profiles, such as those shown in Figure command scripts called Since hoard profiles are just files, it is simple for an application maintainer
4. to
provide a common profile for his users. or for users collaborating on a project to maintain a common profile. A user can customize his HDB by specifying different combinations of profiles or by executing front-end commands interactively. To facilitate construction of hoard profiles, Venus can record all file references observed between a pair of start and stop events indicated by a user. To reduce the verbosity of hoard profiles and the effort needed to maintain them, Venus if the letter ACM Transactions
meta-expansion of HDB supports ‘c’ (or ‘d’) follows a pathname, on Computer Systems, Vol
entries. As shown in Figure the command also applies
10, No, 1, February
1992
4, to
Disconnected immediate
children
(or all
Operation in the Coda File System
descendants).
A ‘ +‘
following
the
.
13
‘c’ or ‘d’ indi-
cates that the command applies to all future as well descendants. A hoard entry may optionally indicate
as present children or priority, with a hoard
higher priorities indicating more critical The current priority of a cached object
of its hoard
well
as a metric
representing
ously in response no longer in the victims when To resolve imperative
usage.
The
latter
priority
is updated
all
ensure
the
ancestors
that
a cached
of the
object
directory
while
of objects chosen as
disconnected,
also be cached.
is not
as
continu-
to new references, and serves to age the priority working set. Objects of the lowest priority are
cache space has to be reclaimed. the pathname of a cached object
that
therefore
recent
objects. is a function
purged
it
is
Venus
must
any
of its
before
hierarchical cache management is not needed in tradidescendants. This tional file caching schemes because cache misses during name translation can be serviced, albeit at a performance cost. Venus performs hierarchical
cache management by assigning infinite priority to directories with children. This automatically forces replacement to occur bottom-up. 4.3.2
Hoard
Walking.
say that
We
a cache
is in
cached
signifying
equilibrium,
that it meets user expectations about availability, when no uncached object has a higher priority than a cached object. Equilibrium maybe disturbed as a result the
of normal cache
activity.
mentioned its priority equilibrium,
as a hoard
walk.
implementation, bindings
replacing
an
suppose object,
B.
an object,
A, is brought
Further
suppose
restores but
one
disconnection. of HDB
equilibrium
A hoard
walk may
The
entries
occurs be
by performing every
explicitly
walk
occurs
are reevaluated
Coda clients. For example, new tory whose pathname is specified
required in
an operation
10 minutes by
in
is
First,
to reflect
update
activity
been created in the HDB.
a key characteristic
data is refetched only a strategy may result
current
a user
phases.
children may have with the ‘ +‘ option
callback-based caching, break. But in Coda, such
being unavailable should it. Refetching immediately ignores
B
known
our
two
priorities of all entries in the cache and HDB are reevaluated, fetched or evicted as needed to restore equilibrium. Hoard walks also address a problem arising from callback traditional callback
into
that
in the HDB, but A is not. Some time after activity on A ceases, will decay below the hoard priority of B. The cache is no longer in than the uncached since the cached object A has lower priority
object B. Venus periodically
voluntary
For example,
on demand,
prior the
to
name
by other in a direcSecond, the and
objects
breaks.
on demand in a critical
In
after a object
a disconnection occur before the next reference to upon callback break avoids this problem, but of Unix
it is likely to be modified many interval [15, 6]. An immediate
environments:
once an object
is modified,
more times by the same user within a short refetch policy would increase client-server
traffic considerably, thereby reducing scalability. Our strategy is a compromise that balances scalability. For files and symbolic links, Venus ACM Transactions
availability, consistency and purges the object on callback
on Computer Systems, Vol. 10, No. 1, February 1992
14
.
break,
and refetches
occurs
earlier.
J. J. Klstler and M. Satyanarayanan it on demand
If a disconnection
or during were
the next
to occur
hoard
before
walk,
whichever
refetching,
the
object
would be unavailable. For directories, Venus does not purge on callback break, but marks the cache entry suspicious. A stale cache entry is thus available should a disconnection occur before the next hoard walk or reference.
The
callback entry
acceptability
semantics.
y of stale
A callback
has been added
to or deleted
other directory entries fore, saving the stale nection
causes
directory break
and copy
consistency
the and
data
follows
on a directory
from
from
typically
the directory.
its
particular
means
It is often
that
objects they name are unchanged. using it in the event of untimely
to suffer
only
a little,
but
increases
actions
normally
an
the case that Therediscon-
availability
considerably. 4.4
Emulation
In the
emulation
servers.
For
semantic
state,
example,
checks.
Venus
performs
Venus
now
many
assumes
It is also responsible
full
responsibility
for generating
handled
by
for access and
temporary
file
identi-
(fids) for new objects, pending the assignment of permanent fids at updates reintegration. But although Venus is functioning as a pseudo-server, accepted by it have to be revalidated with respect to integrity and protection by real servers. This follows from the Coda policy of trusting only servers, not clients. To minimize unpleasant delayed surprises for a disconnected user, it
fiers
behooves
Venus
Cache algorithm
management used during
cache freed ity
entries
to be as faithful
of the
immediately,
so that
default request 4.4.1
they
objects
but are
not
version
in its emulation. is
involved.
Cache
of other
modified
those
purged
before
done with operations entries
the same priority directly update the
of deleted
objects
assume
reintegration.
objects
infinite
On a cache
are
prior-
miss,
the
behavior of Venus is to return an error code. A user may optionally Venus to block his processes until cache misses can be serviced. Logging.
During
emulation,
to replay update activity when in a per-volume log of mutating contains
as possible
during emulation hoarding. Mutating
a copy state
of the
corresponding
of all objects
Venus
records
sufficient
information
it reinteWates. It maintains this information log. Each log entry operations called a replay
referenced
system
call
arguments
as well
as the
by the call.
Venus uses a number of optimizations to reduce log, resulting in a log size that is typically a few
the length of the replay percent of cache size. A
small log conserves disk space, a critical resource during periods of disconnection. It also improves reintegration performance by reducing latency and server load. One important optimization to reduce log length pertains to write operations on files. Since Coda uses whole-file caching, the close after an open of a file for modification installs a completely new copy of the file. Rather than logging the open, close, and intervening write operations individually, Venus logs a single store record during the handling of a close, Another optimization for a file when a new
consists of Venus discarding a previous store record one is appended to the log. This follows from the fact
ACM TransactIons on Computer Systems, Vol. 10, No. 1, February
1992,
Disconnected that
a store
record
renders
all
Operation in the Coda File System
previous
does not contain
versions
of a file
a copy of the file’s
.
superfluous.
contents,
but
15
The
merely
store
points
to the
copy in the cache. We
are
length
currently
of the replay
the previous earlier would
implementing
two
log.
generalizes
paragraph
The
such that
operations may be the canceling
second optimization the inverting and own log record 4.4.2
chine
further
as well
cancel the of a store
as that
to reduce
optimization
which
the
described
in
the effect
overwrites
of
corresponding log records. An example by a subsequent unlink or truncate. The operations to cancel both of inverse For example, a rmdir may cancel its
of the corresponding
A disconnected
user
and continue
where
a shutdown
optimizations the
any operation
exploits knowledge inverted log records.
Persistence.
after
first
mkdir.
must
be able
he left
to restart
his
ma-
off. In case of a crash,
the
amount of data lost should be no greater than if the same failure occurred during connected operation. To provide these guarantees, Venus must keep its cache and related data structures in nonvolatile storage. Mets-data,
consisting
of cached
directory
and symbolic
link
contents,
status
blocks for cached objects of all types, replay logs, and the HDB, is mapped virtual memory (RVM). Transacinto Venus’ address space as recoverable tional
access to this
Venus.
memory
The actual
contents
local Unix files. The use of transactions
is supported of cached to
by the RVM
files
library
are not in RVM,
manipulate
meta-data
[13] linked but
into
are stored
simplifies
Venus’
as job
enormously. To maintain its invariants Venus need only ensure that each transaction takes meta-data from one consistent state to another. It need not be concerned with crash recovery, since RVM handles this transparently. If we had files,
chosen
we
synchronous RVM trol
over
the
would
writes
supports the
obvious
have
An
alternative to
follow
and an ad-hoc
local,
basic
serializability.
had
non-nested
recovery
RVM’S write-ahead bounded provides flushes. Venus
exploits
log from persistence,
the
allows
commit
latency
a synchronous the application
time to time. where the
capabilities
and
of atomicity,
reduce
thereby avoiding commit as no-flush, persistence of no-flush transactions,
of RVM
in
local
accessible
when
emulating,
When bound
con-
permanence
and
by
the
labelling
write to disk. To ensure must explicitly flush
used in this manner, RVM is the period between log
to provide
Venus
is more
the log more frequently. This lowers performance, data lost by a client crash within acceptable limits. 4.4.3 volatile
Resource
storage
Exhaustion.
during
emulation.
It
is possible The
timed
independent
good performance
constant level of persistence. When hoarding, Venus initiates log infrequently, since a copy of the data is available on servers. Since are not
Unix
of carefully
algorithm.
properties can
meta-data
discipline
transactions
transactional application
of placing a strict
two
for
conservative but
Venus
significant
keeps
at a flushes servers
and flushes the
to exhaust instances
amount
its of this
of
nonare
ACM ‘l!ransactions on Computer Systems, Vol. 10, No. 1, February 1992
16
.
J. J. Kistler and M. Satyanarayanan
the file allocated
cache becoming filled to replay logs becoming
with full.
modified
files,
Our current implementation is not very graceful tions. When the file cache is full, space can be freed modified
files.
reintegrate always
When
ion
log space is full,
has been
no further
performed.
and
the
in handling by truncating
mutations
Of course,
RVM
non-rout
these situaor deleting
are allowed
at ing
space
until
operations
are
allowed.
We plan emulating.
to explore at least three alternatives One possibility is to compress file
to free up disk space while cache and RVM contents.
Compression trades off computation time for space, and recent work [21 has shown it to be a promising tool for cache management. A second possibility is to allow users to selectively back out updates made while disconnected. A third
approach
is to allow
out to removable 4.5
media
portions
of the
such as floppy
file
cache
and RVM
to be written
disks.
Reintegration
Reintegration
is a transitory
state
through
which
Venus
passes
in changing
roles from pseudo-server to cache manager. In this state, Venus propagates changes made during emulation, and updates its cache to reflect current server state. Reintegration is performed a volume at a time, with all update activity
in the volume
4.5.1
Replay
suspended
in two
for
and
This
new
objects
step
steps.
in many
independently
executed. protection,
to replace cases,
of changes
step, Venus temporary
since
Venus
from
obtains fids
client
in the
obtains
to AVSG
permanent
a small
fids
replay
log.
supply
of
fids in advance of need, while in the hoarding state. In the second replay log is shipped in parallel to the AVSG, and executed at each
member.
single transaction, which The replay algorithm parsed, locked.
In the first
uses them
is avoided
permanent step, the
completion.
The propagation
Algorithm.
is accomplished
until
Each
server
performs
the
is aborted if any error is detected. consists of four phases. In phase
replay one
within the
log
a is
a transaction is begun, and all objects referenced in the log are In phase two, each operation in the log is validated and then The validation consists of conflict detection as well as integrity, and disk space checks. Except in the case of store operations,
execution during replay is identical to execution in connected mode. For a file is created and meta-data is updated to reference store, an empty shadow it, but the data transfer is deferred. Phase three consists exclusively of performing
these
data
transfers,
a process
known
as bac?z-fetching.
The final
phase commits the transaction and releases all locks. If reinte~ation succeeds, Venus frees the replay log and resets the priority of cached objects referenced by the log. If reintegration fails, Venus writes file in a superset of the Unix tar format. out the replay log to a local replay The log and all corresponding cache entries are then purged, so that subsequent references will cause refetch of the current contents at the AVSG. A tool is provided which allows the user to inspect the contents of a replay file, compare entirety.
it
to the
ACM Transactions
state
at
the
AVSG,
and
replay
on Computer Systems, Vol. 10, No. 1, February
it 1992,
selectively
or in
its
Disconnected
Operation in the Coda File System
Reintegration at finer granularity than perceived by clients, improve concurrency reduce user effort during manual replay. implementation
to reintegrate
dent
within
operations
using
the
precedence
Venus
will
pass them
to servers
along
The tagged
with
with
operations
clients. conflicts.
for conflicts a storeid
of subsequences
subsequences
of Davidson
precedence the replay
The
relies
that
only
Read/
system model, since it of a single system call.
check
granularity
graphs
of depen-
can be identified
[4]. In the
revised
during
emulation,
imple
-
and
log.
Our use of optimistic replica control means that of one client may conflict with activity at servers
Handling.
or other disconnected with are write / write Unix file boundary
maintain
17
a volume would reduce the latency and load balancing at servers, and To this end, we are revising our
Dependent
approach
graph
mentation
ConfZict 4.5.2 the disconnected
at the
a volume.
.
class of conflicts we are concerned conflicts are not relevant to the
write
has
no
on the
fact
uniquely
notion that
identifies
of atomicity each replica
the
last
beyond
the
of an object
update
to it.
is
During
phase two of replay, a server compares the storeid of every object mentioned in a log entry with the storeid of its own replica of the object. If the comparison the mutated
indicates equality for all objects, the operation objects are tagged with a new storeid specified
is performed and in the log entry.
If a storeid comparison fails, the action taken depends on the operation being validated. In the case of a store of a file, the entire reintegration is aborted. But for directories, a conflict is declared only if a newly created name collides with an existing name, if an object updated at the client or the server has been deleted by the other, or if directory attributes have been modified
at the
directory
updates
server
and the
is consistent
client. with
This our
strategy
strategy
and was originally suggested by Locus [23]. Our original design for disconnected operation replay files at servers rather than clients. This damage forcing
to
be
manual
confined repair,
by
marking
conflicting
as is currently
of resolving in server
Today, laptops
[11],
called for preservation of approach would also allow replicas
inconsistent
done in the case of server
We are awaiting more usage experience to determine the correct approach for disconnected operation. 5. STATUS
partitioned
replication
whether
and
replication. this
is indeed
AND EVALUATION
Coda runs on IBM RTs, Decstation such as the Toshiba 5200. A small
3100s and 5000s, and 386-based user community has been using
Coda on a daily basis as its primary data repository since April 1990. All development work on Coda is done in Coda itself. As of July 1991 there were nearly 350MB of triply replicated data in Coda, with plans to expand to 2GB in the next A version
few months. of disconnected
operation
with
strated in October 1990. A more complete 1991, and is now in regular use. We have for periods
lasting
one to two days. ACM Transactions
Our
minimal
functionality
version was functional successfully operated
experience
with
on Computer Systems, Vol
was demonin January disconnected
the system
has been
10, No 1, February 1992.
18
.
J. J. Kistler and M. Satyanarayanan Table I.
Elapssd
Time
(seconds)
Andrew
Make
Reintegration Totaf
Time
AllocFid
Size of Replay
(xconds)
Replay
con
Records
Log
Data
Back-Fetched (Bytes)
Bytes
288 (3)
43 (2)
4 (2)
29 (1)
10 (1)
223
65,010
1,141,315
3,271 (28)
52 (4)
1 (o)
40 (1)
10 (3)
193
65,919
2,990,120
Benchmark
Venus
Time for Reintegration
Note: This data was obtained with a Toshiba T5200/100 client (12MB memory, 100MB disk) reintegrating over an Ethernet with an IBM RT-APC server (12MB memory, 400MB disk). The values shown above are the means of three trials, Figures m parentheses are standard deviations.
quite
positive,
and we are confident
that
the refinements
under
will result in an even more usable system. In the following sections we provide qualitative and quantitative to three important questions pertaining to disconnected operation. (1) How
long
(2) How
large
(3) How
likely
5.1
Duration
does reintegration a local
disk
development answers These are:
take?
does one need?
are conflicts? of Reintegration
In our experience, development lasting To characterize
typical disconnected a few hours required
reintegration
speed more
sessions of editing and program about a minute for reintegration. precisely,
we measured
the reinte-
gration times after disconnected execution of two well-defined tasks. The first task is the Andrew benchmark [91, now widely used as a basis for comparing file system performance. The second task is the compiling and linking of the current version of Venus. Table I presents the reintegration times for these tasks. The time for reintegration consists of three components: the time to allocate permanent fids, the time for the replay at the servers, and the time for the second phase of the update protocol used for server replication. The first component
will
be zero for many
disconnections,
due to the
preallocation
of
fids during hoarding. We expect the time for the second component to fall, considerably in many cases, as we incorporate the last of the replay log optimizations described in Section 4.4.1. The third component can be avoided only if server replication is not used. One can make some interesting secondary observations from Table 1. First, the total time for reinte~ation is roughly the same for the two tasks even though the Andrew benchmark has a much smaller elapsed time. This is because the Andrew benchmark uses the file system more intensively. Second, reintegration for the Venus make takes longer, even though the number of entries in the replay log is smaller. This is because much more file data is ACM Transactions
on Computer Systems, Vol. 10, No. 1, February
1992
Disconnected
Operation in the Coda File System
o
19
7J50$
— – – -----
540 g -cl ~ m 30 s ~
Max Avg Min
s 20 & .Q) z 10 -
—--———
/“--
.........
. . . . . . . . . . ------0
4
2
‘“
12 10 Time (hours)
8
6
,. . . ..-. -.
Fig. 5. High-water mark of Cache usage. This graph is based on a total of 10 traces from 5 active Coda workstations. The curve labelled “Avg” corresponds to the values obtained by “Max” and “ Min” plot averaging the high-water marks of all workstations. The curves labeled the highest and lowest values of the high-water marks across all workstations. Note that the high-water mark does not include space needed for paging, the HDB or replay logs.
back-fetched in the third phase of the replay. Finally, neither any think time. As a result, their reintegration times are
task involves comparable to
that
session
after
a much
longer,
but
more
typical,
disconnected
in
our
environment.
5.2
Cache
Size
A local disk capacity of 100MB on our clients has proved initial sessions of disconnected operation. To obtain a better the cache size requirements for disconnected operation, reference menting
traces from workstations
adequate for understanding we analyzed
our environment. The traces were obtained to record information on every file system
by instruoperation,
regardless of whether the file was in Coda, AFS, or the local file system. Our analysis is based on simulations driven by these traces. Writing validating Venus Venus driven
a simulator
that
precisely
models
the complex
caching
our of file
and
behavior
of
would be quite difficult. To avoid this difficulty, we have modified to act as its own simulator. When running as a simulator, Venus is by traces rather than requests from the kernel. Code to communicate
with the servers, as well as code to perform system are stubbed out during simulation. Figure
5 shows
The actual
disk
both the explicit From our data operating
the
high-water
size needed
mark
physical
of cache
for disconnected
1/0
usage
operation
on the
as a function
for a typical ACM Transactions
workday.
Of course,
user
file
of time.
has to be larger,
and implicit sources of hoarding information it appears that a disk of 50-60MB should
disconnected
local
since
are imperfect. be adequate for activity
that
is
on Computer Systems, Vol. 10, No. 1, February 1992.
20
J. J. Kkstler and M. Satyanarayanan
●
drastically significantly We plan First,
different from what different results. to extend our work
we will
investigate
disconnection.
Second,
cache we will
was
recorded
in
on trace-driven
our
could
simulations
size requirements be sampling
traces
in three
for much
a broader
longer
range
traces from many more machines in our environment. the effect of hoarding by simulating traces together
profiles
have
5.3
Likelihood
been specified
virtually
different
no
network
Second,
server
conflicts
replication due to While
partitions.
perhaps grows
will lead to many To obtain data mented 400
the AFS
computer
program cant
conflicts
will
larger.
Third,
in Coda for nearly
multiple gratifying
servers science
amount
faculty,
the same kind
mate
frequent
of usage
conflicts
a user
would
modifies
an
at
assumptions,
file
or
we instru-
for
research,
includes
a signifi-
is descended
were
user
are used by over
we can use this
be if Coda
users are system. Coda
scale,
profile
Coda
in is
disconnections
students
usage
Since
AFS
larger
graduate
Their
as the
servers
we have
an object observation
voluntary
These
and
activity.
only
extended
of conflicts
staff
a year,
updating us, this
a problem
perhaps
and education.
and makes
users to
is possible that our with an experimental
in our environment.
of collaborative
environment. Every time
become
more conflicts. on the likelihood
development,
how
Third, we with hoard
by users.
ex ante
subject to at least three criticisms. First, it being cautious, knowing that they are dealing community
of
activity
of Conflicts
In our use of optimistic seen
ways.
periods
of user
by obtaining will evaluate that
produce
to replace
directory,
we
from
AFS
data
to esti-
AFS
in our
compare
his
identity with that of the user who made the previous mutation. We also note the time interval between mutations. For a file, only the close after an open for update is counted as a mutation; individual write operations are not counted.
For
directories,
as mutations. Table II presents is classified
by
all
operations
our observations
volume
type:
user
that
modify
over a period volumes
a directory
of twelve
containing
are counted
months. private
The data user
data,
volumes used for collaborative work, and system volumes containing program binaries, libraries, header files and other similar data. On average,
project
a project volume has about volume has about 1600 files
2600 files and 280 and 130 directories.
directories, and a system User volumes tend to be
smaller, averaging about 200 files and 18 directories, place much of their data in their project volumes. Table II shows that over 99% of all modifications
because were
users
by the
often
previous
writer, and that the chances of two different users modifying the same object less than a day apart is at most O.75%. We had expected to see the highest degree of write-sharing on project files or directories, and were surprised to see that it actually occurs on system files. We conjecture that a significant fraction of this sharing arises from modifications to system files by operators, who change shift periodically. If system files are excluded, the absence of ACM Transactions
on Computer Systems, Vol. 10, No. 1, February
1992
Disconnected
Operation in the Coda File System
.
2I
m “E! o
v
w
ACM Transactions
on Computer
Systems,
Vol.
10, No.
1, February
1992
22
J. J. Kistler and M. Satyanarayanan
.
write-sharing the previous
is even more striking: more than 99.570 of all mutations are by writer, and the chances of two different users modifying the
same object within ing from the point would not environment.
be
6. RELATED
a week are less than 0.4~o! This of view of optimistic replication.
a serious
problem
if
were
replaced
by
Coda
in
our
WORK
Coda is unique in that it exploits availability while preserving a high no other
AFS
data is highly encouragIt suggests that conflicts
system,
published
caching for both performance and high degree of transparency. We are aware of
or unpublished,
that
duplicates
this
key aspect
of
Coda. By providing system
[20]
since this hoarding,
tools
to link
provided
local
and
rudimentary
was not transparent
its
remote
support
name
for
spaces,
disconnected
the
Cedar
operation.
file But
primary goal, Cedar did not provide support reintegz-ation or conflict detection. Files were
for ver-
sioned and immutable, and a Cedar cache manager could substitute a cached version of a file on reference to an unqualified remote file whose server was inaccessible. However, the implementors of Cedar observe that this capability was not often exploited since remote files were normally referenced by specific Birrell
version and
availability more recent
number. Schroeder
pointed
out
the
possibility
of “stashing”
data
for
in an early discussion of the Echo file system [141. However, a description of Echo [8] indicates that it uses stashing only for the
highest levels of the naming hierarchy. The FACE file system [3] uses stashing but does not integrate it with caching. The lack of integration has at least three negative consequences. First, it reduces transparency because users and applications deal with two different name spaces, with different consistency properties. Second, utilization of local disk space is likely to be much worse. Third, recent information from cache management is not available to manage the The
available
literature
integration detracted An application-specific the PCMAIL manipulate nize with
on FACE from
system at MIT [12]. existing mail messages
a central
does not
report
on how
the usability of the system. form of disconnected operation
repository
much
the
usage stash. lack
of
was implemented
in
PCMAIL allowed clients to disconnect, and generate new ones, and desynchro-
at reconnection.
Besides
relying
heavily
on the
semantics of mail, PCMAIL was less transparent than Coda since it required manual resynchronization as well as preregistration of clients with servers. The use of optimistic replication in distributed file systems was pioneered by Locus [23]. Since Locus used a peer-to-peer model rather than a clientserver model, availability was achieved solely through server replication. There was no notion of caching, and hence of disconnected operation. Coda has transparency
benefited in a general sense and performance in distributed
from file
the large body of work on systems. In particular, Coda
owes much to AFS [191, from which it inherits its model of trust integrity, as well as its mechanisms and design philosophy for scalability. ACM
TransactIons
on
Computer
Systems,
Vol.
10,
No.
1, February
1992.
and
Disconnected 7. FUTURE
section
operation
sections
optimization, additional
.
23
development.
In
WORK
Disconnected earlier
Operation m the Coda File System
in
of this
granularity work is also
we
consider
broadened. An excellent
Coda
paper
Explicit
dreds
or thousands
under
work
active
in progress
in the areas of log
of reintegration, and evaluation of hoarding. being done at lower levels of the system.
two
ways
opportunity
Unix.
is a facility
we described
in
exists
transactions
which
and
scope
in Coda for adding
become
of nodes,
the
more the
work
transactional
desirable
informal
of our
as systems
concurrency
In
Much this
may
be
support
to
scale
control
to hunof Unix
becomes less effective. Many of the mechanisms supporting disconnected operation, such as operation logging, precedence graph maintenance, and conflict checking would transfer directly to a transactional system using optimistic concurrency control. Although transactional file systems are not a new idea, no such system with the scalability, availability and performance properties of Coda has been proposed A different opportunity exists in nected
in
operation,
low bandwidth. lines, or that mask
failures,
connectivity.
partial
as provided But
opportunities aggressive file
where
Coda
to support
connectivity
weakly
con-
is intermittent
or of
Such conditions are found in networks that rely on voice-grade use wireless technologies such as packet radio. The ability to
weak cation more
environments
or built. extended
by disconnected
techniques
which
operation, exploit
is of value
and adapt
at hand write-back
are also needed. Such techniques policies, compressed network and caching at intermediate levels.
transfer
even
with
to the communi may
-
include
transmission,
8. CONCLUSION Disconnected
operation
preload
cache
one’s
connection,
log all
reconnect ion. Implementing
is a tantalizingly
with
critical
changes
made
disconnected
simple
data,
continue
while
operation
idea.
disconnected is not
are dynamic
Only in hindsight traditional caching of a lifeline
choices
to be made
about
one has to do is to operation
and replay
so simple.
modifications and careful attention to detail in many agement. While hoarding, a surprisingly large volume lated state has to be maintained. When emulating, integrity of client data structures become critical. there
All
normal
It
until them
involves
disupon major
aspects of cache manand variety of interrethe persistence and During reintegration,
the granularity
of reintegration.
do we realize the extent to which implementations of schemes have been simplified by the guaranteed presence
to a first-class
replica.
Purging
and
refetching
on demand,
a
strategy often used to handle pathological situations in those implementations, is not viable when supporting disconnected operation. However, the obstacles to realizing disconnected operation are not insurmountable. Rather, the central message of this paper is that disconnected operation is indeed feasible, efficient and usable. One way to view our work is to regard it as an extension of the idea of write-back caching. Whereas write-back caching has hitherto been used for ACM
TransactIons
on Computer
Systems,
Vol.
10, No
1, February
1992.
24
J. J. Kistler and M. Satyanarayanan
.
performance, failures too. transitions system. remote
we have A broader between
shown that it can be extended to mask temporary view is that disconnected operation allows graceful
states
of autonomy
and
in a distributed
interdependence
Under favorable conditions, our approach provides all the benefits of data access; under unfavorable conditions, it provides continued
access to critical data. We become increasingly important and vulnerability y.
are certain that disconnected operation will as distributed systems grow in scale, diversity
ACKNOWLEDGMENTS We wish
to thank
Lily
and postprocessing
Mummert
the file
for her
reference
traces
invaluable
assistance
used in Section
Varotsis, who helped instrument the AFS servers which yielded ments of Section 5.3. We also wish to express our appreciation present
contributors
Mashburn,
Maria
to the Okasaki,
Coda
project,
and David
especially
in collecting
5.2, and Dimitris
Puneet
the measureto past and Kumar,
Hank
Steere.
REFERENCES 1. BURROWS, M.
Efficient data sharing, PhD thesis, Univ. of Cambridge, Computer Laboratory, Dec. 1988. 2. CAT~, V., AND GROSS,T. Combinmg the concepts of compression and caching for a two-level of the 4th ACM Symposium on Architectural Support for file system. In Proceedings Programm mg Languages and Operating Systems (Santa Clara, Calif., Apr. 1991), pp 200-211. 3. COVA, L. L. Resource management in federated computing environments. PhD thesis, Dept. of Computer Science, Princeton Univ., Oct. 1990. 4. DAVIDSON, S. B. Optimism and consistency in partitioned distributed database systems. ACM Trans. Database Syst. 9, 3 (Sept. 1984), 456-481. networks. 5. DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D, Consistency in partitioned ACM Comput. Suru. 17, 3 (Sept. 1985), 341-370. in distributed file systems. Tech. Rep. TR 272, Dept. of 6. FLOYD, R. A. Transparency Computer Science, Univ. of Rochester, 1989. fault-tolerant mechanism for 7. GRAY, C. G., AND CHERITON, D. R. Leases: An efficient of the 12th ACM Symposwm on Operating distributed file cache consistency. In Proceedings System Principles (Litchfield Park, Ariz., Dec. 1989), pp. 202-210. Availability and 8. HISGEN, A., BIRRELL, A., MANN, T., SCHROEDER, M., AND SWART, G of the Second consistency tradeoffs in the Echo distributed file system. In Proceedings Workshop on Workstation Operating Systems (Pacific Grove, Calif., Sept. 1989), pp. 49-54. M., 9. HOWARD, J. H., KAZAR, M. L.j MENEES, S, G., NICHOLS, D. A., SATYANARAYANAN, SIDEBOTHAM, R. N., AND WEST, M. J. Scale and performance in a distributed file system. 10.
ACM Trans. Comput. Syst. 6, 1 (Feb. 1988), 51-81. KLEIMAN, S. R. Vnodes: An architecture for multiple Summer
11. KUMAR,
system, 1991.
Usenix P.,
Conference
Proceedings
AND SATYANARAYANAN,
Tech. Rep. CMU-CS-91-164,
tile system types in Sun UNIX. In (Atlanta, Ga., June 1986), pp. 238-247, M. Log-based directory resolution in the Coda file School of Computer Science, Carnegie Mellon Univ.,
mail system for personal computers. DARPA-Internet 12. LAMBERT, M. PCMAIL: A distributed RFC 1056, 1988, memory user man13. MASHBURN, H., AND SATYANARAYANAN, M. RVM: Recoverable virtual ual. School of Computer Science, Carnegie Mellon Univ., 1991. ACM Transactions
on Computer
Systems, Vol. 10, No 1, February
1992
Disconnected
Operation in the Coda File System
.
25
R. M., AND HERBERT, A. J. Report on the Third European SIGOPS Workshop: “Autonomy or Interdependence in Distributed Systems.” SZGOPS Reu. 23, 2 (Apr. 1989), 3-19. OUSTERHOUT,J., DACOSTA, H., HARRISON, D., KUNZE, J., KUPFER, M., AND THOMPSON, J. A trace-driven analysis of the 4.2BSD file system. In Proceedings of the 10th ACM Symposium on Operating System FYinciples (Orcas Island, Wash., Dec. 1985), pp. 15-24. SANDBERG, R., GOLDBERG,D., KLEIMAN, S., WALSH, D., AND LYON, B. Design and implemenUsenix Conference Proceedings (Portland, tation of the Sun network filesystem. In Summer Ore., June 1985), pp. 119-130. of SATYANARAYANAN, M. On the influence of scale in a distributed system. In Proceedings the 10th International Conference on Software Engineering (Singapore, Apr. 1988), pp. 10-18. SATYANARAYANAN, M., KISTLER, J. J., KUMAR, P., OKASAKI, M. E., SIEGEL, E. H., AND STEERE, D. C. Coda: A highly available file system for a distributed workstation environment. IEEE Trans. Comput. 39, 4 (Apr. 1990), 447-459. SATYANARAYANAN, M. Scalable, secure, and highly available distributed file access. IEEE Comput. 23, 5 (May 1990), 9-21.
14. NEEDHAM,
15.
16.
17.
18.
19.
D., GIFFORD, D. K., AND NEEDHAM, R. M. A caching file system for a of the 10th ACM Symposium on Operating workstation. In Proceedings System Principles (Orcas Island, Wash., Dec. 1985), pp. 25-34. 21. STEERE, D. C., KISTLER, J. J., AND SATYANARAYANAN, M. Efficient user-level cache file Usenix Conference Proceedings management on the Sun Vnode interface. In Summer (Anaheim, Calif., June, 1990), pp. 325-331. 20.
SCHROEDER, M.
programmer’s
22. Decorum File System, 23. WALKER, B., POPEK, operating system. In ples (Bretton Woods,
Transarc Corporation, Jan. 1990. G., ENGLISH, R., KLINE, C., AND THIEL, G. The LOCUS distributed Proceedings of the 9th ACM Symposium on Operating System PrinciN. H., Oct. 1983), pp. 49-70.
Received June 1991; revised August
1991; accepted September
ACM Transactions
1991
on Computer Systems, Vol. 10, No. 1, February 1992.