Disconnected operation in the Coda File System - CiteSeerX

3 downloads 0 Views 2MB Size Report
hoarding, optimistic replication, reintegration, second-class replication, .... protocol based on callbacks. [9] to guarantee that an open file yields its latest. 1Unix is ...
Kistler92 Disconnected System JAMES

J. KISTLER

Carnegie

Mellon

Operation

in the Coda File

and M. SATYANARAYANAN

University

Disconnected

operation

data during

temporary

is a mode of operation that enables a client to continue accessing critical failures of a shared data repository. An important, though not exclusive,

application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Coda File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability. Categories and Subject Descriptors: D,4.3 [Operating Systems]: Systems]: Reliability— Management: —distributed file systems; D.4.5 [Operating D.4.8 [Operating Systems]: Performance —nzeasurements General

Terms: Design, Experimentation,

Additional Key Words and Phrases: reintegration, second-class replication,

Measurement,

Performance,

Disconnected operation, server emulation

File fault

Systems tolerance;

Reliability

hoarding,

optimistic

replication,

1. INTRODUCTION Every serious user of a distributed system has faced situations work has been impeded by a remote failure. His frustration acute when his workstation is powerful has been configured to be dependent inst ante

of such dependence

Placing users, growing

data

and

is the use of data

in a distributed

allows

popularity

them

enough to be used on remote resources.

file

to delegate

of distributed

file

from

standalone, but An important

a distributed

system

simplifies

the

administration

systems

where critical is particularly

collaboration

such as NFS

of that

file

syst em. between

data.

[161 and AFS

The [191

This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab, Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air Force, Wright-Patterson AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order 7597), National Science Foundation (PYI Award and Grant ECD 8907068), IBM Corporation (Faculty Development Award, Graduate Fellowship, and Research Initiation Grant), Digital Equipment Corporation (External Research Project Grant), and Bellcore (Information Networking Research Grant). Authors’ address: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 ACM 0734-2071/92/0200-0003 $01.50 ACM Transactions

on Computer Systems, Vol. 10, No. 1, February 1992, Pages 3-25.

J J. Kistler and M, Satyanarayanan

4.

attests to the compelling users of these systems critical

juncture

How

may

nature of these considerations. Unfortunately, have to accept the fact that a remote failure

seriously

can we improve

the benefits when

of a shared

that

repository

inconvenience

this

state

data

repository,

them.

of affairs?

Ideally,

but

is inaccessible.

We

the at a

we would

like

be able to continue call

the

latter

to enjoy

critical

mode

work

of operation

disconnected operation, because it represents a temporary deviation from normal operation as a client of a shared repository. In this paper we show that disconnected operation in a file system is indeed feasible, efficient and usable. The central idea behind our work is that caching

of

exploited

now

data,

to enhance

widely

used

availability.

to

improve

have

We

performance,

implemented

can

tion in the Coda File System at Carnegie Mellon University. Our initial experience with Coda confirms the viability operation.

We have

to two days.

successfully

operated

For a disconnection

disconnected

of this

duration,

and propagating changes typically takes 100MB has been adequate for us during

2. DESIGN Coda

is

servers.

Unix and

for

an

1 clients

The design

academic

half

lasting

one

of reconnecting

A local disk of of disconnection. that

size should

be

OVERVIEW

designed

untrusted

the process

of about workday.

be

opera-

of disconnected

for periods

about a minute. these periods

Trace-driven simulations indicate that a disk adequate for disconnections lasting a typical

also

disconnected

environment and

a much

is optimized

research

consisting smaller

of a large

number

for the access and sharing

environments.

It

is

collection

of trusted patterns

specifically

not

of

Unix

file

typical

of

intended

for

applications that exhibit highly Each Coda client has a local

concurrent, fine granularity disk and can communicate

data access. with the servers

over a high bandwidth network. ily unable to communicate with

At certain times, a client may be temporarsome or all of the servers. This may be due to

a server or network failure, or due to the detachment from the network. Clients view Coda as a single, location-transparent

shared

tem.

servers

The Coda namespace

larity

of subtrees

called

is mapped volumes.

At

to individual each

of a portable

file

client,

a cache

The first mechanism, replicas at more than

server one

allows

replication,

server.

The

set

sys-

( Venus)

to achieve volumes

of replication

high

to have sites

for

is

protocol

based

1Unix

its

volume

on callbacks

is a trademark

ACM Transactions

storage

of AT&T

group

( VSG).

[9] to guarantee

Bell Telephone

that

subset

a

of a VSG that is accessible VSG ( A VSG). The performance currently accessible is a client’s cost of server replication is kept low by caching on disks at clients and through the use of parallel access protocols. Venus uses a cache coherence volume

The

file

at the granu-

manager

dynamically obtains and caches volume mappings. Coda uses two distinct, but complementary, mechanisms availability. read-write

Unix

client

an open file

Labs

on Computer Systems, Vol. 10, No 1, February

1992,

yields

its latest

Disconnected

Operation m the Coda File System

.

Icopy in the AVSG. This guarantee is provided by servers notifying when their cached copies are no longer valid, each notification being to as a ‘callback

break’.

all AVSG sites, Disconnected

Modifications

and eventually operation, the

mechanism While

Venus cache.

services file system requests Since cache misses cannot application

Venus

propagates

depicts

a typical

and disconnected Earlier discuss cantly

this

3. DESIGN At

a high

used

by

disconnected,

by relying solely on the contents of its be serviced or masked, they appear and

and

users.

reverts

involving

When

to server

transitions

[18, 19] have restricts

replication

disconnection replication.

between

server

ends, Figure

1

replication

its

only

our design

described

server

attention

in those

to

areas

for disconnected

replication

disconnected

where

in depth.

In

operation.

its presence

We

has signifi-

operation.

RATIONALE level,

we

empty.

to

operation.

influenced

First,

scenario

paper

server

becomes

programs

modifications

Coda papers

contrast,

the

clients referred

in parallel

AVSG

takes

to

when

are propagated

to missing VSG sites. second high availability

Coda,

as failures

effect

in Coda

5

two

wanted

our system.

factors

to

Second,

use

influenced

our

conventional,

we wished

strategy

off-the-shelf

to preserve

for

high

availability.

hardware

throughout

by seamlessly

transparency

inte-

grating the high availability mechanisms of Coda into a normal environment. At a more detailed level, other considerations influenced our design. the advent of portable workstations, include the need to scale gracefully,

These the

resource, integrity, and very different clients and servers, and the need to strike

about and

We examine

consistency.

3.1

each of these

security

a balance

issues

assumptions made availability between

in the following

Unix

sections.

Scalability

Successful Coda’s

distributed

systems

tend

ancestor,

AFS,

had impressed

rather

than

treating

a priori,

it

to

grow

upon

in

size.

Our

experience

us the need to prepare

as an afterthought

[17].

with

for growth

We brought

this

experience to bear upon Coda in two ways. First, we adopted certain mechanisms that enhance scalability. Second, we drew upon a set of general principles to guide our design choices. An example of a mechanism we adopted for scalability is callback-based whole-file caching, offers the cache coherence. Another such mechanism added occur

advantage of a much simpler failure on an open, never on a read, write,

substantially partial-file [1] would

model: a cache seek, or close.

miss This,

can only in turn,

simplifies the implementation of disconnected operation. A caching scheme such as that of AFS-4 [22], Echo [8] or MFS have complicated our implementation and made disconnected

operation less transparent. A scalability principle that has had considerable of functionality on clients rather than the placing

influence on our design servers. Only if integrity

ACM Transactions on Computer Systems, Vol

is or

10, No 1, February 1992

6.

J J. Kistler and M Satyanarayanan

‘m

_.___—l

ACM TransactIons on Computer Systems, Vol. 10, No 1, February

1992

Disconnected security Another rapid

would

been

compromised

scalability principle change. Consequently,

or agreement algorithms consensus

Powerful,

by

large

have

we have adopted we have rejected

numbers

such as that on the current

Portable

3.2

have

Operation in the Coda File System

of nodes.

this

principle.

is the avoidance ofsystem-wide strategies that require election For

example,

used in Locus [23] that depend partition state of the network.

lightweight

and compact

to observe

how

laptop a person

we

have

on nodes

avoided achieving

computers

are commonplace

with

in a shared

uses such a machine. Typically, he identifies them from the shared file system into the When

he returns,

he copies

substantially simplify use a different name propagate

today.

file

system

files

back

manual that

into

the

caching,

shared

with

disconnected

file

write-back

operation

could

the use of portable clients. Users would not have to space while isolated, nor would they have to man-

changes

champion application The use of portable

data

files of interest and downloads local name space for use while

modified

system. Such a user is effectively performing upon reconnection! Early in the design of Coda we realized

ually

violated

7

Workstations

It is instructive

isolated.

we

o

upon

reconnection.

Thus

portable

for disconnected operation. machines also gave us another

machines

insight.

The

are

fact

a

that

people are able to operate for extended periods in isolation indicates that they are quite good at predicting their future file access needs. This, in turn, suggests

that

it

is reasonable

cache management policy involuntary Functionally,

to

seek

user

for disconnected disconnections

uoluntary disconnections from Hence Coda provides a single

assistance

in

augmenting

operation. caused by failures

caused by unplugging mechanism to cope with

are no different

portable computers. all disconnections. Of

course, there may be qualitative differences: user expectations as well extent of user cooperation are likely to be different in the two cases. First- vs. Second-Class

3.3

If

disconnected

Clients unattended

as the

Replication

operation

all? The answer to this assumptions made about

the

why

is server

replication

question depends clients and servers

is feasible,

critically in Coda.

on the

needed very

at

different

are like appliances: they can be turned off at will and may be for long periods of time. They have limited disk storage capacity,

their software and hardware may be tampered with, and their owners may not be diligent about backing up the local disks. Servers are like public utilities: they have much greater disk capacity, they are physically secure, and they It

are carefully

is therefore

servers,

and

monitored

appropriate

second-class

and administered

to distinguish replicas

(i.e.,

cache

by professional

between copies)

first-class

on clients.

staff. replicas

on

First-class

replicas are of higher quality: they are more persistent, widely known, secure, available, complete and accurate. Second-class replicas, in contrast, are inferior along all these dimensions. Only by periodic revalidation with respect to a first-class replica can a second-class replica be useful. ACM Transactions

on Computer Systems, Vol. 10, No. 1, February 1992.

J. J. Kistler and M. Satyanarayanan

8.

The function and

of a cache

scalability

first-class

coherence

advantages

replica.

When

protocol

is to combine

of a second-class disconnected,

replica

the quality

with

in

for degradation. the

face

ability.

Hence

quency

Whereas

of failures,

and

server

duration

server

replication

in

operation,

because

contrast,

quality

because

operation,

which

little.

it

of data for

reduces

fre-

viewed

additional Whether

avail-

the

is properly

it requires

costs

of a replica

the quality

forsakes

is important

quality

which it is contingent is the greater the poten-

preserves

operation

of disconnected

a measure of last resort. Server replication is expensive Disconnected

replication

disconnected

the

of the second-class

may be degraded because the first-class replica upon inaccessible. The longer the duration of disconnection, tial

the performance

hardware.

to

use

server

replication or not is thus a trade-off between quality and cost. Coda permit a volume to have a sole server replica. Therefore, an installation rely exclusively on disconnected operation if it so chooses.

3.4

Optimistic

vs. Pessimistic

By definition,

a network

replica

and

all

replica

control

to the design ing

strategies,

danger

occurrence. A pessimistic

all reads

control

by

towards

a disconnected

choice

strategy and

provides

conflict-

deals

them

operation

would

of a cached reconnection. preclude

reads

much

and

of

central

avoids

resolving

disconnected

would

families

or by restricting

everywhere,

detecting

client

strategy

writes

and writes

second-class

two

[51, is therefore

A pessimistic

partitioned

exclusive control such control until

by a disconnected

between

optimistic

An optimistic

of conflicts

client to acquire shared or disconnection, and to retain exclusive

The

operation.

partition.

approach

between

and

pessimistic

by permitting

attendant

exists

by disallowing

does can

Control

associates.

of disconnected

to a single

availability

partition

its first-class

operations

and writes

Replica

as

higher

with

after

the their

require

a

object prior Possession

to of

reading

or writing

at all other replicas. Possession of shared control would allow reading at other replicas, but writes would still be forbidden everywhere. Acquiring control prior to voluntary disconnection is relatively simple. It is more

difficult

when

have to arbitrate

disconnection

among

needed

to make

system

cannot

a wise predict

is involuntary,

multiple

requesters.

decision

is not

which

requesters

because

Unfortunately,

readily would

available. actually

the

system

may

the information For use the

example, object,

the when

they would release control, or what the relative costs of denying them access would be. Retaining control until reconnection is acceptable in the case of brief disconnections. But it is unacceptable in the case of extended disconnections. A disconnected client with shared control of an object would force the rest of the system to defer all updates until it reconnected. With exclusive control, it would even prevent other users from making a copy of the object. Coercing the client

to reconnect

ACM Transactions

may

not be feasible,

since

its whereabouts

on Computer Systems, Vol. 10, No 1, February

1992

may

not be

Disconnected

Operation in the Coda File System

.

9

known. Thus, an entire user community could be at the mercy of a single errant client for an unbounded amount of time. Placing a time bound on exclusive or shared control, as done in the case of leases [7], avoids this problem but introduces others. Once a lease expires, a disconnected client loses the ability to access a cached object, even if no one else in the

system

disconnected already

is interested

operation

made

while

An optimistic disconnected

which

in it.

is to provide

disconnected

approach client

This,

have

in turn, high

conflict

the

availability.

purpose

Worse,

of

updates

to be discarded.

has its own disadvantages.

may

defeats

with

An update

an update

at another

made

at one

disconnected

or

connected client. For optimistic replication to be viable, the system has to be more sophisticated. There needs to be machinery in the system for detecting conflicts, age and manually

for automating resolution when possible, and for confining dampreserving evidence for manual repair. Having to repair conflicts violates transparency, is an annoyance to users, and reduces the

usability

of the system.

We chose optimistic replication because we felt that its strengths and weaknesses better matched our design goals. The dominant influence on our choice an

was the low degree

optimistic

strategy

of write-sharing

was

likely

to

typical lead

to

of Unix.

This

relatively

few

implied

that

conflicts.

An

optimistic strategy was also consistent with our overall goal of providing the highest possible availability of data. In principle, we could have chosen a pessimistic strategy for server replication even after choosing an optimistic strategy for disconnected But that would have reduced transparency, because a user would the anomaly of being able to update data when disconnected, unable

to do so when

the previous replication. Using system

connected

arguments

an optimistic from

strategy

the user’s

data in his accessible everyone else in that set of shrinks

updates

universe

universe.

contact. universe

become

4. DETAILED

throughout

strategy

presents

At any time,

Further, also apply

a uniform

visible

model

and his updates are immediately His accessible universe is usually

of

of the

visible to the entire

When failures occur, his accessible universe he can contact, and the set of clients that they, in

throughout

his now-enlarged

disconnected, reconnection,

accessible

his his

universe.

AND IMPLEMENTATION

In describing our implementation client since this is where much the physical structure of Venus, and Sections

many

to server

he is able to read the latest

In the limit, when he is operating consists of just his machine. Upon

DESIGN

tion of the server Section 4.5.

of the servers.

of an optimistic

perspective.

servers and clients. to the set of servers

turn, can accessible

to a subset

in favor

operation. have faced but being

of disconnected of the complexity

of a client, 4.3 to 4.5

support

needed

operation, we focus on the lies. Section 4.1 describes

Section 4.2 introduces the major states discuss these states in detail. A descripfor disconnected

ACM Transactions

operation

is contained

in

on Computer Systems, Vol. 10, No. 1, February 1992.

10

.

J. J Kistler and M Satyanarayanan

IT



Venus

I

1

Fig.2.

4.1

Client

Because than

Structure

ofa Coda client.

Structure of the

part but

debug.

Figure

Venus

complexity

of Venus,

of the kernel.

mance,

would

The latter

have

been

2 illustrates

intercepts

Unix

we made approach

less portable

the high-level file

system

it a user-level

may

have

access,

are handled A system

disconnected

MiniCache. If possible, the returned to the application. service returns

operation

entirely by Venus. call on a Coda object

calls

via

or server

is forwarded

from

our

for good performance

Venus

Logically, reintegration.

Venus always

rather

better

perfor-

more

difficult

to

widely

used

Sun

Vnode

performance

overhead MiniCache to filter contains no support

replication; by the

these

Vnode

on out for

functions

interface

to the

call is serviced by the MiniCache and control Otherwise, the MiniCache contacts Venus

is to

the call. This, in turn, may involve contacting Coda servers. Control from Venus via the MiniCache to the application program, updating

Measurements

4.2

process

of a Coda client.

the

MiniCache state as a side effect. MiniCache initiated by Venus on events such as callback critical

yielded

and considerably structure

interface [10]. Since this interface imposes a heavy user-level cache managers, we use a tiny in-kernel many kernel-Venus interactions. The MiniCache remote

to Coda servers

implementation

state changes breaks from

confirm

that

the

may Coda

also be servers.

MiniCache

is

[211,

States Venus operates in one of three Figure 3 depicts these states

is normally on the alert

emulation,

between

and them.

in the hoarding state, relying on server replication but for possible disconnection. Upon disconnection, it enters

the emulation state and remains Upon reconnection, Venus enters ACM Transactions

states: hoarding, and the transitions

on Computer Systems, Vol

there for the the reintegration 10, No 1, February

duration state, 1992

of disconnection. desynchronizes its

Disconnected

Operation in the Coda File System

.

11

r)

Hoarding

Fig. 3. Venus states and transitions. When disconnected, Venus is in the emulation state. It transmits to reintegration upon successful reconnection to an AVSG member, and thence to hoarding, where it resumes connected operation.

cache

with

its

AVSG,

and

then

all volumes may not be replicated be in different states with respect conditions in the system.

4.3

reverts

to

across the to different

the

hoarding

state.

same set of servers, volumes, depending

Since

Venus can on failure

Hoarding

The hoarding state

state

is to hoard

not its only

is so named

useful

data

responsibility.

because

Rather,

reference

predicted

with

behavior,

manage

is

its cache in a manner

and disconnected operation. a certain set of files is critical

the implementation especially

and reconnection

–The true cost of a cache hard to quantify. an object

this

must

For inbut may

in

of hoarding: the

distant

future,

cannot

be

certainty.

–Disconnections

—Activity

in this

However,

files. To provide good performance, Venus must to be prepared for disconnection, it must also cache

the former set of files. Many factors complicate —File

of Venus

of disconnection.

Venus

that balances the needs of connected stance, a user may have indicated that currently be using other cache the latter files. But

a key responsibility

in anticipation

at other

clients

miss must

are often while

unpredictable.

disconnected

be accounted

is highly

for, so that

variable

the latest

and

version

of

is in the cache at disconnection.

—Since cache space is finite, the availability of less critical to be sacrificed in favor of more critical objects. ACM Transactions

objects

may

have

on Computer Systems, Vol. 10, No. 1, February 1992.

12

J J. Kistler and M Satyanarayanan

. # a a a

Personal files /coda /usr/jjk d+ /coda /usr/jjk/papers /coda /usr/jjk/pape

# # a a a a a a a a a a a

100:d+ rs/sosp 1000:d+

# a a a a a

System files /usr/bin 100:d+ /usr/etc 100:d+ /usr/include 100:d+ /usr/lib 100:d+ /usr/local/gnu d+ a /usr/local/rcs d+ a /usr/ucb d+

XII files (from Xll maintainer) /usr/Xll/bin/X /usr/Xll/bin/Xvga /usr/Xll/bin/mwm /usr/Xl l/ bin/startx /usr/Xll /bin/x clock /usr/Xll/bin/xinlt /usr/Xll/bin/xtenn /usr/Xll/i nclude/Xll /bitmaps c+ /usr/Xll/ lib/app-defaults d+ /usr/Xll /lib/font s/mist c+ /usr/Xl l/lib/system .mwmrc

(a) # # a a a

Venus source files (shared among Coda developers) /coda /project /coda /src/venus 100:c+ /coda/project/coda/include 100:c+ /coda/project/coda/lib c+

(c) Fig.

4.

Sample

hoard

profiles.

These are typical

hoard

profiles

provided

hy a Coda user, an

application maintainer, and a group of project developers. Each profile IS interpreted separately by the HDB front-end program The ‘a’ at the beginning of a line indicates an add-entry command. Other commands are delete an entry, clear all entries, and list entries. The modifiers following some pathnames specify nondefault priorities (the default is 10) and/or metaexpansion for the entry. ‘/coda’.

To

Note that

address

the pathnames

these

concerns,

and periodically cache via a process known algorithm,

4.3.1

plicit

Prioritized

sources

Cache

manage

Management.

of information

of interest front-end

we

with

reevaluate which walking. as hoard

rithm. The implicit information traditional caching algorithms. hoard database per-workstation fying objects A simple

beginning

in

its

‘/usr’

the

(HDB),

cache

objects

Venus priority-based

consists Explicit

are actually

symbolic

using

merit

a

links

into

prioritized

retention

in the

combines implicit and excache management algo-

of recent reference history, as in information takes the form of a whose entries are pathnames identi-

to the user at the workstation. program allows a user to update

the

HDB

using

hoard profiles, such as those shown in Figure command scripts called Since hoard profiles are just files, it is simple for an application maintainer

4. to

provide a common profile for his users. or for users collaborating on a project to maintain a common profile. A user can customize his HDB by specifying different combinations of profiles or by executing front-end commands interactively. To facilitate construction of hoard profiles, Venus can record all file references observed between a pair of start and stop events indicated by a user. To reduce the verbosity of hoard profiles and the effort needed to maintain them, Venus if the letter ACM Transactions

meta-expansion of HDB supports ‘c’ (or ‘d’) follows a pathname, on Computer Systems, Vol

entries. As shown in Figure the command also applies

10, No, 1, February

1992

4, to

Disconnected immediate

children

(or all

Operation in the Coda File System

descendants).

A ‘ +‘

following

the

.

13

‘c’ or ‘d’ indi-

cates that the command applies to all future as well descendants. A hoard entry may optionally indicate

as present children or priority, with a hoard

higher priorities indicating more critical The current priority of a cached object

of its hoard

well

as a metric

representing

ously in response no longer in the victims when To resolve imperative

usage.

The

latter

priority

is updated

all

ensure

the

ancestors

that

a cached

of the

object

directory

while

of objects chosen as

disconnected,

also be cached.

is not

as

continu-

to new references, and serves to age the priority working set. Objects of the lowest priority are

cache space has to be reclaimed. the pathname of a cached object

that

therefore

recent

objects. is a function

purged

it

is

Venus

must

any

of its

before

hierarchical cache management is not needed in tradidescendants. This tional file caching schemes because cache misses during name translation can be serviced, albeit at a performance cost. Venus performs hierarchical

cache management by assigning infinite priority to directories with children. This automatically forces replacement to occur bottom-up. 4.3.2

Hoard

Walking.

say that

We

a cache

is in

cached

signifying

equilibrium,

that it meets user expectations about availability, when no uncached object has a higher priority than a cached object. Equilibrium maybe disturbed as a result the

of normal cache

activity.

mentioned its priority equilibrium,

as a hoard

walk.

implementation, bindings

replacing

an

suppose object,

B.

an object,

A, is brought

Further

suppose

restores but

one

disconnection. of HDB

equilibrium

A hoard

walk may

The

entries

occurs be

by performing every

explicitly

walk

occurs

are reevaluated

Coda clients. For example, new tory whose pathname is specified

required in

an operation

10 minutes by

in

is

First,

to reflect

update

activity

been created in the HDB.

a key characteristic

data is refetched only a strategy may result

current

a user

phases.

children may have with the ‘ +‘ option

callback-based caching, break. But in Coda, such

being unavailable should it. Refetching immediately ignores

B

known

our

two

priorities of all entries in the cache and HDB are reevaluated, fetched or evicted as needed to restore equilibrium. Hoard walks also address a problem arising from callback traditional callback

into

that

in the HDB, but A is not. Some time after activity on A ceases, will decay below the hoard priority of B. The cache is no longer in than the uncached since the cached object A has lower priority

object B. Venus periodically

voluntary

For example,

on demand,

prior the

to

name

by other in a direcSecond, the and

objects

breaks.

on demand in a critical

In

after a object

a disconnection occur before the next reference to upon callback break avoids this problem, but of Unix

it is likely to be modified many interval [15, 6]. An immediate

environments:

once an object

is modified,

more times by the same user within a short refetch policy would increase client-server

traffic considerably, thereby reducing scalability. Our strategy is a compromise that balances scalability. For files and symbolic links, Venus ACM Transactions

availability, consistency and purges the object on callback

on Computer Systems, Vol. 10, No. 1, February 1992

14

.

break,

and refetches

occurs

earlier.

J. J. Klstler and M. Satyanarayanan it on demand

If a disconnection

or during were

the next

to occur

hoard

before

walk,

whichever

refetching,

the

object

would be unavailable. For directories, Venus does not purge on callback break, but marks the cache entry suspicious. A stale cache entry is thus available should a disconnection occur before the next hoard walk or reference.

The

callback entry

acceptability

semantics.

y of stale

A callback

has been added

to or deleted

other directory entries fore, saving the stale nection

causes

directory break

and copy

consistency

the and

data

follows

on a directory

from

from

typically

the directory.

its

particular

means

It is often

that

objects they name are unchanged. using it in the event of untimely

to suffer

only

a little,

but

increases

actions

normally

an

the case that Therediscon-

availability

considerably. 4.4

Emulation

In the

emulation

servers.

For

semantic

state,

example,

checks.

Venus

performs

Venus

now

many

assumes

It is also responsible

full

responsibility

for generating

handled

by

for access and

temporary

file

identi-

(fids) for new objects, pending the assignment of permanent fids at updates reintegration. But although Venus is functioning as a pseudo-server, accepted by it have to be revalidated with respect to integrity and protection by real servers. This follows from the Coda policy of trusting only servers, not clients. To minimize unpleasant delayed surprises for a disconnected user, it

fiers

behooves

Venus

Cache algorithm

management used during

cache freed ity

entries

to be as faithful

of the

immediately,

so that

default request 4.4.1

they

objects

but are

not

version

in its emulation. is

involved.

Cache

of other

modified

those

purged

before

done with operations entries

the same priority directly update the

of deleted

objects

assume

reintegration.

objects

infinite

On a cache

are

prior-

miss,

the

behavior of Venus is to return an error code. A user may optionally Venus to block his processes until cache misses can be serviced. Logging.

During

emulation,

to replay update activity when in a per-volume log of mutating contains

as possible

during emulation hoarding. Mutating

a copy state

of the

corresponding

of all objects

Venus

records

sufficient

information

it reinteWates. It maintains this information log. Each log entry operations called a replay

referenced

system

call

arguments

as well

as the

by the call.

Venus uses a number of optimizations to reduce log, resulting in a log size that is typically a few

the length of the replay percent of cache size. A

small log conserves disk space, a critical resource during periods of disconnection. It also improves reintegration performance by reducing latency and server load. One important optimization to reduce log length pertains to write operations on files. Since Coda uses whole-file caching, the close after an open of a file for modification installs a completely new copy of the file. Rather than logging the open, close, and intervening write operations individually, Venus logs a single store record during the handling of a close, Another optimization for a file when a new

consists of Venus discarding a previous store record one is appended to the log. This follows from the fact

ACM TransactIons on Computer Systems, Vol. 10, No. 1, February

1992,

Disconnected that

a store

record

renders

all

Operation in the Coda File System

previous

does not contain

versions

of a file

a copy of the file’s

.

superfluous.

contents,

but

15

The

merely

store

points

to the

copy in the cache. We

are

length

currently

of the replay

the previous earlier would

implementing

two

log.

generalizes

paragraph

The

such that

operations may be the canceling

second optimization the inverting and own log record 4.4.2

chine

further

as well

cancel the of a store

as that

to reduce

optimization

which

the

described

in

the effect

overwrites

of

corresponding log records. An example by a subsequent unlink or truncate. The operations to cancel both of inverse For example, a rmdir may cancel its

of the corresponding

A disconnected

user

and continue

where

a shutdown

optimizations the

any operation

exploits knowledge inverted log records.

Persistence.

after

first

mkdir.

must

be able

he left

to restart

his

ma-

off. In case of a crash,

the

amount of data lost should be no greater than if the same failure occurred during connected operation. To provide these guarantees, Venus must keep its cache and related data structures in nonvolatile storage. Mets-data,

consisting

of cached

directory

and symbolic

link

contents,

status

blocks for cached objects of all types, replay logs, and the HDB, is mapped virtual memory (RVM). Transacinto Venus’ address space as recoverable tional

access to this

Venus.

memory

The actual

contents

local Unix files. The use of transactions

is supported of cached to

by the RVM

files

library

are not in RVM,

manipulate

meta-data

[13] linked but

into

are stored

simplifies

Venus’

as job

enormously. To maintain its invariants Venus need only ensure that each transaction takes meta-data from one consistent state to another. It need not be concerned with crash recovery, since RVM handles this transparently. If we had files,

chosen

we

synchronous RVM trol

over

the

would

writes

supports the

obvious

have

An

alternative to

follow

and an ad-hoc

local,

basic

serializability.

had

non-nested

recovery

RVM’S write-ahead bounded provides flushes. Venus

exploits

log from persistence,

the

allows

commit

latency

a synchronous the application

time to time. where the

capabilities

and

of atomicity,

reduce

thereby avoiding commit as no-flush, persistence of no-flush transactions,

of RVM

in

local

accessible

when

emulating,

When bound

con-

permanence

and

by

the

labelling

write to disk. To ensure must explicitly flush

used in this manner, RVM is the period between log

to provide

Venus

is more

the log more frequently. This lowers performance, data lost by a client crash within acceptable limits. 4.4.3 volatile

Resource

storage

Exhaustion.

during

emulation.

It

is possible The

timed

independent

good performance

constant level of persistence. When hoarding, Venus initiates log infrequently, since a copy of the data is available on servers. Since are not

Unix

of carefully

algorithm.

properties can

meta-data

discipline

transactions

transactional application

of placing a strict

two

for

conservative but

Venus

significant

keeps

at a flushes servers

and flushes the

to exhaust instances

amount

its of this

of

nonare

ACM ‘l!ransactions on Computer Systems, Vol. 10, No. 1, February 1992

16

.

J. J. Kistler and M. Satyanarayanan

the file allocated

cache becoming filled to replay logs becoming

with full.

modified

files,

Our current implementation is not very graceful tions. When the file cache is full, space can be freed modified

files.

reintegrate always

When

ion

log space is full,

has been

no further

performed.

and

the

in handling by truncating

mutations

Of course,

RVM

non-rout

these situaor deleting

are allowed

at ing

space

until

operations

are

allowed.

We plan emulating.

to explore at least three alternatives One possibility is to compress file

to free up disk space while cache and RVM contents.

Compression trades off computation time for space, and recent work [21 has shown it to be a promising tool for cache management. A second possibility is to allow users to selectively back out updates made while disconnected. A third

approach

is to allow

out to removable 4.5

media

portions

of the

such as floppy

file

cache

and RVM

to be written

disks.

Reintegration

Reintegration

is a transitory

state

through

which

Venus

passes

in changing

roles from pseudo-server to cache manager. In this state, Venus propagates changes made during emulation, and updates its cache to reflect current server state. Reintegration is performed a volume at a time, with all update activity

in the volume

4.5.1

Replay

suspended

in two

for

and

This

new

objects

step

steps.

in many

independently

executed. protection,

to replace cases,

of changes

step, Venus temporary

since

Venus

from

obtains fids

client

in the

obtains

to AVSG

permanent

a small

fids

replay

log.

supply

of

fids in advance of need, while in the hoarding state. In the second replay log is shipped in parallel to the AVSG, and executed at each

member.

single transaction, which The replay algorithm parsed, locked.

In the first

uses them

is avoided

permanent step, the

completion.

The propagation

Algorithm.

is accomplished

until

Each

server

performs

the

is aborted if any error is detected. consists of four phases. In phase

replay one

within the

log

a is

a transaction is begun, and all objects referenced in the log are In phase two, each operation in the log is validated and then The validation consists of conflict detection as well as integrity, and disk space checks. Except in the case of store operations,

execution during replay is identical to execution in connected mode. For a file is created and meta-data is updated to reference store, an empty shadow it, but the data transfer is deferred. Phase three consists exclusively of performing

these

data

transfers,

a process

known

as bac?z-fetching.

The final

phase commits the transaction and releases all locks. If reinte~ation succeeds, Venus frees the replay log and resets the priority of cached objects referenced by the log. If reintegration fails, Venus writes file in a superset of the Unix tar format. out the replay log to a local replay The log and all corresponding cache entries are then purged, so that subsequent references will cause refetch of the current contents at the AVSG. A tool is provided which allows the user to inspect the contents of a replay file, compare entirety.

it

to the

ACM Transactions

state

at

the

AVSG,

and

replay

on Computer Systems, Vol. 10, No. 1, February

it 1992,

selectively

or in

its

Disconnected

Operation in the Coda File System

Reintegration at finer granularity than perceived by clients, improve concurrency reduce user effort during manual replay. implementation

to reintegrate

dent

within

operations

using

the

precedence

Venus

will

pass them

to servers

along

The tagged

with

with

operations

clients. conflicts.

for conflicts a storeid

of subsequences

subsequences

of Davidson

precedence the replay

The

relies

that

only

Read/

system model, since it of a single system call.

check

granularity

graphs

of depen-

can be identified

[4]. In the

revised

during

emulation,

imple

-

and

log.

Our use of optimistic replica control means that of one client may conflict with activity at servers

Handling.

or other disconnected with are write / write Unix file boundary

maintain

17

a volume would reduce the latency and load balancing at servers, and To this end, we are revising our

Dependent

approach

graph

mentation

ConfZict 4.5.2 the disconnected

at the

a volume.

.

class of conflicts we are concerned conflicts are not relevant to the

write

has

no

on the

fact

uniquely

notion that

identifies

of atomicity each replica

the

last

beyond

the

of an object

update

to it.

is

During

phase two of replay, a server compares the storeid of every object mentioned in a log entry with the storeid of its own replica of the object. If the comparison the mutated

indicates equality for all objects, the operation objects are tagged with a new storeid specified

is performed and in the log entry.

If a storeid comparison fails, the action taken depends on the operation being validated. In the case of a store of a file, the entire reintegration is aborted. But for directories, a conflict is declared only if a newly created name collides with an existing name, if an object updated at the client or the server has been deleted by the other, or if directory attributes have been modified

at the

directory

updates

server

and the

is consistent

client. with

This our

strategy

strategy

and was originally suggested by Locus [23]. Our original design for disconnected operation replay files at servers rather than clients. This damage forcing

to

be

manual

confined repair,

by

marking

conflicting

as is currently

of resolving in server

Today, laptops

[11],

called for preservation of approach would also allow replicas

inconsistent

done in the case of server

We are awaiting more usage experience to determine the correct approach for disconnected operation. 5. STATUS

partitioned

replication

whether

and

replication. this

is indeed

AND EVALUATION

Coda runs on IBM RTs, Decstation such as the Toshiba 5200. A small

3100s and 5000s, and 386-based user community has been using

Coda on a daily basis as its primary data repository since April 1990. All development work on Coda is done in Coda itself. As of July 1991 there were nearly 350MB of triply replicated data in Coda, with plans to expand to 2GB in the next A version

few months. of disconnected

operation

with

strated in October 1990. A more complete 1991, and is now in regular use. We have for periods

lasting

one to two days. ACM Transactions

Our

minimal

functionality

version was functional successfully operated

experience

with

on Computer Systems, Vol

was demonin January disconnected

the system

has been

10, No 1, February 1992.

18

.

J. J. Kistler and M. Satyanarayanan Table I.

Elapssd

Time

(seconds)

Andrew

Make

Reintegration Totaf

Time

AllocFid

Size of Replay

(xconds)

Replay

con

Records

Log

Data

Back-Fetched (Bytes)

Bytes

288 (3)

43 (2)

4 (2)

29 (1)

10 (1)

223

65,010

1,141,315

3,271 (28)

52 (4)

1 (o)

40 (1)

10 (3)

193

65,919

2,990,120

Benchmark

Venus

Time for Reintegration

Note: This data was obtained with a Toshiba T5200/100 client (12MB memory, 100MB disk) reintegrating over an Ethernet with an IBM RT-APC server (12MB memory, 400MB disk). The values shown above are the means of three trials, Figures m parentheses are standard deviations.

quite

positive,

and we are confident

that

the refinements

under

will result in an even more usable system. In the following sections we provide qualitative and quantitative to three important questions pertaining to disconnected operation. (1) How

long

(2) How

large

(3) How

likely

5.1

Duration

does reintegration a local

disk

development answers These are:

take?

does one need?

are conflicts? of Reintegration

In our experience, development lasting To characterize

typical disconnected a few hours required

reintegration

speed more

sessions of editing and program about a minute for reintegration. precisely,

we measured

the reinte-

gration times after disconnected execution of two well-defined tasks. The first task is the Andrew benchmark [91, now widely used as a basis for comparing file system performance. The second task is the compiling and linking of the current version of Venus. Table I presents the reintegration times for these tasks. The time for reintegration consists of three components: the time to allocate permanent fids, the time for the replay at the servers, and the time for the second phase of the update protocol used for server replication. The first component

will

be zero for many

disconnections,

due to the

preallocation

of

fids during hoarding. We expect the time for the second component to fall, considerably in many cases, as we incorporate the last of the replay log optimizations described in Section 4.4.1. The third component can be avoided only if server replication is not used. One can make some interesting secondary observations from Table 1. First, the total time for reinte~ation is roughly the same for the two tasks even though the Andrew benchmark has a much smaller elapsed time. This is because the Andrew benchmark uses the file system more intensively. Second, reintegration for the Venus make takes longer, even though the number of entries in the replay log is smaller. This is because much more file data is ACM Transactions

on Computer Systems, Vol. 10, No. 1, February

1992

Disconnected

Operation in the Coda File System

o

19

7J50$

— – – -----

540 g -cl ~ m 30 s ~

Max Avg Min

s 20 & .Q) z 10 -

—--———

/“--

.........

. . . . . . . . . . ------0

4

2

‘“

12 10 Time (hours)

8

6

,. . . ..-. -.

Fig. 5. High-water mark of Cache usage. This graph is based on a total of 10 traces from 5 active Coda workstations. The curve labelled “Avg” corresponds to the values obtained by “Max” and “ Min” plot averaging the high-water marks of all workstations. The curves labeled the highest and lowest values of the high-water marks across all workstations. Note that the high-water mark does not include space needed for paging, the HDB or replay logs.

back-fetched in the third phase of the replay. Finally, neither any think time. As a result, their reintegration times are

task involves comparable to

that

session

after

a much

longer,

but

more

typical,

disconnected

in

our

environment.

5.2

Cache

Size

A local disk capacity of 100MB on our clients has proved initial sessions of disconnected operation. To obtain a better the cache size requirements for disconnected operation, reference menting

traces from workstations

adequate for understanding we analyzed

our environment. The traces were obtained to record information on every file system

by instruoperation,

regardless of whether the file was in Coda, AFS, or the local file system. Our analysis is based on simulations driven by these traces. Writing validating Venus Venus driven

a simulator

that

precisely

models

the complex

caching

our of file

and

behavior

of

would be quite difficult. To avoid this difficulty, we have modified to act as its own simulator. When running as a simulator, Venus is by traces rather than requests from the kernel. Code to communicate

with the servers, as well as code to perform system are stubbed out during simulation. Figure

5 shows

The actual

disk

both the explicit From our data operating

the

high-water

size needed

mark

physical

of cache

for disconnected

1/0

usage

operation

on the

as a function

for a typical ACM Transactions

workday.

Of course,

user

file

of time.

has to be larger,

and implicit sources of hoarding information it appears that a disk of 50-60MB should

disconnected

local

since

are imperfect. be adequate for activity

that

is

on Computer Systems, Vol. 10, No. 1, February 1992.

20

J. J. Kkstler and M. Satyanarayanan



drastically significantly We plan First,

different from what different results. to extend our work

we will

investigate

disconnection.

Second,

cache we will

was

recorded

in

on trace-driven

our

could

simulations

size requirements be sampling

traces

in three

for much

a broader

longer

range

traces from many more machines in our environment. the effect of hoarding by simulating traces together

profiles

have

5.3

Likelihood

been specified

virtually

different

no

network

Second,

server

conflicts

replication due to While

partitions.

perhaps grows

will lead to many To obtain data mented 400

the AFS

computer

program cant

conflicts

will

larger.

Third,

in Coda for nearly

multiple gratifying

servers science

amount

faculty,

the same kind

mate

frequent

of usage

conflicts

a user

would

modifies

an

at

assumptions,

file

or

we instru-

for

research,

includes

a signifi-

is descended

were

user

are used by over

we can use this

be if Coda

users are system. Coda

scale,

profile

Coda

in is

disconnections

students

usage

Since

AFS

larger

graduate

Their

as the

servers

we have

an object observation

voluntary

These

and

activity.

only

extended

of conflicts

staff

a year,

updating us, this

a problem

perhaps

and education.

and makes

users to

is possible that our with an experimental

in our environment.

of collaborative

environment. Every time

become

more conflicts. on the likelihood

development,

how

Third, we with hoard

by users.

ex ante

subject to at least three criticisms. First, it being cautious, knowing that they are dealing community

of

activity

of Conflicts

In our use of optimistic seen

ways.

periods

of user

by obtaining will evaluate that

produce

to replace

directory,

we

from

AFS

data

to esti-

AFS

in our

compare

his

identity with that of the user who made the previous mutation. We also note the time interval between mutations. For a file, only the close after an open for update is counted as a mutation; individual write operations are not counted.

For

directories,

as mutations. Table II presents is classified

by

all

operations

our observations

volume

type:

user

that

modify

over a period volumes

a directory

of twelve

containing

are counted

months. private

The data user

data,

volumes used for collaborative work, and system volumes containing program binaries, libraries, header files and other similar data. On average,

project

a project volume has about volume has about 1600 files

2600 files and 280 and 130 directories.

directories, and a system User volumes tend to be

smaller, averaging about 200 files and 18 directories, place much of their data in their project volumes. Table II shows that over 99% of all modifications

because were

users

by the

often

previous

writer, and that the chances of two different users modifying the same object less than a day apart is at most O.75%. We had expected to see the highest degree of write-sharing on project files or directories, and were surprised to see that it actually occurs on system files. We conjecture that a significant fraction of this sharing arises from modifications to system files by operators, who change shift periodically. If system files are excluded, the absence of ACM Transactions

on Computer Systems, Vol. 10, No. 1, February

1992

Disconnected

Operation in the Coda File System

.

2I

m “E! o

v

w

ACM Transactions

on Computer

Systems,

Vol.

10, No.

1, February

1992

22

J. J. Kistler and M. Satyanarayanan

.

write-sharing the previous

is even more striking: more than 99.570 of all mutations are by writer, and the chances of two different users modifying the

same object within ing from the point would not environment.

be

6. RELATED

a week are less than 0.4~o! This of view of optimistic replication.

a serious

problem

if

were

replaced

by

Coda

in

our

WORK

Coda is unique in that it exploits availability while preserving a high no other

AFS

data is highly encouragIt suggests that conflicts

system,

published

caching for both performance and high degree of transparency. We are aware of

or unpublished,

that

duplicates

this

key aspect

of

Coda. By providing system

[20]

since this hoarding,

tools

to link

provided

local

and

rudimentary

was not transparent

its

remote

support

name

for

spaces,

disconnected

the

Cedar

operation.

file But

primary goal, Cedar did not provide support reintegz-ation or conflict detection. Files were

for ver-

sioned and immutable, and a Cedar cache manager could substitute a cached version of a file on reference to an unqualified remote file whose server was inaccessible. However, the implementors of Cedar observe that this capability was not often exploited since remote files were normally referenced by specific Birrell

version and

availability more recent

number. Schroeder

pointed

out

the

possibility

of “stashing”

data

for

in an early discussion of the Echo file system [141. However, a description of Echo [8] indicates that it uses stashing only for the

highest levels of the naming hierarchy. The FACE file system [3] uses stashing but does not integrate it with caching. The lack of integration has at least three negative consequences. First, it reduces transparency because users and applications deal with two different name spaces, with different consistency properties. Second, utilization of local disk space is likely to be much worse. Third, recent information from cache management is not available to manage the The

available

literature

integration detracted An application-specific the PCMAIL manipulate nize with

on FACE from

system at MIT [12]. existing mail messages

a central

does not

report

on how

the usability of the system. form of disconnected operation

repository

much

the

usage stash. lack

of

was implemented

in

PCMAIL allowed clients to disconnect, and generate new ones, and desynchro-

at reconnection.

Besides

relying

heavily

on the

semantics of mail, PCMAIL was less transparent than Coda since it required manual resynchronization as well as preregistration of clients with servers. The use of optimistic replication in distributed file systems was pioneered by Locus [23]. Since Locus used a peer-to-peer model rather than a clientserver model, availability was achieved solely through server replication. There was no notion of caching, and hence of disconnected operation. Coda has transparency

benefited in a general sense and performance in distributed

from file

the large body of work on systems. In particular, Coda

owes much to AFS [191, from which it inherits its model of trust integrity, as well as its mechanisms and design philosophy for scalability. ACM

TransactIons

on

Computer

Systems,

Vol.

10,

No.

1, February

1992.

and

Disconnected 7. FUTURE

section

operation

sections

optimization, additional

.

23

development.

In

WORK

Disconnected earlier

Operation m the Coda File System

in

of this

granularity work is also

we

consider

broadened. An excellent

Coda

paper

Explicit

dreds

or thousands

under

work

active

in progress

in the areas of log

of reintegration, and evaluation of hoarding. being done at lower levels of the system.

two

ways

opportunity

Unix.

is a facility

we described

in

exists

transactions

which

and

scope

in Coda for adding

become

of nodes,

the

more the

work

transactional

desirable

informal

of our

as systems

concurrency

In

Much this

may

be

support

to

scale

control

to hunof Unix

becomes less effective. Many of the mechanisms supporting disconnected operation, such as operation logging, precedence graph maintenance, and conflict checking would transfer directly to a transactional system using optimistic concurrency control. Although transactional file systems are not a new idea, no such system with the scalability, availability and performance properties of Coda has been proposed A different opportunity exists in nected

in

operation,

low bandwidth. lines, or that mask

failures,

connectivity.

partial

as provided But

opportunities aggressive file

where

Coda

to support

connectivity

weakly

con-

is intermittent

or of

Such conditions are found in networks that rely on voice-grade use wireless technologies such as packet radio. The ability to

weak cation more

environments

or built. extended

by disconnected

techniques

which

operation, exploit

is of value

and adapt

at hand write-back

are also needed. Such techniques policies, compressed network and caching at intermediate levels.

transfer

even

with

to the communi may

-

include

transmission,

8. CONCLUSION Disconnected

operation

preload

cache

one’s

connection,

log all

reconnect ion. Implementing

is a tantalizingly

with

critical

changes

made

disconnected

simple

data,

continue

while

operation

idea.

disconnected is not

are dynamic

Only in hindsight traditional caching of a lifeline

choices

to be made

about

one has to do is to operation

and replay

so simple.

modifications and careful attention to detail in many agement. While hoarding, a surprisingly large volume lated state has to be maintained. When emulating, integrity of client data structures become critical. there

All

normal

It

until them

involves

disupon major

aspects of cache manand variety of interrethe persistence and During reintegration,

the granularity

of reintegration.

do we realize the extent to which implementations of schemes have been simplified by the guaranteed presence

to a first-class

replica.

Purging

and

refetching

on demand,

a

strategy often used to handle pathological situations in those implementations, is not viable when supporting disconnected operation. However, the obstacles to realizing disconnected operation are not insurmountable. Rather, the central message of this paper is that disconnected operation is indeed feasible, efficient and usable. One way to view our work is to regard it as an extension of the idea of write-back caching. Whereas write-back caching has hitherto been used for ACM

TransactIons

on Computer

Systems,

Vol.

10, No

1, February

1992.

24

J. J. Kistler and M. Satyanarayanan

.

performance, failures too. transitions system. remote

we have A broader between

shown that it can be extended to mask temporary view is that disconnected operation allows graceful

states

of autonomy

and

in a distributed

interdependence

Under favorable conditions, our approach provides all the benefits of data access; under unfavorable conditions, it provides continued

access to critical data. We become increasingly important and vulnerability y.

are certain that disconnected operation will as distributed systems grow in scale, diversity

ACKNOWLEDGMENTS We wish

to thank

Lily

and postprocessing

Mummert

the file

for her

reference

traces

invaluable

assistance

used in Section

Varotsis, who helped instrument the AFS servers which yielded ments of Section 5.3. We also wish to express our appreciation present

contributors

Mashburn,

Maria

to the Okasaki,

Coda

project,

and David

especially

in collecting

5.2, and Dimitris

Puneet

the measureto past and Kumar,

Hank

Steere.

REFERENCES 1. BURROWS, M.

Efficient data sharing, PhD thesis, Univ. of Cambridge, Computer Laboratory, Dec. 1988. 2. CAT~, V., AND GROSS,T. Combinmg the concepts of compression and caching for a two-level of the 4th ACM Symposium on Architectural Support for file system. In Proceedings Programm mg Languages and Operating Systems (Santa Clara, Calif., Apr. 1991), pp 200-211. 3. COVA, L. L. Resource management in federated computing environments. PhD thesis, Dept. of Computer Science, Princeton Univ., Oct. 1990. 4. DAVIDSON, S. B. Optimism and consistency in partitioned distributed database systems. ACM Trans. Database Syst. 9, 3 (Sept. 1984), 456-481. networks. 5. DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D, Consistency in partitioned ACM Comput. Suru. 17, 3 (Sept. 1985), 341-370. in distributed file systems. Tech. Rep. TR 272, Dept. of 6. FLOYD, R. A. Transparency Computer Science, Univ. of Rochester, 1989. fault-tolerant mechanism for 7. GRAY, C. G., AND CHERITON, D. R. Leases: An efficient of the 12th ACM Symposwm on Operating distributed file cache consistency. In Proceedings System Principles (Litchfield Park, Ariz., Dec. 1989), pp. 202-210. Availability and 8. HISGEN, A., BIRRELL, A., MANN, T., SCHROEDER, M., AND SWART, G of the Second consistency tradeoffs in the Echo distributed file system. In Proceedings Workshop on Workstation Operating Systems (Pacific Grove, Calif., Sept. 1989), pp. 49-54. M., 9. HOWARD, J. H., KAZAR, M. L.j MENEES, S, G., NICHOLS, D. A., SATYANARAYANAN, SIDEBOTHAM, R. N., AND WEST, M. J. Scale and performance in a distributed file system. 10.

ACM Trans. Comput. Syst. 6, 1 (Feb. 1988), 51-81. KLEIMAN, S. R. Vnodes: An architecture for multiple Summer

11. KUMAR,

system, 1991.

Usenix P.,

Conference

Proceedings

AND SATYANARAYANAN,

Tech. Rep. CMU-CS-91-164,

tile system types in Sun UNIX. In (Atlanta, Ga., June 1986), pp. 238-247, M. Log-based directory resolution in the Coda file School of Computer Science, Carnegie Mellon Univ.,

mail system for personal computers. DARPA-Internet 12. LAMBERT, M. PCMAIL: A distributed RFC 1056, 1988, memory user man13. MASHBURN, H., AND SATYANARAYANAN, M. RVM: Recoverable virtual ual. School of Computer Science, Carnegie Mellon Univ., 1991. ACM Transactions

on Computer

Systems, Vol. 10, No 1, February

1992

Disconnected

Operation in the Coda File System

.

25

R. M., AND HERBERT, A. J. Report on the Third European SIGOPS Workshop: “Autonomy or Interdependence in Distributed Systems.” SZGOPS Reu. 23, 2 (Apr. 1989), 3-19. OUSTERHOUT,J., DACOSTA, H., HARRISON, D., KUNZE, J., KUPFER, M., AND THOMPSON, J. A trace-driven analysis of the 4.2BSD file system. In Proceedings of the 10th ACM Symposium on Operating System FYinciples (Orcas Island, Wash., Dec. 1985), pp. 15-24. SANDBERG, R., GOLDBERG,D., KLEIMAN, S., WALSH, D., AND LYON, B. Design and implemenUsenix Conference Proceedings (Portland, tation of the Sun network filesystem. In Summer Ore., June 1985), pp. 119-130. of SATYANARAYANAN, M. On the influence of scale in a distributed system. In Proceedings the 10th International Conference on Software Engineering (Singapore, Apr. 1988), pp. 10-18. SATYANARAYANAN, M., KISTLER, J. J., KUMAR, P., OKASAKI, M. E., SIEGEL, E. H., AND STEERE, D. C. Coda: A highly available file system for a distributed workstation environment. IEEE Trans. Comput. 39, 4 (Apr. 1990), 447-459. SATYANARAYANAN, M. Scalable, secure, and highly available distributed file access. IEEE Comput. 23, 5 (May 1990), 9-21.

14. NEEDHAM,

15.

16.

17.

18.

19.

D., GIFFORD, D. K., AND NEEDHAM, R. M. A caching file system for a of the 10th ACM Symposium on Operating workstation. In Proceedings System Principles (Orcas Island, Wash., Dec. 1985), pp. 25-34. 21. STEERE, D. C., KISTLER, J. J., AND SATYANARAYANAN, M. Efficient user-level cache file Usenix Conference Proceedings management on the Sun Vnode interface. In Summer (Anaheim, Calif., June, 1990), pp. 325-331. 20.

SCHROEDER, M.

programmer’s

22. Decorum File System, 23. WALKER, B., POPEK, operating system. In ples (Bretton Woods,

Transarc Corporation, Jan. 1990. G., ENGLISH, R., KLINE, C., AND THIEL, G. The LOCUS distributed Proceedings of the 9th ACM Symposium on Operating System PrinciN. H., Oct. 1983), pp. 49-70.

Received June 1991; revised August

1991; accepted September

ACM Transactions

1991

on Computer Systems, Vol. 10, No. 1, February 1992.

Suggest Documents