Lightweight recoverable virtual memory - CiteSeerX

Lightweight

Recoverable Virtual Memory

M. SATYANARAYANAN, HENRY H. MASHBURN, DAVID C. STEERE, and JAMES J. KISTLER Carnegie

virtual

guarantees

are it

permanence, potentially RVM

refers

This

allows

serializability. the

performs

layering

of

well

over

specialized

operating

transaction

optimization.

its

leads

to

such

mtended

system

over

support.

It

space

efficient,

the

on which

portable,

environments,

than

as nesting

and

range

of usage

also

demonstrates

benefit

of

in

use

from

the

The

though

the

it

article

does

importance

used of

atomicity, of

transactions.

distribution.

even

easily

characteristic

properties

flexibility

can

transactional

and

A unique

transactional

considerable

applications

of functionality

an

for Unix

control This

range

address

RVM,

memory

independent

and

of a virtual

describes

virtual

enlarging the

to regions

article

of recoverable

that

simplifies

memory

offered.

implementation is

KUMAR,

Mellon University

Recoverable

RVM

PUNEET

RVM, It

also

shows

not

that

benefit

of intra-

from

and

inter-

Descriptors: D.4.2 [Operating Systems]: Storage Management—mrtual [Operating Systems]: Reliability—fault tolerance; D,4.8 [ Operating Systems]: and Performance—measurements; H.2.2 [Database Management]: Physical Design—recouery restart; H.2.4 [Database Management]: Systems—trczmact~on processing. Categories

and Subject

memory;

D.4.5

General

Terms:

Additional ity,

Key

Design, Words

throughput,

Experimentation, and

Phrases:

truncation,

Measurement,

Camelot,

Coda,

Performance,

logging,

paging,

Reliability persistence,

RVM,

scalabil-

Unix

1. INTRODUCTION How

simple

fault

tolerance?

minimal

can a transactional Our

answer,

programming

constraints,

line

code and no more

put.

This

This

work

Systems

Systems

Research addresses:

Permission

School

to copy without or distributed

than

a typical

called

RVM,

remaining

a potent

is a user-level in about

runtime

10K lines

library

Wright James

Research

without

special-

and Development

Wright-Patterson Kistler

with

of main-

for input-out-

is implemented

Force,

tool for

library

is now

AFB, affiliated

Center,

Ohio, with

under

the

DEC

CA 94301.

Computer

fee all or part

Science,

Carnegie

Mellon

DEC

Systems

of this

material

is granted

advantage,

the ACM

commercial

date appear, Machinery.

Air

no. 7597.

J. Kistler,

for direct

of the publication and its Association for Computing specific

of

15213;

here,

implemented

U.S.

order

Palo Alto,

PA

be, while

Laboratory,

(AFSC),

ARPA

Center,

Pittsburgh,

not made

by the Avionics Division

F33615-90-C-1465,

Avenue,

facility,

was sponsored

contract Authors’

intrusive

transactional

Aeronautical

facility

as elaborated

Research

University,

Center,

Palo

provided copyright

and notice is given that copying To copy otherwise, or to republish,

5000 Alto,

that notice

Forbes

CA 94301.

the copies

are

and the title

is by permission of the requires a fee and/or

permission.

@ 1994 ACM

0734-2071/94/0200-0033 ACM

TransactIons

$3.50 on Computer

Systems,

Vol. 12, No. 1, February

1994, Pages 33-57.

34

M, Satyanarayanan

.

ized

operating

wide

range

RVM must

system

in

combination

to servers.

applications

with

in use for over persistent

manner.

of circumstances

the metadata

oriented

repositories,

runtime

support

independent

The

how they

from

and

distributed

CAD

for

tools,

persistent over

likely

repositories. file

and

CASE

the

basic

serializability,

and

tools.

applications

It may

often

is richer

But

our

be tempting,

experience

represents

or better

has been

ease of use, and

unavoidable,

set

a wide

to object-

also

provide

RVM of

allows

atomicity,

flexibility

in

more

We begin

this

RVM

et al. 1991],

a predecessor

the fault-tolerance Satyanarayanan were

et al. 1990] the

in three

appropriate,

of RVM.

requirements dominant parts:

we

experience, [Kistler

and Venari

rationale,

point

This

influences

out

[Nettles

on our

architecture,

ways

in

which

and

usability

in minimalism.

Camelot

[Eppinger

and our understanding and Satyanarayanan and Wing The

experience

of 1992;

1992; Wing

description

and implementation. usage

of

Thus

to add, but in determinwith

design.

We conclude with an evaluation of RVM, block, and a summary of related work.

2. LESSONS

of

as an exercise

of Coda

cost

of functionality

concerns

our experience

system.

at the

constraints.

concerns

engineering

by describing

comes

programming

the system-level

one can view

article

the operating

sophistication

onerous

software

to use a mechanism

with

Our design challenge lay not in conjuring up features ing what could be omitted without crippling RVM.

2.1

can Since

considerable

integrated

such

between

the

Alternatively,

maintenance.

that

and

a balance

performance

design. building

data

working

in situations

properties

have

and sometimes

in functionality

portability,

follows

of those

databases

RVM

transactional

that

can benefit

languages.

on a

use transactions.

that

1992],

size

and their

RVM

systems

years

structures

to be found

Thus

programming

two

data

total

capacity,

is most

of storage

control

permanence,

and

has been

a fault-tolerant

of applications

RVM

laptops

should be a small fraction of disk easily fit within main memory.

involving range

and

from

for Unix

be updated

structures size must This

support

of hardware

is intended

et al.

et al.

of RVM Wherever

influenced

a discussion

our

of its use as a

FROM CAMELOT

Overview

Camelot purpose

is a transactional facility build to validate the thesis that generaltransactional support would simplify and encourage the construction

of reliable tributed choice

distributed nested

systems

transactions

of logging,

synchronization,

[Spector and

1991].

provides and

It

supports

considerable

transaction

local

and

dis-

flexibility

in

the

commitment

strategies.

Camelot relies heavily on the external page management and interprocess communication facilities of the Mach operating system [Baron et al, 1987], which is binary compatible with the 4.3BSD Unix operating system [Leffler et al. 1989]. Figure 1 shows the overall structure of a Camelot node. Each ACM

Transactions

on Computer

Systems,

VO1 12, No

1, February

1994,

Lightweight

m


.

35

E5ia

“o”

_

Recoverable Processes

_

Camelot System Components

Mach

Fig.

module

1,

Our

figure

shows

is implemented

is via Mach’s 2.2

This

Kernel

the structure

as a Mach

of a Camelot

task,

interprocess

communication

in

arose

node with

and communication facility

application

code,

between

modules

(IPC).

Usage interest

Camelot

in

the

context

of the

two-phase

optimistic

replication protocol used by the Coda File System. Although the protocol does not require a distributed commit, it does require each server to ensure the atomicity and permanence of local updates to metadata in the first phase. The simplest strategy for us would have been to implement an ad hoc fault tolerance mechanism for metadata using some form of shadowing. But we were curious to see what Camelot could do for us. The aspect of Camelot that we found most useful is its support for recoverable

virtual

regions

memory

of a process’

[Eppinger 1989]. This unique feature of Camelot virtual address space to be endowed with the

enables transac-

tional properties of atomicity, isolation, and permanence. Since we did not find a need for features such as nested or distributed transactions, we realized that our use of Camelot would be something of an overkill. Yet we persisted, because it would give transactions and because it would Camelot thesis. ACM

Transactions

us first-hand experience in the contribute toward the validation on Computer

Systems,

use of of the

Vol. 12, No 1, February

1994

36

M. Satyanarayanan

. We

placed

memory

data

I on

persistent each Coda recovery

structures

servers.

data

The

for replica

file

et al

was kept

consisted

of

pertaining metadata

control

to

and internal

in a Unix Camelot

Coda

included file

local

recoverable

recoverable as well

as

The contents

of

file

system.

memory

which

salvager

in

directories

housekeeping.

on a server’s

restoring

committed state, followed by a Coda tency between metadata and data.

metadata

Coda

ensured

to

Server the

mutual

last

consis-

Experience

2.3

The most valuable lesson we learned by using Camelot virtual memory was indeed a convenient and practically abstraction for data structures

systems like were restored

merely

manipulations

simple

because

the encapsulation

encountered

data

of error

of messy

simplified Coda Unfortunately,

Coda. Crash recovery was simplified because in situ by Camelot. Directory operations were

of in-memory

the range

structures.

states

crash

server code. these benefits

manifested

was that recoverable useful programming

The

it had to handle

recovery

details

came

a high

at

into

Coda

salvager

was small. Camelot

price.

The

was

Overall,

considerably problems

we

themselves

programming conas poor scalability, and difficulty of maintenance. In spite of considerable effort, we straints, were not able to circumvent these problems. Since they were direct conse-

quences of the design of Camelot, we elaborate on these problems in the following paragraphs. A key design goal of Coda was to preserve the scalability of AFS. But a set of carefully yanan

controlled

et al.

experiments

1990])

showed

that

(described

in an earlier

Coda

less

was

scalable

paper

[Satyanara-

than

AFS.

These

experiments also showed that the primary contributor to loss of scalability was increased server CPU utilization and that Camelot was responsible for over a third of this increase. Examination of Coda servers in operation showed considerable paging and context-switching overheads due to the fact that

each

Camelot

operation

involved

interactions

between

many

of the

component processes shown in Figure 1. There was no obvious way to reduce this overhead, since it was inherent in the implementation structure of Camelot. A second obstacle to using Camelot was the set of programming constraints it

imposed.

These

constraints

came

in

a variety

of guises.

For

example,

Camelot required all processes using it to be descendants of the Disk Manager task shown in Figare 1. This meant that starting Coda servers required a rather convoluted procedure that made our system administration scripts complicated and fragile. It also made debugging more difficult because starting a Coda server under a debugger was complex. Another example of a programming constraint was that Camelot required us to use Mach kernel threads, even though Coda was capable of using user-level threads. Since kernel thread context switches were much more expensive, we ended up paying a hefty performance cost with little to show for it.

1For brevity, ACM

we often

Transactions

omit

c’virtual”

on Computer

from

Systems,

“recoverable Vol

12,

No

virtual 1. February

memory” 1994

in the rest

of this

article.

Lightweight A third

limitation

dependence and

porting

memory,

of Camelot

on rarely

was that

Since

usually

Coda

of Mach

was

the first

the

features

sternest

to expose

test

made

case for recoverable

preserve

of Camelot

while

easy to use, and had few strings 3. DESIGN

RATIONALE

The

principle

central

avoiding

its drawbacks.

But

it was

or Mach. for ways to

Since

recover-

of Camelot we relied on, we sought into a realization that was cheap,

attached.

we adopted

and tight

new bugs in Camelot.

whether a particular problem lay in Camelot toll of these problems mounted, we looked

able virtual memory was the only aspect to distill the essence of this functionality

37

maintenance

often hard to decide As the cumulative the virtues

.

its code size, complexity,

used combinations

difficult.

we were


That

in designing

quest

RVM

led to RVM.

was to value

simplicity

over generality. In building a tool that did one thing well, we were heeding Lampson’s sound advice on interface design [Lampson 1983]. We were also being faithful to the long Unix tradition of keeping building blocks simple. The change in focus from generality to simplicity allowed us to take radically different system

positions

from

Camelot

and

dependence,

in

the

areas

of

functionality,

operating

structure.

3.1 Functionality Our first

simplification

A cost-benefit

analysis

independent

layer

was to eliminate

support

showed

each

us that

on top of RVM.2

While

for nesting

could

and

be better

a layered

distribution.

provided

as an

implementation

may

be

less efficient than a monolithic one, it has the attractive property of keeping each layer simple. Upper layers can count on the clean failure semantics of RVM, while the latter is only responsible for local, nonnested transactions. control. A second area where we have simplified RVM is concurrency Rather out

than

having

concurrency

choice

and

to perform

abstractions RVM

RVM

insist

control.

they

This

on a specific allows

synchronization

are supporting.

has to enforce

it. That

technique,

applications

at a granularity

If serializability

layer

we decided

to use

of their

appropriate

is required,

is also responsible

to factor

a policy

for coping

locks, starvation, and other unpleasant concurrency control Internally, RVM is implemented to be multithreaded

to the

a layer

above

with

dead-

problems. and to function

correctly in the presence of true parallelism. But it does not depend on kernel thread support and can be used with no changes on user-level thread implementations. We have, in fact, used RVM with three different threading mechanisms:

Mach

kernel

threads

[Cooper

threads, and coroutine LWP [Satyanarayanan Our final simplification was to factor Standard siliency. mented

‘An

techniques Our

expectation

in the device

implementation

such

sketch

as mirroring

is that

driver

functionality

in Section

TransactIons

Draves

1988],

1991]. out resiliency can

of a mirrored

is provided ACM

this

and

be used will

to

coroutine

media

to

achieve

most

likely

C

failure. such

re-

be imple-

disk.

8.

on Computer

Systems,

Vol

12, No, 1, February

1994

38

.

M. Satyanarayanan

et al.

Application

Nesting

Code

III 1

I II

~ Distribution I 1 I

~Serializability 1 1 1

RVM Atomicity Permanence: process

Operating Permanence:

Fig. 2.

RVM Figure

System media failure

of functionality

in RVM.

thus adopts a layered approach to transactional support, as shown in 2. This approach is simple and enhances flexibility: an application

does not have to buy into irrelevant to it. 3.2

Layering

failure

Operating

those

aspects

of the transactional

concept

that

are

System Dependence

To make RVM portable, we decided to rely only on a small, widely supported, Unix subset of the Mach system call interface. A consequence of this decision was that we could not count on tight coupling between RVM and the VM pager subsystem. The Camelot Disk Manager module runs as an external [Young 1989] and takes full responsibility for managing the backing store for recoverable regions of a process. The use of advisory VIM calls (pin and unpin) in the Mach interface lets Camelot ensure that dirty recoverable regions of a process’ address space are not paged out until the corresponding log records are forced to disk. This close alliance with Mach’s V%l subsystem allows Camelot to avoid double paging and to support recoverable regions whose size ACM

Transactions

on Computer

Systems,

Vol

12, No

1, February

1994.

Lightweight approaches

backing

recoverable

regions

Our

goals

in

store

building

replace traditional databases. Rather, systems

and

in

or addressing

is critical

Recoverable Virtual Memory limits.

to Camelot’s

RVM

were

Efficient

.

handling

39

of large

goals.

more

modest.

We

were

not

trying

to

forms of persistent storage, such as file systems and we saw RVM as a building block for metadata in those

higher-level

compositions

of them.

Consequently,

we could

assume that the recoverable memory requirements on a machine would only be a small fraction of its total disk storage. This in turn meant that it was acceptable to waste some disk space by duplicating the backing store for recoverable called

regions.

its external

Hence

data

RVM’S

segment,

swap space. Crash recovery segment. Since a VM pageout uncommitted

dirty

backing

by the VM

good performance our strategy

use of external yet. The most

because

a process’

being paged Insulating

a recoverable

region,

of the region’s

subsystem

also requires

is to view

source usage tradeoff. By being generous have been able to keep RVM simple and optional feature

for

independent

VM

relies only on the state of the external data does not modify the external data segment, an

page can be reclaimed

of correctness. Of course, be rare. One way to characterize

store

is completely

that

without

it as a complexity

with memory portable. Our

-versus-re-

and disk space, we design supports the

pagers, but we have not implemented support apparent impact on Coda has been slower

recoverable

memory

in on demand. RVM from the

VM

must

be read

subsystem

also

loss

such pageouts

in en masse hinders

for this startup

rather

the

than

sharing

of

recoverable virtual memory across address spaces. But this is not a serious limitation. After all, the primary reason to use a separate address space is to increase robustness by avoiding memory corruption. Sharing recoverable memory across address spaces defeats this purpose. In fact, it is worse than sharing (volatile) virtual memory because damage may be persistent! Hence, our view is that processes each other enough to run 3.3

willing to share recoverable memory as threads in a single address space.

already

trust

Structure

The ability to communicate ness to be enhanced without decomposition,

shown

earlier

efficiently sacrificing

across address spaces allows robustgood performance. Camelot’s modular

in Figure

1, is predicated

on fast

IPC. Although

it has been shown that IPC can be fast [Bershad et al. 1990], its performance in commercial Unix implementations lags far behind that of the best experimental implementations. Even on Mach 2.5, the measurements reported by Stout et al. [1991] indicated that IPC is about 600 times more expensive than local procedure call.3 To make matters worse, Ousterhout [ 1990] reports that the context-switching performance of operating systems is not improving linearly

with

3430 microseconds DECStation

raw

hardware

versus

performance.

0.7 microseconds

for a null

call on a typical

contemporary

machine,

the

5000/200. ACM

TransactIons

on Computer

Systems,

Vol

12, No

1, February

1994

40

M. Satyanarayanan

. Given

our

desire

et al. to make

its

design critically dependent on fast IPC. Instead, we have structured RVM a library that is linked in with an application. No external communication

to make

RVM

portable,

we were

not willing

as of

any kind is involved in the servicing of RVM calls. An implication course, that we have to trust applications not to damage RVM tures and vice versa. A less obvious implication write-ahead

is

log on a dedicated

that

disk.

applications

Such sharing

cannot

of this is, of data struc-

share

is common

a single

in transactional

systems because disk head movement is a strong determinant of performance and because the use of a separate disk per application is economically infeasible at present. In Camelot, for example, the Disk Manager serves as the multiplexing agent for the log. The inability to share one log is not a significant limitation for Coda, because we run only one file server process on a machine. But it may be legitimate concern for other applications that wish to use RViM. horizon. First, erable

Fortunately,

independent interest

in

there

are two

potential

of transaction-processing log-structured

alleviating

factors

considerations,

implementations

of the

there Unix

on the

is consid-

file

system

[Rosenblum and Ousterhout 19921. If one were to place the RVM log for each application in a separate file on such a system, one would benefit from minimal disk head movement. No log multiplexer would be needed, because that role would be played by the file system. Second, there is a trend toward using disks of small form factor, partly motivated by interest been predicated that

in disk array the large disk

technology [Patterson capacity in the future

et al. 1988]. It has will be achieved by

using many small disks. If this turns out to be true, there will be considerably less economic incentive to avoiding a dedicated disk per process. In summary, placed

each

in a Unix

process

using

file or on a raw disk

RVM

has a separate

partition.

When

log. The

log can be

the log is on a file,

RVM

uses the fsync system call to synchronously flush modifications onto disk. RVMS permanence guarantees rely on the correct implementation of this system call. For best performance, the log should either be in a raw partition on a dedicated disk or in a file on a log-structured Unix file system. 4. ARCHITECTURE The

design

of RVM

follows

logically

from

the

rationale

presented

the description below, we first present the major program-visible and then describe the operations supported on them. 4.1

Segments

earlier.

In

abstractions

and Regions

Recoverable memory is managed in segments, which are loosely analogous to Multics segments. RVM has been designed to accommodate segments up to 2‘4 bytes long, although current hardware and file system limitations restrict of segments on a machine is OnIY segment length to 2 ~Z bYtes. The number limited by its storage resources. The backing store for a segment maybe a file or a raw disk partition. Since the distinction is invisible to programs, we use the term “external data segment” to refer to either. ACM

Transactions

cm Computer

Systems,


1994

Lightweight


Segment-1

41

232.1

Unix Virtual Memory

0

.

/ 264-1

o ““e~ Segment-2 Fig. 3.

Each

As shown their

in Figure

virtual

RVM

image

of objects

area mapping

regions

3, applications

memory.

the committed collection

shaded

of segments

explicitly

guarantees

of the region.

A region

the copying of data from when a region is mapped.

startup

as mentioned

latency,

tion

is that

no region

same process. Also, restrictions eliminate

in Section

during

regions

newly typically

data

corresponds segment.

may

mappings cannot the need for RVM

to a related

segment of this

3.2. In the future,

we plan

be mapped overlap in to cope with

to virtual method is to provide

important

more

into

represents

In the current

external data The limitation

pager to copy data on demand. mapping are minimal. The most

of a segment

mapping.

of segments

mapped

as the entire

implementation, memory occurs

an optional Mach external Restrictions on segment

map

that

and may be as large

specified

than

restric-

once by the

virtual memory. aliasing. Mapping

These must

be done in multiples of page size, and regions must be page aligned. Regions can be unmapped at any time, as long as they have no uncommitted transactions mappings after

outstanding. RVM retains no information about a segment’s its regions are unmapped. A segment loader package, built on

top of RVM,

allows

able storage

and takes

each time. able

storage 4.2

This

memory

the creation

simplifies

allocator,

within

and maintenance

care of mapping

a segment

the use of absolute also layered

of a load into

pointers

on RVM,

supports

map

for recover-

the same base address in segments. heap

A recover-

management

of

a segment.

RVM Primitives

The operations provided by RVM for initialization, termination, and segment mapping are shown in Figure 4a. The log to be used by a process is specified at RVM initialization via the options_ desc argument. The map operation is called once for each region to be mapped. The external data segment and the range of virtual memory addresses for the mapping are identified in the first argument.

The

unmap

operation

can be invoked

at any time

that

a region

is

quiescent.

Once unmapped, a region can be remapped to some other part of the process’ address space. After a region has been mapped, memory addresses within it may be used in the transactional operations shown in Figure 4b. The begin _transaction ACM

Transactions

on Computer

Systems,


1994.

42

M. Satyanarayanan

.

e

et al,

begin_transaction

(o) Imtml[zatmn

(tid,

restore_rmode);

base_addr,

nbytes)

::2.:= set_range(tid,

& Mapping

Operzzt]on~

Trmsactlonal

(b)

Operations

query(options_desc, set_options

E2zl (c)

Log Control

(d)

Fig.4

operation

returns

associated

RVM

know

that

RVM

to record

tion

indicate

the that

A

is

transaction.

changes

made

never

of end _transaction.

by

commit

latency

no-flush

transactions

since

time

a log

force

When

used

it can undo

require

ted

no-flush

truncate,

is avoided.

and

reflected to transparently

until

have

been

all committed

ensure flush

manner,

external in the

data segments. background by

The

final

such

set The

as the

number

The set _ options old

for

ACM

TransactIons

of primitives, query

triggering

operation and

operation log

on Computer

identity

provides

Systems,

of its bounded

Note

that

controlling

until

The

second

Figure

we have

shown

in

allows

an application

of uncommitted and

param-

write-ahead

flushes. for

of

the

all commitoperation,

log have been

Log truncation is usually performed RVM. But since this is a potentially

sets a variety

truncation

via

willing-

persistence

blocks

to disk.

on a

has reduced

in the write-ahead

long-running and resource-intensive operation, nism for applications to control its timing. functions.

its mode

RVM’S

RVM

flush,

forced

changes

a no-re-

aborted

transaction

To

in

permanence

can indicate

explicitly

log. The fh-st operation,

transactions

blocks

lets allows

intervention.

via the commit_

or “lazy”

in this

further

changes

Such

guarantees

persistence, where the bound is the period between log atomicity is guaranteed independent of permanence. Figure 4C shows the two operations provided by RVM use of the write-ahead

;

lets an applica-

no RVM

an application

must

all This

a transaction.

guarantee

the application

to time.

so that

commit

a no-fZush

in

operation

to be modified.

end _transacticm

a successful

Such

is used

set_range

does not have to copy data

regions

But

that

abort

since RVM

permanence

mode)

Operat]um

to begin _transaction

explicitly

committed

a weaker

area

flag

on mapped

By default,

Mmcllaneous

The

is about

of the

mode

in a transaction.

ness to accept

log from

value

efficient,

operations

transaction

abort_

eter

it will is more

Read

tid,

transaction.

area of a region

current

;

log_len,

primitives.

identifier,

that

The restore_

transaction

set-range.

with

a certain

case of an abort. store

RVM

a transaction

operations

region_desc);

(options_desc)

create_log(options,

Opermons

;

of tuning the

sizes

Vol. 12, No, 1, February

4d,

provided

performs

a variety

to obtain transactions

knobs

such

of internal 1994

a mechaof

information in

a region.

as the threshbuffers.

Using

Lightweight


create_ log, an application can dynamically use it in an initialize operation.

create

a write-ahead

.

43

log and then

5. IMPLEMENTATION Since RVM draws upon well-known systems, we restrict our discussion implementation: log management

techniques for building here to two important and

[Mashburn and Satyanarayanan 19921 comprehensive treatment of transactional found 5.1

in Gray

and Reuter

optimization.

transactional aspects of its

The

RViM

manual

offers many further details, implementation techniques

and a can be

[1993].

Log Management

5.1.1 Log Format. RVM is able to use a no-undo/redo value-logging strategy [Bernstein et al. 1987] because it never reflects uncommitted changes to an external data segment. The implementation assumes that adequate buffer space is available in virtual memory for the old-value records of uncommitted transactions. Consequently, only the new-value records of committed

transactions

have

to be written

to the log. The format

record is shown in Figure 5. The bounds and contents of old-value set-range operations records are replaced the

corresponding

in only

records

are known

of a typical to RVM

log

from

the

issued during a transaction. Upon commit, old-value by new-value records that reflect the current contents of ranges

one new-value

of memory.

record

Note

even if that

that

range

in a transaction. The final step of transaction the new-value records to the log and writing

each modified

range

has been updated commitment out a commit

results

many

consists record.

times

of forcing

No-restore and no-flush transactions are more efficient. The former result in both time and space spacings since the contents of old-value records do not have to be copied or buffered. The latter result in considerably lower commit latency, forced 5.1.2

since

new-value

and

commit

records

can

be spooled

rather

than

to the log. Crash

Recovery

and

Log

Truncation.

Crash

recovery

RVM first reading the log from tail to head, then constructing tree of the latest committed changes for each data segment the log. The trees are then traversed, applying modifications

consists

of

an in-memory encountered in in them to the

corresponding external data segment. Finally, the head and tail location information in the log status block is updated to reflect an empty log. The idempotency of recovery is achieved by delaying this step until all other recovery actions are complete. Truncation is the process of reclaiming space allocated to log entries by applying the changes contained in them to the recoverable Periodic truncation is necessary because log space is finite whenever current log size exceeds a preset fraction of its log truncation has proved to be the hardest experience, implement correctly. To minimize implementation effort, we ACM Transactions

on Computer

Systems,

data segment. and is triggered total size. In our part of RVM to initially

chose to


1994.

44

.

M, Satyanarayanan

et al. Reverse

Displacements {

hfi Trans

Range

Hdr

----i)ata

Hdr 1

~;g;

-------

---------

Data

~~g~

-----------

!1

I

5.

This

~k

-.-----*

I Forward

Fig.

-----Data

log record

format

h

Displacements

of a typical

log record

read

either

way.

Tail Displacements

I

Disk

Status

Label

Block

. . . . . .Truncation Epoch

. . . . . . . . .Current Epoch

........

New

.......

Record

.......

Space

i

I Head

Fig.6.

reuse

crash

epoch

truncation,

an initial rest

1,

f

I

Displacements

This

recovery the

part

of the

of the log. Figure

fi~reshows

code

for

crash

recovery

it

more

than

is stable

truncation

truncation.

In

fornew

this

procedure

approach,

the layout

ofa

on

truncation

reliance and

results

and robust, normal

forward

epoch

increases

system

operation.

This

occurs

an epoch is

to

in the

truncation

a logically forward

correct processing

performance.

a mechanism

mechanism

to as

is applied

processing

degrades

we are implementing

referred

above

log while

log traffic, in bursty

log records

described

6 depicts

necessary, during

freed

concurrent

substantially

RVM

truncation

log while

is in progress. Although exclusive strategy,

epoch

Now

that

for incremental

periodically

renders

oldest log entries obsolete by writing out relevant pages directly from the recoverable data segment. To preserve the no-undo/redo property

the

VM to of the

log, pages that have been modified by uncommitted transactions cannot be written out to the recoverable data segment. RVM maintains internal locks to ensure

that

situations, high

incremental such

concurrency,

truncation

as the may

result

long that log space becomes to epoch truncation. Figure 7 shows the two The

first

maintains

does not

presence

data

structure

the

modification

in incremental critical. data

is

Under

truncation

structures

a page

status

property.

being

circumstances,

used

vector

of that

this

transactions

those

for

region’s

loosely analogous to a VM page table: the entry and an uncommitted reference count. A page committed set _ ranges

violate

of long-running

in

incremental

each

mapped

pages.

The

Certain

or sustained blocked RVM

for so reverts

truncation. region page

that

vector

is

for a page contains a dirty bit is marked dirty when it has

changes. The uncommitted reference count is incremented as are executed and decremented when the changes are committed

or aborted. On commit, the affected pages are marked dirty. The second data structure is a FIFO queue of page modification descriptors that specifies the ACM

TransactIons

on Computer

Systems,

Vol

12, No, 1, February

1994

Lightweight


.

45

Page Vector Uncommitted Rt?f Cnt Dirty Reserved PP 12

P 3

log head Fig. 7.

order head. that

page.

incremental writing

queue in the

offset

the desired

truncation

no

of selecting

specified

by the

next

references:

it could

the first

by it, deleting

amount

page

in which

specified

drop to zero.

out in order to move the log of the first record referencing

duplicate

descriptor

consists

count

descriptor

the descriptor, descriptor.

a page

appear.

in the queue, and moving

This

is

A step in the

step is repeated

of log space has been reclaimed.

Optimization

Early

experience

tially

reducing

with the

intratransaction Intratransaction cal, overlapping, transaction. sive

incremental

contains earliest

truncation to the

shows

pages should be written specifies the log offset

out the pages

log head

5.2

The only

log tail

Log Records

figure

in which dirty Each descriptor

mentioned

until

This

P 4

indicated of data

two

distinct

written

opportunities

to the

for substan-

log. We refer

to these

as

and intertransaction optimizations respectively. optimizations arise when set-range calls specifying identior adjacent memory addresses are issued within a single

Such

situations

programming

typically

occur

in applications.

insidious

bug,

are often

written

when one procedures

RVM

volume

while

issuing

to err

a duplicate

on the

part of an application elsewhere to perform

because

Forgetting side

call

is harmless.

Hence

This

Transactions

on Computer

Systems,

call

is an

applications

is particularly

a transaction within that

and then transaction.

those procedures may perform set-range calls for the areas memory it modifies, even if the caller or some other procedure ACM

and defen-

a set-range

of caution.

begins actions

of modularity

to issue

common invokes Each of

of recoverable is supposed to


1994.

46

.

M. Satyanarayanan

have

done

calls

to be ignored

so already.

often

Optimization

optimizations

Temporal

locality

translates

code in RVM

and overlapping

Intertransaction actions.

et al.

into

occur

of reference

locality

causes

and adjacent only in

in the input

of modifications

duplicate

log records context requests

set-range

to be coalesced. of no-flush to an

to recoverable

trans-

application

memory.

For

“Cp d 1 /* d2° on a Coda client will cause as many example, the command no-flush transactions updating the data structure in RVM for d2 as there are children of d 1. Only the last of these updates needs to be forced to the log on a future flush. The check for intertransaction optimization is performed at commit time. If the modifications being committed subsume those from an earlier unflushed transaction, the older log records are discarded.

6. STATUS RVM IBM

AND

EXPERIENCE

has been in daily RTs, DEC

MIPS

use for over two years workstations,

Intel 386/486-based laptops machines ranges from 12MB

Sun Spare

on hardware

platforms

workstations,

and workstations. Memory to 64 MB, while disk capacity

such as

and a variety

of

capacity on these ranges from 60MB

to 2.5GB. Our personal experience with RVM has only been on Mach 2.5 and 3.0. But RVM has been ported to SunOS and SGI IRIX at MIT, and we are confident that ports to other Unix platforms will be straightforward. Most applications using RVM have been written in C or C + +, but a few have been written in Standard ML. A version of the system that uses incremental truncation

is being

Our original role described aged

us to expand

updates [Kumar RVM.

debugged.

intent was just to replace Camelot by RVM in Section 2.2. But positive experience with its use. For

made to partitioned and Satyanarayanan Clients

also

use RVM

example,

transparent

on servers, in the RVM has encour-

resolution

of directory

server replicas is done using a log-based strategy 1993]. The logs for resolution are maintained in now,

particularly

for

supporting

disconnected

operation [Kistler and Satyanarayanan 1992]. The persistence of changes made while disconnected is achieved by storing replay logs in RVM, and user advice for long-term cache management is stored in a hoard database in RVM, An unexpected use of RVM has been in debugging Coda servers and clients [Satyanarayanan et al. 1992]. As Coda matured, we ran into hard-to-reproduce bugs involving corrupted persistent data structures. We realized that the information in RVM’S log offered excellent clues to the source of these corruptions. All we had to do was to save a copy of the log before truncation and build a postmortem tool to search and display the history of modifications recorded by the log. The most common source of programming problems in using RVM has been in forgetting to do a set set-range call prior to modifying an area of recoverable memory. The result is disastrous, because RVM does not create a new-value record for this area upon transaction commit. Hence the restored state after a crash or shutdown will not reflect modifications by the transaction to that area of memory. The current solution, as described in Section 5.2, ACM

TransactIons

on Computer

Systems,

Vol. 12, No, 1, February

1994.

Lightweight is to program discussed

defensively.

in Section

A better


solution

would

.

be language

based,

47 as

8.

7. EVALUATION A fair assessment of RVM must consider two distinct issues. From a software engineering perspective, we need to ask whether RVM’S code size and complexity are commensurate with its functionality. From a systems perspective, we need to know

whether

RVM’S

focus on simplicity

has resulted

in unaccept-

able loss of performance. To

address

Camelot. ties, test Camelot

the

first

issue,

we

compared

the

source

code

of RVM

and

RVM’S mainline code is approximately 10K lines of C, while utiliprograms, and other auxiliary code contribute a further 10K lines. has a mainline

code size of about

60K lines

of C and auxiliary

code of

about 10K lines. These numbers do not include code in Mach for features like IPC and the external pager that are critical to Camelot. Thus the total size of code that has to be understood, debugged, and tuned is considerably smaller for RVM. This translates into a corresponding reduction

of effort

in maintenance

and porting.

support for nesting and distribution, choice of logging strategies—a fair To evaluate

the

performance

as well

as measurements

specific

questions

—How

serious

—What —How 7.1

effective

impact

operating

Coda

we used

servers

controlled

and

clients

up in return

in areas

such

is as

experimentation in actual

use. The

of integration

between

RVM

and VM?

on scalability? and intertransaction

system

3.2, the separation could

hurt

optimizations?

of RVM

performance.

designed a benchmark inspired by the 1991] and used it in a series of carefully 7.1.1

given

Integration

in Section

The Benchmark.

of a hypothetical

is being

to us were:

are intra-

Lack of RVM-VM

As discussed an

is the lack

is RVM’S

of RVM

from

of interest

What

as well as flexibility trade by our reckoning.

bank

The with

from To

the VM component

quantify

this

effect,

TPC family of benchmarks controlled experiments.

TPC

family

one

or

of benchmarks

more

branches,

is stated multiple

of we

[Serlin

in terms tellers

per

branch, and many customer accounts per branch. A transaction updates a randomly chosen account, updates branch and teller balances, and appends a history record to an audit trail. In our variant of this benchmark, we represent all the data structures accessed by a transaction in recoverable memory. The number parameter of our benchmark. The accounts and the audit sented data

as arrays structures

of 128-byte occupies

and

64-byte

close to half

records

the total

of accounts is a trail are repre-

respectively.

recoverable

Each

memory.

of these The sizes

of the data structures for teller and branch balances are insignificant. Access to the audit trail is always sequential, with wraparound. The pattern of accesses to the account array is a second parameter of our benchmark. The best case for paging performance occurs when accesses are ACM Transactions

on Computer

Systems,

Vol. 12, No

1, February

1994.

48

.

M, Satyanarayanan

sequential. across all

et al.

The worst case occurs when accesses are uniformly distributed accounts. To represent the average case, the benchmark uses an

access pattern that exhibits considerable temporal locality. In this access pattern, referred to as localized, 70% of the transactions update accounts on 570 of the pages; 25$%0of the transactions update accounts on a different 159Z0 of the pages; and the remaining 59Z0 of the transactions update accounts on the remaining distributed. 7.1.2

809?0 of the

Results.

Our

pages.

primary

Within

each

goal in these

the throughput of RVM over its intended situations where paging rates are low, ondary paging

goal was to observe performance becomes more significant. We

importance To meet from

of RVM-VM these goals,

32K entries

integration. we conducted

to about

450K

set,

experiments

are

uniformly

was to understand

domain of use. This corresponds to as discussed in Section 3.2. A secdegradation relative to Camelot expected this to shed light on

experiments

entries.

accesses

This

for account

roughly

arrays

corresponds

as the

ranging

to ratios

of

10!%oto 175% of total recoverable-memory size to total physical-memory size. At each account array size, we performed the experiment for sequential, random, and localized account access patterns. Table I and Figare 8 present our results. Hardware and other relevant experimental conditions are described in Table I. For sequential account access, Figure 8a shows that RVM and Camelot offer virtually identical throughput. size of recoverable memory increases. on the disks theoretical

This throughput hardly changes as the The average time to perform a log force

used in our experiments maximum

throughput

is about of 57.4

17.4 milliseconds.

transactions

per

This

second,

yields which

within 15% of the observed best-case throughputs for RVM and Camelot. When account access is random, Fig-m-e 8a shows that RVM’S throughput

a is is

initially close to its value for sequential access. As recoverable-memory size increases, the effects of paging become more significant, and throughput drops. But the drop does not become serious until recoverable-memory size exceeds about 70’%0 of physical-memory size. The random-access case is precisely where one would expect Camelot’s integration with Mach to be most valuable. Indeed, the convexities of Camelot’s degradation is more graceful ratio of recoverable to physical-memory Camelot’s. For localized

account

access, Figure

the curves in Figure 8a show that than RVMS. But even at the highest size, RVM’S throughput is better than 8b shows

that

RVM’S

throughput

drops

almost linearly with increasing recoverable-memory size. But the drop is relatively slow, and performance remains acceptable even when recoverablememory size approaches physical-memory size. Camelot’s throughput also drops linearly and is consistently worse than RVM’S throughput. These measurements confirm that RVMS simplicity is not an impediment to good performance for its intended application domain. A conservative interpretation of the data in Table I indicates that applications with good locality can use up to 4090 of physical memory for active recoverable data, ACM

Transactions

on Computer

Systems,

Vol

12, No. 1, February

1994.

Lightweight Table No. of Accounts

I.

This

Table

llmem Pmem

Presents FWM

Sequential

12.5~o 48.6 (OO) 32768 25 .0% 48.5 (O2) 65536 48.6 (OO) 37.5% 98304 131072 50.0% 48.2 (0,0) 163840 62.5% 48.1 (00) 196608 75.0% 47.7 (o o) 229376 87.5% 47.2 (0,1) 262144 100.0% 46.9 (OO) 46.3 (O 6) 294912 112.5% 46.9 [07) 327680 125.0% 48.6 (O O) 360448 137.5% 46.9 (O2) 393216 150.0% 46.5 (O4) 425984 162.5% 464 (O 4) 458752 175.0%


Transactional

Throughput

49

.

Significantly

(Trans/See) Random Localized

Camelot (Trans/See) Sequential Random Localized

47.9 (oo) 46.4 (O I)

47,5 (o o)

48.1 (0.0)

41.6 (04)

44.5 (o 2)

46,6 (0.0)

48.2 (0.0)

34,2 (0.3)

43.1 (O6)

45.5 (0.0)

46.2 (OO)

48.9 (O 1)

30.1 (o 2)

41.2(02)

44.7 (0,2)

45.1 (00)

48.1 (OO)

29.2 (0.0)

41.3 (o 1)

43,9 (o o)

44.2 (0.1)

48,1 (0.0)

27.1 (0.2)

40.3 (o 2)

43.2 (O O)

43.4 (o o)

48.1 (04)

25.8 (12)

39.5 (O8)

42.5 (0.0)

43,8 (O 1)

48.2 (0.2)

23.9 (O 1)

37.9 (o 2)

41.6 (0.0)

41.1 (0.0)

48.0 (OO)

21.7 (00)

35.9 (02)

40.8 (O5)

39.0 (O6)

48.0 (OO)

20.8 (02)

35.2 (O 1)

39.7 (o o)

39.0 (0.5)

48.1 (0.1)

19.1 (00)

33.7 (0.0)

338

40.0 (o o)

48.3 (OO)

18.6 (00)

33.3 (o 1)

33.3 (14)

39.4 (o 4)

48.9 (OO)

18.7 (O 1)

32.4 (O2)

30.9 (o 3)

38.7 (02)

48.0 (OO)

18.2 (OO)

32.3 (O2)

27,4 (02)

35.4 (1 o)

47.7 (o o)

17.9(01)

31.6(00)

(09)

while keeping throughput degradation to less than 10%. Applications with poor locality have to restrict active recoverable data to less than 25% for similar

performance.

strained operating spite

Inactive

recoverable

data

can

be much

larger,

con-

only by startup latency and virtual-memory limits imposed by the system, The comparison with Camelot is especially revealing. In

of the fact that

RVM

is not integrated

with

VM,

it is able to outperform

Camelot over a broad range of workloads. Although we were gratified by these results, we were puzzled by Camelot’s behavior. For low ratios of recoverable to physical memory we had expected both Camelot’s and RVM’S throughputs to be independent of the degree of locality in the access pattern. The data shows that this is indeed the case for RVM. But in Camelot’s case, throughput is highly sensitive to locality even at the

lowest

recoverable-to-physical-memory

Camelot’s

throughput

sequential

case to 44.5 in the localized

Closer

examination

attributable Disk

to much

Manager.

in

ratio

transactions

of the raw higher

We conjecture

data

levels

per

of

12.55Z0. At

drops

from

that 48.1

ratio in

case, and to 41.6 in the random indicates

of paging

that

second

this

that

activity

increased

the case.

the drop in throughput sustained paging

is

by the Camelot

activity

is induced

by a less efficient checkpointing implementation for segment sizes of the order of main memory size. During truncation, the Disk Manager writes out all dirty pages referenced by entries in the affected portion of the log. When truncation is frequent and account access is random, many opportunities to amortize the cost of writing out a dirty page across multiple transactions are lost. Less frequent truncation or sequential account access result in fewer such lost opportunities. 7,2 Scalability As discussed in Section 2.3, Camelot’s heavy toll servers was a key influence on the design of RVM. ACM

Transactions

on Computer

on the scalability of Coda It is therefore appropriate

Systems,

Vol

12, No

1, February

1994

50

.

M. Satyanarayanan Q 50 &j UI c 0 “~ 40 -\ @ c

-a-

et al

w--w_

e -e-~-,.

❑

\

‘ ..0. \ ❑

\

E

\ ●

\

❑’ \

30 “

\

X*\ 1* \