[Nettles and Wing. 1992; Wing et al. 1992], were the dominant influences ...... Bill. Weihl, helped us improve the presentation significantly. REFERENCES.
Lightweight
Recoverable Virtual Memory
M. SATYANARAYANAN, HENRY H. MASHBURN, DAVID C. STEERE, and JAMES J. KISTLER Carnegie
virtual
guarantees
are it
permanence, potentially RVM
refers
This
allows
serializability. the
performs
layering
of
well
over
specialized
operating
transaction
optimization.
its
leads
to
such
mtended
system
over
support.
It
space
efficient,
the
on which
portable,
environments,
than
as nesting
and
range
of usage
also
demonstrates
benefit
of
in
use
from
the
The
though
the
it
article
does
importance
used of
atomicity, of
transactions.
distribution.
even
easily
characteristic
properties
flexibility
can
transactional
and
A unique
transactional
considerable
applications
of functionality
an
for Unix
control This
range
address
RVM,
memory
independent
and
of a virtual
describes
virtual
enlarging the
to regions
article
of recoverable
that
simplifies
memory
offered.
implementation is
KUMAR,
Mellon University
Recoverable
RVM
PUNEET
RVM, It
also
shows
not
that
benefit
of intra-
from
and
inter-
Descriptors: D.4.2 [Operating Systems]: Storage Management—mrtual [Operating Systems]: Reliability—fault tolerance; D,4.8 [ Operating Systems]: and Performance—measurements; H.2.2 [Database Management]: Physical Design—recouery restart; H.2.4 [Database Management]: Systems—trczmact~on processing. Categories
and Subject
memory;
D.4.5
General
Terms:
Additional ity,
Key
Design, Words
throughput,
Experimentation, and
Phrases:
truncation,
Measurement,
Camelot,
Coda,
Performance,
logging,
paging,
Reliability persistence,
RVM,
scalabil-
Unix
1. INTRODUCTION How
simple
fault
tolerance?
minimal
can a transactional Our
answer,
programming
constraints,
line
code and no more
put.
This
This
work
Systems
Systems
Research addresses:
Permission
School
to copy without or distributed
than
a typical
called
RVM,
remaining
a potent
is a user-level in about
runtime
10K lines
library
Wright James
Research
without
special-
and Development
Wright-Patterson Kistler
with
of main-
for input-out-
is implemented
Force,
tool for
library
is now
AFB, affiliated
Center,
Ohio, with
under
the
DEC
CA 94301.
Computer
fee all or part
Science,
Carnegie
Mellon
DEC
Systems
of this
material
is granted
advantage,
the ACM
commercial
date appear, Machinery.
Air
no. 7597.
J. Kistler,
for direct
of the publication and its Association for Computing specific
of
15213;
here,
implemented
U.S.
order
Palo Alto,
PA
be, while
Laboratory,
(AFSC),
ARPA
Center,
Pittsburgh,
not made
by the Avionics Division
F33615-90-C-1465,
Avenue,
facility,
was sponsored
contract Authors’
intrusive
transactional
Aeronautical
facility
as elaborated
Research
University,
Center,
Palo
provided copyright
and notice is given that copying To copy otherwise, or to republish,
5000 Alto,
that notice
Forbes
CA 94301.
the copies
are
and the title
is by permission of the requires a fee and/or
permission.
@ 1994 ACM
0734-2071/94/0200-0033 ACM
TransactIons
$3.50 on Computer
Systems,
Vol. 12, No. 1, February
1994, Pages 33-57.
34
M, Satyanarayanan
.
ized
operating
wide
range
RVM must
system
in
combination
to servers.
applications
with
in use for over persistent
manner.
of circumstances
the metadata
oriented
repositories,
runtime
support
independent
The
how they
from
and
distributed
CAD
for
tools,
persistent over
likely
repositories. file
and
CASE
the
basic
serializability,
and
tools.
applications
It may
often
is richer
But
our
be tempting,
experience
represents
or better
has been
ease of use, and
unavoidable,
set
a wide
to object-
also
provide
RVM of
allows
atomicity,
flexibility
in
more
We begin
this
RVM
et al. 1991],
a predecessor
the fault-tolerance Satyanarayanan were
et al. 1990] the
in three
appropriate,
of RVM.
requirements dominant parts:
we
experience, [Kistler
and Venari
rationale,
point
This
influences
out
[Nettles
on our
architecture,
ways
in
which
and
usability
in minimalism.
Camelot
[Eppinger
and our understanding and Satyanarayanan and Wing The
experience
of 1992;
1992; Wing
description
and implementation. usage
of
Thus
to add, but in determinwith
design.
We conclude with an evaluation of RVM, block, and a summary of related work.
2. LESSONS
of
as an exercise
of Coda
cost
of functionality
concerns
our experience
system.
at the
constraints.
concerns
engineering
by describing
comes
programming
the system-level
one can view
article
the operating
sophistication
onerous
software
to use a mechanism
with
Our design challenge lay not in conjuring up features ing what could be omitted without crippling RVM.
2.1
can Since
considerable
integrated
such
between
the
Alternatively,
maintenance.
that
and
a balance
performance
design. building
data
working
in situations
properties
have
and sometimes
in functionality
portability,
follows
of those
databases
RVM
transactional
that
can benefit
languages.
on a
use transactions.
that
1992],
size
and their
RVM
systems
years
structures
to be found
Thus
programming
two
data
total
capacity,
is most
of storage
control
permanence,
and
has been
a fault-tolerant
of applications
RVM
laptops
should be a small fraction of disk easily fit within main memory.
involving range
and
from
for Unix
be updated
structures size must This
support
of hardware
is intended
et al.
et al.
of RVM Wherever
influenced
a discussion
our
of its use as a
FROM CAMELOT
Overview
Camelot purpose
is a transactional facility build to validate the thesis that generaltransactional support would simplify and encourage the construction
of reliable tributed choice
distributed nested
systems
transactions
of logging,
synchronization,
[Spector and
1991].
provides and
It
supports
considerable
transaction
local
and
dis-
flexibility
in
the
commitment
strategies.
Camelot relies heavily on the external page management and interprocess communication facilities of the Mach operating system [Baron et al, 1987], which is binary compatible with the 4.3BSD Unix operating system [Leffler et al. 1989]. Figure 1 shows the overall structure of a Camelot node. Each ACM
Transactions
on Computer
Systems,
VO1 12, No
1, February
1994,
Lightweight
m
Recoverable Virtual Memory
.
35
E5ia
“o”
_
Recoverable Processes
_
Camelot System Components
Mach
Fig.
module
1,
Our
figure
shows
is implemented
is via Mach’s 2.2
This
Kernel
the structure
as a Mach
of a Camelot
task,
interprocess
communication
in
arose
node with
and communication facility
application
code,
between
modules
(IPC).
Usage interest
Camelot
in
the
context
of the
two-phase
optimistic
replication protocol used by the Coda File System. Although the protocol does not require a distributed commit, it does require each server to ensure the atomicity and permanence of local updates to metadata in the first phase. The simplest strategy for us would have been to implement an ad hoc fault tolerance mechanism for metadata using some form of shadowing. But we were curious to see what Camelot could do for us. The aspect of Camelot that we found most useful is its support for recoverable
virtual
regions
memory
of a process’
[Eppinger 1989]. This unique feature of Camelot virtual address space to be endowed with the
enables transac-
tional properties of atomicity, isolation, and permanence. Since we did not find a need for features such as nested or distributed transactions, we realized that our use of Camelot would be something of an overkill. Yet we persisted, because it would give transactions and because it would Camelot thesis. ACM
Transactions
us first-hand experience in the contribute toward the validation on Computer
Systems,
use of of the
Vol. 12, No 1, February
1994
36
M. Satyanarayanan
. We
placed
memory
data
I on
persistent each Coda recovery
structures
servers.
data
The
for replica
file
et al
was kept
consisted
of
pertaining metadata
control
to
and internal
in a Unix Camelot
Coda
included file
local
recoverable
recoverable as well
as
The contents
of
file
system.
memory
which
salvager
in
directories
housekeeping.
on a server’s
restoring
committed state, followed by a Coda tency between metadata and data.
metadata
Coda
ensured
to
Server the
mutual
last
consis-
Experience
2.3
The most valuable lesson we learned by using Camelot virtual memory was indeed a convenient and practically abstraction for data structures
systems like were restored
merely
manipulations
simple
because
the encapsulation
encountered
data
of error
of messy
simplified Coda Unfortunately,
Coda. Crash recovery was simplified because in situ by Camelot. Directory operations were
of in-memory
the range
structures.
states
crash
server code. these benefits
manifested
was that recoverable useful programming
The
it had to handle
recovery
details
came
a high
at
into
Coda
salvager
was small. Camelot
price.
The
was
Overall,
considerably problems
we
themselves
programming conas poor scalability, and difficulty of maintenance. In spite of considerable effort, we straints, were not able to circumvent these problems. Since they were direct conse-
quences of the design of Camelot, we elaborate on these problems in the following paragraphs. A key design goal of Coda was to preserve the scalability of AFS. But a set of carefully yanan
controlled
et al.
experiments
1990])
showed
that
(described
in an earlier
Coda
less
was
scalable
paper
[Satyanara-
than
AFS.
These
experiments also showed that the primary contributor to loss of scalability was increased server CPU utilization and that Camelot was responsible for over a third of this increase. Examination of Coda servers in operation showed considerable paging and context-switching overheads due to the fact that
each
Camelot
operation
involved
interactions
between
many
of the
component processes shown in Figure 1. There was no obvious way to reduce this overhead, since it was inherent in the implementation structure of Camelot. A second obstacle to using Camelot was the set of programming constraints it
imposed.
These
constraints
came
in
a variety
of guises.
For
example,
Camelot required all processes using it to be descendants of the Disk Manager task shown in Figare 1. This meant that starting Coda servers required a rather convoluted procedure that made our system administration scripts complicated and fragile. It also made debugging more difficult because starting a Coda server under a debugger was complex. Another example of a programming constraint was that Camelot required us to use Mach kernel threads, even though Coda was capable of using user-level threads. Since kernel thread context switches were much more expensive, we ended up paying a hefty performance cost with little to show for it.
1For brevity, ACM
we often
Transactions
omit
c’virtual”
on Computer
from
Systems,
“recoverable Vol
12,
No
virtual 1. February
memory” 1994
in the rest
of this
article.
Lightweight A third
limitation
dependence and
porting
memory,
of Camelot
on rarely
was that
Since
usually
Coda
of Mach
was
the first
the
features
sternest
to expose
test
made
case for recoverable
preserve
of Camelot
while
easy to use, and had few strings 3. DESIGN
RATIONALE
The
principle
central
avoiding
its drawbacks.
But
it was
or Mach. for ways to
Since
recover-
of Camelot we relied on, we sought into a realization that was cheap,
attached.
we adopted
and tight
new bugs in Camelot.
whether a particular problem lay in Camelot toll of these problems mounted, we looked
able virtual memory was the only aspect to distill the essence of this functionality
37
maintenance
often hard to decide As the cumulative the virtues
.
its code size, complexity,
used combinations
difficult.
we were
Recoverable Virtual Memory
That
in designing
quest
RVM
led to RVM.
was to value
simplicity
over generality. In building a tool that did one thing well, we were heeding Lampson’s sound advice on interface design [Lampson 1983]. We were also being faithful to the long Unix tradition of keeping building blocks simple. The change in focus from generality to simplicity allowed us to take radically different system
positions
from
Camelot
and
dependence,
in
the
areas
of
functionality,
operating
structure.
3.1 Functionality Our first
simplification
A cost-benefit
analysis
independent
layer
was to eliminate
support
showed
each
us that
on top of RVM.2
While
for nesting
could
and
be better
a layered
distribution.
provided
as an
implementation
may
be
less efficient than a monolithic one, it has the attractive property of keeping each layer simple. Upper layers can count on the clean failure semantics of RVM, while the latter is only responsible for local, nonnested transactions. control. A second area where we have simplified RVM is concurrency Rather out
than
having
concurrency
choice
and
to perform
abstractions RVM
RVM
insist
control.
they
This
on a specific allows
synchronization
are supporting.
has to enforce
it. That
technique,
applications
at a granularity
If serializability
layer
we decided
to use
of their
appropriate
is required,
is also responsible
to factor
a policy
for coping
locks, starvation, and other unpleasant concurrency control Internally, RVM is implemented to be multithreaded
to the
a layer
above
with
dead-
problems. and to function
correctly in the presence of true parallelism. But it does not depend on kernel thread support and can be used with no changes on user-level thread implementations. We have, in fact, used RVM with three different threading mechanisms:
Mach
kernel
threads
[Cooper
threads, and coroutine LWP [Satyanarayanan Our final simplification was to factor Standard siliency. mented
‘An
techniques Our
expectation
in the device
implementation
such
sketch
as mirroring
is that
driver
functionality
in Section
TransactIons
Draves
1988],
1991]. out resiliency can
of a mirrored
is provided ACM
this
and
be used will
to
coroutine
media
to
achieve
most
likely
C
failure. such
re-
be imple-
disk.
8.
on Computer
Systems,
Vol
12, No, 1, February
1994
38
.
M. Satyanarayanan
et al.
Application
Nesting
Code
III 1
I II
~ Distribution I 1 I
~Serializability 1 1 1
RVM Atomicity Permanence: process
Operating Permanence:
Fig. 2.
RVM Figure
System media failure
of functionality
in RVM.
thus adopts a layered approach to transactional support, as shown in 2. This approach is simple and enhances flexibility: an application
does not have to buy into irrelevant to it. 3.2
Layering
failure
Operating
those
aspects
of the transactional
concept
that
are
System Dependence
To make RVM portable, we decided to rely only on a small, widely supported, Unix subset of the Mach system call interface. A consequence of this decision was that we could not count on tight coupling between RVM and the VM pager subsystem. The Camelot Disk Manager module runs as an external [Young 1989] and takes full responsibility for managing the backing store for recoverable regions of a process. The use of advisory VIM calls (pin and unpin) in the Mach interface lets Camelot ensure that dirty recoverable regions of a process’ address space are not paged out until the corresponding log records are forced to disk. This close alliance with Mach’s V%l subsystem allows Camelot to avoid double paging and to support recoverable regions whose size ACM
Transactions
on Computer
Systems,
Vol
12, No
1, February
1994.
Lightweight approaches
backing
recoverable
regions
Our
goals
in
store
building
replace traditional databases. Rather, systems
and
in
or addressing
is critical
Recoverable Virtual Memory limits.
to Camelot’s
RVM
were
Efficient
.
handling
39
of large
goals.
more
modest.
We
were
not
trying
to
forms of persistent storage, such as file systems and we saw RVM as a building block for metadata in those
higher-level
compositions
of them.
Consequently,
we could
assume that the recoverable memory requirements on a machine would only be a small fraction of its total disk storage. This in turn meant that it was acceptable to waste some disk space by duplicating the backing store for recoverable called
regions.
its external
Hence
data
RVM’S
segment,
swap space. Crash recovery segment. Since a VM pageout uncommitted
dirty
backing
by the VM
good performance our strategy
use of external yet. The most
because
a process’
being paged Insulating
a recoverable
region,
of the region’s
subsystem
also requires
is to view
source usage tradeoff. By being generous have been able to keep RVM simple and optional feature
for
independent
VM
relies only on the state of the external data does not modify the external data segment, an
page can be reclaimed
of correctness. Of course, be rare. One way to characterize
store
is completely
that
without
it as a complexity
with memory portable. Our
-versus-re-
and disk space, we design supports the
pagers, but we have not implemented support apparent impact on Coda has been slower
recoverable
memory
in on demand. RVM from the
VM
must
be read
subsystem
also
loss
such pageouts
in en masse hinders
for this startup
rather
the
than
sharing
of
recoverable virtual memory across address spaces. But this is not a serious limitation. After all, the primary reason to use a separate address space is to increase robustness by avoiding memory corruption. Sharing recoverable memory across address spaces defeats this purpose. In fact, it is worse than sharing (volatile) virtual memory because damage may be persistent! Hence, our view is that processes each other enough to run 3.3
willing to share recoverable memory as threads in a single address space.
already
trust
Structure
The ability to communicate ness to be enhanced without decomposition,
shown
earlier
efficiently sacrificing
across address spaces allows robustgood performance. Camelot’s modular
in Figure
1, is predicated
on fast
IPC. Although
it has been shown that IPC can be fast [Bershad et al. 1990], its performance in commercial Unix implementations lags far behind that of the best experimental implementations. Even on Mach 2.5, the measurements reported by Stout et al. [1991] indicated that IPC is about 600 times more expensive than local procedure call.3 To make matters worse, Ousterhout [ 1990] reports that the context-switching performance of operating systems is not improving linearly
with
3430 microseconds DECStation
raw
hardware
versus
performance.
0.7 microseconds
for a null
call on a typical
contemporary
machine,
the
5000/200. ACM
TransactIons
on Computer
Systems,
Vol
12, No
1, February
1994
40
M. Satyanarayanan
. Given
our
desire
et al. to make
its
design critically dependent on fast IPC. Instead, we have structured RVM a library that is linked in with an application. No external communication
to make
RVM
portable,
we were
not willing
as of
any kind is involved in the servicing of RVM calls. An implication course, that we have to trust applications not to damage RVM tures and vice versa. A less obvious implication write-ahead
is
log on a dedicated
that
disk.
applications
Such sharing
cannot
of this is, of data struc-
share
is common
a single
in transactional
systems because disk head movement is a strong determinant of performance and because the use of a separate disk per application is economically infeasible at present. In Camelot, for example, the Disk Manager serves as the multiplexing agent for the log. The inability to share one log is not a significant limitation for Coda, because we run only one file server process on a machine. But it may be legitimate concern for other applications that wish to use RViM. horizon. First, erable
Fortunately,
independent interest
in
there
are two
potential
of transaction-processing log-structured
alleviating
factors
considerations,
implementations
of the
there Unix
on the
is consid-
file
system
[Rosenblum and Ousterhout 19921. If one were to place the RVM log for each application in a separate file on such a system, one would benefit from minimal disk head movement. No log multiplexer would be needed, because that role would be played by the file system. Second, there is a trend toward using disks of small form factor, partly motivated by interest been predicated that
in disk array the large disk
technology [Patterson capacity in the future
et al. 1988]. It has will be achieved by
using many small disks. If this turns out to be true, there will be considerably less economic incentive to avoiding a dedicated disk per process. In summary, placed
each
in a Unix
process
using
file or on a raw disk
RVM
has a separate
partition.
When
log. The
log can be
the log is on a file,
RVM
uses the fsync system call to synchronously flush modifications onto disk. RVMS permanence guarantees rely on the correct implementation of this system call. For best performance, the log should either be in a raw partition on a dedicated disk or in a file on a log-structured Unix file system. 4. ARCHITECTURE The
design
of RVM
follows
logically
from
the
rationale
presented
the description below, we first present the major program-visible and then describe the operations supported on them. 4.1
Segments
earlier.
In
abstractions
and Regions
Recoverable memory is managed in segments, which are loosely analogous to Multics segments. RVM has been designed to accommodate segments up to 2‘4 bytes long, although current hardware and file system limitations restrict of segments on a machine is OnIY segment length to 2 ~Z bYtes. The number limited by its storage resources. The backing store for a segment maybe a file or a raw disk partition. Since the distinction is invisible to programs, we use the term “external data segment” to refer to either. ACM
Transactions
cm Computer
Systems,
Vol. 12, No. 1, February
1994
Lightweight
Recoverable Virtual Memory
Segment-1
41
232.1
Unix Virtual Memory
0
.
/ 264-1
o ““e~ Segment-2 Fig. 3.
Each
As shown their
in Figure
virtual
RVM
image
of objects
area mapping
regions
3, applications
memory.
the committed collection
shaded
of segments
explicitly
guarantees
of the region.
A region
the copying of data from when a region is mapped.
startup
as mentioned
latency,
tion
is that
no region
same process. Also, restrictions eliminate
in Section
during
regions
newly typically
data
corresponds segment.
may
mappings cannot the need for RVM
to a related
segment of this
3.2. In the future,
we plan
be mapped overlap in to cope with
to virtual method is to provide
important
more
into
represents
In the current
external data The limitation
pager to copy data on demand. mapping are minimal. The most
of a segment
mapping.
of segments
mapped
as the entire
implementation, memory occurs
an optional Mach external Restrictions on segment
map
that
and may be as large
specified
than
restric-
once by the
virtual memory. aliasing. Mapping
These must
be done in multiples of page size, and regions must be page aligned. Regions can be unmapped at any time, as long as they have no uncommitted transactions mappings after
outstanding. RVM retains no information about a segment’s its regions are unmapped. A segment loader package, built on
top of RVM,
allows
able storage
and takes
each time. able
storage 4.2
This
memory
the creation
simplifies
allocator,
within
and maintenance
care of mapping
a segment
the use of absolute also layered
of a load into
pointers
on RVM,
supports
map
for recover-
the same base address in segments. heap
A recover-
management
of
a segment.
RVM Primitives
The operations provided by RVM for initialization, termination, and segment mapping are shown in Figure 4a. The log to be used by a process is specified at RVM initialization via the options_ desc argument. The map operation is called once for each region to be mapped. The external data segment and the range of virtual memory addresses for the mapping are identified in the first argument.
The
unmap
operation
can be invoked
at any time
that
a region
is
quiescent.
Once unmapped, a region can be remapped to some other part of the process’ address space. After a region has been mapped, memory addresses within it may be used in the transactional operations shown in Figure 4b. The begin _transaction ACM
Transactions
on Computer
Systems,
Vol. 12, No. 1, February
1994.
42
M. Satyanarayanan
.
e
et al,
begin_transaction
(o) Imtml[zatmn
(tid,
restore_rmode);
base_addr,
nbytes)
::2.:= set_range(tid,
& Mapping
Operzzt]on~
Trmsactlonal
(b)
Operations
query(options_desc, set_options
E2zl (c)
Log Control
(d)
Fig.4
operation
returns
associated
RVM
know
that
RVM
to record
tion
indicate
the that
A
is
transaction.
changes
made
never
of end _transaction.
by
commit
latency
no-flush
transactions
since
time
a log
force
When
used
it can undo
require
ted
no-flush
truncate,
is avoided.
and
reflected to transparently
until
have
been
all committed
ensure flush
manner,
external in the
data segments. background by
The
final
such
set The
as the
number
The set _ options old
for
ACM
TransactIons
of primitives, query
triggering
operation and
operation log
on Computer
identity
provides
Systems,
of its bounded
Note
that
controlling
until
The
second
Figure
we have
shown
in
allows
an application
of uncommitted and
param-
write-ahead
flushes. for
of
the
all commitoperation,
log have been
Log truncation is usually performed RVM. But since this is a potentially
sets a variety
truncation
via
willing-
persistence
blocks
to disk.
on a
has reduced
in the write-ahead
long-running and resource-intensive operation, nism for applications to control its timing. functions.
its mode
RVM’S
RVM
flush,
forced
changes
a no-re-
aborted
transaction
To
in
permanence
can indicate
explicitly
log. The fh-st operation,
transactions
blocks
lets allows
intervention.
via the commit_
or “lazy”
in this
further
changes
Such
guarantees
persistence, where the bound is the period between log atomicity is guaranteed independent of permanence. Figure 4C shows the two operations provided by RVM use of the write-ahead
;
lets an applica-
no RVM
an application
must
all This
a transaction.
guarantee
the application
to time.
so that
commit
a no-fZush
in
operation
to be modified.
end _transacticm
a successful
Such
is used
set_range
does not have to copy data
regions
But
that
abort
since RVM
permanence
mode)
Operat]um
to begin _transaction
explicitly
committed
a weaker
area
flag
on mapped
By default,
Mmcllaneous
The
is about
of the
mode
in a transaction.
ness to accept
log from
value
efficient,
operations
transaction
abort_
eter
it will is more
Read
tid,
transaction.
area of a region
current
;
log_len,
primitives.
identifier,
that
The restore_
transaction
set-range.
with
a certain
case of an abort. store
RVM
a transaction
operations
region_desc);
(options_desc)
create_log(options,
Opermons
;
of tuning the
sizes
Vol. 12, No, 1, February
4d,
provided
performs
a variety
to obtain transactions
knobs
such
of internal 1994
a mechaof
information in
a region.
as the threshbuffers.
Using
Lightweight
Recoverable Virtual Memory
create_ log, an application can dynamically use it in an initialize operation.
create
a write-ahead
.
43
log and then
5. IMPLEMENTATION Since RVM draws upon well-known systems, we restrict our discussion implementation: log management
techniques for building here to two important and
[Mashburn and Satyanarayanan 19921 comprehensive treatment of transactional found 5.1
in Gray
and Reuter
optimization.
transactional aspects of its
The
RViM
manual
offers many further details, implementation techniques
and a can be
[1993].
Log Management
5.1.1 Log Format. RVM is able to use a no-undo/redo value-logging strategy [Bernstein et al. 1987] because it never reflects uncommitted changes to an external data segment. The implementation assumes that adequate buffer space is available in virtual memory for the old-value records of uncommitted transactions. Consequently, only the new-value records of committed
transactions
have
to be written
to the log. The format
record is shown in Figure 5. The bounds and contents of old-value set-range operations records are replaced the
corresponding
in only
records
are known
of a typical to RVM
log
from
the
issued during a transaction. Upon commit, old-value by new-value records that reflect the current contents of ranges
one new-value
of memory.
record
Note
even if that
that
range
in a transaction. The final step of transaction the new-value records to the log and writing
each modified
range
has been updated commitment out a commit
results
many
consists record.
times
of forcing
No-restore and no-flush transactions are more efficient. The former result in both time and space spacings since the contents of old-value records do not have to be copied or buffered. The latter result in considerably lower commit latency, forced 5.1.2
since
new-value
and
commit
records
can
be spooled
rather
than
to the log. Crash
Recovery
and
Log
Truncation.
Crash
recovery
RVM first reading the log from tail to head, then constructing tree of the latest committed changes for each data segment the log. The trees are then traversed, applying modifications
consists
of
an in-memory encountered in in them to the
corresponding external data segment. Finally, the head and tail location information in the log status block is updated to reflect an empty log. The idempotency of recovery is achieved by delaying this step until all other recovery actions are complete. Truncation is the process of reclaiming space allocated to log entries by applying the changes contained in them to the recoverable Periodic truncation is necessary because log space is finite whenever current log size exceeds a preset fraction of its log truncation has proved to be the hardest experience, implement correctly. To minimize implementation effort, we ACM Transactions
on Computer
Systems,
data segment. and is triggered total size. In our part of RVM to initially
chose to
Vol. 12, No. 1, February
1994.
44
.
M, Satyanarayanan
et al. Reverse
Displacements {
hfi Trans
Range
Hdr
----i)ata
Hdr 1
~;g;
-------
---------
Data
~~g~
-----------
!1
I
5.
This
~k
-.-----*
I Forward
Fig.
-----Data
log record
format
h
Displacements
of a typical
log record
read
either
way.
Tail Displacements
I
Disk
Status
Label
Block
. . . . . .Truncation Epoch
. . . . . . . . .Current Epoch
........
New
.......
Record
.......
Space
i
I Head
Fig.6.
reuse
crash
epoch
truncation,
an initial rest
1,
f
I
Displacements
This
recovery the
part
of the
of the log. Figure
fi~reshows
code
for
crash
recovery
it
more
than
is stable
truncation
truncation.
In
fornew
this
procedure
approach,
the layout
ofa
on
truncation
reliance and
results
and robust, normal
forward
epoch
increases
system
operation.
This
occurs
an epoch is
to
in the
truncation
a logically forward
correct processing
performance.
a mechanism
mechanism
to as
is applied
processing
degrades
we are implementing
referred
above
log while
log traffic, in bursty
log records
described
6 depicts
necessary, during
freed
concurrent
substantially
RVM
truncation
log while
is in progress. Although exclusive strategy,
epoch
Now
that
for incremental
periodically
renders
oldest log entries obsolete by writing out relevant pages directly from the recoverable data segment. To preserve the no-undo/redo property
the
VM to of the
log, pages that have been modified by uncommitted transactions cannot be written out to the recoverable data segment. RVM maintains internal locks to ensure
that
situations, high
incremental such
concurrency,
truncation
as the may
result
long that log space becomes to epoch truncation. Figure 7 shows the two The
first
maintains
does not
presence
data
structure
the
modification
in incremental critical. data
is
Under
truncation
structures
a page
status
property.
being
circumstances,
used
vector
of that
this
transactions
those
for
region’s
loosely analogous to a VM page table: the entry and an uncommitted reference count. A page committed set _ ranges
violate
of long-running
in
incremental
each
mapped
pages.
The
Certain
or sustained blocked RVM
for so reverts
truncation. region page
that
vector
is
for a page contains a dirty bit is marked dirty when it has
changes. The uncommitted reference count is incremented as are executed and decremented when the changes are committed
or aborted. On commit, the affected pages are marked dirty. The second data structure is a FIFO queue of page modification descriptors that specifies the ACM
TransactIons
on Computer
Systems,
Vol
12, No, 1, February
1994
Lightweight
Recoverable Virtual Memory
.
45
Page Vector Uncommitted Rt?f Cnt Dirty Reserved PP 12
P 3
log head Fig. 7.
order head. that
page.
incremental writing
queue in the
offset
the desired
truncation
no
of selecting
specified
by the
next
references:
it could
the first
by it, deleting
amount
page
in which
specified
drop to zero.
out in order to move the log of the first record referencing
duplicate
descriptor
consists
count
descriptor
the descriptor, descriptor.
a page
appear.
in the queue, and moving
This
is
A step in the
step is repeated
of log space has been reclaimed.
Optimization
Early
experience
tially
reducing
with the
intratransaction Intratransaction cal, overlapping, transaction. sive
incremental
contains earliest
truncation to the
shows
pages should be written specifies the log offset
out the pages
log head
5.2
The only
log tail
Log Records
figure
in which dirty Each descriptor
mentioned
until
This
P 4
indicated of data
two
distinct
written
opportunities
to the
for substan-
log. We refer
to these
as
and intertransaction optimizations respectively. optimizations arise when set-range calls specifying identior adjacent memory addresses are issued within a single
Such
situations
programming
typically
occur
in applications.
insidious
bug,
are often
written
when one procedures
RVM
volume
while
issuing
to err
a duplicate
on the
part of an application elsewhere to perform
because
Forgetting side
call
is harmless.
Hence
This
Transactions
on Computer
Systems,
call
is an
applications
is particularly
a transaction within that
and then transaction.
those procedures may perform set-range calls for the areas memory it modifies, even if the caller or some other procedure ACM
and defen-
a set-range
of caution.
begins actions
of modularity
to issue
common invokes Each of
of recoverable is supposed to
Vol. 12, No. 1, February
1994.
46
.
M. Satyanarayanan
have
done
calls
to be ignored
so already.
often
Optimization
optimizations
Temporal
locality
translates
code in RVM
and overlapping
Intertransaction actions.
et al.
into
occur
of reference
locality
causes
and adjacent only in
in the input
of modifications
duplicate
log records context requests
set-range
to be coalesced. of no-flush to an
to recoverable
trans-
application
memory.
For
“Cp d 1 /* d2° on a Coda client will cause as many example, the command no-flush transactions updating the data structure in RVM for d2 as there are children of d 1. Only the last of these updates needs to be forced to the log on a future flush. The check for intertransaction optimization is performed at commit time. If the modifications being committed subsume those from an earlier unflushed transaction, the older log records are discarded.
6. STATUS RVM IBM
AND
EXPERIENCE
has been in daily RTs, DEC
MIPS
use for over two years workstations,
Intel 386/486-based laptops machines ranges from 12MB
Sun Spare
on hardware
platforms
workstations,
and workstations. Memory to 64 MB, while disk capacity
such as
and a variety
of
capacity on these ranges from 60MB
to 2.5GB. Our personal experience with RVM has only been on Mach 2.5 and 3.0. But RVM has been ported to SunOS and SGI IRIX at MIT, and we are confident that ports to other Unix platforms will be straightforward. Most applications using RVM have been written in C or C + +, but a few have been written in Standard ML. A version of the system that uses incremental truncation
is being
Our original role described aged
us to expand
updates [Kumar RVM.
debugged.
intent was just to replace Camelot by RVM in Section 2.2. But positive experience with its use. For
made to partitioned and Satyanarayanan Clients
also
use RVM
example,
transparent
on servers, in the RVM has encour-
resolution
of directory
server replicas is done using a log-based strategy 1993]. The logs for resolution are maintained in now,
particularly
for
supporting
disconnected
operation [Kistler and Satyanarayanan 1992]. The persistence of changes made while disconnected is achieved by storing replay logs in RVM, and user advice for long-term cache management is stored in a hoard database in RVM, An unexpected use of RVM has been in debugging Coda servers and clients [Satyanarayanan et al. 1992]. As Coda matured, we ran into hard-to-reproduce bugs involving corrupted persistent data structures. We realized that the information in RVM’S log offered excellent clues to the source of these corruptions. All we had to do was to save a copy of the log before truncation and build a postmortem tool to search and display the history of modifications recorded by the log. The most common source of programming problems in using RVM has been in forgetting to do a set set-range call prior to modifying an area of recoverable memory. The result is disastrous, because RVM does not create a new-value record for this area upon transaction commit. Hence the restored state after a crash or shutdown will not reflect modifications by the transaction to that area of memory. The current solution, as described in Section 5.2, ACM
TransactIons
on Computer
Systems,
Vol. 12, No, 1, February
1994.
Lightweight is to program discussed
defensively.
in Section
A better
Recoverable Virtual Memory
solution
would
.
be language
based,
47 as
8.
7. EVALUATION A fair assessment of RVM must consider two distinct issues. From a software engineering perspective, we need to ask whether RVM’S code size and complexity are commensurate with its functionality. From a systems perspective, we need to know
whether
RVM’S
focus on simplicity
has resulted
in unaccept-
able loss of performance. To
address
Camelot. ties, test Camelot
the
first
issue,
we
compared
the
source
code
of RVM
and
RVM’S mainline code is approximately 10K lines of C, while utiliprograms, and other auxiliary code contribute a further 10K lines. has a mainline
code size of about
60K lines
of C and auxiliary
code of
about 10K lines. These numbers do not include code in Mach for features like IPC and the external pager that are critical to Camelot. Thus the total size of code that has to be understood, debugged, and tuned is considerably smaller for RVM. This translates into a corresponding reduction
of effort
in maintenance
and porting.
support for nesting and distribution, choice of logging strategies—a fair To evaluate
the
performance
as well
as measurements
specific
questions
—How
serious
—What —How 7.1
effective
impact
operating
Coda
we used
servers
controlled
and
clients
up in return
in areas
such
is as
experimentation in actual
use. The
of integration
between
RVM
and VM?
on scalability? and intertransaction
system
3.2, the separation could
hurt
optimizations?
of RVM
performance.
designed a benchmark inspired by the 1991] and used it in a series of carefully 7.1.1
given
Integration
in Section
The Benchmark.
of a hypothetical
is being
to us were:
are intra-
Lack of RVM-VM
As discussed an
is the lack
is RVM’S
of RVM
from
of interest
What
as well as flexibility trade by our reckoning.
bank
The with
from To
the VM component
quantify
this
effect,
TPC family of benchmarks controlled experiments.
TPC
family
one
or
of benchmarks
more
branches,
is stated multiple
of we
[Serlin
in terms tellers
per
branch, and many customer accounts per branch. A transaction updates a randomly chosen account, updates branch and teller balances, and appends a history record to an audit trail. In our variant of this benchmark, we represent all the data structures accessed by a transaction in recoverable memory. The number parameter of our benchmark. The accounts and the audit sented data
as arrays structures
of 128-byte occupies
and
64-byte
close to half
records
the total
of accounts is a trail are repre-
respectively.
recoverable
Each
memory.
of these The sizes
of the data structures for teller and branch balances are insignificant. Access to the audit trail is always sequential, with wraparound. The pattern of accesses to the account array is a second parameter of our benchmark. The best case for paging performance occurs when accesses are ACM Transactions
on Computer
Systems,
Vol. 12, No
1, February
1994.
48
.
M, Satyanarayanan
sequential. across all
et al.
The worst case occurs when accesses are uniformly distributed accounts. To represent the average case, the benchmark uses an
access pattern that exhibits considerable temporal locality. In this access pattern, referred to as localized, 70% of the transactions update accounts on 570 of the pages; 25$%0of the transactions update accounts on a different 159Z0 of the pages; and the remaining 59Z0 of the transactions update accounts on the remaining distributed. 7.1.2
809?0 of the
Results.
Our
pages.
primary
Within
each
goal in these
the throughput of RVM over its intended situations where paging rates are low, ondary paging
goal was to observe performance becomes more significant. We
importance To meet from
of RVM-VM these goals,
32K entries
integration. we conducted
to about
450K
set,
experiments
are
uniformly
was to understand
domain of use. This corresponds to as discussed in Section 3.2. A secdegradation relative to Camelot expected this to shed light on
experiments
entries.
accesses
This
for account
roughly
arrays
corresponds
as the
ranging
to ratios
of
10!%oto 175% of total recoverable-memory size to total physical-memory size. At each account array size, we performed the experiment for sequential, random, and localized account access patterns. Table I and Figare 8 present our results. Hardware and other relevant experimental conditions are described in Table I. For sequential account access, Figure 8a shows that RVM and Camelot offer virtually identical throughput. size of recoverable memory increases. on the disks theoretical
This throughput hardly changes as the The average time to perform a log force
used in our experiments maximum
throughput
is about of 57.4
17.4 milliseconds.
transactions
per
This
second,
yields which
within 15% of the observed best-case throughputs for RVM and Camelot. When account access is random, Fig-m-e 8a shows that RVM’S throughput
a is is
initially close to its value for sequential access. As recoverable-memory size increases, the effects of paging become more significant, and throughput drops. But the drop does not become serious until recoverable-memory size exceeds about 70’%0 of physical-memory size. The random-access case is precisely where one would expect Camelot’s integration with Mach to be most valuable. Indeed, the convexities of Camelot’s degradation is more graceful ratio of recoverable to physical-memory Camelot’s. For localized
account
access, Figure
the curves in Figure 8a show that than RVMS. But even at the highest size, RVM’S throughput is better than 8b shows
that
RVM’S
throughput
drops
almost linearly with increasing recoverable-memory size. But the drop is relatively slow, and performance remains acceptable even when recoverablememory size approaches physical-memory size. Camelot’s throughput also drops linearly and is consistently worse than RVM’S throughput. These measurements confirm that RVMS simplicity is not an impediment to good performance for its intended application domain. A conservative interpretation of the data in Table I indicates that applications with good locality can use up to 4090 of physical memory for active recoverable data, ACM
Transactions
on Computer
Systems,
Vol
12, No. 1, February
1994.
Lightweight Table No. of Accounts
I.
This
Table
llmem Pmem
Presents FWM
Sequential
12.5~o 48.6 (OO) 32768 25 .0% 48.5 (O2) 65536 48.6 (OO) 37.5% 98304 131072 50.0% 48.2 (0,0) 163840 62.5% 48.1 (00) 196608 75.0% 47.7 (o o) 229376 87.5% 47.2 (0,1) 262144 100.0% 46.9 (OO) 46.3 (O 6) 294912 112.5% 46.9 [07) 327680 125.0% 48.6 (O O) 360448 137.5% 46.9 (O2) 393216 150.0% 46.5 (O4) 425984 162.5% 464 (O 4) 458752 175.0%
Recoverable Virtual Memory
Transactional
Throughput
49
.
Significantly
(Trans/See) Random Localized
Camelot (Trans/See) Sequential Random Localized
47.9 (oo) 46.4 (O I)
47,5 (o o)
48.1 (0.0)
41.6 (04)
44.5 (o 2)
46,6 (0.0)
48.2 (0.0)
34,2 (0.3)
43.1 (O6)
45.5 (0.0)
46.2 (OO)
48.9 (O 1)
30.1 (o 2)
41.2(02)
44.7 (0,2)
45.1 (00)
48.1 (OO)
29.2 (0.0)
41.3 (o 1)
43,9 (o o)
44.2 (0.1)
48,1 (0.0)
27.1 (0.2)
40.3 (o 2)
43.2 (O O)
43.4 (o o)
48.1 (04)
25.8 (12)
39.5 (O8)
42.5 (0.0)
43,8 (O 1)
48.2 (0.2)
23.9 (O 1)
37.9 (o 2)
41.6 (0.0)
41.1 (0.0)
48.0 (OO)
21.7 (00)
35.9 (02)
40.8 (O5)
39.0 (O6)
48.0 (OO)
20.8 (02)
35.2 (O 1)
39.7 (o o)
39.0 (0.5)
48.1 (0.1)
19.1 (00)
33.7 (0.0)
338
40.0 (o o)
48.3 (OO)
18.6 (00)
33.3 (o 1)
33.3 (14)
39.4 (o 4)
48.9 (OO)
18.7 (O 1)
32.4 (O2)
30.9 (o 3)
38.7 (02)
48.0 (OO)
18.2 (OO)
32.3 (O2)
27,4 (02)
35.4 (1 o)
47.7 (o o)
17.9(01)
31.6(00)
(09)
while keeping throughput degradation to less than 10%. Applications with poor locality have to restrict active recoverable data to less than 25% for similar
performance.
strained operating spite
Inactive
recoverable
data
can
be much
larger,
con-
only by startup latency and virtual-memory limits imposed by the system, The comparison with Camelot is especially revealing. In
of the fact that
RVM
is not integrated
with
VM,
it is able to outperform
Camelot over a broad range of workloads. Although we were gratified by these results, we were puzzled by Camelot’s behavior. For low ratios of recoverable to physical memory we had expected both Camelot’s and RVM’S throughputs to be independent of the degree of locality in the access pattern. The data shows that this is indeed the case for RVM. But in Camelot’s case, throughput is highly sensitive to locality even at the
lowest
recoverable-to-physical-memory
Camelot’s
throughput
sequential
case to 44.5 in the localized
Closer
examination
attributable Disk
to much
Manager.
in
ratio
transactions
of the raw higher
We conjecture
data
levels
per
of
12.55Z0. At
drops
from
that 48.1
ratio in
case, and to 41.6 in the random indicates
of paging
that
second
this
that
activity
increased
the case.
the drop in throughput sustained paging
is
by the Camelot
activity
is induced
by a less efficient checkpointing implementation for segment sizes of the order of main memory size. During truncation, the Disk Manager writes out all dirty pages referenced by entries in the affected portion of the log. When truncation is frequent and account access is random, many opportunities to amortize the cost of writing out a dirty page across multiple transactions are lost. Less frequent truncation or sequential account access result in fewer such lost opportunities. 7,2 Scalability As discussed in Section 2.3, Camelot’s heavy toll servers was a key influence on the design of RVM. ACM
Transactions
on Computer
on the scalability of Coda It is therefore appropriate
Systems,
Vol
12, No
1, February
1994
50
.
M. Satyanarayanan Q 50 &j UI c 0 “~ 40 -\ @ c
-a-
et al
w--w_
e -e-~-,.
❑
\
‘ ..0. \ ❑
\
E
\ ●
\
❑’ \
30 “
\
X*\ 1* \