O from the Berkeley Plateau Research Group. Displays 610 frames from a compressed video file [19]. ousterhout. John Ousterhout's benchmark suite from [18].
Design Tradeoffs for Software-Managed TLBs RICHARD
UHLIG,
STUART University
An
increasing
and
number
TLBs.
performance
and
memorzes:
cache
Design
TLB
design
systems.
versions
Subject
TREVOR
running
MUDGE,
D.4.2
[Operating
General
Terms:
Design,
Experimentation,
Words
and
(TLB),
Systems]:
trap-driven
interaction
through
and
R2000-based
software-
penalties
that
memory.
with
monitoring
[Memory B.3.3.
(3.4 [Computer
D.4.8
Key
support
considerable
its use of virtual
and their
on a MIPS
B.3.2
memory;
Additional
and
hardware
memory;
technzqaes;
memory impose
a range
This
of monolithic
simulation,
workstation
are work
we explore
running
Ultrix,
3.0.
vzrtual
ulatzon;
can
tradeoffs
Descriptors:
Aids-szm
virtual
structure
Through
of Mach
memorzes;
tems—measurement
provide
system’s
for benchmarks
and three
buffer
STANLEY, BROWN
management
operating
operating
Categories and
architectures
software-managed
OSF\l,
TIM
RICHARD
software
on the
microkernel
TLB
of
However,
dependent
explores
NAGLE,
and
of Michigan
managed highly
DAVID
SECHREST,
Structures]: [Memory
Systems
[Operating
Design
structures]:
Styles—assoczatwe Performance
Organization]:
Systems]:
Analysis
Performance
Storage
of Sys-
Management—~irtual
Performance—measurements Performance
Phrases:
Hardware
monitoring,
simulation,
translation
lookaside
simulation
1. INTRODUCTION Many
computers
translation 1988
support
lookaside
[Smith
et al.
including the [Hewlett-Packard
virtual
buffers 1988],
version
Symposzum This
work
contract grant
CDA-9121887,
Michigan, Permission not made
Ann
number
and
a National
MI
to copy without
with
of computer
and
its
specific permission. @ 0734-2071/94/0800-0175 ACM
Proceedings May
Advanced
the
Science
in
the HP-PA 19921, have
operating system. and provide greater
of the 20th
Projects
Foundation
Foundation
ZS-1
architectures,
Annual
These flexi-
International
1993.
Research
Science
of Electrical
Agency
CISE
Graduate
Engineering
and
under
Research
DARPA/ARO Instrumentation
Fellowship.
Computer
Science,
University
of
48109-2122. fee all or part
for direct
for Computing
in the
San Diegoj
by National
Department
Arbor,
appeared
by Defense
or distributed
publication
Association
hardware-managed
beginning
responsibility into the simplify hardware design
article
supported
address:
an increasing
Architecture,
DAAL03-90-C-O028,
Authors’
of the
of this
on Computer was
by providing
However,
AMD 29050 [Advanced Micro Devices 19911, 1990], and the MIPS RISC [Kane and Heinrich
shifted TLB management software-managed TLBs
A shorter
memory
(TLBs’).
date
of this
commercial
appear,
Machinery.
and
material
is granted
advantage,
the ACM
notice
is given
To copy otherwise,
that
provided copyright
copying
that notice
the copies
is by permission
or to republish,
requires
of the
a fee and/or
$0350 TransactIons
on Computer
Systems,
Vol.
12, No
3, August
1994,
are
and the title
Pages
175-205
176
.
bility
Richard
in
page
Uhlig
table
et al.
structure,
but
typically
have
slower
refill
times
hardware-managed TLBs [DeMoney et al. 1986]. At the same time, operating systems such as Mach
3.0 [Accetta
are moving functionality into user processes and virtual memory for mapping kernel data structures
making greater held within the
These
and related
hence, This
decrease overall system article is a case study,
mance
of application
based
operating
system
and
workstation.
trends
can increase
performance. exploring the
operating
system
In particular,
impact
TLB
we examine
the
system
code.
Through
simulation,
configurations. The results illustrate software-managed TLB in conjunction plementations. As a case study, this
rates
on the
differences
we
use of kernel.
R2000-
in performance
operating construction
explore
and,
perfor-
on a MIPS
seen when running the same applications under different showing how these performance differences reflect the operating
et al. 1986]
miss
of TLBs
code running
than
systems, of the
alternative
the design tradeoffs for with particular operating work seeks to illuminate
TLB
a particular system imthe broader
problems of interaction between operating system software and architectural features. This work is based on measurement and simulation of running systems. To examine issues that cannot developed a system analysis running
machines.
Tapeworm,
We
which
be adequately modeled with simulation, we have tool called Monster, which enables us to monitor
have
also
is compiled
intercept all TLB misses caused references. The information that is used to obtain tions. The remainder
TLB
miss
of this
previous
TLB
and
analysis
tools,
Monster
OS research
under
software-based
5. Section
2. RELATED
and
Ultrix,
6 summarizes
a novel
TLB
the operating
to simulate
is organized related
simulator
system
OSF/
work.
The MIPS
1, and Mach
performance
different
as follows:
to this
and Tapeworm.
HardwareSection
into
TLB
Section
Section
R2000
it can
memory system configura-
2 examines
3 describes
our
structure
and
TLB
3.0 are explored
improvements
called
so that
by both user process and OS kernel Tapeworm extracts from the running
counts
article
its performance and
developed
directly
in Section
are presented
4. in
our conclusions.
WORK
By
caching page table entries, TLBs greatly speed up virtual-to-physical address translations. However, memory references that require mappings not in the TLB result in misses that must be serviced either by hardware or by software. Clark and Emer [ 1985] examined the cost of hardware TLB management by monitoring a VAX- I 1/780. program’s run-time was spent handling More recent papers have investigated
For their workloads, TLB misses. the TLB’s impact
5 to 8% of a user on user
program
performance. Using traces generated from the SPEC benchmarks, Chen, et al. [ 1992] showed that, for a reasonable range of page sizes, the amount of the address space that could be mapped was more important in determining the TLB miss rate than was the specific choice of page size. Talluri et al. ACM
Transactions
on
Computer
Systems,
Vol
12,
No
3, August
1994
So ftware-Managed [1992]
showed
mapped including from
that
large
while
regions
MIPS
RISC,
Clark
system
and
could
(such
[1985]
as that
memory,
TLBs
showed
significantly
references
Emer
TLBs
do not. They
4 to 32 Kbytes
CPI. Operating
older
of physical
also have
showed
that
in
in
also that
the
a strong
impact
the page size
TLB’s
only
11/780)
architectures,
increasing
the
177
.
VAX
newer
decrease
although
TLBs
contribution
on TLB 18%
to
miss
of
all
rates.
memory
references in the system they examined were made by the operating system, these references were responsible for 70% of the TLB misses. Several recent papers [Anderson et al. 1991; Ousterhout 1989; Welch 1991] have noted changes in the structure of operating systems are altering the utilization the
TLB.
For
example,
Anderson
implementation
of an operating
implementation
of that
machine, found
which
a 600%
misses.
increase
This
article
system mance
differs
TLBs
and
technology difficulties
here,
managed
there TLBs
focus
that
3100 running
The different
service
different dependent
TOOLS
To monitor
and analyze operating
system
called
of this
section
which 3.1
Monster
misses its
impact
was far
et
al.
of TLB
and
away
3.0 kernel.
focus
on
of changing
cost
Mach
when,
refill
penalty.
penalties,
of handling
3.0 can vary
reflect
as in the
in the refill
small
system AND
softwareoperating
the varying
with
are used.
System
Monitoring monitoring
EXPERIMENTAL
TLB
behavior we
TLB
and a TLB these
than
and
our on
a
400 cycles.
of the code paths
that
miss types is We, therefore,
ENVIRONMENT
developed
simulator tools
miss
and discussion.
for benchmark have
exam-
hardware-
low variance,
20 to more
lengths
systems
While
a single
from
in our analysis
systems,
describes
they
The Monster
category
through
of the
variance the
times
variety
of
expensive
types of misses. The particular mix of TLB on the construction of the operating system.
on the operating
3. ANALYSIS
Anderson
for the Mach
work
R3000-based
of TLBs.
expensive
can be complex,
relatively
show
DECstation handle highly
TLBs
is significant have
measurements
previous
examination
monolithic microkernel
3.0) on a MIPS
of the most primitive
old-style a newer
on TLB design. We trace the specific sources of perforand the options for eliminating them. The design tradeoffs
for software-managed ined
(Mach
of these
system
from its
an
2.5) and
management
in the incidence
invoked
compared
(Mach
system
software
the handling
frequently
managed
operating
requires
Moreover,
the most
et al. system
that of
called the
programs
running
on a
a hardware-monitoring Tapeworm.
experimental
The remainder environment
in
with Monster system
(Figure
1) enables
comprehensive
analyses
of
the interaction between operating systems and architectures. Monster is comprised of a monitored DE Citation 3100, an attached logic analyzer, and a controlling workstation. The logic analyzer component of Monster contains a programmable hardware state machine and a 128 K-entry trace buffer. The state machine includes pattern recognition hardware that can sense the ACM
Transactions
on Computer
Systems,
Vol
12, No
3, August
1994
178
.
Richard
Uhllg
et al
l——.——.—
I FIR
1
m
The Monster
a Tektromx Ultnx,
DAS
OSF/1,
to the CPU momtur systems
monitormg
9200
Logic
and Mach
pros,
which
vu-tually with
such
system
Monster and
marker
the lo&c
of a TLB
instructions
runmng
has been
and the cache
To enable
servlcmg NOP
3100
motherboard
the processor
act]vlty.
is a hardware-momtonng
a DECstatlon
3.0. The DECstatlon
he between as the
special
Logic Anal yzer
Analyzer
all system
events,
mented
DECstation 3100
miss,
that
This
analyzer each
md]cate
system three modified
allows
operating
exit
access
logic analyzer
on certain
system and
of
systems
to provide
the
to trigger
the entry
conslstmg
operat]ng
has
to
operating
been
points
instru-
of various
routines
processor’s
address,
machine
data,
and control
can be programmed
signals
to trigger
on every
clock
on predefine
cycle.
patterns
This
the CPU resolution)
bus and then store these signals and a timestamp (with into the trace buffer. Monster’s capabilities are described
completely In this
in Nagle et al. [1992]. article, we used Monster
instrumenting exit
points
each OS kernel of various
monitored
with
the marker
trace
marker
code segments.
the logic
instructions
analyzer’s
to measure
with
analyzer,
The
whose
and a nanosecond
buffer.
Once
filled,
the
TLB
miss-handling
instructions machine
resolution trace
buffer
kernels detected
timestamp was
at 1 ns more
costs
denoting
instrumented state
state
appearing
entry were and into
by and
then stored
the logic
postprocessed
to
obtain a histogram of time spent in the different invocations of the TLB miss handlers. This technique allowed us to time single executions of code paths with
far
greater
accuracy
than
can be obtained
using
the
coarser-resolution
system clock. It also avoids the problems inherent in the common method of improving system clock resolution by taking averages over repeated invocations [Clapp et al. 1986]. Note that this measurement technique is not limited to processors with off-chip caches. Externally observable markers can be implemented on a processor with on-chip caches through noncached memory accesses. 3.2 TLB Slmulatlon
with Tapeworm
Many
previous
TLB
design
tradeoffs However,
[Alexander there are
1992].
simulation. tools like ACM
studies
have
used
First, it is difficult to obtain pixie [Smith 199 1] or AE [Larus
Tran+act]ons
on
Pomputer
trace-driven
simulation
to explore
et al. 1985; Clark and Emer 1985; Talluri a number of difficulties with trace-driven
Systems,
Vol
12,
No
et al. TLB
accurate traces. Code annotation 1990] generate user-level address 3, August
1994
Software-Managed traces
for
monitors [Chen
a single
task.
[Agarwal 1993],
account
However,
et al. 1988;
are required
for
both
more
Clark
to obtain
multiprocess
complex
and
Emer
realistic
and
the
address
operating
kernels
traces
that
system
trace-driven
tion
technique
periods
at regular
traces
requires
of time intervals.
are invariant
policies
while Third,
misses
are
TLB.
serviced
software-managed TLBs the stream of instruction Because
the code that
interaction
operation
is processed,
While
this
for
introducing
assumes
be true
state
a TLB
a change
that
address
or management
for
cache
machines),
simulation
it is not
true
in
miss
TLB
can itself
structure
induce
and
by compiling
a TLB
the
3100
are handled
kernel
TLB
this
misses
information
possible TLB A simulated Tapeworm
by software.
our TLB
simulated actual physical the
the
its
own
that
the
For example,
slots
(Figure mappings
to determine
data
TLB
to simulate checks
each
if it would
than
with
the
128 slots
TLB. in the
using
reference the
only
64
that
misses
larger,
simu-
property
the actual
and replacement It can simulate
actual
of 128 virtual-to-
also missed
controls
uses other
available
an array
inclusion
TLB
to simulate
entries
memory
have
a strict
Tapeworm
policies by supplying placement operating system miss handlers.
a TLB maintains
maintains
TLBs.
holds
systems’
Tapeworm
and
or smaller only
Tapekernels. DECsta-
“hooks” after every about all user and
simulator.
structures
larger
actual
and
Tapeworm
and simulated
Tapeworm
2), Tapeworm
TLB
one. Thus,
actual
to
address
actual
lated
directly
the
system
simulator,
the operating
code via procedural relevant information
configurations. TLB can be either
TLB.
TLB
We modified
the Tapeworm passes the
to maintain
ensures
for
miss,
resulting
worm, directly into the OSF/ 1 and Mach 3.0 operating system Tapeworm relies on the fact that all TLB misses in an R2000-based miss handlers to call miss. This mechanism
ex-
distor-
where a miss (or absence thereof) directly changes and data addresses flowing through the processor.
address trace can be quite complex. We have overcome these problems
tion
and
be suspended
parameters
may
by hardware
processing
thus
simulation
in the structural
services
between
system
trace
trace-driven
to changes
of a simulated
(where
that the
considerable
itself.
resources. Some researchers have overcome the storage resource by consuming traces on-the-fly [Agarwal et al. 1988; Chen et al. This
consume
as hardware
or annotated
Second,
1992].
can
such
179
.
storage problem tended
simulation
tools, 1985]
system-wide
workloads
TLBs
TLB
functions TLBs with
between
the
management called fewer
by the entries
than the actual TLB by providing a placement function that never utilizes certain slots in the actual TLB. Tapeworm uses this same technique to restrict the associativity of the actual TLB.1 By combining these policy functions with adherence to the inclusion property, Tapeworm can simulate the performance of a wide range degrees of associativity and a variety
of different-sized TLBs with of placement and replacement
different policies.
— 1 The by
actual
using
which
R2000
certain the
mapping
TLB bits
of may
is fully
associative,
a mapping’s be
virtual
but page
varying number
degrees to
of
associativity
restrict
the
slot
can
be
set
of
(or
emulated slots)
mto
placed. ACM
TransactIons
on Computer
Systems,
Vol.
12, No.
3, August
1994
180
.
Richard
Software .—
Uhlig
et al.
Trap on TLB MISS /
-——
Kernel Code (Unmapped
Space)
F;:~=~F; (128 Slots)
Actual TLB
Page Tables (Mapped
Fig.
2.
Space)
Tapeworm.
revoked
The
whenever
there
Tapeworm
TLB
TLB
M a real
Its own TLB
configuration(s),
Because
captures
dynamic
of the
driven
the
by static
nature
simulator
is
miss
The
the
simulator
system
btnlt
.wmulator
and
mto
resides
avoids
design
cited
completely
operating
TLB
problems
system
misses
m the operating
the
avoids
above.
many
Because
associated
of the problems
Tapeworm
is
avoided.
Because
miss-handling
Tapeworm
code, it considers
impact
by
traces large
is invoked the
with
driven
within the OS kernel, it does not require address difficulties with extracting, storing, and processing TLB
the
uses the real
and
IS
to simulate
system,
Tapeworm
with
simulators
traces
The Tapeworm simulation
—
(84 slots)
trace-driven
TLB
procedure
calls
at all; the various address traces are
by the
machine’s
of all TLB
misses
actual whether
they are caused by user-level tasks or the kernel itself. The Tapeworm code and data structures are placed in unmapped memory and therefore do not distort simulation results by causing additional Tapeworm changes the structural parameters the actual avoiding
TLB,
the behavior
the distortion
3.3
Experimental
All
experiments
of the system
inherent
in fixed
itself
TLB and
misses. Finally, because management policies of
changes
automatically,
thus
traces.
Environment were
performed
on
a DECstation
31002
running
three
different base operating systems (Table I): Ultrix, OSF/ 1, Mach 3.0. Each of these systems includes a standard Unix file system (UFS) [McKusick et al. 1984]. Two additional versions of Mach 3.0 include the Andrew file system (AFS) cache manager [Satyanarayanan 1990]. One version places the AFS cache manager m the Mach Unix Server @FSin) while the other migrates the AFS cache manager into a separate server task (AFSout). To obtain measurements, all of the operating systems were instrumented with counters and markers. For TLB the OSF/ 1 and Mach 3.0 kernels.
2The
DECstation
3100
contains
an
R2000
simulation,
microprocessor
Tapeworm
was embedded
(16,67
and
memory. ACM
Transactions
on
Computer
Systems.
Vol
1’2, No
3, August
1994
MHz)
16 Megabytes
in
of
Software-Managed Table
I.
Benchmarks
and Operating
Benchmark
TLBs
.
181
Systems.
Description
compress
Uncompressed and compresses a 7.7 Megabyte video clip.
IOzone
Asequential file l/O benchmark that writes and then readsa Megabyte file. Written by Bill Norcott.
jpeg~lay
Thexloadimage
10
program written by Jim Frost. Displays
four
JPEG images. mab
John Ousterhout’s Modified Andrew Benchmark [1 8].
mpeg~lay
mpeg_play
ousterhout
John Ousterhout’s benchmark suite from [1 8].
video~lay
A modified version of mpeg~lay an uncompressed video file.
V2.O from the Berkeley Plateau Research Group. Displays 610 frames from a compressed video file [1 9].
Operating System
that displays 610 frames from
Description
Lfltrix
Version 3.1 from Dlgtal Equipment Corporation.
OSFI1
OSF/1 10 is the Open Software Foundation’s 2.5,
Mach 3.0
Carnagle Mellon University’s version mk77 of the kernel and uk38 of the UNIX server.
Mach3+AFSin
Same as Mach 3.0, but with the AFS cache manager (CM) running in the UNIX server.
Mach3+AFSout
Same as Mach 3.0, but with the AFS cache manager running as a separate task outside of the UNIX server. Not all of the CM functlonalky has been moved into this server task.
Benchmarks
were
were
so that
tuned
( 100–200
seconds
Throughout benchmark ment
with
each
under
Mach
this
article
3.0)
article
ON
C compiler
takes All
version
approximately
measurements
2.1 (level-2 the
cited
used
on all the
is the average
same
are the
we use the benchmarks
were
in this
IMPACT
the Ultrlx
benchmark
binaries
cited
4. OS
compded
of three
SOFTWARE-MANAGED
optimization).
amount
average
listed
operating
version of Mach
of three
in Table
systems.
Inputs
of time
to
run
runs.
I. The same
Each
measure-
trials.
TLBS
Operating system references have a strong influence on TLB performance. Yet, few studies have examined these effects, with most confined to a single operating operating
system systems
[Clark and Emer can be substantial.
1985]. However, differences between To illustrate this point, we ran our
benchmark suite on each of the operating systems listed in Table I. The results (Table II) show that although the same application binaries were run on each
system,
and total
TLB
the
functionality
there service
is significant time.
between
variance
Some of these operating
in the
increases
systems
(i.e.,
number
of TLB
UFS
vs. AFS).
creases are due to the structure of the operating systems. For monolithic Ultrix kernel spends only 11.82 seconds handling while the microkernel-based Mach 3.0 spends 80.01 seconds. ACM
TransactIons
on Computer
misses
are due to differences
Systems,
Vol
12, No
Other
in in-
example, the TLB misses,
3, August
1994
182
Richard
.
Uhlig Table
et al
11.
Total
Run Time (see)
Operating System
TLB
Misses
across
the Benchmarks.
Ratio to Ultrix TLB Service Time
Total TLB Service Time (sec~
Total Number of TLB Misses
Ultrix 3.1
583
9,177,401
11.82
1.0
OSFII
892
11,691,398
51.85
4.39 6.77
975
24,349,121
80.01
Mach3+AFSin
1,371
33,933,413
106.56
9.02
Mach3+AFSout
1,517
36,649,834
134.71
11.40
Mach 3.0
*Time The
based
total
on
measured
run-time
Although
the
substantial
and
same
that
of TLB binaries
m the number
while
the total
types
of software-managed reason,
ship to TLB misses.
4.1
miss
Page Tables
OSF/1
task
has
own
machine-independent
seven
of the
operating
and their
of TLB
misses,
programs
systems, g service
increases
four
rise
to a 3-level
(PTE) types. The R2000 used
to
address
cache
m a
fold (from
each with
its own associated
page table
structure,
cost. For
its relation-
and costs of different
implement
linear
(Ll)
page table
page
code [Rashid
table,
structures
which
et al. 1988].
is
types
of
Because
(Figure
3).
maintained
by
the user
page
tables can require several megabytes of space, they are themselves the virtual address space. This is supported through level-2 (L2 page tables, which relatively large and
there times
Hardware
level-l
pmap
benchmark
correspond]n
misses
and the frequencies
3.0 both its
nmmcs
by the
AFSout), the total time spent servicing TLB This is due to the fact that there are different
and Translation
and Mach
Each
on each
to understand
handling,
miss.
recurred
run
number
TLB
it is important
TLB
misses were
of TLB
9,177,401 to 36,639,834 for misses increases 11.4 times. this
to service
time
number
application
difference
Notice
median
stored in or kernel)
also map other kernel data. Because kernel data is sparse, the L2 page tables are also mapped. This gives
page
table
processor
contains
recently
to a physical
hierarchy
used address,
and
four
a 64-slot,
PTEs.
fully
When
the relevant
the PTE
different
page
associative R2000 must
table
TLB,
translates be held
entry
which
is
a virtual
by the TLB.
If
the PTE is absent, the hardware invokes a trap to a software TLB miss-handling routine that finds and inserts the missing PTE into the TLB. The R2000 supports two different types of TLB miss vectors. The fh-st, called the user (uTLB) vector, is used to trap on missing translations for LIU pages. TLB This vector is justified by the fact that TLB misses on LIU PTEs are typically the most frequent [DeMoney et al. 1986]. All other TLB miss types (such as those caused by references to kernel pages, invalid pages, or read-only pages) and
all
other
interrupts
ACM
Tranwct]o.s
and
exceptions
trap
to a second
vector.
generic-exception on
Computer
Systems,
Vol
12,
No
.3 August
1994
vector,
called
the
Software-Managed
n User Data Page
L1
183
.1 L1 K PTE Each PTE maps one,4K page of kernel text or data.
L2 PTE Each L2 PTE maps one, 1,024 entry user page table page.
L2
.
Kernel Data Page
L1 U PTE Each PTE maps one, 4K page of user text or data.
@
TLBs
Vw’tualAddress Space Physical Address Space L3
L3 PTE Each L3 PTE maps 1 page of either L2 PTEs or LIK PTEs.
m
u Fig,
3.
Page
structure
with
structure
holds
are stored
the
storage, virtual data
address
stored
For the purposes correspond
to the
3.0.3 In
kernel
TLB
misses
miss
vectors
3 Other
page
conclusions.
hashed
page
of LIU
to LIU
TLB
L2,
table
they
Hays
structures
are
[1993]
ACM
(LIU)
PTEs.
are stored the
page
tab] e
These
LIU
PTEs
L2 page
tables
tables,
in tbe
L3 page
page
a 3-level
of the
L2 pages
the
to the
form
top
set of L1 page
time,
table
are m appeal table
by
is fixed
hierarchy
in
because
TLB. page hold
size. 1,024
can directly
Each
PTE
LIU
map
requires
PTEs,
either
we define
TLB
structure
implemented
misses,
L3, modify,
4 bytes
of
or 4 Megabytes
4 Megabytes
to handle is primarily
of
of kernel
possible
and
have
examined
could
mms-handhng Transactions
types
LIU
lead
to
111) to 1 and
subcategories
Table
a
PTEs tuned
require
the impact
(Table
by OSF\ III
the different due to the
by a highly
miss
types
five
invalid).
the OS uses them. other
miss
we define
and
are serviced
on TLB
boot
tables
The
data.
table
However,
structures and
can
of the time required differential in costs
vector.
table
the
page
case study,
(LIK,
At
4 KByte
page
and the way that
Huck
table
user
In turn,
anchor
the L2 page tables
of this
16 cycles because
at the uTLB
table.
page
space.
They
PTEs.
as an
a fixed
page
4 GBytes
addition
our measurements misses. The wide within
has
L1
linear
serves
Mach
its own
PTEs.
( LIK)
do not go through
architecture
map
kernel
The
(mapped)
by level-1
(L2)
L3 page
This
3.0.
having
level-2
level-1
space. Likewise,
or indirectly
each task
in the
a single
Mach
in virtual
are mapped
with
table
and
residing
are the
and
L3 page
Therefore,
Mach
table
memory.
R2000
OSF/1
which
tables
L2 PTEs
to the
MIPS
in levels
pages
L1 page
physical
references The
user
(L3 ) PTEs
unmapped
two
L1 page
both
level-3
first
the
the
hold
structure
the
in the
Mapping which
table
of also
types of TLB two different
can be retrieved handler
anywhere
different
shows
set
of
of forward-mapped,
inserted
from
definitions
inverted,
about
and
and
times. on Computer
Systems.
Vol
12, No
3, August
1994.
184
.
Richard
Uhlig
et al,
Table
III
Costs
for Different
TLB
Ultrix
TLB Miss Type
table
types
shows
of TLB
histogram
the
LI U
20
20
LI K
333
355
294
L2
494
511
407
354
286
To
of timings
described
below.
Modify
375
436
499
Invalld
336
277
267
number
misses
of machme
determine
for each
Note
that
Mach 3.0
OSFI1
16
L3
This
Mms Types
type
Ultrix
cycles
these
(at
costs.
of miss,
60 ns/cycle)
Monster
We separate
does not have
reqmred
was
TLB
L3 misses
used
miss
because
to service
to
types
collect mto
different
a 128 K-entry
the
SIX categories
It Implements
a 2-level
page
table. LILT
TLB
mms on a level-l
user
LIK LZ
TLB
miss
on a level-l
kernel
TLB
miss
on level-2
L3
TLB
miss
on a level-3
Modify
A page protection
Invahd
An access to a page marked
kernel
PTE
PTE
PTE
This
PTE
can only
occur
Can occur
after
as invalid
(page
after
a miss
elthcr
on a level-l
a level-2
user
PTE.
mms or a level-l
miss. vlolatlon fault)
300 to over 500 cycles because they are serviced by the generic handler residing at the generic-exception vector. The R2000 TLB hardware supports partitioning of the TLB into two sets of slots. The lower the upper
partition
partition
be refetched
quickly
The TLB
hardware
partition
through
the
range
is intended
is intended (e.g.,
LIU)
random
OS Influence
high
frequently
register
effectively,
fixes
that the
O through
retrieval used
referenced
replacement
index
the lower partition consists of slots consists of slots 8 through 63. 4.2
with
more
or infrequently
also supports a hardware
8 to 63. This,
for PTEs
to hold
PTEs
of PTEs
returns TLB
costs, while
PTEs
7, while
the
can L3).
in the upper
random
partition
that (e.g.,
numbers at
in
8, so that
upper
partition
on TLB Performance
In the versions
of the operating
which
for
account
the
systems
variation
in
studied,
the
number
there
are three
of TLB
basic
misses
factors
and
their
associated costs (Table IV and Figure 4). They are: (1) the use of mapped memory by the kernel (both for page tables and other kernel data structures); (2) the placement of functionality within the kernel, with a user-level server process (service migration), or divided among several server processes (OS decomposition); (additional relationship 4.2.1 address therefore, ACM
and
(3) the
range
of functionality
provided
by the
system
OS services). The rest of this section uses our data to examine between these OS characteristics and TLB performance.
the
Kernel Data Structures. Unmapped portions of the kernel’s space do not rely on the TLB. Mapping kernel data structures, adds a new category of TLB misses: LIK misses. For the operating
Mapping
Transactmn~
on
Computer
Systems.
Vol
12,
No
3, August
1994
So ftware-Managed Table
0s
IV.
Characteristics
of the Operating
Mapped Kernel Data Structures
Service Migration
Few
None
Ultrix
Systems
TLBs
.
185
Studied.
Service Decomposition
Additional OS Services
None
X Server
OSF/1
Many
None
None
X Server
Mach 3.0
Some
Some
Some
X Server
Mach3+AFSin
Some
Some
Some
X Server&
AFS CM
Mach3+AFSout
Some
Some
Many
X Server&
AFS CM
systems
we examined,
substantial
impact
several
hundred
Ultrix
places
unmapped
to service.
most
of its that
flexibility,
in the
data
by
can draw
In contrast,
OSF/
1 and
Mach
3.0L
upper
slots.
handling
This
an
contention
produces
miss
result
LIK
can
a small,
the much
place
most
can have
miss
fixed time.
larger
Table
portion
of
However,
to
virtual
V shows
of their
a
requires
space if it
that
kernel
few data
LIK struc-
to rely heavily on the TLB. Both and LIU PTEs in the TLB’s 56
a large in
L lK
OS at boot
memory.
tures in mapped virtual place, forcing them OSF/1 and Mach 3.0 mix the LIK PTEs
misses
each
in
the
upon
exhausts this fixed-size unmapped misses occur under Ultrix.
of LIK
because
structures
is reserved
Ultrix
number
performance
cycles
memory
maintain
an increase on TLB
an
number
L3
of LIK
miss.5
In
misses.
our
Further,
measurements,
OSF/1 and Mach 3.0 both incur more than 1.5 million LIK misses. OSF/1 must spend 6270 of its TLB handing time servicing these misses while Mach 3.0 spends 4.2.2
3770 of its TLB
Service
handling
time
servicing
In a traditional
Migration.
LIK
operating
misses. system
kernel
such as
Ultrix or OSF/ 1 (Figure 4), all OS services reside within only the kernel’s data structures mapped into the virtual these services, however, can be moved into separate server
the kernel, with space. Many of tasks, increasing
the
[Anderson
modularity
1991].
and
For this
been developed 3.0 [Accetta
extensibility
reason, in recent
1986],
By migrating
of the
numerous years
(e.g., Chorus
and V [Cheriton
these
services
operating
system
microkernel-based [Dean
et al.
operating
systems
have
and Armand
1991],
Mach
1984]).
into
separate
user-level
tasks,
operating
sys-
tems like Mach 3.0, fundamentally, change the behavior of the system for two reasons. First, moving OS services into user space requires, on the MIPS RISC architecture, that both their program text and their data structures be mapped. On other architectures that provide a small number of entries to
4 Like
Ultrix,
structures. to allocate between 5L1K
Mach
3.0 reserves
a portion
of unmapped
space
for
However, it appears that Mach 3.0 uses this unmapped mapped memory Once Mach 30 has allocated mapped mapped
PTEs
and unmapped
are stored
space
in the mapped, ACM
despite L2 page
Transactions
their tables
differing (Figure
on Computer
dynamic
allocation
of data
space quickly and must begin space, it does not distinguish
costs. 3). Systems,
Vol.
12, No.
3, August
1994
186
ACM
.
Transactmns
Richard
Uhlig
on Computer
et al.
Systems,
VOI
12,
No
3, August
Ig94
So ftware-Managed
ACM
Transactions
on Computer
Systems,
Vol
TLBs
12, N0
.
3. August
187
1994.
188
Richard
.
map
large
Uhlig
blocks,
et al.
such
as the entire
kernel,
these
entries
may
be exhausted
[Digital 1992; Hewlett-Packard 1990; Kane 1990; 1993]. In either case, migrated services
and Heinrich 1992; Motorola will have to share more of the
TLB
the user tasks’
with
user
Comparing
tasks,
the
2.2-fold
increase
moving
OS services
moving
OS data
In
user
handled same
from
9.8
into
million
structures
by the
fast
uTLB
to
4.2.3
Operating
in kernel
server
and can provide Unfortunately, performance monolithic
This
is
footprints.
3.0, we see a directly
change
by
LIU
for Mach
are mapped
due
to
comes
from
user
space.
space to mapped
which
are
3.0). In contrast,
the
PTEs,
by LIK
PTEs,
which
are
handler. Moving
Decomposition.
achieve
the
full
OS functionality potential
into
a
of a microkernel-
Operating system functionality can be further deserver tasks. The resulting system is more flexible
a higher degree of fault tolerance. experience with fully decomposed
systems
has shown
severe
problems. Anderson et al. [1991] compared the performance of a Mach 2.5 and a microkernel Mach 3.0 operating system with a
substantial portion of the file system functionality ~unning cache manager task. Their results demonstrate a significant between
TLB
Mach
second
mapped
(20 cycles
space
does not
based operating system. composed into individual
and
million.
kernel
are
handler
System
Unix
21.5
OSF/1
space. The
mapped
structures
by the general-exception
monolithic
in
user
from
data
with
misses
mapped
the
structures
conflicting
of LIU
space,
data
serviced
possibly
number
the two
systems
with
Mach
2.5 running
as a separate performance
36’% faster
than
AFS gap
Mach
3.0,
despite the fact that only a single additional server task is used. Later versions of Mach 3.0 have overcome this performance gap by integrating the AFS cache manager into the Unix server. We compared our benchmarks running on the Mach3 + AFSin system against only
the
same
structural
cache number
benchmarks
difference
manager.
The
of both
running
between
results
(Table
L2 and L3 misses.
on the
the V)
L2 slots required is proportional service concurrently. As a result,
show
Many
mappings needed to service L2 misses. The L2 PTEs compete for the R2000’s
Mach3
systems
+ AFSout
is the
a substantial
of the L3 misses 8 lower
location
TLB
system.
The
of the
AFS
increase
in
the
are due to missing
slots. Yet, the number
to the number of tasks providing adding just a single, tightly coupled
of
an OS service
task overloads the TLB’s ability to map L2 page tables. Thrashing results. This increase in L2 misses will grow ever more costly as systems continue to decompose services into separate tasks. 4.2.4
Additional
OS
In addition
Functionality.
to OS decomposition
and
migration, many systems provide supplemental services (e.g., X, AFS, NFS, Quicklime). Each of these services, when interacting with an application, can change the operating system behavior and how it interacts with the TLB hardware. For example, adding a distributed file service (in the form of an AFS cache manager) to the LIU TLB miss-handling
Mach 3.0 Unix time (Table VI).
ACM
Systems,
Transactions
on
Computer
Vol
12,
No
server adds 10.39 seconds to the This is due solely to the increased 3, August
1994
Software-Managed
ACM
Transactions
on Computer
Systems,
Vol.
TLBs
12, No.
.
3, August
189
1994
190
.
Richard
functionality
Uhlig
residing
et al
in the Unix
server.
adding 14.3 seconds. These misses ment the Mach 3.0 kernel must creased
functionality
support
operating
crease
and decompose
5. IMPROVING This
will
section
improving helpful to
have
systems
TLB
However,
an important
and
LIK
to what
impact
degree
vector.
Our
64 TLB requires
data).
with
slots
three
L2 PTEs
(Table
a small
penalty.
However,
3. O-based systems.
for the
majority
substantially The rest
of time
frequency
5.1
to accommodate
L2 PTEs)
and
types
these
assumptions systems,
handling
of TLB
Reducing
misses Finally,
data
that
these
break
down
the costly TLB
misses.
increases the cost of software-TLB of this section proposes and explores
of most
Unix
two
much
slower
in the slots
partiand
a traditional kernel
56
Unix
(2 PTEs
for
segments. design
choices
make
system such as Ultrix. For Ultrix, The remaining miss types impose
In these
are added to the architecture. size and associativity.
the
sets of 8 lower
left for additional
managed TLBs can be improved. First, can be reduced by modifying the TLB L2 misses can be reduced by increasing
by providing
and
are also reflected
V) demonstrate
spent
vector
disjoint
are intended
at least
assumptions
uTLB
the two
sense for a traditional Unix operating misses constitute 98.3% of all misses. Mach
these
fast
assumptions
into
principal assumptions to be the most frequent
all OS text and most of the OS data page tables) are assumed to be un-
reflects
the
These
slots
three
measurements
Second, of user
design
vectors:
The 8 lower
(which
kernel
TLB
miss
of the slots.
task
in-
PERFORMANCE
The R2000
general-exception upper
can
examines both hardwareand software-based techniques for before we suggest changes, it is TLB performance. However, review the motivations behind the design of the R2000 TLB,
of TLB
tioning
architectures
systems
functionality.
( > 9592) of all TLB miss types. structures (with the exception mapped.
also increase,
on how
operating
described in Section 4.1. The MIPS R2000 TLB design is based on two [DeMoney et al. 1986]. First, LIU misses are assumed
types
misses
are due to the additional memory manageprovide for the AFS cache manager. In-
for the OSF/
non-LIU
misses
Handling
LIU only 1- and
account
these
misses
management (Table VI). four ways in which software-
the cost of certain types of TLB misses vector scheme. Second, the number of the number of lower slots.G Third, the can be reduced we examine
if more
the tradeoffs
total
TLB
slots
between
TLB
Miss Costs
The data in Table V show a significant increase in LIK misses for OSF/1 and Mach 3.0 when compared against Ultrix. This increase is due to both systems’ reliance on dynamic allocation of kernel mapped memory. The R2000’s TLB performance suffers because LIK misses must be handled by the costly generic-exception vector, which requires 294 cycles (Mach 3.0).
6The newer ACM
MIPS
Transactmns
R4000
processor
on Computer
[Kane
Systems,
and Heinrlch Vol
12. No
1992]
3. August
Implements 1994
both
of these
changes
Soitware-Managed Table
VII.
Benefits
of Reduced
TLB
Miss
New Miss costs
Table 6 (see)
20
20
30,123,212
36,15
36.15
0.00
294
20
330,803
8.08
0.79
7.29
L1K
407
40
2,493,283
43.98
2.99
40.99
L3
286
286
690,441
11.85
11.85
0.00
Modify
499
499
127,245
3.81
3.81
0.00
267
267
Invahd
Total table
shows
of adding LIK
These
time
miss
Based
support on our
vectoring
the LIK
misses
from
also estimate
that
decrease
their
hardware miss
could
thelrcost
misses 294
handler.
For
miss-handling
from
could
implement
the remaining
and
1993].
Such
(for
407
cycles
hardware
steps
of the
a scheme
is reduced.
misses
at the
Alternatively,
3.0)
and L2 misses. reduce
to under
implement and
algorithm
retains
the
and reduced hardware),
solution
40 cycles.
Better
step of a TLB handler
necessary
of software
hardware but the
We
would
hardware/software
when
benefits
generic-exception
the cost of
a software
only
that
20 cycles.
the first
invoke
or
example,
for L2 misses
of a hybrid
a software
of the
beginning
would
miss vector
(Mach
support
we estimate
3.0) to approximately TLB
contribution respectively.
For
for LIK
handler
to servme
Their
hardware
handlers,
approach
handler
handlers.
vectors
some systems in
better
TLB
the uTLB Mach
uTLB
respectively.
miss
also come in the form
(e.g., flexible page table structure to systems with full table-walking misses
through
of the
hardware-based
the
to 0.79 and 2.99 seconds,
software
a separate
example,
algorithm
20cycles, down
either
the
allowing
of additional
through
dedicating
and
seconds
analysis
cycles
costs through
to40and
of the
consist
and
cost
support
Hays
miss misses
coding
timing
0.00 48.28
L2
8.08 and 4398
careful
2.70 58.29
TLB
for
costs can be reduced
more
hardware
from
2.70 106,56
vector
reduces
drops
168,429 33,933,413
of reducing
interrupt
Thlschange
miss
through
LIK
the benefits
a separate
misses
to TLB
Time Saved (see)
L2
L1 U
This
New Total cost (see)
To:ymost
Counts
191
.
Costs
Previous Previous Miss costs
Type of PTE Miss
TLBs
to
[Huck
handling
complexity relative average cost of TLB
could
vector
test
before
for
L2 PTE
invocation
of
the code that saves register state and allocates a kernel stack. Table VII shows the benefits of adding additional vectors. It uses the same data
for Mach3
cost estimates these
two
+ AFSin
as shown
resulting
from
modifications
106.56
seconds
service
time
down
drops
is that to 58.29
90%.
More
in Table
V, but is recomputed
the refinements the
total
seconds. importantly,
above.
TLB
LIK
miss
service the LIK
The result service drops
with
time 93%,
the new
of combining drops and
and L2 misses
from
L2 miss no longer
contribute substantially to overall TLB service time. This minor design modification enables the TLB to support much more effectively a microkernel-style operating system with multiple servers in separate address spaces. ACM
TransactIons
on Computer
Systems,
Vol
12, No 3, August
1994
192
-
Richard Uhlig et al.
Multiple TLB handler, dozens
miss vectors provide additional benefits. In the generic trap of load and store instructions are used to save and restore a
task’s context. Many the processor to stall.
of these loads and stores cause cache misses requiring As processor speeds continue to outstrip memory access
times, the CPI in this save/restore region will grow, increasing the number of wasted cycles and making non-uTLB misses much more expensive. TLBspecific they
miss
handlers
contain
only
memory-resident 5.2 The
not suffer data
the same performance
reference
to load
problems
the missed
because
PTE
from
the
page tables.
Partitioning MIPS
will
the single
the TLB
R2000
TLB
design
mandates
a partition
of the
TLB
fixes its position between the 8 lower slots and the 56 upper architectures, it is possible to lock entries in the TLB, which
entries
and
slots. On other can have much
the same effect as a partition, but with more flexibility in choosing the number of slots to be locked [Milenkovic 1990; Motorola 1990]. The MIPS R2000 partitioning is appropriate for an operating system like Ultrix [DeMoney
et
functionality comes
1986].
reside
PTE
mappings
In the structure,
However,
separate
insufficient.
that
the
al. into
This
is because
in different from
as OS
user-space
user-level
designs
tasks,
migrate
having
in a decomposed
tasks
compete
and
decompose
only
8 lower
system
the
by displacing
slots
be-
OS services
each other’s
L2
the TLB.
following sections, we examine the influences of operating system workload, PTE placement policy, and PTE replacement policy on
optimal
mechanism
for
on the Optimal Partition Point. To better understand 5.2.1 Influences impact of the position of the partition point, we measured how L2 miss
the rates
adjusting
partition
vary
depending
used
to vary
number
point.
dynamically
We
on the number the
of TLB
then
the partition
number slots
of lower
of lower
was
kept
propose
fixed
TLB
TLB
slots
slots
adaptive
available.
from
at 64. OSF/
Mach 3.0 ran the mab benchmark over total number of L2 misses was recorded For region
an
point.
Tapeworm
4 to 16, while
1 and
all three
the
was total
versions
of
and
the
the range of configurations, (Figure 5).
each operating system, two distinct regions can be identified. The left exhibits a steep decline which levels off near zero seconds. This shows
a significant performance improvement for every extra lower TLB slot made available to the system, up to a certain point. For example, simply moving from 4 to 5 lower slots decreases OSF\ 1 L2 miss-handling time by almost 507.. fiter 6 lower slots, the improvement slows because the TLB can hold most of the L2 PTEs required by OSF/ 1.7 In contrast, the Mach 3.0 system continues to show significant improvement up to 8 lower performance in line
7Tw0
L2
PTEs
for
slots. The with OSF/
kernel
data
additional 3 slots needed 1 are due to the migration
structures
and
one
each
for
a task’s
segments. ACM
TransactIons
on Computer
Sy~tems,
Vol
12!, No
3, August
1994
to bring Mach of OS services
text,
data,
and
3.0’s from
stack
Software-Managed
~5— E E1 F ~4< ‘E
AFSout
+-
AFSin
-A
Mach3
+—
OSFI
Fig,
\
\
5 p3 L N 12
+
5.
lower
U
: ~h.y.
miss under
for
L2
number time
different
the
total
becomes
of
for the
operating
reserves
PTEs,
L2 misses
vs.
L2 miss
TLB
As the
193
.
cost
total
mab benchmark
servicing
L
PTE The
systems. slots
1
L2
slots,
TLBs
more
lower
time
spent
neghgible.
‘k
o 4567891011121314
1516
Number of Lower Slots
the
kernel
makes
to the
Unix
a system
share
call
the TLB’s
server
in user
to the Unix
lower
slots,
space.
server,
In other
the
words,
Mach3
+ AFSin’s
cache
However,
when
user-level Figure PTEs
behavior
manager the
server,
how
can reside
TLB
hold
+ AFSout
three
additional
continues
the
size of the lower
tion
Different
in Figure
have
Figure
operating
different
7 shows
the optimal various partition requires additional
L2
PTEs until
6. Coupling
the decreasing
(11 total). all
11 L2
systems
with
for the
varying
an optimal
partition
point
of 8 for Mach
+ AFSout, when running level of interaction with The lower
graph
partition
particular
degrees
For example,
of LIU,
with point,
operating
of service the upper
of
LIK,
L2 misses
an optimal
points.
point.
L2 PTEs.
at the expense
the number
partition
partition
additional a separate
to improve
optimal
AFSin, and 12 for Mach3 Applications and their
the into
partition
of increasing
the increasing LIU, LIK, and L3 misses yields shown in Figure 6. This partition point, however, is only optimal system.
slot require-
server’s
decomposed
must
L2 PTEs
in the TLB.
has the side-effect
as shown
is
the lower
Unix
a task
server
three
3.0 because
by the
manager
must
increasing
partition
and L3 misses
cache
Mach3
to Mach
the Unix server’s
increase
is mapped
simultaneously
Unfortunately, the upper
AFS
the
5 shows
is similar
functionality
3.0, whenever
and
the Unix
(text segment, data segment, stack segment) ment, for the system as a whole, to 8. AFS
In Mach task
migragraph
in
3.0, 10 for Mach3
+
the Ousterhout benchmark. OS services also influence
in Figure
7 shows
the results
for
applications running under Mach 3.0. compress has an optimal point at 8 slots. However, video_ play requires 14, and mpeg _ play 18 lower slots to achieve optimal TLB performance. The need for lower
slots
is due to video_
action with services like the BSD the importance of understanding how applications
interact
with ACM
play
and
mpeg _ play’s increased
server and the X server. both the decomposition
the various TransactIons
inter-
It also underscores of the system and
OS services.
on Computer
Systems,
Vol. 12, No. 3, August
1994.
194
.
Richard Uhlig et al
25
1
!%
Fig.
6
Total
number cost
cost
of lower
of TLB
against
mms
the
of
of lower
4 to 32, while entries
LIK,
LIK
L2
+
L3
1s plotted L2,
and
L3 The
TLB
slots
varies
from
number
constant
+
-~
vs.
time.
the total
Total
total
total
m video_
der Mach
The
this
remams
benchmark
misses
slots.
serwcmg
LIU,
components number
of TLB
TLB
e
at
of TLB 64
The
play runmng
un-
30.
24 –
22 ~
G20– ~ al E 18– F ~
14– 12 46810121416182022 Number
Fig.
7
slots
Optimal
optimal
partltlon
shows
the
systems. Mach
ACM
average
The
of Lower
partition
are allocated,
right
upper which
I
I
6
81012141618202224262830
I
for various
operating
are avadable
varies
with
of the
shows
the
the
operating of 3 runs
for
and LIK,
system
benchmark
ousterhout
average
systems
for the L lU,
on Computer
I
run
Systems,
Vol
12,
No
3.
August
I
I
of Lower
I
benchmark under
I
I
I
1
Slots
benchmarks.
3 different
1994
I
As more
and L3 PTEs
and
3.0.
TransactIons
[
Number
slots
of 3 runs graph
4
Slots
points
fewer point
24
This The
3 different benchmarks
lower
yields left
an
graph
operating run
under
Sollware-Managed The TLB PTEs
partition
with
of our operating for all other
types
LIU
PTEs.
if mixing
it is unclear
To determine miss handlers The
operating
with
slots
However,
are not placed
Further,
to allow PTEs
use the lower
of PTEs.
and L3 PTEs
acceptable,
costs from
systems
LIK
cies.
was implemented
low retrieval
higher
it is unclear
costly-to-retrieve
if PTE
is required
the effects of other PTE placement of the baseline systems to implement
results
are
shown
in
Table
the
but
VIII.
four slots
costly-to-retrieve
are mixed
mappings
partitioning
to separate
costs. All
and the upper
why
slots,
195
.
systems
retrieval
for L2 PTEs
in the lower
TLBs
with
with
LIU
the
PTEs
is
at all.
choices, other
Policy
A
we modified meaningful
is
identical
the poli-
to
that
implemented by the baseline systems. The importance of some sort of partitioning is shown by Policy D, where all PTEs are mixed together, and which demonstrates very poor performance. At first glance, the appears to be the most desirable. However, note that with the lower
partition
was not permitted
to grow
beyond
partition,
Figure
we varied
8 shows
policies
the results
A, B, and
has different
the
for this
C. Only
optimal
partition
the
but
curves
improve
from
experiment
total
points,
point
its
optimal
with
fixed
performed are shown.
at these
policy B and
A C,
8 slots to accommodate
the additional PTE types allocated to this partition. To see if the performance of policies B and C would lower
baseline policies
for PTE Note
points
a larger
location
that
the
at 8.
placement each policy
performance
is
roughly the same for each system. From this we can conclude that the most important PTE placement policy is the separation of LIU and L2 PTEs (to avoid
the
between vided
very the
that
performance
PTE
the partition
Careful separate
poor
other
software services
point
in separate
management
user-level this
more
policy
policies
is tuned
can coexist
unacceptable level. This kernel system structuring illustrate
of
placement
D).
However,
A, B, and
to the optimal
of the
TLB
in a system
tasks
that
communicate
we constructed
interaction between servers in a multiserver load, a collection of user-level tasks mimics
the
degree
performance
is important because a key is that logically independent
clearly,
can be varied.
emulator partitioning
on the point
Figure
Mach
9 shows
3.0 kernel.
moves
farther
the
With
through
message
a workload
that
to
results
the
A
to an
passing.
To
emulates
the
OS. In this workof communicating
of running additional
right.
to which
characteristic of microservices should reside
microkernel the behavior
each
pro-
degrades
OS servers by passing a token between each other. The number and the number of pages that each server touches before passing along
differences
region.
can extend
before
the
C are negligible,
this server,
system
that
of servers the token multiserver the
optimal
leaves
the
partition point fixed at 8 will quickly encounter a performance bottleneck due to the addition of servers. However, if the TLB partition point is adjusted to account for the number of interacting servers in the system, a much greater number of servers can be accommodated. Nevertheless, note that as more servers are added, the optimal point number of tightly coupled servers bottleneck is best dealt with through of larger
TLBs
or miss
vectors ACM
still tends to shift upward, limiting the that can coexist in the system. This additional hardware support in the form
dedicated TransactIons
to level-2 on Computer
PTE Systems,
misses. Vol
12, No 3, August
1994
196
Richard Uhlig
.
et al.
Table
VIII
Alternate
table
shows
Placement
Pohcies.
cost
PTE Placement
Policy
This
PTE
A
Level 2 PTEs r lower partition. All other PTEs m upper partition.
1.91
B
Level 2 and 3 PTEs in lower parhtlon. Level 1 user and kernel PTEs In upper parhtlon.
392
c
All PTEs m lower part[tlon, except for level 1 user PTEs which are placed m upper parht]on
2.46
D
No partitioning at all
the time
11.92
affect
of alternate
PTE
spent
(m seconds)
to handle
cost M the
total
ousterhout
The partition
point
m fixed
placement TLB
policles mmses
at 8 sloth for the lower
on TLB for the
partltlon
management
Mach
cost
3.0 system
and 56 slots
The
running
for the upper
partltlon
f!
Policy A
c
Policy B
a
Pollcy
c
o ! 4
12
8
16
20
24
28
32
36
40
ParWlOn Point
Fig
8
that
of Table
costs
TLB are
partlt]onlng VIII,
shown
for different
except The
PTE
wlthdlffcrent
total
cost
for
placement
PTEplacement
pobcles.
This
1s the
pollcles
Only
thetotal
TLB
scale
of the
POIICY D (no partltlon)
]s off the
same
experiment
as
management plot
at
11.92
seconds
The baseline
systems
use a FIFO
replacement
policy
for the lower
partition
and a random replacement policy for the upper partition when selecting a PTE to evict from the TLB after a miss. To explore the effects of the replacement policy in these two partitions, we modified the miss handlers to try other combinations of FIFO and random replacements in the upper and lower partitions. The results of one such experiment are shown in Figure 10. For these workloads, differences between the replacement policies are negligible over the full range of TLB partition points, indicating that the choice between these two replacement policies is not very important. Our trap-driven simulation method makes it difficult to simulate nonrecently used and other ACM
TransactIons
on Computer
Systems,
Vol
12, No
3, August
1994
Software-Managed
TLBs
.
197
25.
&lC43
02X4
11 5
9
13
17
21
ParWlOn
Fig.
9.
TLB
management passing
partitioning costs
a token
for
among
under Mach different
multiserver
operating
3.0 running
a workload
numbers
Replacement (random
33
This
emulates
graph
shows
a multlserver
total
TLB
system
by
tasks.
A
POIICYA Random
D
Policy A FIFO
20
2s
Parhtlon
10.
systems. that
ofuser-level
12
4
Fig.
29
!
15
policies
25
Point
policies.
or FIFO)
for
This the
graph
Mach
shows
3.0 system
36
Point
the
performance
implementing
of different PTE
placement
replacement policy
A on
August
1994
ousterhout.
ACM
Transactions
on Computer
Systems,
Vol
12, No.3,
198
.
Richard
Uhllg
et al,
Table
IX
Different
Fixed Partitioning (see)
Workload
table
mg, The
to the PTE
1.27(18)
1 11
161
147(18)
143
IOzone
1 30(8)
O 43 (32)
043
the
static
total
TLB
optimal
placement
pseudo-LRU
Dynamic Partitioning (see)
Static Partitioning (see)
392 (8)
terhout
compares
settmglt
Schemes.
vldeo~lay
ous
This
TLBPartltlonlng
policies,
management
point
pohcy
(8)
(shown
costs
when
m parentheses),
fixing
the
and when
partltlon
using
at
8, when
dynamic
partltlon-
1s B,
so currently
we cannot
extend
this
statement
to other
policies. 5.2.2 Dynamic TLB partition
point
workload,
PTE
and
knowledge optimal
varies
partition
factors
point
run.
depending
placement
of these
processing
We have
Partitioning.
and
However,
administrators
might
then
but
flx
that
the best place to set the
operating the
it
system
replacement
is possible
it
statically
for
an
operating
system
knowledge workloads
have
not
of time,
although
placement policy and have predict the nature of future typically run ally are often
policy, ahead
shown on the
knowledge
of the
sort that
the
duration can
control
of workloads must
that
optimal
algorithm
point.
The
function
is computed
At
invocation,
each
change
resulted
costs when adjusted
is invoked
after
from the
the two most algorithm
in a significant
compared
against
appropriately.
tests increase
the previous
This
algorithm
some
PTE
be tuned
problems, we have designed and implemented that self-tunes the TLB partition dynamically
at which time it decides to move the at all. It is based on a hill-climbing
the
of some
it can do little to Although system
To address these adaptive algorithm misses or not
Given
to determine
of its own structure, that it must service.
at a given installation, parameters left untouched or are set incorrectly.
structure,
policy.
fixed
are
manu-
a simple, to the
number
of TLB
partition point either up, down, approach, where the objective
recent
settings
to see if the or decrease setting. tends
of the partition most in TLB
recent
miss-handling
If so, the partition to
home
in
point. partition
on
the
point
is
optimal
partition point and tracks this point as it changes with time. We tested this algorithm on each of the benchmarks in our suite and compared the resultant miss-handling costs against runs that flx the partition at 8 and at the static optimal point for the given benchmark. The results are shown in Table IX. Note that dynamic partitioning yields results that at times are slightly performed another
better than experiment
the static optimal. To explain this effect, we that records the movement of the TLB parti-
tion point during the run of gee, results (see Figure 11) show that ACM
TransactIons
on Computer
Systems,
Vol
a component the optimal 12, No
3, August
of the partition 1994
mab benchmark. point changes
The with
Software-Managed
20
TLBs
199
.
T
-z 15 .— 2 ~ 10 .— + .— c 25 0 Time Fig.
11.
Dynamic
partition
with
partitioning
time
TLB
time
partitioning
while
during
running
gcc
gee,
on
This
a system
graph that
shows
as a benchmark
its
overhead
additional compute PTE
among
the optimal point with time, it has the potential than the static optimal which remains fixed at
In
is minimal.
L2 and L3 PTE miss
handler.8 could
Increasing this
upper
TLB
adaptive
moves
its
working
It should
cost for maintaining the objective function.
L1 misses 5.3
of the
dynamic,
tracks results
sets.
some “good average point” for the duration of a run. The invocation period for the dynamic partitioning
costly
movement the
algorithm.
partitioning algorithm to give slightly better
that
the
implements
it is more support
to reduce
this
substantial
we
examine
The tradeoffs
are used to hold
minimize Figure
three
the here
there
is an
are required to for the already
of a register
tuned that
L1
counts
cost.
benefits
of building
can be more
different
types
TLB configurations configurations was L2 misses. 12 shows TLB
For smaller
can be set so that
for the highly
in the form
complex
of mappings
TLBs,
with
slots,
additional
the upper
LIK,
we used Tapeworm
for all seven benchmarks
significant
components
slots
and L3 PTEs),
ranging from 32 to 512 upper slots. fully associative and had 16 lower
performance
the most
TLBs because
(LIU,
whereas the lower slots only hold L2 PTEs. To better understand the requirements for upper to simulate these TLB
dynamic
TLB Size
section slots.
misses,
the
algorithm
however,
the TLB miss counts that Although this cost is negligible
Hardware
help
be noted,
Because
are LIK
under misses;
Each slots
of to
OSF/1. LIU
and
8Maintaining a memory-resident store sequence On the R2000,
counter this can
in the level-l miss handler requires a load-incrementrequire anywhere from 4 to 10 cycles, depending
whether
hits
data
20-cycle
the
memory
average
reference
currently
required ACM
the by the
cache.
level-l
TransactIons
miss
This
is a 20@+ to 507(
increase
over
on the
handler.
on Computer
Systems,
Vol. 12, No. 3, August
1994.
200
.
Richard
Uhlig
et al,
100 90 LI U %0
FUZ. 12.
TLB
service
time
vs
number
70
of
upper TLB slots servlcmg for all
The total cost of TLB miss seven benchmarks run un-
~ &
der
number
2 *
OSF/1
varied
The
from
lower
8 to 512,
slots
tions,
was fixed
Other
modify,
of upper
and
is
whd e the
slots
was
number
of
sum
L2 miss
of
the
\
for less than misses
structures
in the OSF\
ing the hardware misses
recomputed cycles)
for the OSF/
With
misses.
service
time
time
In
After
128 slots,
figures).
Because
size,
any
system,
and
decreases
invalid
further
may
However,
this
a modest discussion
workload. of TLB
article
+ AFSout
misses
in TLB
systems
to support is limited
like
large to TLB
service
(Figure
Mach
3.0.
moving
by over 5070. as “other”
support
in the
with
respect
to
a negligible
effect
on
Of
TLB may be and OSF/ 1
course,
such
even
as CAD
for operating
The reader is referred to Chen support for large applications.
13).
is dominated in TLB
a 128- or 256-entry systems like Ultrix
applications
(20
For example,
(listed
have
we
time
improvement
are constant
size will
LIK
Therefore,
time
time
dominate
misses
5.1, modify-
systems
handling
This suggests that monolithic operating
operating
be needed
miss
32 to 128 slots.
and modify
increases
LIK
The data
to service
20 cycles.
miss-handling
3.0 TLB
and modify
handler
is a noticeable
from
Mach
the invalid
microkernel
there
time.
of mapped
in Section
the uTLB
TLB
Slots
miss-handling
as outlined
lower-cost
reduced,
of Upper
number
to an estimated
the
sizes increase
overall TLB performance. sufficient to support both TLBs
time
TLB
large
3.0, and Mach3
misses
each
as TLB
the
to allow
using
1, Mach
64 to 128 slots
TLB
service
the cost of LIK
by LIU from
total
to
However,
mechanism
the LIK
the
of the total
due
1 kernel.
trap
reduces
357.
is
512
256
128
64 Number
LIK
Other
\
32
of
W
30
20
account
L3
i
*40;
costs
prominence
,.,
50
invalid,
L3 misses
LIK
60
at 16 for all configura-
the
&
systems
et al. [ 1992]
larger
programs. running
for a detailed
5.4 TLB Associativlty Large, consume capacity
fully
associative
TLBs
nearly twice the chip [Mulder et al. 199 I].
(128+
entries)
are
difficult
architects could implement larger TLBs with The following section explores the effectiveness
lesser degrees of TLBs with
of associativity. ACM
Transac.tlons
on Computer
Systems,
to build
area of a direct mapped structure To achieve high TLB performance,
Vol
12, No
3, August
1!994
and
can
of the same computer
of associativity. varying degrees
So ftware-Managed 70--
OSFI1 35
30 iTL-
u
LIU
H
LIK
‘-T
TLBs
–-—-
.
201
Mach 3.0
60
U
LIU
H
LIK
B
L3
.—
50 III=T
— 20
10
0 32
128
64
256
512
32
126
64
Number oi Upper Slots
256
512
Number of Upper Slots
100
‘-Mach3+ AFSout
❑- Lli
90 80
E
LIK
n
L3
—
❑
Other
-
70 ~ g ~ ~ +
60 50 40 [
:’
‘--
‘:
30 20 10 0 64
32
126
5t2
256
Number of Upper Slots
Fig.
13.
servicing
Modified (for
in 20 cycles and
service
for Mach3
from
vs. number assuming
are handled + AFSout.
current-generation
sizes ranging
time
benchmarks)
and L2 misses
the bottom
Many
TLB
all seven
of upper L lK
in 40 cycles. Note
that
the
processors
32 to more
than
TLB
misses
slots.
The
total
can be handled
The top graphs scale varies
implement 100 entries
cost of TLB
by the uTLB
are for OSF/1
for each
fully
and Mach
3.0,
graph.
associative
(Table
miss
handler
X). However,
TLBs
with
technol-
ogy limitations may force designers to begin building larger TLBs which are not fully associative. To explore the performance impact of limiting TLB associativity, we used Tapeworm to simulate TLBs with varying degrees of associativit y. The top two graphs in Figure 14 show the total TLB mpeg _ play benchmark under Mach3 + AFSout the benchmark
under
Mach
3.0. Throughout
the
range
miss-handling time for and the video– play of TLB
sizes,
increasing
associativity reduces the total TLB handling time. These figures illustrate the general “rule of thumb” that doubling the size of a caching structure will yield about the same performance as doubling the degree of associativity [Patterson
and Hennessy
19901. ACM
TransactIons
on Computer
Systems,
Vol
12, No 3, August
1994
202
Richard Uhlig et al.
.
Table
X.
Number
of TLB
Slots
for Current
Processors
Number
Processor
Associativity
TI Viking
MIPS R2000
8+4
32
I
2-way
I
32
128
I
full
I
64 unrfred
I
—
I
full
I
64 unlfled
I
—
full
HP 9000 Series 700
full
96+4
4-way
32 urmfled
that
R4000
I
MIPS R4000
Intel 486 Note
page
sizes
actually
mappings
has
vary
from
48 double
are to consecutive
of
Data slots
full
DEC Alpha 21064 IBM RS/6000
Number
of
Instruction slots
4K
to 16 Meg
slots
pages
Two
PTEs
m the virtual
48 urvf!ed
and
are variable
can
reside
address
96+4
m many
m
space
one
processors
double
[Dl~tal
slot
The
If them
JIIPS virtual
1992]
25-
E+
20 ~ {
E3
~
\
; 1=
–a
,
.s15—
x
5 lo:
2-way
3
4-way
— —
4-way
8-way
_.
8-way
Full
x
Full
[
2.way
0 G ,5 x
5
Elti A
~
x
G 7 $?
x 0 I
I 64
1 128
I 256 Number of Upper Slots
32
x o I
I 512
I
32
64
I 256
128
Number of Upper Slots
under
mpeg~lay
Mach3+AFSout
vldeo~lay
under
Mach 3.0
140 1.
32
64
128 256 Number of Upper Slots
compress
Fig. ACM
14
Transactions
Total
TLB
on Computer
service
time
Systems,
under
for TLBs
Vol
12, No
512
OSFI1
of different 3, August
sizes and assoclatmtles 1994
I 512
So ftware-Managed Some degree
benchmarks,
the total For
TLB
a 2-way
Even
miss-handling
time TLB,
2-way
32-entry,
These
three
4-way
graphs
of larger
can
TLBs
perform
For example,
set-associative
a 512-entry,
smaller tion
however,
of set associativity.
compress
set-associative
with
a small
in Figure
benchmark
displays TLB
TLBs
203
.
14 shows
under
pathological
OSF/
1.
behavior.
is outperformed
by
a much
TLB.
reducing
is an effective
for graph
for the compress
set-associative
show that
badly
the bottom
TLBs
associativity
technique
to enable
for reducing
the construc-
TLB
misses.
6. SUMMARY This
article
seeks to demonstrate
to architects
ers the importance
of understanding
software
and
features,
simulation
of these
performance ported
to
attributable magnified
a new
platform.
data
structures,
dependent
when
A
utilization on overall
including
the
also on the division
the
page
of the TLB. performance
Software because
themselves.
machine,
together
with
simulations
was was
management of the large performance.
to map TLB
between
its own
behavior
is
the kernel
approaches difficulties
modest degree of service decomposition. In our study, we have presented measurements
numerous
problem
TLB
memory
functionality
and and
system
of the
separate user tasks. Currently popular microkernel ple server tasks, but here fell prey to performance
current
here,
operating
component
tables
of service
hardware
presented
use of virtual
design-
measurement
times that can exist. can significantly influence
on the kernel’s
system
between for
a microkernel
significant
miss service of the TLB
depends
of tools
In the case study
resulted
to an increased the TLB’s impact
behavior
the interactions
importance
interactions.
problems
variations in TLB The construction TLB
the
and operating
and
rely on multiwith even a
of actual
of architectural
systems
on a
problems,
and
have related the results to the differences between operating systems. We have outlined four architectural solutions to the problems experienced by microkernel-based systems: changes in the vectoring of TLB misses, flexible partitioning
of the
associativity
to
implemented
at little
applicable
enable
The
design,
larger
construction
more their
providing
TLBs,
and
of larger
TLBs.
cost, as is done in the R4000.
to the MIPS
performed. system
TLB,
RISC general
potential
architectural lessons
family
The
with
and the utility applicable.
two
in
architectural
of
can
be
are most
the studies
trends
of tools
degree
solutions
on which the
the first
These
regarding
for interaction
machines to which they are ported, impacts are, we feel, more broadly
limiting
were
operating features
for assessing
on these
REFERENCES ACCETTA, M., new
kernel
BARON, R., GOLUB, D., RASHID, R., TEVANIAN, foundation
for
UNIX
development.
USENIX Assoc., Berkeley, Calif. AGARWAL, A., HENNESSY, J., AND HOROWITZ, M. and multiprogramming
workloads. ACM
ACM
Trans.
Transactions
In 1988.
the
A., AND YOUNG, M. Summer
Cache
Comput. on Computer
1986
performance
Syst.
1986.
USENIX
Mach:
A
Conference.
of operating
system
6, 4, 393–431.
Systems,
Vol
12, No 3, August
1994
204
Richard
.
et al
C A , KESHLEAR,
ALEXANDER,
UNIX
envu-onment
AM~RICAN
MICRO
Devices,
Inc.,
ANDERSON, T Support
W
1991
E, LF,\T,
1993
Calif,
on
Neu,s
1985
Translation
buffer
performance
In a
1.3, 5, 2-14
Am29050
Mwruproce.wor
User’s
Manual.
American
Micro
Cahf
H
M,,
B~RSHAD, sy%em
for Programmmg
Workshop
Arch
DWTCES
and operating
B.
M , AND BRJGGS, F.
C’omput
Sunnyvale,
architecture CHEN,
Uhlig
B
Languages
Software Workstation
In the 4th Internatlonul
and
methods
for
Operatzng
1991
N , AND LAZOWSKA, E. D
des~gn
Operatzng system
Sy~tem~
address
Systems
IEEE
The
Conference AC’M,
tracing.
New
In
Computer
InteractIon
of
on Arrhztectura[
York
Proceedings
Society
Press,
of Los
the
4th
Alamltos,
178-185
CHEN, J
B., BORG, A., AND JOtJPPI, N, P
In The 19th
Annual
CHERITON, D, R.
Internatmnal
1984,
1992
A simulation
Sjnlposzurn
The V Kernel:
based
on Computer
A software
study
of TLB
Architecture
base for dmtrlbuted
performance
IEEE,
New
systems
York
IEEE
Softu,
1,
2, 19-42 CLAPP, R
M , DUCHESNEAU,
real-time CLAI?h,
performance
D.
W , AND EM~R,
.%mulatlon D~AN,
I J
Other
Kernel
DEMoNE~,
M.,
DIGITAL.
IEEE,
1992.
Performance
Data
Assoc J
29, 8 (Aug
of the
C’omput
Syst
movement
USENIX
Architecture
Toward
translation
buffer:
1,31-62
m kernehzed
1992.
1986
), 760–778
VAX-11/780
3,
, Berkeley,
Handbook
PA-RISC
system~
In
Mtcr[)-Kernels
Cahf
Operating
system
supports
on a RISC
In
11
Dlgltal
Eqmpment
Architecture
and
Carp
, Bedford,
Instruction
Mass
Set Reference
Manual.
Inc. 1993
space machines.
G
ACM
York
1990
HLTC~, J AND HAYES, J tzwe, IEEE,
1991
N , AND SCH[JLTZE, T.
Commun
Trans
J , AND MASHEY,
Alpha
Hewlett-Packard,
KANE,
F.
New
HEWLIZ rT-P~UC4RD
address
1985 ACM
Architectures
MOORE,
C’OMPCON
for Ada
S
and measurement
R. W. AND ARMAND,
and
J , VULZ, R. A , MLT~G~, T
benchmarks
New
Architectural
In The 20th
support
Annual
for translation
International
table
management
Sympuszum
in large
on Cornpater
Archztec-
York.
.mm HEINRIC’H,
1992.
J
MIPS
RISC
Architecture
Prentice-Hall,
Englewood
Cllffs,
NJ LARUS, J
1990.
R
Abstract
execution:
A techmque
for
efficiently
tracing
programs
Univ
of
Wlsconsm-Madison MC KUSIC,
M.
UNIX.
ACM
K, Joy,
W.
Trans.
N,,
LEFFL~R,
Cornpat
MIL17Nh0V1C,
M.
MOTOROLA.
1993
1990
PowerPC
1990
MC88200
Syst
S, J., AND FA~RAY,
R. S
1984
A fast
file
system
for
2, 3, 181-197
Microprocessor
memory
601 RISC
management
Microprocessor
units
Users’
IEEE
Manual
Mzcro
10, 2, 75-85
Motorola,
Inc , Phoemx,
Ariz, MOTORULA. tice-Hall,
Englewood
Chffs,
MULIYER, J. M., QUACH, N Its application NAGLE,
IEEE
D , UHLI~,
between
R.,
operating
of Mlchlgan,
P.A.TEL, K., decoder
SMITH, Univ
B.
gan Kaufmann,
processor Comput ACM
Why C.,
T
1992
computer
aren’t
1991.
An area
Monster.
model
A tool
architectures.
operating
AND ROWE,
2nd
ed
Pren-
for on-chip
memories
and
Tech
for
analyzing
the
Rep. CSE-TR-147-92
interaction The
Umv
L.
systems
A
1992
getting
faster
Performance
J.
1990,
Computer
Architecture.
A , YOUNG, M., GOLUB, D., BARON, R., BLACh,
architectures.
vu-tual ZEEE
1990
Trans.
Scalable,
memory Comput, secure,
management
of
as hardware?
a software
A Quantatzve
MPEG
WRL video
Approach
Mor-
Systems,
Vol
12,
D., BOLOShY, W., AND Grew,
for paged
umprocessor
and
multi-
file access,
IEEE
37, 8, 896–908 and highly
available
23, 5, 9-21, on Computer
as fast
Cahf
Machine-independent
Transactions
Manual
Berkeley
HENNESSY,
SATYANARAYAN.AN, M.
User’s
Mlch
San Mateo,
RASHID, R, T~VANWN, 1988.
and
of California,
PATTERSON, D. mu
UnLt
Cu-c , 1, 2, 98–106.
AND MUDGE,
Arbor,
OUSTWRHOUT, J. 1989. Tech Note, (TN-11)
Management
T., AND FLYNN, M, J
J. Solzd-State
systems
Ann
Cache/Memory N J
No
3, August
1994
distributed
J,
Software-Managed SMITH,
J.
E.,
memory. SMITH, M.D. tory,
M.,
page
sizes.
New ings. WILRES,
Tracing
Univ.,
Palo
The
19th
with Alto,
KONG S., HILL, In
Corporation Pixie
Tech.
1988.
Computer
of America
Patent
system
205
.
employing
virtual
Systems
Labora-
No. 4,774,659.
Rep. CSL-TR-91-497,
Computer
Calif.
M.
D., AND PATTERSON, D, A.
Annual
Internat~onal
1992.
Symposlurn
Tradeoffs
on Computer
in supporting Archztectzme.
two IEEE,
York. B.
1991.
USENIX
The file system Assoc.,
Berkeley,
J. AND SEARS, B.
protection
Received
Astronautics
1991.
Stanford
TALLURI,
WELCH,
G. E., AND GOLDSMITH, M.A.
DERMER,
Assignee:
TLBs
architecture.
October
1993;
1992.
belongs
in the kernel.
In USENIX
Mach
Symposzum
Proceed-
Calif. A comparison
of protection
lookaside
buffers
and the PA-RISC
HP Laboratories.
revised
May
ACM
1994;
accepted
Transactions
July
1994
on Computer
Systems,
Vol
12, No. 3, August
1994