Memory System Performance of Programs with Intensive l--leap Allocation AMER
DIWAN,
Carnegie
Heap
allocation
support heap
found
ML
programs
heap
allocation
allocation
penalty. with
This
subblock
Categories
which
—szrnzdatzon;
Terms: Key
garbage
A paper
Project
15213-3891; granted
servers, 01995 ACM
Transactions
Measurement,
Automatic
storage
by the
data
with cache;
Design
Styles —assocmtZue
Analysis
and Design
of Systems;
D,3.2
Aids
D. 1.1 [Pro-
[Programming Language
LanConstructs
States
of the
expressed
collection,
subblock
appeared
in the
Projects
Agency,
place-
21st
Annual
DoD,
F19628-91-C-0168.
authors
or Implied, or AT&T.
E, Moss,
Department
[email protected]
and
should
not
of the Defense of Computer
01003-4610; email: {Dlwan; Carnegie Mellon Umversity,
through
D. Tarditl
is
be interpreted
Advanced
Science,
Moss}@cs umass.edu; 5000 Forbes Avenue,
as
Research
Umversity
D. Tarditi, Pittsburgh,
of ComPA
edu. copy of part
that
copies the
is by permission to lists,
article
contract
Government,
and
notice,
garbage
mode,
write-through
Research
under
are those
either
digital/hard
page
Scholarship.
article
Dlwan
Advanced
by ESD/AVS
policies,
copying
allocation,
Languages.
Defense
monitored
title
or all of this
are
requires
Systems,
not
of the
of ACM,
0734-2071/95/0800-0244 Computer
caches
Performance
in this
on
a
50’?o.
reclamation,
presented
or to redistribute ACM
For
or larger
Languages]:
results
copyright
buffer.
a 64K
without
size of one word
management
of the
MA
the cache
a subblock
Programming
[Programming
of Programming
A
for achieving
into
Performance
some
fee provided
the
Organization]:
Languages,
in this
crucial
Performance
write-policy,
and
property
Structures]:
policy,
email:
copying
that
[Memory
write-miss
to make
advantage,
than
write-buffer,
Amherst, Department,
given
higher
(Functional)
Phrases:
the United
without
was often
machmes
organization,
or a write
9T0 for
most
system
with
writes
that
for
functional
memory object
under
heap
addresses:
Permission
placement
collection,
the official
Massachusetts, puter Science
a new
garbage
conclusions
Agency,
Authors’
system
Structures]:
D.3.3
by an AT & T Ph.D.
representing
memory initialize
allocation
of mostly
We found
was
B,3,2
Systems
generational
8313,
also supported and
and
The
page-mode
[Memory
Applicative
m sponsored
Order
fast
allocation.
for
To investigate
of heap
performance
appropriate
technique
performance,
performance
the
and
overhead
the overhead
B.3.3
the
subblock
with
cache
Descriptors:
on Principles
research
Views
along
data
Experimentation,
containing
Symposium This
by having
ZCstorage
collection,
use of heap
management
system
system
studied
with
storage
memory
memory
to allocate
Classification;
Words
write-back,
ARPA
ability
Techniques]:
and Features—dynam
ment,
performance.
the
Language
General
MOSS
1s a general poor
We
heavy
good
C.4 [Computer
gramming
Additional
made
However,
pohcy,
Subject
of the
machines.
poorly.
cache memorzes;
guages]:
study
on many
placement
and
memories:
to have
can be achieved
placement,
without
collection
was the
a write-allocate
subblock
ELIOT
garbage
can have
good performance
AND
It is believed
an in-depth
systems
Standard
copying
languages.
we conducted
memory
TARDITI,
University
with
programming this,
DAVID
Mellon
prior
made
pubhcatlon,
Inc
To copy
specific
work
for personal
or distributed and
its
for date
otherwise,
permission
or classroom profit appear,
to republish, and/or
a fee,
$03.50 VOI
13,
No
3, August
1995,
Pages
244-273.
use M
or commercial and
notice
is
to post
on
Performance
of Programs with Intensive Heap Allocation
.
245
1. INTRODUCTION Heap
allocation
with
poor memory
copying
system
garbage
performance
collection
[Koopman
is widely
believed
to have
et al. 1992; Peng and Sohi
1989;
Wilson et al. 1990; 1992; Zorn 1991]. To investigate this, we conducted extensive study of memory system performance of heap-allocation-intensive programs
on memory
programs, amounts
compiled with the SML/NJ of heap allocation, allocating
The programs heaps.
system
used a generational
To our surprise,
to actual
machines,
performance Bershad
organizations
was
1993]:
such
that
as the
programs
garbage
DECStation
only
collector
5000/200,
of C and
combined sizes
with
of 32K
caches
write-allocate
or larger.
3 to 13% slower
are smaller
than
for the benchmarks mentioned Our and
Sohi
1989;
performance overall
this
differs
miss
programs
of heap
most
et al.
allocation
writes,
poorly
work
1992;
in two
(256K
ways.
1991]
performance tem
system
metric,
on program
entire
memory
features
contribution
which
First,
et al. 1992;
previous
which
running system:
such as write
to cycles
accurately
work
time. it
Second,
concentrated
buffers
the
previous solely
and page-mode
on
writes
hits and misses in the cache and should be simulated of memory system behavior. We simulate the entire
(CPI)
of the
work
the
indicator
read and write does not reflect substantially.
instruction effect
Peng
system used
is a misleading
per
reflects
or larger
of the features
on memory
of performance. The overall miss ratio neglects the fact that misses may have different costs. Also, the overall miss ratio the rates of reads and writes, which may affect performance We use memory
cache whose
workstations. [Koopman
Zorn
metric,
and
on machines
of the programs
of the current
reported
as the performance
and
cache misses
For other configurahigher than 50%.
page-mode
area
1990;
system
[Chen
due to data
and do not have one or more
previous
their
corresponding
for achieving good performance with a subblock size of one word,
performs
allocation here)
includes
from
Wilson
ratio
allocation
the
studied
above;
work
important placement
on write-miss,
Heap
The
to manage
the memory
Fortran
than they would have with an infinitely fast memory. tions, the slowdown due to data cache misses was often The memory system features with heap allocation are subblock
workstations.
for some configurations
to that
ran
of many
compiler [Appel 1992], do tremendous one word every 4 to 10 instructions.
copying
we found
comparable
typical
an
did
as our
memory
not
model
caches.
Memory
interact
with
systhe
system
the costs of
to give a correct memory system.
picture
We did the study by instrumenting programs to produce traces of all memory references. We fed the references into a memory system simulator which
calculated
the architecture
a performance
penalty
to be the MIPS
R3000
cache configurations DECStations, substantial
to cover the design
SPARCStations,
and
HP
due to the memory [Kane
and Heinrich
space typical 9000
system. 1992]
We fixed and varied
of workstations
series
700.
We
such as
studied
eight
programs.
We varied the following memory system parameters: cache size (8K to 512K), cache block size (16 or 32 bytes), write-miss policy (write-allocate or write-no-allocate), subblock placement (with and without), associativity (oneACM
Transactions
on Computer
Systems,
Vol.
13, No
3, August
1995
246
and
Amer Diwan et al
.
two-way),
TLB
and page-mode and data
sizes (1 to 64 entries),
writes
caches,
(with
i.e., no unified
4 describes
simulation
ory system performance results, analyzes them,
metrics validates
extends gests
them
areas
to programs
for future
with
work.
data
easily to write-back information. Section
the
methods, used. them,
only
(1 to 6 deep), split
instruction
only for write-through
caches. 3 describes
the benchmarks,
related
work.
and the
mem-
Section 5 presents the simulation and gives an analytical model that
different
Section
depth
We simulated
caches. We report
caches, but the results extend Section 2 gives background Section
write-buffer
and without).
allocation
behavior.
Section
6 sug-
7 concludes.
2. BACKGROUND The
following
SML/NJ, 2.1 This
sections
SML,
and the
describe
memory
SML/NJ
compiler.
systems,
garbage
collection
in
Memory Systems section
describes
cache organization
is divided into blocks which reside in the cache in exactly
for a single
level
of caching.
are grouped into sets. A memory one set, but may reside in any block
A cache
block within
may the
set. A cache with sets of size n is said to be n-way associative. If n = 1, the cache is called direct-mapped. Some caches have valid bits, to indicate what sections of a block hold valid data. A subblock is the smallest part of a cache with which a valid bit is associated. In this article, subblock placement with each word. implies a subblock of one word, i.e., valid bits are associated Moreover, on a read miss, the whole block is brought into the cache, not just the subblock A memory Otherwise,
that missed. Przybylski [1990] notes that this is a good choice. access to a location which is resident in the cache is called a hit. the memory
due to a memory miss if memory memory A read to the
access is a miss.
block
being
accessed
A miss
is a compulsory
for the first
time.
A miss
miss
it results from the cache not being large enough to hold all the blocks used by a program. It is a confi!ict miss if it results from two blocks mapping to the same set [Hill 19881. miss is handled by copying the missing block from the main memory
cache.
A write
hit
is always
written
to the
cache.
There
policies for handling a write miss, which differ in their performance For each of the policies, the actions taken on a write miss are (1) Write-no-allocate: —Do not allocate —Send the write
a block to main
in the cache. memory, without
(2) write-allocate, no subblock placement: —Allocate a block in the cache. —Fetch the corresponding memory —Write the word to the cache (and (3) write-allocate, sub block placement: If the tag matches but the valid bit —Write the word to the cache (and ACM
if it is
is a capacity
TransactIons
on Computer
Systems,
putting
the write
block from to memory
main memory. if write-through).
is off to memory
if write-through).
Vol. 13. No. 3, August
1995
are several penalties.
in the cache.
Performance
of Programs with Intensive Heap Allocation
247
.
If the tag does not match: —Allocate —Write
a block the word
—Invalidate
in the cache. to the cache (and
the remaining
Write-allocate/subblock
words
placement
to memory
if write-through).
in the block. will
have
a lower
write-miss
penalty
than write-allocate/no subblock placement since it avoids fetching a memory block from main memory. In addition, it will have a lower penalty than write-no-allocate if the written word is read before being evicted from the cache. See Jouppi [1993] for more information on write-miss policies. A write
buffer
may
be used
A write
buffer
memory. and the
When the CPU CPU continues
buffer
retires
is a queue
entries
to reduce
the
containing
cost of writes
writes
that
are
to main
memory.
to be sent
to main
does a write, the write is placed in the write buffer, without waiting for the write to finish, The write
to main
memory
using
free
memory
cycles.
The
write
buffer cannot always prevent stalls on writes to main memory. First, if the CPU writes to a full write buffer, the CPU must wait for an entry to become available in the write buffer. Second, if the CPU reads a location queued in the write buffer, the CPU may need to wait until the write empty. Third, if the CPU issues a read to main memory progress, the CPU must wait for the write to finish. Main
memory
is divided
of writes
memory
accesses to another
example, on cycles, while
to the
into
latency
same
DRAM
DRAM
a DECStation a page-mode
This
Memory section
systems. the
System
Page-mode
when
there
page [Patterson
a write
writes are
high
reduce
and Hennessy
spatial
is in the
no intervening
a non-page-mode write one cycle, Page-mode
with
is is
locality,
1990].
For
takes writes
five are
such as those
Performance
describes
One popular
number
page
5000/200, write takes
especially effective at handling writes seen when doing sequential allocation. 2.2
pages.
DRAM
while
which buffer
two metrics metric
of memory
accesses
which
miss
miss,
divided
of memory
The cache miss by the total
which
better
usually
measures
have
this
is the
of the memory system to cycles per useful instruction (CPI); all besides nops (software-controlled pipeline stalls) are considered is calculated for a program as the number of CPU cycles to
complete the program/total sures how efficiently the memory
A metric
accesses
is of
accesses.
performance.
of memory
ratio
number
miss costs, it is useful to have miss ratios for each kind of access. miss ratios alone do not measure the impact of the memory system system
kinds
the performance ratio.
memory
contribution instructions useful. CPI
different
cache
different Cache on overall
Since
for measuring
is the
system
to CPI
number of useful instructions executed. CPU is being utilized. The contribution is calculated
as the
number
of CPU
It mea-
cycles
of the spent
waiting for the memory system /total number of useful instructions executed. As an example, on a DECStation 5000/200, the lowest CPI possible is 1, completing one instruction per cycle. If the CPI for a program is 1.5, and the memory contribution to CPI is 0.3, 20% (0.3/1.5) of the CPU cycles are spent ACM
Transactions
on Computer
Systems,
Vol
13, No.
3, August
1995
248
Amer Diwan et al.
.
cmp allot + 12, top branch-lf-gt call-gc store tag, (allot)
;
Store tag
store
,
Store value
Check for heap overflow
ra,4(alloc)
store rd,8(alloc) move alloc+4, result add alloc,12 Flg.1.
; Store pomterto next cell , Save pointer to cell Increment allocation pointer
Pseudoassembly
code forallocatmg
a list
cell.
waiting for the memory system (the rest may be due to other causes such as nops, multicycle instructions like integer division, etc.). CPI is machine dependent 2.3
since
Copying
A copying
it is calculated
Garbage garbage
reclaims
an
another
area
garbage,
and
area
actual
penalties.
Collection collector
[Cheney
of memory
of memory. the
using
area
by
All
1970;
copying
data
in
can be reused.
Fenichel
all the
the
and
live
Yochelson
garbage-collected
Since
1969]
(nongarbage)
memory
data
area
to
becomes
is reclaimed
in
large
contiguous areas, objects can be allocated sequentially from such areas in several instructions. Figure 1 gives an example of pseudoassembly code for allocating a list cell. ra contains the value to be stored in the list cell; rd contains the pointer to the next list cell; allot is the address of the next free word
in the
2.4
Garbage
The
SML\NJ
[Appel
1989].
allocation
area;
Collection
and top contains
the
end of the
allocation
area.
in SML / NJ
compiler
uses a simple
Memory
is divided
into
generational an old
area. New objects are created in the allocation the live objects in the allocation area to the
copying
garbage
generation
and
area; garbage old generation,
collector
an allocation
collection freeing
copies up the
allocation area. Generational garbage collection relies on the fact that most allocated objects die young; thus most objects (about 99% [Appel 1992, p. 206]) are not copied from the allocation area. This makes the garbage collector efficient, since it works mostly on an area of memory where it is very effective at reclaiming The most-important
space. property
of a copying
collector
with
respect
to memory
system behavior is that allocation sequentially initializes memory which has not been touched in a long time and is thus unlikely to be in the cache. This is especially true if the allocation area is large relative to the size of the cache since allocation will knock everything out of the cache. This means that caches which cannot hold the allocation area will incur a large number of write misses. For example consider the code in Figure 1 for allocating a list cell. Assume that a cache write miss costs 16 CPU cycles and that the block size is four words. On average, every fourth word allocated causes a write miss. Thus, the average memory system cost of allocating a word on the heap is four cycles. ACM
Transactions
on Computer
Systems,
Vol
13, No. 3, August
1995.
Performance The
average
cost for
instruction)
plus
allocation
is cheap
terms 2.5
of machine
Standard
Standard language
in
a list
for
terms
the
cell
is seven
memory
cycles
system
of instruction
(at
one cycle
overhead.
counts,
it
may
Thus,
per
while
be expensive
in
cycle counts.
(SML) [Milner et al. 1990] is a call-by-value, lexically higher-order functions. SML encourages a nonimperative
style.
default
allocating
12 cycles
249
.
ML
ML with
gramming
of Programs with Intensive Heap Allocation
data
Variables
structures
cannot cannot
be altered
be altered
once
once
they
they
are
are
bound,
created.
scoped proand
The
by only
kinds of assignable data structures are ref cells and arrays, which must be explicitly declared. The implications of this nonimperative programming style for compilation are clear: SML programs tend to do more allocation and copying 2.6
than
SML/
programs
written
in imperative
NJ Compiler
The SML/NJ
compiler
We used version
[Appel
1992]
0.91. The compiler
is a publicly
available
concentrates
on making
and function calls fast. Allocation is done arrays. Optimizations done by the compiler arguments uncurrying,
for SML.
allocation
cheap
constant in the instead
folding,
code hoisting,
compiler was to allocate of the stack [Appel 1987;
and Jim 1989]. In principle, the presence of higher-order functions that procedure activation records must be allocated on the heap. With
a suitable [Kranz
compiler
inline, except for the allocation of include inlining, passing function
in registers, register targeting, and instruction scheduling.
The most-controversial design decision procedure activation records on the heap Appel means
languages.
analysis,
a stack
et al. 1986].
However,
can
be used
using
only
to store a heap
most
activation
simplifies
records
the compiler,
the
run-time system [Appel 1990], and the implementation of first-class continuations [Hieb et al. 1990]. The decision to use only a heap was controversial because it increases the amount of heap allocation to cause poor memory system performance. 3. RELATED There
have
greatly,
which
is believed
WORK been
many
studies
of the
cache
behavior
of systems
using
heap
allocation and some form of copying garbage collection. Peng and Sohi [1989] examined the data cache behavior of small Lisp programs. They used tracedriven simulation and proposed an ALLOCATE instruction for improving cache behavior, which allocates a block in the cache without fetching it from memory. Wilson et al. [1990; 1992] argued that cache performance of programs with generational garbage collection will improve substantially when the youngest generation fits in the cache. Koopman et al. [1992] studied the effect
of cache
organization
on combinator
graph
reduction,
an implementa-
tion technique for lazy functional programming languages. They observed the importance of a write-allocate policy with subblock placement for improving heap allocation. Zorn [1991] studied the effect of cache behavior on the ACM Transactions
on Computer
Systems,
Vol. 13, No 3, August
1995.
250
Amer Diwan et al.
.
performance
of a Common
Lisp
system,
sweep garbage collection algorithms programs are run with mark-and-sweep locality
than
when
Our work the
overall
indicator
run
differs
with
from
miss
work
as the
of performance.
stop-and-copy
and
mark-andthat better
when cache
stop-and-copy.
previous
ratio
when
were used. He concluded they have substantially
The
in two ways.
performance overall
and write misses may have different not reflect the rates of reads and
First,
metric,
miss
ratio
previous
which
neglects
is the
work
used
a misleading fact
that
read
costs. Also, the overall miss ratio does writes, which may substantially affect
performance. We use memory system contribution to CPI as our performance metric, which accurately reflects the effect of the memory system on program running time. Second, previous work did not model the entire memory system:
it
concentrated
write buffers in the cache
solely
on caches.
Memory
system
features
and page-mode writes interact with the costs of hits and should be simulated to give a correct picture
such
system behavior. We simulate the entire memory system. Appel [1992] estimated CPI for the SML/NJ system on a single using
elapsed
time
and instruction
counts.
His
CPI
differs
were
undercounted
machine
substantially
ours. However, Appel has confirmed our measurements by personal nication and later work [Appel 1993]. The reason for the difference instructions
as
and misses of memory
from commuis that
in his measurements.
Jouppi [1993] studied the effect of cache write policies on the performance of C and Fortran programs. Our class of programs is different from his, but his conclusions support placement is a desirable ratio
for the programs
that
write-allocate
ours: that architectural he studied
with
a write-allocate policy feature. He found that
was comparable
subblock
placement
with subblock the write-miss
to the read-miss
eliminated
many
ratio, of the
and write
misses. 4. METHODOLOGY We used trace-driven simulations to evaluate the memory system performance of programs. For simulations to be useful, there must be an accurate simulation model and a good selection of benchmarks. Simulations that make simplifying assumptions about important aspects of the system being modeled can yield misleading results. equally misleading. In this work,
Toy or unrepresentative benchmarks can be much effort has been devoted to addressing
these issues. Section 4.1 describes our trace generation and simulation tools. Section 4.2 4.3 states our assumptions and argues that they are reasonable. Section describes and characterizes the benchmark programs used in this study. Section 4.4 describes the metrics used to present memory system performance. 4.1 Tools We extended Larus 1990; ACM
TransactIons
QPT (Quick Program Profiler and Tracer) [Ball and Larus 1992; Larus and Ball 1992] to produce memory traces for SML/NJ on
Computer
Systems,
Vol
13.
No.
3. August
1995
Performance programs.
QPT
rewrites
of Programs with Intensive Heap Allocation
an executable
information;
QPT
that
the compressed
expands
also
produces
the executable program, (which is written in C).
program
trace
into
it can trace
We obtained ate elements count 4.2
allocation
The profiler
This
(1) Simulating Subblock
1989] for the memory system a write-buffer simulator.
by using
an allocation
intermediate
on every
profiler
We extended
simulabuilt
code to increment
allocation.
on
collector
into
appropri-
this
profiler
to
assumptions
section
which
describes
might
reduce
the important
the
assumptions
validity
of our
we made.
it by simulating write-allocate/no sub block and ignoring the memory that occur on a write miss. This can cause a small
inaccuracy
in the CPI numbers. that
the cache
block
and
the
second
The following
example
size is 2 words,
word, that a program writes the write misses. In subblock cache,
QPT operates
code and the garbage
Write-Allocate /Subblock Placement with Write-Allocate/No Placement. Tycho does not simulate subblock placement so we
approximate reads from
Suppose
Because
trace
program
and Assumptions
to minimize
simulations.
array
compressed
regeneration
trace.
251
of assignments.
Simplifications
We tried
statistics instruments
of a count
the number
a full
the SML
We extended Tycho [Hill and Smith tions. Our extensions to Tycho include SML\NJ.
to produce
a program-specific
.
that
illustrates
the subblock
this. size is 1
the first word in a memory block, and that placement, the word will be written to the
word
in
the
cache
block
will
be
invalidated.
However, the simplified model will mark both words as valid after the write. Subsequently, if the program reads the second word, the read will incorrectly hit. Thus the CPI reported for caches with subblock placement can
be less
than
the
since
SML
programs
rarely
and since (2) Ignoring (3)
most
actual
writes
the effects
CPI. tend
These
are to sequential
of context
incorrect
switches
conflict
misses
than
those
however, (see Section
occur 4.3)
locations. and
system
The simulations are driven by virtual addresses. the SPARCStation 11 have physically indexed different
hits,
to do few assignments
reported
(4) Placing code in the text segment instead performance over the unmodified SML/NJ collection costs, since code is never copied, flushes after garbage collections.
calls. Some machines such as caches, and will have
here. of the heap. This improves system. It reduces garbage and avoids instruction-cache
(5)
Used default compilation settings enable extensive pact of these optimizations article.
(6)
The preferred ratio of heap size Used default garbage collection settings. to live data was set to 5 [Appel 1989]. The softmax, which is the desired upper limit on the heap size, was set to 20MB; the benchmark programs ACM
settings for SML/NJ. Default compilation optimization (Section 2.6). Evaluating the imon cache behavior is beyond the scope of this
TransactIons
on Computer
Systems,
Vol. 13. No 3, August
1995.
252
Amer Diwan et al.
.
Table
I.
Benchmark
Programs Description
Program
Cw
The
Concurrency
Workbench
networks
of finite-state
nicating
Systems.
Cleaveland
of the
Huet,
The
Life
game
et al. 1989],
of
Life,
of an object A
spherical
benchmark Arvind], VLIW
sample
for analyzing
Calculus
session
from
of Commu-
Section
by James
by
Chris
Algorithm video
fluld-dynamics
and then
the lexical
7.5
of
al.
1978],
translated
into
A Very-Long-Instruction-Word
and David
of Standard
R. ML.
1989],
running
50
decides
the location
hsts.
et al. 1990]
image.
program,
et
[Reade using
[Waugh
implemented
S Mattson
description
Reade
It is implemented
in a perspective [Crowley
algorlthm,
implemented processing
gun.
Inversion
completion of geometry.
written
of a glider
The Perspective
Simple
the
is a tool
in Milner’s
some axioms
generator,
[Appel
generations PIA
is
Knuth-Bendix
processing
A lexical-analyzer Tarditi
input
et al. 1993]
expressed
et al. [ 1993].
by Gerard Lexgen
processes
The
An implementation
Knuth-Bendix
[Cleveland
developed translated
Standard
ML
as
a “realistic”
into
ID
by Lal
Fortran
[Ekanadham
and
George.
instruction
scheduler
written
implemented
by David
R. Tarditi
by John
Dan-
[Tarditi
and
skin. A LALR(I)
YACC
Appel
1990],
parser
generator,
processing
never reached this limit. investigate the interaction standing
these
tradeoffs
the
grammar
of Standard
ML.
The initial heap size of the sizing strategy
was and
is beyond
the scope of this
take one cycle with a perfect memory costs, since multicycle instructions
more time to retire is negligible, since
writes. Section
The inaccuracy 5.4 shows that
introduced write-buffer
We did not size. Under-
article.
(7) MIPS as a prototypical RISC machine. All the traces station 5000/200, which uses a MIPS R3000 CPU. (8) All instructions only write-buffer
lNtB. cache
are for the
system. This give the write
DECaffects buffer
by this assumption costs are small.
(9) Assuming CPU cycle time does not vary with memory organization. This may not be true, since the CPU cycle time depends on the cache access time, which may differ across cache organizations. For example, a 128K cache may take longer to access than an 8K cache. 4.3
Benchmarks
Table I describes the benchmark programs. ~ Knuth-Bendix, Lexgen, Life, Simple, VLIW, and YACC are identical to the benchmarks measured by Appel [1992]. The description of these benchmarks is copied from Appel [1992]. Table II gives the following for each benchmark: lines of SML code excluding comments and empty lines, maximum heap size, compiled-code size, and user-mode CPU time on a DECStation 5000/200. The code size includes 207KB for standard libraries, but does not include the garbage
lAvailable ACM
from
TransactIons
the on
authors, Computer
Systems,
Vol.
13,
No
3, August
1995
Performance Table
of Programs with Intensive Heap Allocation II.
Sizes of Benchmark
Lines
Cw Knuth-Bendix Lexgen
Heap
PIA Simple
Code Size (KB)
Non-gc
Time
(see)
Gc (see)
5728
1107
894
22.74
3.09
491
2768
251
13.47
1.48
1224
2162
305
15.07
1.06
1026
221
16.97
0.19
111
Life
Run
Size (KB)
253
Programs
Size Program
.
1454
1025
291
6.07
0.34
999
11571
314
25.58
4.23
VLIW
3207
1088
486
23.70
1.91
YACC
5751
1632
580
4.60
1.98
Table
III.
Characteristics
of Benchmark
Programs
Partial Read
Writes
Writes
(70)
(%)
(’%)
(%)
(%)
0.41
13.24
0.77
9.04
0.38
11.14
Program
Inst
Cw
523,245,987
17.61
11.61
Knuth-Bendix
312,086,438
19.66
22.31
Lexgen
328,422,283
16,08
10.44
Fetches
Assignments
Life
413,536,662
12.18
9.26
PIA
122,215,151
25.27
16.50
Simple
604,611,016
23.86
14.06
VLIW
399,812,033
17.89
15.99
0.01 0.00 0.20 0.00 0.00 0.00 0.10
YACC
133,043,324
18.49
14.66
0.32
collector and other run-time support times are the minimum of five runs. Table and
III
kinds
percentage list
characterizes
the benchmark
of memory
references
of instructions.
the reads,
full-word
code,
The writes,
which
programs
5.92
0.21
12.33
0.00
15.45
0.00
8.39
0.05
7.58
60KB.
according
do. All
numbers
Reads,
Writes,
and
and partial-word
0.00
is about
they
The
run
to the number
are
Partial
writes
Nops
reported writes
as a
columns
done by the program
and the garbage collector; the assignments column lists the noninitializing writes done by the program only. The Nops column lists the nops executed by the program and the garbage collector. All the benchmarks have long traces; most related works use traces that are an order of magnitude smaller. Also, the benchmark programs do few assignments; the majority of the writes are initializing writes. Table IV gives the allocation and total allocation
allocation
the allocation by kind: functions, closures for cludes
statistics
for
each
benchmark
program.
All
sizes are reported in words. The Allocation column lists the done by the benchmark. The remaining columns break down
spill
records,
z Closures
for
callee-save
first-class
continuations.
closures for escaping functions, closures callee-save continuations,2 records, and
arrays,
continuations
ACM
strings,
vectors,
can be trivially
Transactions
ref cells,
allocated
on Computer
store
list
on a stack
Systems,
for known others (inrecords,
in the
and
absence
Vol. 13, No. 3, August
of
1995.
254
Amer D[wan et al.
.
Table
IV.
Allocation
Allocation Program
Characteristics Known
Escaping
(words)
of Benchmark
%
Size
70
Callee
Size
Saved Size
%
Programs Other
Records To
Size
Size
%
Cw
56,467,440
4.0
4,12
3.3
15.39
67,2
6.20
19.5
3.01
6.0
4,00
Knuth-Bendix
67,733,930
37.6
6.60
0.1
15.22
49.5
4.90
12.7
3.00
0.1
15.05
Lexgen
33,046,349
3.4
6.20
5.4
1296
72.7
6.40
15.1
3.00
3.7
6.97
Life
37,840,681
0.2
3.45
0.0
15.00
77.8
5.52
22.2
300
0.0
10.29 3.22
PIA
18,841,256
0.4
5.56
28.0
11.99
25.0
4.69
12,7
3.41
339
Simple
80,761,644
4.0
5,70
1.1
15.33
68.1
6.43
8.3
3.00
185
VLIW
59,497,132
9.9
5.22
6.0
26.62
61.8
7.67
20.3
3.01
2.1
2.60
YACC
17,015,250
2.3
4.83
15.3
15.35
54.8
7.44
23.7
3.04
4.0
10.22
Table
V
Timings
of Memory
Operations Tlmmgs
Task Non-page-mode Page-mode
4
flush
16 bytes
from
memory
Read
32 bytes
from
memory
15 19 195
period
5
time
0
Write
hit
or miss
Write
hit
(16 bytes,
no subblocks)
Write
hit
(32 bytes,
no subblocks)
Write
miss
(16 bytes,
no subblocks)
miss
(32 bytes,
no subblocks)
Write TLB
floating-point
(subblocks)
0 0 15 19
miss
28
numbers).
For
each
allocation
kind,
the
% column
words allocated for objects of that kind as a percentage and the Size column gives the average size in words,
one-word
4.4
1 11
Read
Refresh
total tion,
5
write write
Page-mode
Refresh
tag,
(Cycles)
write
Partial-word
3.41
of an object
of that
gives
the
of total allocaincluding the
kind.
Metrics
We state cache performance numbers in cycles per useful instruction (CPI). All instructions besides nops are considered useful. Table V lists the timings used in the simulations. These numbers are derived from the penalties for the DECStation 5000/200, but are similar to those in other machines of the same class. In addition to the times in Table V, all reads and writes may also incur write-buffer penalties. In an actual implementation, there may be a one-cycle penalty for write misses in caches with subblock placement. This is because a tag needs to be written to the cache after the miss is detected. This does not change our results, adds at most 0.02–0.05 to the CPI of caches with subblock placement. ACM
TransactIons
on
Computer
Systems,
Vol.
13,
No.
3, August
1995.
since
it
Performance We used
a DRAM
Page-mode after
flush
a series
TLB
data
page
is the
of Programs with Intensive Heap Allocation size of 4K in the
number
of page-mode is reported
is used instead
of just
5. RESULTS
AND ANALYSIS
Section
configurations 5.4, and
SML/NJ.
simulated and
TLB
memory
a qualitative
with
data
for
Qualitative
Recall
from
portant
not
analysis why
performance.
2 that of
In
to the CPI.
of the
This
metric
for all the
memory
5.2 we list
they
were
system Section
behavior
the cache
selected.
of
and TLB
In Sections
performance,
5.3,
write-buffer
5.6 we validate
an analytical model allocation behavior.
SML/NJ
a copying
allocation
enough
uses
collector
initializes
since the last garbage
large
misses. memory
Section
is that
touched
pipeline
that
the
simula-
extends
these
Analysis
property
behavior
writes.
write
page size of 4K was used in the
memory
tions. In Section 5.7 we present results to programs with different 5.1
the
the measurements
In Section
and explain
5.5 we present
performance,
of page-mode
to flush
contribution
us to present
A virtual
5.1 we present compiled
miss
CPI to allow
in one chart.
programs
simulation
needed
255
writes.
as the TLB
benchmarks simulations.
In
of cycles
.
to contain
The slowdown that system organization.
write
to
in an area
This
means
allocation
these
collector.
respect
memory
collection. the
a copying with
area
misses
that
The
most-im-
memory that
system
has not
been
for caches that
there
will
translate
be many
are write
to depends
on the
Recall from Section 4.3 that SML/NJ programs have the following important properties. First they do few assignments: the majority of the writes are initializing writes. Second, programs do heap allocation at a furious rate: 0.1 to 0.22
words
per
instruction.
Third,
writes
come
in bunches
correspond to initialization The burstiness of writes
of a newly allocated combined with the
mentioned
above
that
particular, where the
writes should not stall the CPU. Memory CPU has to wait for a write to be written
memory will need to wait
suggests
an aggressive
area. property write
perform poorly. Even memory systems for writes if they are issued far apart
because
of copying policy
they
collectors
is necessary.
In
system organizations through (or back) to
where the CPU (e.g., two cycles
does not apart in
the HP 9000 series 700) may perform poorly due to the bunching of writes. This means that the memory system needs two features. First, a write buffer or fast page-mode writes are essential to avoid waiting Second, on a write miss, the memory system must block from memory if it is going to be written before this
requirement
placement
only
[Koopman
holds
for caches
et al. 1992],
with
a block
for writes to memory. avoid reading a cache being read. Of course,
a write-allocate size of one word,
policy. and
Subblock the
ALLO-
CATE instruction [Peng and Sohi 1989] can achieve this. Since the effects on cache performance of these features are similar, we discuss only subblock placement. For large caches, when the allocation area fits in the cache and ACM
Transactions
on Computer
Systems,
Vol.
13, No
3, August
1995.
256
Amer Dlwan et al.
.
Table
VI.
Memory
System
Organizations
Miss
Block
Cache
Write
Page
Size
Sizes
Buffer
mode
8K-512K
1-6
yes
8K-512K
6 deep
no
8K-512K
6 deep
no
Write
Write
Policy
Policy
Subblocks
through
allocate
yes
1, 2
16, 32 bytes
through
allocate
no
1, 2
16, 32 bytes
through
no allocate
no
1, 2
16, 32 bytes
Table
VII.
Architecture
Memory
Assoc
System
Write
Write
Policy
Policy
Organization
Miss
Studied
of Some Popular
.Machines
Write Buffer
Subblocks
deep
Assoc
Block
Cache
Size
Size
DS31OO
through
allocate
4 deep
—
1
4 bytes
64K
DS5000/200
through
allocate
6 deep
yes
1
16 bytes
64K
HP 9000 SPARCStation SPARCStations 128K
have
instruction
and 256K a fetch
II
data
back
allocate
none
no
1
32 bytes
64K-2M
through
no allocate
4 deep
no
1
32 bytes
64K
unified
cache cache
caches;
and 256K
for model
size of 16 bytes,
This
most
data
HP 9000
cache
series
for models
750: the DS5000/200
is stronger
than
700 caches
are much
720 and 730, and 256K actually
subblock
has a block
placement
since
smaller
than
instruction
size of four
it has a full
2M: cache
bytes
with
tag on every
“subblock.”
thus there reduced.
are few
write
misses,
the
5.2 Cache and TLB Configurations The design involved,
space for memory
of subblock
placement
will
be
Simulated
systems
and the dependencies
benefit
is enormous.
among
study only a subset of the memory restrict ourselves to features found
them
There
are complex.
are many
variables
Therefore
we could
system design space. In this study, we in currently popular RISC workstations
[Cypress 1990; DEC 1990a; 1990b; Slater 1991]. Table VI summarizes the cache organizations simulated. Table VII lists the memory system organizations of some popular machines. We simulated only separate instruction and data caches (i.e., no unified caches). While many current machines have separate caches (e.g., DECStations, HP 700 series), there are some exceptions (notably SPARCStations). We report data only for write-through caches, but the CPI for write-back caches
can be inferred
from
the
data
for write-through
caches.
While
write-
through and write-back caches have identical misses, their contribution to the CPI may differ for two reasons. First, a write hit or miss in a write-back cache may take one cycle more than in a write-through cache. A write-back cache must probe the tag before writing to the cache [Jouppi 1993], unlike a write-through cache. It is easy to adjust the data for write-through caches for this to obtain the data for write-back caches. If the program has w writes and n useful instructions, then the CPI for a write-back cache can be obtained CPI of the write-through cache with the same size ACM
TransactIons
on
Computer
Systems,
Vol
13,
No,
3, August
1995,
by adding w/n and configuration.
to the For
Performance
VLIW w\n
is 0.18. Second,
different
write-buffer
different penalties
for
than
for
writes caches
We simulated separate Two
of the the
over Thus,
placement
the
subblock
TLBs
from
(such
as the
From
Section
TLBs
perform
for
write-through caches
write-through
and
penalty
is
with
an
cache
parameters
placement
not
data
HP
9000
series)
5.5 it is clear are
versus
that
have for the
write-allocate
no subblock
for
placement the
versus
placement.
placement
subblock
collect
1 to 64 entries
well.
write-no-allocate/subblock
did
with
write-buffer
write-buffer
unified
write-no-allocate/no we
between
since
have
for write-back
machines TLBs.
unified
combination
provement mance.
Some
and data
and
difference
may
memory
the
those
257
caches.
most-important
write-no-allocate these,
This
associative,
even small
expect
than
.
caches
to main
are less frequent
to be negligible
policy.
instruction
benchmarks
We
to be smaller
caches.
fully
replacement
write-back
do writes
points.
memory
even for write-through
LRU
they
different
caches
is likely
and
because
at
to main
write-through
write-back small
and
write-back
since
write-through
penalties
frequencies
caches
of Programs with Intenswe Heap Allocation
for
of
offers
no im-
cache
perfor-
write-no-allocate/subblock
configuration.
5.3 Memory System Performance We present graphs.
memory
Each
program locate
for a range
performance
graph
(1 or 2). Each
memory gram,
system
breakdown
components depth The
for
in a summary There
breaks
graphs
only
programs
the
and omitted
the
are
remaining
for conciseness.
summary
memory
system
a summary
graph.
asso-
to a different
graphs
for a block
summary
similar
3, 4, and 5 are the breakdown
configurations;
(write-aland
for each
pro-
size of 32 bytes. overhead The
into
its
write-buffer
at six entries.
we present
other
the
in
policies
or without),
corresponds
and another
down
is fixed
graph
and breakdown of one benchmark
write-miss (with
are two
configuration
graphs
the performance
placement
size of 16 bytes
one
section
data
Figures
curve
graph
for
in these
In this
subblock
organization.
one for a block
Each
in summary
summarizes
of cache sizes (8K to 512K),
or write-no-allocate),
ciativity
size
system
summary
and
are
given
in
the
for VLI W for the
graphs breakdown
The breakdown
for VLIW (Figure
graphs
graphs graphs
for
2).
Appendix.
16-byte
block
VLI W are similar
for the other
benchmarks
are similar (and predictable from the summary graphs) and are thus omitted for the same reason. The full set of graphs is available from the authors. In the summary graphs, the nops curve is the base CPI: the total number of instructions
executed
tions
this
the
executed; breakdown
miss
is the
CPI
graphs,
partial
the
contribution
tion of write misses (if instruction fetch misses; buffer;
divided
corresponds
word
by the
nop
area
of read any); write
is the CPI ACM
number
of useful
to the CPI for a perfect is the misses;
CPI write
(not
contribution miss
nop)
memory
instruc-
system. of nops;
is the CPI
For read
contribu-
inst fetch miss is the CPI contribution of buffer is the CPI contribution of the write contribution
Transactions
of partial-word
on Computer
Systems,
writes. Vol. 13, No. 3, August
1995.
258
Amer Diwan et al.
.
—.—
write-no-allot,
~
write-allot,
— ‘— —.—
write-allot,
no-subblk, subblk,
‘---*
assoc=l
---
assoc=l
no-subblk,
‘---
assoc= 1
‘--
write-no-allot,
no-subblk,
-Cl---
wnte-~l~~,
subblk,
‘--
write-allot,
no-subblk,
●
assoc=2
assoc=2 assoc=2
nops
3.5
(a) Block
‘:
size =
16 bytes
:, ‘t q\
~
“y ,. \ “Q,
‘b:\ ‘:, ,
\
‘*::, ‘,
“’a:,
\ ‘“>
!~—0
“
.,,
,A
1
MM~MMMM w Split
I and D Cache
Split
sizes Fig. 2.
9
❑
nop
db=
MMM
.s:~;wz
a
read miss
❑
VLIW
I and D Cache
inst fetch
miss
❑
write
Fig. 3, TransactIons
VLIW
breakdown,
on Computer
Systems,
buffer
W partml
(b)
write-no-allocate,
f%SOC
=
%
m
Spht I and D cache sizes
ACM
sizes
summary.
i\
x+’.j
M~
2
ti
Split I and D cache sizes no subblock,
Vol. 13, No, 3, August
1995,
block
size = 16,
word
v-1
Performance
❑
❑
nop
of Programs with Intensive Heap Allocation
❑
read miss
inst fetch
❑
miss
write
❑
buffer
.
partial
259
word I
3.5 ( I
c
.-0 g ~
3.5
3
(a) Assoc = 1
34 G 0 .+ s ! g ~ 2.5 ~
I 2.5
h
MM
cowed
T
MMMM~
(b)
MM
Fig.4.
❑
Split I and write-allocate,
❑
read miss
instfetch
~ . ~
subblock,
❑
miss
write
block
D cache
sizes
size = 16.
❑
buffer
partial
word
3.5
3.5 :
c 0 .,.. s ~
2
:5$
VLIWbreakdown,
❑
nop
=
MMMMM
ro$:~z”z%i$
.
Split I and D cache sizes
r
fhSOC
3
(b) fhsOC
=
2
2.5
32 ~ -u G 1,5
—-——..———_ %L4titi%titi COQN Spht Fig.5.
MM
COWC4
mz~~~ I and
D cache
VLIWbreakdown, ACM
ro3~?: Split
sizes write-allocate, Transactions
nosubblock,
on Computer
——.—___ .
,, MMMMM
1-
1
I and
D cache
block
size = 16.
Systems,
sizes
Vol. 13, No. 3, August
1995,
I
260
Amer Diwan et al.
.
The
64K
point
on the
write
allot,
assoc = 1 curves
sub block,
corresponds
closely to the DECStation 5000/200 memory system. In the following subsections we describe the impact
of write-miss
subblock
size,
placement,
partial-word programs. 5.3.1
writes
blocks,
arithmetic
mean
with
system
cache
direct-mapped associative sponds
with
that
the
programs
the
system
machines
size of 16 bytes,
mean
the memory
and
corresystem
It
of SML/NJ miss
64K
two-way
is similar
1993].
the very high write-miss
for
memory
5000/200
Bershad
performance
due to
9%)
configuration The
an
large
97o) for 32K
system.
and
the
and
sufficiently
mean
DECStation
despite
caches
the overhead
direct-mapped
[Chen
configurations.
organization,
(arithmetic
memory
on the
memory
current
a block
64K
programs
for
is acceptable;
13%
5000/200
summary
is write-alloc-
the CPI by 0.35 to 0.88, with
placement 3 to
the
all other
Surprisingly,
programs
and
benchmark
write-allocate
1 to 1370 (arithmetic
Recall
Fortran
that
on some
cache
and
of SML/NJ
emphasizing
reduces
of 0.55.
from
DECStation
of C and
good
ranges
From we studied
and
buffer,
of the
Placement. outperforms
policy
write
performance
direct-mapped
placement
of SML/NJ
caches.
overhead
64K
cache
organization
it substantially with
caches
to the
Sub block
write-allocate/subblock
misses
size,
system
cache
improvement
performance
data
and
subblock
the
block
memory
the best
system
four-word caches
that
placement;
a memory
that
Policy
it is clear
ubblock
For
on the
Write-Miss
graphs, ate/s
associativity,
is
to
worth
programs
rates;
for
read-miss
is
a 64K
ratios
for
VLIW are 0.23 and 0.02 respectively. Recall would
that
in Section
be substantial,
The summary even
for
creases
5.1 we argued but
graphs
128K
cache
sharply
for
that
indicate sizes; larger
that
the benefit
the benefit that
benefit
six
of the
placement
for larger
in benefit
the
for
of subblock
decrease
the reduction
however, caches
would
caches.
is not substantial
of subblock
placement
benchmark
de-
programs.
This
suggests that the allocation area size of six of the benchmark programs is between 256K and 512K. The performance of write-allocate/no subblock is almost identical to that of write-no-allocate/no ence between This
them
suggests
that
an 8K cache,
subblock is so small an address
an address
(Knuth-Bendix in most
is being
is read
read
after
is an exception).
graphs
that
the
soon after
being
written
two
being before
The
curves
differoverlap.
written;
even in
it is evicted
from
the cache (if it was evicted from the cache before being read, then write -alloeate/no sub block would have inferior performance). The only difference between
these
two schemes
case, it is brought miss.
Because
ments, allocated
SML/NJ
a newly
allocated
another
is when
in on a write
a cache block
miss;
programs object
C bytes,
allocate
remains
where
C is
programs allocate 0.4–0.9 bytes per read of a block occurs within 9K–20K ACM
Transactions
on
Computer
Systems,
Vol.
is read
in the other,
13,
sequentially
in the the
instruction, instructions No.
3, August
from
memory.
it is brought
cache size
and until
of the
do few
the
assign-
program
cache.
has
Since
our results suggest of its being written. 1995,
In one
in on a read
that
the a
Performance The
benefit
languages ment
of Programs with Intenswe Heap Allocation
of subblock
such
placement
as Standard
combined
with
is not
ML.
Jouppi
an 8K data
cache
limited
[1993] and
to strongly
reports
that
a 16-byte
5.3.2
Changing
ciativity
improves
to two-way from
and four-word
blocks,
an arithmetic
from
associativity
However, increasing improvements may From
Figures
instruction
nate
with
to a two-way
with
set. When
conflicts 5.3.3 with
64K
the
write-allocate
sizes
have
From
Figure
much
from
larger
a write
Size.
that
that
misses,
less
spatial
spatial
sizes
with
locality
of the
functions
block
block system
size reduces of 0.22. For
size can halve when
In particular,
the writethere
is a
larger
/sub block
is comparable
of
locality.
increasing
subblock
reads
is the
chances
a memory
organizations
as write-allocate[no
the
cache
to hold
the
spatial
write-allocate
in going from
the performance
performance
write-no-allocate
and
caches.
data
improvement
the block
et al. 1992].
4(a),
of small
Thus,
For
cache
can domi-
enough
increasing
mean
improve
to caches
the
2 we see that
caches,
[Koopman
size
locality.
performance.
doubling
the
not large
had strong
Figure
3(a),
penalty
from
the
data
observed
are examined,
an arithmetic
block
that
the
improves
cache penalty of the
16K.
and thus
or smaller
the code consists
write-allocate
to offer
the
the
From
with
miss
block
and
benefit
than
on
(Figures
for 128K
a lot
programs
also improves
2 we see that larger
caches
if the instructions
organizations,
little
sizes
associativity
cache is simply
lowers
than Block
Thus,
for
suggests
which
by 0.14 to 0.35,
rate.
penalty
The data
direct-mapped
CPI
the CPI by 0.01
The improvement
suggests
obtained
write-allocate
or no impact
the instruction
system.
cache
16 to 32 bytes
the miss
placement,
the benchmark
calls,
Changing
cache
little
cache is not surprising:
are greater
size from
64K
reduces
higher
has
is due to conflict
misses.
frequent
small
is substantial
for the memory
of the instruction
with
assoone-way
of 0.06. The maximum
for direct-mapped
subblock
due to capacity working
but
cache penalty
cache
for
from
the improvement
associativity
5 we see that
performance
associative
instruction
than
increasing
in going
system
place-
eliminates
[1993] finds that the subblock placement.
2 we see that
improvement
is obtained
Surprisingly,
the penalty
increasing mean
3, 4, and
5(a)) the instruction For caches
smaller
a memory
line
associativity may increase CPU cycle time, not be realized in practice [Hill 1988].
cache
performance.
For
Figure
The improvement
is much
placement.
to 0.09, with higher
From
all organizations.
set associativity
subblock
caches
Associativity.
functional
subblock
cache
31% of the memory references for C programs. Reinhold memory performance of Scheme programs is good with
261
.
block
placement.
benefit
just
placement;
as this
to that
of the
writes. Note that subblock placement improves performance way associativity and 32-byte blocks combined. 5.3.4 identified
Changing for cache
Cache sizes.
Size.
Three
The first ACM
distinct
region
TransactIons
more
than
even two-
regions
of performance
can be
corresponds
to the range
of cache
on Computer
Systems,
Vol. 13, No. 3, August
1995.
262
Amer Diwan et al.
.
sizes when the allocation area does not fit in the cache (i.e., allocation happens in an area of memory which is not cache resident). For most of the benchmarks,
this
region
corresponds
to cache
Simple
sizes
of less
than
256K
(for
and Knuth-Bendix this region extends increasing the cache size uniformly improves
beyond 512K). In this region, performance for all con@-ura-
tions.
from
However,
the performance
improvement
doubling
the cache size is
small. From the breakdown graphs we see that in the first has little effect on the data cache miss contribution
region the cache size to CPI. Most of the
improvement
cache
improved
in
CPI
that
performance
sizes have interactions longer may
with
to access. Thus, not be achieved
The second cache until
comes
of the
from
increasing
instruction
cache.
the cycle time
small
the As with
of the CPU:
improvements
size
is due
associativity,
larger
caches
due to increasing
the
to
cache can take
cache
size
in practice.
region
ranges
the allocation
from area
when
fits
the allocation
area begins
in the cache. For most
to fit in the
of the benchmarks
Simple and Knuth-Bendix), this region corresponds to (once again excepting cache sizes in the range 256K to 512K.3 In this region, increasing the cache size sharply without region
improves
subblock has little
instruction
cache miss
The third the
the data
region
cache.
For
larger
than
larger
cache
cache
performance
placement. However, to offer for instruction penalty
corresponds
five
512K
of the
benchmarks,
this
In this
range,
increasing
impact on memory system performance resident, and thus there are no capacity 5.3.5
Write-Buffer
low at this
to cache sizes when
Lexgen, Knuth-Bendix,
(for
sizes).
is already
and
for memory
region
Write
this the
area fits
corresponds
and Simple this the
size in because
point. the allocation
cache
Overheads.
down graphs we see that the write-buffer and partial-word tions to the CPI are negligible. A six-deep write buffer coupled
region
starts
remains
From
in
to caches
size has little
because everything misses to eliminate.
Partial-Word
organizations
increasing the cache cache performance
the
at
or no cache
break-
write contribuwith page-mode
writes is sufficient to absorb the bursty writes. As expected, memory system features which reduce the number of misses (such as higher associativity and larger
cache sizes)
5.4 Write-Buffer
also reduce
the write-buffer
overhead.
Depth
In Section 5.3.5 we showed that a six-deep write buffer coupled with pagemode writes was able to absorb the bursty writes in SML/NJ programs. In this section we explore the impact of write-buffer depth on the write-buffer contribution to CPI. Since the speed at which the write buffer can retire writes depends on whether or not the memory system has page-mode writes, we conducted two sets of experiments: one with and the other without page-mode writes. We varied the write-buffer depth from 1 to 6. We conducted this study for two of the larger benchmarks: CW and VLIW. We fixed
3For
Lexgen
ACM
Transactions
this
region on
extends
Computer
a little Systems,
beyond
512K.
Vol.
No,
13,
3, August
1995
Performance
I —.—
of Programs with Intensive Heap Allocation
wb depth= 1 ~
wb depth.2
wb depth=4
‘*—
.
~
263
wb depth.6
0.2 ~
0.2 0.18
0.18 ] (b)
0.16
= 0.16
ASSOC =
2
0 %
~
0.14
0.14 ‘, \ 0.12 K \~ \
~ 0.12 .,-1
~
().1
Z > +
0.08
0“1 I 0.08 +
0.06
0.06 t
G’ 0,04
~’\, ‘>\’;\
....
=
2
~..+
‘--+-.D
!
.
‘ lT___
t
-~+ :’%, ‘\\
_
+
\ ‘b.
...=
0.5
-C-.—4
‘ ,+
+
“%,,,’-’’*...
..*_
____e
k. -=&, ‘
+
e-.—-.—+
I
() L—–-..k—
o ,—.–-+—+
—. M
co c+
1 and Fig. 7.
write,
is
it is to the
subsequent writes they also complete Due
one after
to
next
I and
CPI contribution
likely
for VLIW,
without
be on the same DRAM
address.
It
will
D cache
page-mode
allocation, another
it
will
is likely
size writes.
page as the previous
therefore
complete
in
to initialize this object find an empty in one cycle due to page-mode writes.
to sequential
allocated
size
Write-buffer
however,
since
D cache
that
writes
also be on the same
write
write,
one cycle. buffer
to initialize
DRAM
All
since objects
page. In the best
delay will happen case, with no read misses and refreshes, a write-buffer-full only once per N words of allocation, where N is the size of the DRAM page. Thus,
the write
programs
if
explanation,
buffer
the
depth
memory
we measured
has little system the
effect
has
on the performance
page-mode
probability
writes.
of two
To
consecutive
of SML/NJ confirm writes
this being
on the same DRAM page. This probability averaged over the benchmarks was 969Z0. The small impact of write-buffer depth on performance does not imply that a write buffer is useless if the memory system has page-mode writes. Instead, it says that a deep write buffer memory system with page-mode
offers writes
little performance if the programs
improvement in a spatial have strong
in the writes and if the majority of the reads and instruction spatial locality means that the probability hit in the cache. Strong consecutive writes are to the same DRAM page is high.
locality
Write-buffer have ACM
page-mode Transactions
depth
is important,
writes on Computer
(Figure Systems,
however, 7). In
this
if the
memory
case,
Vol. 13, No. 3, August
1995.
system
a six-deep
write
fetches that two does not buffer
Performance 25
of Programs with Intensive Heap Allocation
.
265
T
2
2
4
8
16
32
64
Numberof TLB enhies Fig. 8.
performs have 5.5
much
different
better
than
TLB
contribution
a one-deep
write
to CPI.
buffer.
Note
that
to the
CPI
for
Figures
6 and 7
scales.
TLB Performance
Figure
8 gives
the
TLB
miss
contribution
each
benchmark
program. We see that the CPI contribution of TL13 misses falls below 0.01 for all our programs for a 64-entry unified TLB; for half the benchmarks, it is below 5.6
0.01 even for a 32-entry
TLB.
Validation
To validate DECStation
our simulations, 5000/200
we ran
(running
each of the benchmarks
Mach
2.6)
and
measured
five the
times
elapsed
on a user
for each run. The programs were run on a lightly loaded machine but placenot in single-user mode. The simulations with write-allocate/subblock direct-mapped caches, 16-byte blocks, and 64-entry TLB correment, 64K spond closely to the DE Citation 5000/200 with the following three importime
tant and
differences. First, the simulations ignored the effects of context switches address = system calls. Second, the simulations assumed a virtual
physical
the
simulations
mapping which can have many mapping used in Mach 2.6 [Kessler
address
random
assume
that
all instructions
take
fewer conflict misses than and Hill 1992]. Third, the
exactly
one cycle (plus
memory
system overhead). In order to minimize the memory system effects of the virtual-to-physical CPI of the five runs for mapping and context switches, we took the minimum each program and compared it to the CPI obtained via simulations. We Measured (see) is the user time of the present our findings in Table VIII; ACM
Transactions
on Computer
Systems,
Vol. 13, No. 3, August
1995.
266
.
Amer Diwan et al. Table
VIII.
Measured
Measured
versus
Measured
Simulated
Simulated
Multicycle
Discrepancy
Promam
(see)
CPI
CPI
CPI
(?6)
Cw
2583
142
1,39
0.00
2.48
Knuth-Bendix
14.95
1,27
1.21
0.00
5.22
Lexgen
16.13
1.40
1.31
0.00
6.29
Life
17.16
1.23
1.21
0.00
PIA
6.41
1.43
1,18
Simple
29.81
1.33
1.21
* *
VLIW
25.61
1.76
1.39
6.58
1,39
1.36
YACC ‘cannot
be determined
program time;
in seconds;
Simulated
is the
CPI
without
computed;
simulating
the
Measured
is the
CPI
CPI
of multicycle
Discrepancy
is the
FP unit
CPI
obtained
from
9.66 2.20
pipehnes
obtained
instructions
discrepancy
9,03
0.20 0.00
and/or
is the
CPI
overhead
CPU
1.19 17,62
from
the
the
simulations;
when
it could
between
the
measured Multicycle
be accurately
simulated
CPI
plus
the multicycle CPI and the measured CPI as a percentage of measured CPI. is less Table VIII shows that with the exception of PIA, the discrepancy than 10’% and that the actual runs validate the simulations. The discrepancy in
PIA is due
instructions their
to multicycle
executed.
result
instructions
Since
is used, their
which
multicycle
cost can usually
We were
able to determine
the overhead
for VLIW
since
of most
ately
afterward.
the
results
comprise
instructions
be determined of multicycle
multicycle
In the case of PIA,
4.8%
do not cause
total
a stall
until
only by simulations. instructions
instructions
the distance
of the
accurately
are used
between
multicycle
immediinstruc-
tions and instruction
their use varies considerably. However, even if each multicycle stalls the CPU for half its maximum latency, the discrepancy falls
well below PIA.
10%. Thus,
5.7
Extending
multicycle
can explain
the discrepancy
for
the Results
Section
5.3 demonstrated
system
cost if new objects
section,
we present
due
heap
to
instructions
that
heap
cannot
an analytic
allocation
allocation
model
when
can have
be allocated
this
directly
which is
the
predicts case.
a significant into
the memory
The
memory
the cache.
model
In this
system
formalizes
cost the
intuition presented in Section 5.1 and predicts the memorys ystem cost due to or heap allocation rates heap allocation when block sizes, miss penalties, change. We use the model to speculate about the memory system cost of heap allocation for caches without subblock placement if SML/NJ were to use a simple stack. 5.7.1
An
collection time and allocation initialized, ACM
Analytic
Model.
Recall
that
heap
allocation
with
copying
garbage
allocates memory which typically has not been touched in a long is unlikely to be in the cache. This is especially true when the area does not fit in the cache. When newly allocated memory is write misses occur. The rate of write misses depends on the
Transactions
on Computer
Systems,
Vol. 13, No. 3, August
1995.
Performance Table Cache
IX.
Percentage
Size
of Programs with Intensive Heap Allocation
Difference
between
Write-No-Allot/No
(Kilobytes)
Analytical
Model
Subblock
and Simulations
Write-Allot/No
Subblock (%)
(%)
8K
7.12
2.4
16K
6.84
2.2
7.02
2.2
32K 64K
10.8
5.7
128K
31.8
23.5
256K
128.8
111.4
512K
1847’.7
1746.2
allocation rate (a words\ instruction) rate of write misses, we can calculate allocation. Under initializing
The read
and write
the assumption writes miss,
Under allocated,
miss
the additional we have
Since
the
benchmarks
account
=
allot
no allot
WP respectively.
area does not fit in the cache, i.e.,
WP*a/b,
that
do few
for the difference
are rP and
penalties
assumption
c write should
and the block size (b words). Given the the memory system cost, C, due to heap
that the allocation we have
c write
267
.
programs
touch
data
soon after
it is
=rp+a/b,
assignments,
the
in CPIS when
cost
of heap
the write-miss
allocation
policy
is varied.
Hence,
c write Table
IX
predicted actual ‘P
=
no allot/no
shows
cost,
difference 15, and
the
CPIW,i,,
=
subblock
average
C, and
the
actual
~0 ,llOC,~O,U~~lOC~– CPIW,,,e ~llOC/,U~~lOC~
(arithmetic difference
in CPIS. The values
mean)
difference
in CPIS,
between
used to calculate
Table
IX were:
allocation area fits in the cache. infinity as the benefit of subblock model can be used small cache sizes. 5.7.2
of the b = 4,
Wp = 15.
The model is accurate for cache sizes of 128K or less, when the area does not fit in the cache. As expected, the model is inaccurate
SML/NJ
cost of heap
the
as a percentage
to predict
with
allocation
a Stack.
the absence of first-class callee-save continuations
The percentage difference heads toward placement becomes negligible. Thus, this
the memory
in SML\NJ
allocation when the
system
cost of heap
allocation
for
We can speculate about the memory system when a stack is used using this model. In
continuations, which can be stack-allocated
the benchmarks do not easily. The callee-save
use, con-
tinuations correspond to procedure activation records. The first two columns of Table X give the rate of heap allocation with and without heap allocation of callee-save continuations, ACM Transactions
on Computer
Systems,
Vol
13, No. 3, August
1995
268
Amer Dlwan et al,
.
Table
X,
Expected
Placement
Memory
(assuming
System
procedure
Allocation Including Program
Cost of Heap activation
(Words/Useful
for Caches
Cents.
Excluding
InstructIon)
without
are stack-allocated
Allocation
Rate
Callee-Save
Allocation
records
Subblock
in SML/NJ)
Rate
Callee-Save
(Words/Useful
Cents.
Instruction)
c (Cycles/Instruction)
Cw
0.12
0.04
0.15
Knuth-Bendix
0.23
012
044
Lexgen
0.11
012
Life
0.11
0.03 (),()2
PIA
0.17
013
0.47
Simple
0.14
0.05
0.17
VLIW
0.16
0.06
0.23
YACC
0.14
0.07
0.24
Median
0.14
0.05
0.20
Assuming presents that
only
do not have
area.
continuations
an estimate
The
block
write-miss
are
of the memory
subblock
penalty
stack-allocated, system
placement
size is 16 bytes, for the
0.09
and are too small the
read-miss
no-subblock
caches
system
costs
probably
due to heap
6. FUTURE
WORK
We suggest
three
—measuring work,
the
—measuring
the impact
—measuring
other
Regarding
allocation
be significant
directions impact
in which of other
caches
the allocation
15 cycles, Since
for
the
and read
write-allocate
system
cost
to stack-allocate a simple stack, without
X
for caches
subblock
the and and
of heap
additional the memory placement
programs.
this
study
architectural
of different
aspects
architectural
for
for SML\NJ
to hold
15 cycles.
allocation with a stack because it may be possible objects [Kranz et al. 1986]. We see that even with
3 of Table
allocation
penalty
write miss penalties are the same, C is the same write-no-allocate organizations. This is an upper bound on the expected memory
will
column
cost of heap
compilation
can be extended: features
not
techniques,
explored
in this
and
of programs. features,
there
is a need to explore
memory
system
performance of heap allocation on newer machines. As CPUS get faster relative to main memory, memory system performance becomes even more crucial to good performance. To address the increasing discrepancy between CPU speeds and main-memory speeds, newer machines, such as Alpha workstations [DEC 1992], often have features such as secondary caches, stream buffers, and register scoreboarding. The interaction of these features with heap allocation needs to be explored. Regarding different compilation techniques, the impact of stack allocation is worth on most ACM
measuring. A stack reduces memory system organizations)
Transactions
on Computer
Systems,
Vol
heap allocation (which performs badly in favor of stack allocation (which can
13, No
3, August
1995.
Performance have good cache locality of memory,
namely
of Programs with Intensive Heap Allocation
since it focuses
most
the top of the stack).
of heap-allocated
objects
of the references
For SML\NJ
can be allocated
on the
stack
ations,
stack
and threads,
doing stack Regarding promising
allocation
can slow down
A careful
allocation. measuring for future
study
other
aspects
(Table
IV).
Therefore
first-class
to evaluate
of
part
the majority
of SML\NJ programs or with small cache
exceptions,
is needed
269
to a small
programs,
stack allocation can substantially improve performance on memory organizations without subblock placement sizes. However,
.
programs,
continu-
the pros and cons of several
areas
seem
work:
(1) Measuring the impact performance.
of different
garbage
collection
(2) Measuring the impact of changing various garbage (such as allocation area size) on cache performance.
algorithms collector
on cache parameters
(3) Measuring the cost of various operations related to garbage collection: tagging, store checks, and garbage collection checks. A preliminary study of this
is reported
(4) Measuring
in Tarditi
the impact
and Diwan
of optimizations
[1993]. on cache performance.
7. CONCLUSIONS We have copying nique
studied
the
garbage
memory
collection,
for programming
language
features
system
performance
a general
languages.
where
objects
automatic
Heap may
of heap storage
allocation have
is useful
indefinite
allocation
with
management
tech-
for implementing
extent,
such
as list-
processing, higher-order functions, and first-class continuations. However, heap allocation is widely believed to lead to poor memory system performance [Peng and Sohi 1989; Wilson et al. 1990; 1992; Zorn 1991]. This belief is based on the high (write) initialized. We studied the programs
compiled
cate at intensive activation
miss
memory with
rates.
records,
ratios
the They
is done
that
occur
system
when
new objects
performance
SML\NJ
of mostly
compiler.
These
use heap-only
allocation:
on the
We
memory systems typical of current To our surprise, we found that
heap.
are allocated functional
programs
simulated
allo-
including
a wide
performed
SML
heap
all allocation,
workstations. heap allocation
and
variety
well
of
on some
memory systems. In particular, on an actual machine (the DE Citation 5000/200), the memory system performance of heap allocation was good. However, heap allocation performed poorly on most memory system organizations. The memory system property crucial for achieving good performance was the ability to allocate and initialize a new object into the cache without a penalty. This can be achieved by having subblock placement or a cache large enough
to hold
the
allocation
area,
sufficiently deep write buffer. We found for caches with subblock data
cache penalty
was under ACM
along
with
fast
placement,
the
970 for 64K or larger Transactions
on Computer
page-mode arithmetic
caches; Systems,
writes mean
or a of the
for caches without
Vol. 13, No, 3, August
1995
270
Amer Diwan et al.
.
Table
subblock
XI.
CPI with
a Perfect
Memory
System
Program
CPI
Cw
1.15
Knuth-Bendix
1.06
Lexgen
1.14
Life
1.18
PIA
1.09
Simple
1.08
VLIW
1.10
YACC
1.13
placement,
the
mean
of the
data
cache
penalty
was
often
higher
than 50%. We also found that a cache size of 512K aIlowed the allocation area for six of the benchmark programs to fit in the cache, which substantially
improved
the
performance
of cache
organizations
without
subblock
placement. The
implications
achieve heap
of these
good memory
allocation
performance,
penalty
in place
guages with
Heap
of stack
technique.
which
small
make
primary
are clear.
performance.
of procedure
system
management
results
system
activation allocation
allocation, Second,
heavy caches
records can
needed
memory
can also have without
good memory a performance
can better allocation
placement
to
system,
it is a more-general
storage
subblock
is not
right
architects
of dynamic
by using
a stack the
be used
even though computer
use
First,
Given
with
storage
support
lan-
on machines a subblock
size
of one word.
APPENDIX. Table
XI
system.
gives Table
different CPI
SUMMARY the XII
memory
TABLES
CPI gives
of the
benchmark
the CPI
organizations.
programs
for
for each of the benchmark The
following
abbreviations
a perfect
memory
programs are used
for the in the
table:
wa/wn:
write
s/ns:
subblock
allocateiwrite
418:
Block
no allocate
placement\no
size = 4 words\
subblock Block
placement
size = 8 words.
ACKNOWLEDGMENTS
We thank Peter Lee for his encouragement and advice during this work. We thank E. Biagioni, B. Chen, O. Danvy, A. Forin, U. Hoelzle, K. McKinley, E. Nahum, D. Stefanovi6, and D. Stodolsky for comments on drafts of this article, B. Milnes and T. Dewey for their help, J. Larus for creating QPT, M. Hill for creating Tycho, for creating SML\NJ. ACM
Transactions
on Computer
and A. W. Appel,
Systems,
Vol
13, No
D. MacQueen,
3, August
1995.
and many
others
Performance
Table
of Programs with Intensive Heap Allocation
XII.
Associativity Config
8K
16K
32K
64K
Cycles
per Useful
Associativity 256K
512K
271
Instructions
= 1 128K
.
= 2
8K
16K
32K
64K
128K
256K
512K
Cw wn,ns,4
2.41
2.07
1.88
1.73
1.43
1.30
1.18
2.22
1.96
1.78
1.62
1.42
1.30
1.17
wajns,4
2.44
2.09
1.90
1.74
1.44
1.31
1.18
2.24
1.98
1.79
1.63
1.43
1.31
1.17
wa,s,4
1.96
1.62
1.44
1.39
1.24
1.20
1.17
1.77
1.50
1.33
1.25
1.21
1.18
1.16
wnjnsj8
2.18
1.89
1.72
1.60
1.37
1.27
1.18
1.98
1.76
1.62
1.50
1,35
1.26
1.17
wa,ns,8
2.19
1.89
1.72
1.60
1.37
1.27
1.18
1.98
1.76
1.61
1.49
1.35
1.26
1.17
wa,s,8
1.88
1.59
1.43
1.38
1.23
1.20
1.17
1.68
1.46
1.32
1.25
1.21
1.18
1.16
Leroy wn,ns,4
2.32
2.18
1.96
1.87
1.79
1.65
1.37
2.17
2.03
1.89
1.82
1.76
1.65
1.40
wa,ns,4
2.53
2.40
2.18
2.08
1.98
1.83
1.50
2.38
2.25
2.09
2.00
1.95
1.83
1.57 1.08
wa,s,4
1.66
1.52
1.30
1.20
1.15
1.12
1.09
1,51
1.37
1.21
1.12
1.10
1.09
wn,ns,8
2.05
1.93
1.75
1.67
1.60
1.50
1.30
1.92
1.81
1.68
1.61
1.57
1.50
1.33
wa,ns,8
2.12
2.00
1.82
1.73
1.66
1.56
1.35
1.99
1.87
1.74
1.67
1.63
1.55
1.38
wa,s,8
1.56
1.44
1.26
1.18
1.13
1.11
1.09
1.43
1.32
1.19
1.11
1.09
1.08
1.07
Lexgen wn,ns,4
3.59
1.94
1.84
1.72
1.65
1.50
1.31
2.04
1.85
1.70
1.64
1.60
1.50
1.34
wa,ns,4
3.61
1.96
1.85
1.74
1.67
1.51
1.31
2.06
1.87
1.71
1.66
1.61
1.51
1.35
wa,s,4
3.17
1.53
1.42
1.31
1.27
1.21
1,19
1.62
1.43
1.28
1.22
1.20
1.19
1.18
wn,ns,8
2.96
1.78
1.67
1.57
1.51
1.39
1.27
1.86
1.71
1.56
1.49
1.45
1.38
1.28
wa,ns,8
2.97
1.79
1.68
1.57
1.51
1.40
1.27
1.87
1.71
1.57
1.50
1.46
1.39
1.28
wa,s,8
2.70
1.51
1.41
1.30
1.26
1.21
1,19
1.60
1.44
1.30
1.22
1.20
1.18
1.18
Life wn,ns,4
1.79
1.70
1.65
1.62
1.61
1.42
1.20
1.77
1.64
1.61
1.60
1.60
1.48
1.19
wajns,4
1.80
1.70
1.65
1.62
1,60
1.42
1.20
1.74
1.64
1.60
1.60
1.59
1.48
1.19
wa,s,4
1.39
1.29
1.24
1.21
1,20
1.19
1,19
1.33
1.23
1.20
1.19
1.19
1.19
1.18
wn,ns,8
1.65
1.55
1.49
1.47
1.46
1.34
1.19
1.57
1.54
1.46
1.45
1.44
1.37
1.19
wa,nsj8
1.65
1.55
1.50
1.47
1.45
1.34
1.19
1.57
1.51
1.45
1.45
1.44
1.37
1.19
wa,s,8
1.39
1.29
1.24
1.21
1.20
1.19
1.19
1.31
1.25
1.20
1.19
1.19
1.18
1.18
wn,ns,4
2.23
1.93
1.80
1.75
1.62
1.34
1.12
2.23
1.83
1.73
1.72
1.63
1.36
1.12
wa,ns,4
2.27
1.96
1.82
1.77
1.65
1.36
1.12
2.26
1.85
1.76
1.74
1.66
1.39
1.12
wa,s,4
1.66
1.36
1.22
1.18
1.15
1,13
1.10
1.66
1.25
1.15
1.14
1.12
1.11
1.10
Pia
wn,ns,8
1.92
1.70
1.59
1.55
1.46
1.26
1.11
1.87
1.60
1.53
1.51
1.45
1.28
1.11
wa,ns,8
1.95
1.72
1.60
1.56
1.47
1.27
1.11
1.89
1.61
1.54
1.52
1.47
1.30
1.11
wa,s,8
1.55
1.32
1.21
1.17
1.14
1.12
1.10
1.49
1.22
1.14
1.13
1.11
1.10
1.10
Simple wn,ns,4
2.32
2.02
1.79
1.73
1.70
1.68
1.62
1.98
1.74
1.70
1,70
1.68
1.66
1.63
wa,ns,4
2.35
2.05
1.81
1.75
1.72
1.70
1.64
2.01
1.76
1.72
1.72
1.69
1.68
1.65
wa,s,4
1.80
1.50
1.26
1.21
1.19
1.18
1.15
1.46
1.21
1.18
1.17
1.16
1.15
1.14
wn,ns,8
2.03
1.79
1.57
1.51
1.49
1.47
1.43
1.72
1.52
1.49
1.48
1.47
1.46
1.44
wa,ns,8
2.05
1.80
1.58
1.53
1.50
1.48
1.44
1.74
1.54
1.50
1.49
1.48
1.47
1.44
wa,s,8
1.70
1.45
1.23
1.18
1.16
1.15
1.13
1.39
1.19
1.15
1.15
1.14
1.13
1.13 1,14
VLIW wn,ns,4
3.26
2.73
2.30
1.98
1.79
1.35
1.15
3.06
2.64
2.19
1.88
1,69
1.37
wa,ns,4
3.29
2.75
2.32
2.00
1.82
1.36
1.15
3.08
2.66
2.21
1.91
1.72
1.40
1.14
wa,s,4
2.67
2.13
1.70
1.39
1.31
1.17
1.13
2.47
2.04
1.59
1.29
1.16
1.14
1.12
wn,ns,8
2.62
2.25
1.93
1.70
1.57
1.28
1.14
2.48
2.16
1.85
1.63
1.50
1.30
1.13
wa,ns,8 wa, s,8
2.63 2.23
2.26 1.86
1.94 1.55
1.70 1.31
1.58 1.25
1.29 1.16
1.14 1.12
2.48 2.09
2.16 1.76
1.85 1.46
1.64 1.25
1.51 1.16
1.31 1.13
1.13 1.12
ACM
Transactions
on Computer
Systems,
Vol. 13, No. 3, August
1995.
272
Amer Dlwan et al.
.
Table
XII—Continuezi Yacc
wn,ns,4
2.38
2.16
1.99
1,90
1.83
1,64
1.33
2.13
1,99
1.89
1.84
1.79
1,65
1.33
wa,ns,4
2.42
2.20
2.02
1.92
1,86
1.67
1,34
2.16
2.02
1.92
1.86
182
1,68
1.35
wa,s,4
185
1,63
1.45
1.35
132
1.27
1.21
1.59
1.45
1.35
1.29
1.26
1.24
1.20
wn,ns,8
2.11
1.90
1.75
1.67
1.62
1.49
1.28
1.87
1.75
1.67
1.62
1.58
1.49
1.28
wa,ns,8
2.13
1.92
1.77
1.68
1.63
1.50
1.29
1.89
1.76
1.68
1.63
1.59
1.50
1.29
wa,s,8
1.76
1.55
1.40
1.32
1.29
1.25
1.20
1.52
1.40
1.32
1.27
1.24
1.22
1.19
REFERENCES APPEL, A. W. 1987.
Garbage
collection
can be faster
than
stack
allocation.
Zrzf. Process.
Lett.
25,
4, 275-279. APPEL,
A. W.
Exper.
1989.
Simple
19, 2 (Feb.),
APPEL, A. W. 1990. APPEL,
A, W.
generational
garbage
collection
and
fast
allocation.
S’oftuJ.
Pi-act.
171-184. A runtime
1992.
system.
Compiling
LMp
with
Symb.
Cornput.
Continuations.
3, 4 (Nov.),
Cambridge
343-380.
University
Press,
Cambridge,
Mass. APPEL, A, W. AND JIM, T. Y. 1989. the
16th
ACM,
Ann ual
New
ACM
York,
Continuation-passing,
Symposium
closure-passing
on Principles
User
BALL,
manual
distributed
with
T. AND LARUS, J. R. 1992.
sium
on Principles
CHEN,
Standard
Optimally
of Programming
J. B. AND BERSHAD,
system
In Proceedings
Languages
(Austin,
of
Tex.).
293-302.
APPEL, A. W., MATTSON, J. S., AND TARDITI, D. 1989. ML.
style.
of Programmmg
performance.
In
B. N. the
ML profihng
Languages. 1993.
14th
A lexical of New
The
the
ACM,
impact
Symposium
analyzer
tracing New
for Standard
programs.
In the
19th
Sympo-
York.
of operating
on
generator
Jersey.
system
Operating
Systems
structure
on memory
Principles
ACM,
New
York. CHENEY,
C.
1970.
A nonrecursive
list
compacting
algorithm
Commun
ACM
13,
11 (Nov.),
677-678. CLEAVELAND, based
R., F’ARROW, J., AND STEFFEN, B. 1993,
tool
(Jan.),
for the verification
UCID
P., HENDRICKSON,
17715,
Lawrence
CYPRESS. 1990. 1990a.
Corp., DEC.
SPARC
Palo
RISC
User’s
KN02
DECStation Corp.,
1992.
Laboratory, Guzde,
System
IBM/T, FENICHEL,
3100
Palo
DECch Corp.,
ip
Alto,
Program.
A semantics-
Lang.
Syst.
15, 1
computer R.,
first-class Language
Desktop
Research
systems DYBVIG,
Commun. R. K.,
ed. Cypress
Semiconductor,
Functional
Specification,
Workstation
Microprocessor
Center
Des~gn
and
In
ACM
C.
HILL,
M, AND SMITH, A. 1989.
Hardware
1969.
Proceeduzgs
A case for direct
Rep. A
code.
Tech,
Rep,
San Jose, Digital
Calif.
Equipment
13
Speciflcatzon,
ed. Digital
Reference
Manual,
1st ed. Digital
C.
(White mapped
Evaluating
Representing
SIGPLAN
Plains,
caches.
associativity
garbage-collector
’90
N Y.). ACM,
Computer in CPU
21,
Systems,
Vol. 13, No
3. August
control
1995
in
Conference New
vu-tual-memory
IEEE
the
presence
of
on Programming
York,
12 (Dec.),
caches.
1612-1630. on Computer
for
611–612.
1990.
of the
scientltic programming. Tech. Mass. July. Also available as
12686. LISP
12, 11 (Nov.),
Implementation
M. D. 1988.
Functzon
An exercise m future 273, MIT, Cambridge,
Research
J.
AND BRUGGEMAN,
continuations.
TransactIons
SIMPLE
Feb.
Mass.
R. AND YOCHELSON,
HILL,
12 (Dec.),
2nd
The
Calif,
Calif.
21064-AA
Maynard,
J. Watson R.
1978.
Livermore,
Module
EKANADH.AM, K. AND ARVIND. 1987. SIMPLE: Rep. Computation Structures Group Memo
ACM
Workbench:
Trans.
Calif.
Alto,
1990b.
Equipment
HIEB,
ACM
C. P., AND RUDY, T. E.
Livermore
DS5000/200
Equipment DEC.
The Concurrency
systems.
36-72.
CROWLEY) W.
DEC.
of concurrent
66-77.
25-40. Trans.
Comput.
38,
Performance JOUPPI, N. P. 1993. International
Cache
of Programs with Intenswe Heap Allocation
write
Symposium
policies
and
on Computer
performance.
Architecture
In
.
Proceedings
of the 20th
Calif.).
Press,
(San Diego,
ACM
273 Annual
New
York,
191-201. KANE, G. AND HEINRICH,
J. 1992.
KESSLER, R. E. AND HILL, ACM
Trans.
Comput.
Syst.
KOOPMAN, P. J., JR., LEE, reduction. KRANZ,
ACM
D.,
10, 4 (Nov.),
Program.
R.,
REES,
J.,
optimizing
compiler
for Scheme.
Compiler
Construction
(Palo
LARUS,
J. R.
Pratt.
1990.
Exper.
20,
MILNER,
1083,
R.,
M.,
Kaufmannj
AND HARPER,
heap
S.
Approach.
1.
TARDITI,
New
function
WILSON,
N.
The
1986.
’86 Conference
ORBIT:
An
Symposium
on
219-233.
for
efficiently
files
to measure
tracing
Definition
programs.
program
behavior.
Madisonj
of Standard
SoftoJ.
Wise.
ML.
Tech. Mar.
MIT
Press,
Architecture:
A Quantitative
Approach.
memory 860,
design
considerations
Computer
Sciences
to support
Dept.,
Univ.
languages
of Wisconsin-
and
Memory
Programming.
performance
Destgru
A
Performance-Directed
Addison-Wesley,
of garbage-collected
Science,
MIT,
workstations
set
A. W.
ML-YACC,
1990.
Hierarchy
Calif.
Cambridge,
Mass.
languages.
Ph.D.
Mass.
price/performance version
Reading,
programming records.
Microprocess.
2.0. Distributed
with
Rep.
Standard
5, ML
6 of
A.
1993.
The
full
cost
’93 Workshop
of a generational
on Memory
copying
Management
and
garbage Garbage
collection Collection.
P.,
A case
AND MICHAELSON,
study.
Tech.
Rep.
G.
1990.
Computer
Parallel Science
implementations 90/4,
from
Heriot-Watt
Univ.,
U.K.
at Chicago, P. R., LAM, collection. Calif.).
ZORN, B. 1991.
Received
York,
Computer
Rep.
of Functional
MCANDREW,
collection:
Francisco,
graph
York.
P. R., LAM,
garbage
AND ADAMS,
of Wisconsin-Madison,
1990.
San Mateo,
00PSLA
prototypes:
garbage Illinois
In
G.,
Edinburgh, WILSON,
Univ.
PA
D. AND DIWAN,
K.
of combinator
Software.
implementation. ACM,
Cache
Cache
D. AND APPEL,
WAUGH,
J.,
New
Univ.
Cache
for Computer
1991.
Jersey.
TARDITI,
behavior
July.
Elements
Laboratory M.
R.
Tech.
Kaufmann,
M. B. 1993.
(Apr.), New
Wise. 1990.
Morgan
SLATER,
N.J.
caches.
Calif.
allocation.
A.
READE, C. 1989. thesis,
Cliffs,
real-indexed
265–277.
of the SIGPLAN
executable
J. L. 1990.
San Mateo,
Madison,
PRZYBYLSKI,
REINHOLD,
PHILBIN,
A technique
Dept.,
Cache
14, 2 (Apr.),
ACM,
Rewriting
Sciences
AND SOHI, G. S. 1989.
dynamic
Madison,
Calif.).
execution:
Englewood
for large
Mass.
PENG, C.-J. with
P.,
1241-1258.
PATTERSON, D. A. AND HENNESSY, Morgan
HUDAK,
Abstract
Computer
D. P. 1992.
Syst.
Alto,
Prentice-Hall,
algorithms
338-359.
Lang.
12 (Dec.),
TOFTE,
Cambridge,
Architecture. placement
In Proceedings
LARUS, J. R. AND BALL, T. 1992. Rep. Wis
RISC Page
P., AND SIEWIOREK,
Trans.
KELSEY,
MIPS
M. D. 1992.
M.
Chicago, M. ACM,
December
Ill. 1992
New
1993;
ACM
of garbage
considerations
caches.
Tech.
for
generational
Rep. EECS-90-5,
Univ.
of
Caching
on Lisp
considerations
and
Functional
for
generational
Programming
(San
32–42. collection
January
ACM
T. G. 1992. Conference
Boulder,
revised
Caching
set-associative
Dec.
York,
at Boulder,
T. G. 1990.
and
S., AND MOHER,
In the
The effect
of Colorado
S., AND MOHER,
A case for large
on cache performance.
Tech.
Rep. CU-CS-528-91,
Colo. May.
1995;
Transactions
accepted
May
on Computer
1995
Systems,
Vol. 13, No. 3, August
1995.