Memory system performance of programs with ... - cs.umass.EDU

Memory System Performance of Programs with Intensive l--leap Allocation AMER

DIWAN,

Carnegie

Heap

allocation

support heap

found

ML

programs

heap

allocation

allocation

penalty. with

This

subblock

Categories

which

—szrnzdatzon;

Terms: Key

garbage

A paper

Project

15213-3891; granted

servers, 01995 ACM

Transactions

Measurement,

Automatic

storage

by the

data

with cache;

Design

Styles —assocmtZue

Analysis

and Design

of Systems;

D,3.2

Aids

D. 1.1 [Pro-

[Programming Language

LanConstructs

States

of the

expressed

collection,

subblock

appeared

in the

Projects

Agency,

place-

21st

Annual

DoD,

F19628-91-C-0168.

authors

or Implied, or AT&T.

E, Moss,

Department

[email protected]

and

should

not

of the Defense of Computer

01003-4610; email: {Dlwan; Carnegie Mellon Umversity,

through

D. Tarditl

is

be interpreted

Advanced

Science,

Moss}@cs umass.edu; 5000 Forbes Avenue,

as

Research

Umversity

D. Tarditi, Pittsburgh,

of ComPA

edu. copy of part

that

copies the

is by permission to lists,

article

contract

Government,

and

notice,

garbage

mode,

write-through

Research

under

are those

either

digital/hard

page

Scholarship.

article

Dlwan

Advanced

by ESD/AVS

policies,

copying

allocation,

Languages.

Defense

monitored

title

or all of this

are

requires

Systems,

not

of the

of ACM,

0734-2071/95/0800-0244 Computer

caches

Performance

in this

on

a

50’?o.

reclamation,

presented

or to redistribute ACM

For

or larger

Languages]:

results

copyright

buffer.

a 64K

without

size of one word

management

of the

MA

the cache

a subblock

Programming

[Programming

of Programming

A

for achieving

into

Performance

some

fee provided

the

Organization]:

Languages,

in this

crucial

Performance

write-policy,

and

property

Structures]:

policy,

email:

copying

that

[Memory

write-miss

to make

advantage,

than

write-buffer,

Amherst, Department,

given

higher

(Functional)

Phrases:

the United

without

was often

machmes

organization,

or a write

9T0 for

most

system

with

writes

that

for

functional

memory object

under

heap

addresses:

Permission

placement

collection,

the official

Massachusetts, puter Science

a new

garbage

conclusions

Agency,

Authors’

system

Structures]:

D.3.3

by an AT & T Ph.D.

representing

memory initialize

allocation

of mostly

We found

was

B,3,2

Systems

generational

8313,

also supported and

and

The

page-mode

[Memory

Applicative

m sponsored

Order

fast

allocation.

for

To investigate

of heap

performance

appropriate

technique

performance,

performance

the

and

overhead

the overhead

B.3.3

the

subblock

with

cache

Descriptors:

on Principles

research

Views

along

data

Experimentation,

containing

Symposium This

by having

ZCstorage

collection,

use of heap

management

system

system

studied

with

storage

memory

memory

to allocate

Classification;

Words

write-back,

ARPA

ability

Techniques]:

and Features—dynam

ment,

performance.

the

Language

General

MOSS

1s a general poor

We

heavy

good

C.4 [Computer

gramming

Additional

made

However,

pohcy,

Subject

of the

machines.

poorly.

cache memorzes;

guages]:

study

on many

placement

and

memories:

to have

can be achieved

placement,

without

collection

was the

a write-allocate

subblock

ELIOT

garbage

can have

good performance

AND

It is believed

an in-depth

systems

Standard

copying

languages.

we conducted

memory

TARDITI,

University

with

programming this,

DAVID

Mellon

prior

made

pubhcatlon,

Inc

To copy

specific

work

for personal

or distributed and

its

for date

otherwise,

permission

or classroom profit appear,

to republish, and/or

a fee,

$03.50 VOI

13,

No

3, August

1995,

Pages

244-273.

use M

or commercial and

notice

is

to post

on

Performance

of Programs with Intensive Heap Allocation

.

245

1. INTRODUCTION Heap

allocation

with

poor memory

copying

system

garbage

performance

collection

[Koopman

is widely

believed

to have

et al. 1992; Peng and Sohi

1989;

Wilson et al. 1990; 1992; Zorn 1991]. To investigate this, we conducted extensive study of memory system performance of heap-allocation-intensive programs

on memory

programs, amounts

compiled with the SML/NJ of heap allocation, allocating

The programs heaps.

system

used a generational

To our surprise,

to actual

machines,

performance Bershad

organizations

was

1993]:

such

that

as the

programs

garbage

DECStation

only

collector

5000/200,

of C and

combined sizes

with

of 32K

caches

write-allocate

or larger.

3 to 13% slower

are smaller

than

for the benchmarks mentioned Our and

Sohi

1989;

performance overall

this

differs

miss

programs

of heap

most

et al.

allocation

writes,

poorly

work

1992;

in two

(256K

ways.

1991]

performance tem

system

metric,

on program

entire

memory

features

contribution

which

First,

et al. 1992;

previous

which

running system:

such as write

to cycles

accurately

work

time. it

Second,

concentrated

buffers

the

previous solely

and page-mode

on

writes

hits and misses in the cache and should be simulated of memory system behavior. We simulate the entire

(CPI)

of the

work

the

indicator

read and write does not reflect substantially.

instruction effect

Peng

system used

is a misleading

per

reflects

or larger

of the features

on memory

of performance. The overall miss ratio neglects the fact that misses may have different costs. Also, the overall miss ratio the rates of reads and writes, which may affect performance We use memory

cache whose

workstations. [Koopman

Zorn

metric,

and

on machines

of the programs

of the current

reported

as the performance

and

cache misses

For other configurahigher than 50%.

page-mode

area

1990;

system

[Chen

due to data

and do not have one or more

previous

their

corresponding

for achieving good performance with a subblock size of one word,

performs

allocation here)

includes

from

Wilson

ratio

allocation

the

studied

above;

work

important placement

on write-miss,

Heap

The

to manage

the memory

Fortran

than they would have with an infinitely fast memory. tions, the slowdown due to data cache misses was often The memory system features with heap allocation are subblock

workstations.

for some configurations

to that

ran

of many

compiler [Appel 1992], do tremendous one word every 4 to 10 instructions.

copying

we found

comparable

typical

an

did

as our

memory

not

model

caches.

Memory

interact

with

systhe

system

the costs of

to give a correct memory system.

picture

We did the study by instrumenting programs to produce traces of all memory references. We fed the references into a memory system simulator which

calculated

the architecture

a performance

penalty

to be the MIPS

R3000

cache configurations DECStations, substantial

to cover the design

SPARCStations,

and

HP

due to the memory [Kane

and Heinrich

space typical 9000

system. 1992]

We fixed and varied

of workstations

series

700.

We

such as

studied

eight

programs.

We varied the following memory system parameters: cache size (8K to 512K), cache block size (16 or 32 bytes), write-miss policy (write-allocate or write-no-allocate), subblock placement (with and without), associativity (oneACM

Transactions

on Computer

Systems,

Vol.

13, No

3, August

1995

246

and

Amer Diwan et al

.

two-way),

TLB

and page-mode and data

sizes (1 to 64 entries),

writes

caches,

(with

i.e., no unified

4 describes

simulation

ory system performance results, analyzes them,

metrics validates

extends gests

them

areas

to programs

for future

with

work.

data

easily to write-back information. Section

the

methods, used. them,

only

(1 to 6 deep), split

instruction

only for write-through

caches. 3 describes

the benchmarks,

related

work.

and the

mem-

Section 5 presents the simulation and gives an analytical model that

different

Section

depth

We simulated

caches. We report

caches, but the results extend Section 2 gives background Section

write-buffer

and without).

allocation

behavior.

Section

6 sug-

7 concludes.

2. BACKGROUND The

following

SML/NJ, 2.1 This

sections

SML,

and the

describe

memory

SML/NJ

compiler.

systems,

garbage

collection

in

Memory Systems section

describes

cache organization

is divided into blocks which reside in the cache in exactly

for a single

level

of caching.

are grouped into sets. A memory one set, but may reside in any block

A cache

block within

may the

set. A cache with sets of size n is said to be n-way associative. If n = 1, the cache is called direct-mapped. Some caches have valid bits, to indicate what sections of a block hold valid data. A subblock is the smallest part of a cache with which a valid bit is associated. In this article, subblock placement with each word. implies a subblock of one word, i.e., valid bits are associated Moreover, on a read miss, the whole block is brought into the cache, not just the subblock A memory Otherwise,

that missed. Przybylski [1990] notes that this is a good choice. access to a location which is resident in the cache is called a hit. the memory

due to a memory miss if memory memory A read to the

access is a miss.

block

being

accessed

A miss

is a compulsory

for the first

time.

A miss

miss

it results from the cache not being large enough to hold all the blocks used by a program. It is a confi!ict miss if it results from two blocks mapping to the same set [Hill 19881. miss is handled by copying the missing block from the main memory

cache.

A write

hit

is always

written

to the

cache.

There

policies for handling a write miss, which differ in their performance For each of the policies, the actions taken on a write miss are (1) Write-no-allocate: —Do not allocate —Send the write

a block to main

in the cache. memory, without

(2) write-allocate, no subblock placement: —Allocate a block in the cache. —Fetch the corresponding memory —Write the word to the cache (and (3) write-allocate, sub block placement: If the tag matches but the valid bit —Write the word to the cache (and ACM

if it is

is a capacity

TransactIons

on Computer

Systems,

putting

the write

block from to memory

main memory. if write-through).

is off to memory

if write-through).

Vol. 13. No. 3, August

1995

are several penalties.

in the cache.

Performance


247

.

If the tag does not match: —Allocate —Write

a block the word

—Invalidate

in the cache. to the cache (and

the remaining

Write-allocate/subblock

words

placement

to memory

if write-through).

in the block. will

have

a lower

write-miss

penalty

than write-allocate/no subblock placement since it avoids fetching a memory block from main memory. In addition, it will have a lower penalty than write-no-allocate if the written word is read before being evicted from the cache. See Jouppi [1993] for more information on write-miss policies. A write

buffer

may

be used

A write

buffer

memory. and the

When the CPU CPU continues

buffer

retires

is a queue

entries

to reduce

the

containing

cost of writes

writes

that

are

to main

memory.

to be sent

to main

does a write, the write is placed in the write buffer, without waiting for the write to finish, The write

to main

memory

using

free

memory

cycles.

The

write

buffer cannot always prevent stalls on writes to main memory. First, if the CPU writes to a full write buffer, the CPU must wait for an entry to become available in the write buffer. Second, if the CPU reads a location queued in the write buffer, the CPU may need to wait until the write empty. Third, if the CPU issues a read to main memory progress, the CPU must wait for the write to finish. Main

memory

is divided

of writes

memory

accesses to another

example, on cycles, while

to the

into

latency

same

DRAM

DRAM

a DECStation a page-mode

This

Memory section

systems. the

System

Page-mode

when

there

page [Patterson

a write

writes are

high

reduce

and Hennessy

spatial

is in the

no intervening

a non-page-mode write one cycle, Page-mode

with

is is

locality,

1990].

For

takes writes

five are

such as those

Performance

describes

One popular

number

page

5000/200, write takes

especially effective at handling writes seen when doing sequential allocation. 2.2

pages.

DRAM

while

which buffer

two metrics metric

of memory

accesses

which

miss

miss,

divided

of memory

The cache miss by the total

which

better

usually

measures

have

this

is the

of the memory system to cycles per useful instruction (CPI); all besides nops (software-controlled pipeline stalls) are considered is calculated for a program as the number of CPU cycles to

complete the program/total sures how efficiently the memory

A metric

accesses

is of

accesses.

performance.

of memory

ratio

number

miss costs, it is useful to have miss ratios for each kind of access. miss ratios alone do not measure the impact of the memory system system

kinds

the performance ratio.

memory

contribution instructions useful. CPI

different

cache

different Cache on overall

Since

for measuring

is the

system

to CPI

number of useful instructions executed. CPU is being utilized. The contribution is calculated

as the

number

of CPU

It mea-

cycles

of the spent

waiting for the memory system /total number of useful instructions executed. As an example, on a DECStation 5000/200, the lowest CPI possible is 1, completing one instruction per cycle. If the CPI for a program is 1.5, and the memory contribution to CPI is 0.3, 20% (0.3/1.5) of the CPU cycles are spent ACM

Transactions

on Computer

Systems,

Vol

13, No.

3, August

1995

248

Amer Diwan et al.

.

cmp allot + 12, top branch-lf-gt call-gc store tag, (allot)

;

Store tag

store

,

Store value

Check for heap overflow

ra,4(alloc)

store rd,8(alloc) move alloc+4, result add alloc,12 Flg.1.

; Store pomterto next cell , Save pointer to cell Increment allocation pointer

Pseudoassembly

code forallocatmg

a list

cell.

waiting for the memory system (the rest may be due to other causes such as nops, multicycle instructions like integer division, etc.). CPI is machine dependent 2.3

since

Copying

A copying

it is calculated

Garbage garbage

reclaims

an

another

area

garbage,

and

area

actual

penalties.

Collection collector

[Cheney

of memory

of memory. the

using

area

by

All

1970;

copying

data

in

can be reused.

Fenichel

all the

the

and

live

Yochelson

garbage-collected

Since

1969]

(nongarbage)

memory

data

area

to

becomes

is reclaimed

in

large

contiguous areas, objects can be allocated sequentially from such areas in several instructions. Figure 1 gives an example of pseudoassembly code for allocating a list cell. ra contains the value to be stored in the list cell; rd contains the pointer to the next list cell; allot is the address of the next free word

in the

2.4

Garbage

The

SML\NJ

[Appel

1989].

allocation

area;

Collection

and top contains

the

end of the

allocation

area.

in SML / NJ

compiler

uses a simple

Memory

is divided

into

generational an old

area. New objects are created in the allocation the live objects in the allocation area to the

copying

garbage

generation

and

area; garbage old generation,

collector

an allocation

collection freeing

copies up the

allocation area. Generational garbage collection relies on the fact that most allocated objects die young; thus most objects (about 99% [Appel 1992, p. 206]) are not copied from the allocation area. This makes the garbage collector efficient, since it works mostly on an area of memory where it is very effective at reclaiming The most-important

space. property

of a copying

collector

with

respect

to memory

system behavior is that allocation sequentially initializes memory which has not been touched in a long time and is thus unlikely to be in the cache. This is especially true if the allocation area is large relative to the size of the cache since allocation will knock everything out of the cache. This means that caches which cannot hold the allocation area will incur a large number of write misses. For example consider the code in Figure 1 for allocating a list cell. Assume that a cache write miss costs 16 CPU cycles and that the block size is four words. On average, every fourth word allocated causes a write miss. Thus, the average memory system cost of allocating a word on the heap is four cycles. ACM

Transactions

on Computer

Systems,

Vol

13, No. 3, August

1995.

Performance The

average

cost for

instruction)

plus

allocation

is cheap

terms 2.5

of machine

Standard

Standard language

in

a list

for

terms

the

cell

is seven

memory

cycles

system

of instruction

(at

one cycle

overhead.

counts,

it

may

Thus,

per

while

be expensive

in

cycle counts.

(SML) [Milner et al. 1990] is a call-by-value, lexically higher-order functions. SML encourages a nonimperative

style.

default

allocating

12 cycles

249

.

ML

ML with

gramming


data

Variables

structures

cannot cannot

be altered

be altered

once

once

they

they

are

are

bound,

created.

scoped proand

The

by only

kinds of assignable data structures are ref cells and arrays, which must be explicitly declared. The implications of this nonimperative programming style for compilation are clear: SML programs tend to do more allocation and copying 2.6

than

SML/

programs

written

in imperative

NJ Compiler

The SML/NJ

compiler

We used version

[Appel

1992]

0.91. The compiler

is a publicly

available

concentrates

on making

and function calls fast. Allocation is done arrays. Optimizations done by the compiler arguments uncurrying,

for SML.

allocation

cheap

constant in the instead

folding,

code hoisting,

compiler was to allocate of the stack [Appel 1987;

and Jim 1989]. In principle, the presence of higher-order functions that procedure activation records must be allocated on the heap. With

a suitable [Kranz

compiler

inline, except for the allocation of include inlining, passing function

in registers, register targeting, and instruction scheduling.

The most-controversial design decision procedure activation records on the heap Appel means

languages.

analysis,

a stack

et al. 1986].

However,

can

be used

using

only

to store a heap

most

activation

simplifies

records

the compiler,

the

run-time system [Appel 1990], and the implementation of first-class continuations [Hieb et al. 1990]. The decision to use only a heap was controversial because it increases the amount of heap allocation to cause poor memory system performance. 3. RELATED There

have

greatly,

which

is believed

WORK been

many

studies

of the

cache

behavior

of systems

using

heap

allocation and some form of copying garbage collection. Peng and Sohi [1989] examined the data cache behavior of small Lisp programs. They used tracedriven simulation and proposed an ALLOCATE instruction for improving cache behavior, which allocates a block in the cache without fetching it from memory. Wilson et al. [1990; 1992] argued that cache performance of programs with generational garbage collection will improve substantially when the youngest generation fits in the cache. Koopman et al. [1992] studied the effect

of cache

organization

on combinator

graph

reduction,

an implementa-

tion technique for lazy functional programming languages. They observed the importance of a write-allocate policy with subblock placement for improving heap allocation. Zorn [1991] studied the effect of cache behavior on the ACM Transactions

on Computer

Systems,

Vol. 13, No 3, August

1995.

250

Amer Diwan et al.

.

performance

of a Common

Lisp

system,

sweep garbage collection algorithms programs are run with mark-and-sweep locality

than

when

Our work the

overall

indicator

run

differs

with

from

miss

work

as the

of performance.

stop-and-copy

and

mark-andthat better

when cache

stop-and-copy.

previous

ratio

when

were used. He concluded they have substantially

The

in two ways.

performance overall

and write misses may have different not reflect the rates of reads and

First,

metric,

miss

ratio

previous

which

neglects

is the

work

used

a misleading fact

that

read

costs. Also, the overall miss ratio does writes, which may substantially affect

performance. We use memory system contribution to CPI as our performance metric, which accurately reflects the effect of the memory system on program running time. Second, previous work did not model the entire memory system:

it

concentrated

write buffers in the cache

solely

on caches.

Memory

system

features

and page-mode writes interact with the costs of hits and should be simulated to give a correct picture

such

system behavior. We simulate the entire memory system. Appel [1992] estimated CPI for the SML/NJ system on a single using

elapsed

time

and instruction

counts.

His

CPI

differs

were

undercounted

machine

substantially

ours. However, Appel has confirmed our measurements by personal nication and later work [Appel 1993]. The reason for the difference instructions

as

and misses of memory

from commuis that

in his measurements.

Jouppi [1993] studied the effect of cache write policies on the performance of C and Fortran programs. Our class of programs is different from his, but his conclusions support placement is a desirable ratio

for the programs

that

write-allocate

ours: that architectural he studied

with

a write-allocate policy feature. He found that

was comparable

subblock

placement

with subblock the write-miss

to the read-miss

eliminated

many

ratio, of the

and write

misses. 4. METHODOLOGY We used trace-driven simulations to evaluate the memory system performance of programs. For simulations to be useful, there must be an accurate simulation model and a good selection of benchmarks. Simulations that make simplifying assumptions about important aspects of the system being modeled can yield misleading results. equally misleading. In this work,

Toy or unrepresentative benchmarks can be much effort has been devoted to addressing

these issues. Section 4.1 describes our trace generation and simulation tools. Section 4.2 4.3 states our assumptions and argues that they are reasonable. Section describes and characterizes the benchmark programs used in this study. Section 4.4 describes the metrics used to present memory system performance. 4.1 Tools We extended Larus 1990; ACM

TransactIons

QPT (Quick Program Profiler and Tracer) [Ball and Larus 1992; Larus and Ball 1992] to produce memory traces for SML/NJ on

Computer

Systems,

Vol

13.

No.

3. August

1995

Performance programs.

QPT

rewrites


an executable

information;

QPT

that

the compressed

expands

also

produces

the executable program, (which is written in C).

program

trace

into

it can trace

We obtained ate elements count 4.2

allocation

The profiler

This

(1) Simulating Subblock

1989] for the memory system a write-buffer simulator.

by using

an allocation

intermediate

on every

profiler

We extended

simulabuilt

code to increment

allocation.

on

collector

into

appropri-

this

profiler

to

assumptions

section

which

describes

might

reduce

the important

the

assumptions

validity

of our

we made.

it by simulating write-allocate/no sub block and ignoring the memory that occur on a write miss. This can cause a small

inaccuracy

in the CPI numbers. that

the cache

block

and

the

second

The following

example

size is 2 words,

word, that a program writes the write misses. In subblock cache,

QPT operates

code and the garbage

Write-Allocate /Subblock Placement with Write-Allocate/No Placement. Tycho does not simulate subblock placement so we

approximate reads from

Suppose

Because

trace

program

and Assumptions

to minimize

simulations.

array

compressed

regeneration

trace.

251

of assignments.

Simplifications

We tried

statistics instruments

of a count

the number

a full

the SML

We extended Tycho [Hill and Smith tions. Our extensions to Tycho include SML\NJ.

to produce

a program-specific

.

that

illustrates

the subblock

this. size is 1

the first word in a memory block, and that placement, the word will be written to the

word

in

the

cache

block

will

be

invalidated.

However, the simplified model will mark both words as valid after the write. Subsequently, if the program reads the second word, the read will incorrectly hit. Thus the CPI reported for caches with subblock placement can

be less

than

the

since

SML

programs

rarely

and since (2) Ignoring (3)

most

actual

writes

the effects

CPI. tend

These

are to sequential

of context

incorrect

switches

conflict

misses

than

those

however, (see Section

occur 4.3)

locations. and

system

The simulations are driven by virtual addresses. the SPARCStation 11 have physically indexed different

hits,

to do few assignments

reported

(4) Placing code in the text segment instead performance over the unmodified SML/NJ collection costs, since code is never copied, flushes after garbage collections.

calls. Some machines such as caches, and will have

here. of the heap. This improves system. It reduces garbage and avoids instruction-cache

(5)

Used default compilation settings enable extensive pact of these optimizations article.

(6)

The preferred ratio of heap size Used default garbage collection settings. to live data was set to 5 [Appel 1989]. The softmax, which is the desired upper limit on the heap size, was set to 20MB; the benchmark programs ACM

settings for SML/NJ. Default compilation optimization (Section 2.6). Evaluating the imon cache behavior is beyond the scope of this

TransactIons

on Computer

Systems,

Vol. 13. No 3, August

1995.

252

Amer Diwan et al.

.

Table

I.

Benchmark

Programs Description

Program

Cw

The

Concurrency

Workbench

networks

of finite-state

nicating

Systems.

Cleaveland

of the

Huet,

The

Life

game

et al. 1989],

of

Life,

of an object A

spherical

benchmark Arvind], VLIW

sample

for analyzing

Calculus

session

from

of Commu-

Section

by James

by

Chris

Algorithm video

fluld-dynamics

and then

the lexical

7.5

of

al.

1978],

translated

into

A Very-Long-Instruction-Word

and David

of Standard

R. ML.

1989],

running

50

decides

the location

hsts.

et al. 1990]

image.

program,

et

[Reade using

[Waugh

implemented

S Mattson

description

Reade

It is implemented

in a perspective [Crowley

algorlthm,

implemented processing

gun.

Inversion

completion of geometry.

written

of a glider

The Perspective

Simple

the

is a tool

in Milner’s

some axioms

generator,

[Appel

generations PIA

is

Knuth-Bendix

processing

A lexical-analyzer Tarditi

input

et al. 1993]

expressed

et al. [ 1993].

by Gerard Lexgen

processes

The

An implementation

Knuth-Bendix

[Cleveland

developed translated

Standard

ML

as

a “realistic”

into

ID

by Lal

Fortran

[Ekanadham

and

George.

instruction

scheduler

written

implemented

by David

R. Tarditi

by John

Dan-

[Tarditi

and

skin. A LALR(I)

YACC

Appel

1990],

parser

generator,

processing

never reached this limit. investigate the interaction standing

these

tradeoffs

the

grammar

of Standard

ML.

The initial heap size of the sizing strategy

was and

is beyond

the scope of this

take one cycle with a perfect memory costs, since multicycle instructions

more time to retire is negligible, since

writes. Section

The inaccuracy 5.4 shows that

introduced write-buffer

We did not size. Under-

article.

(7) MIPS as a prototypical RISC machine. All the traces station 5000/200, which uses a MIPS R3000 CPU. (8) All instructions only write-buffer

lNtB. cache

are for the

system. This give the write

DECaffects buffer

by this assumption costs are small.

(9) Assuming CPU cycle time does not vary with memory organization. This may not be true, since the CPU cycle time depends on the cache access time, which may differ across cache organizations. For example, a 128K cache may take longer to access than an 8K cache. 4.3

Benchmarks

Table I describes the benchmark programs. ~ Knuth-Bendix, Lexgen, Life, Simple, VLIW, and YACC are identical to the benchmarks measured by Appel [1992]. The description of these benchmarks is copied from Appel [1992]. Table II gives the following for each benchmark: lines of SML code excluding comments and empty lines, maximum heap size, compiled-code size, and user-mode CPU time on a DECStation 5000/200. The code size includes 207KB for standard libraries, but does not include the garbage

lAvailable ACM

from

TransactIons

the on

authors, Computer

Systems,

Vol.

13,

No

3, August

1995

Performance Table

of Programs with Intensive Heap Allocation II.

Sizes of Benchmark

Lines

Cw Knuth-Bendix Lexgen

Heap

PIA Simple

Code Size (KB)

Non-gc

Time

(see)

Gc (see)

5728

1107

894

22.74

3.09

491

2768

251

13.47

1.48

1224

2162

305

15.07

1.06

1026

221

16.97

0.19

111

Life

Run

Size (KB)

253

Programs

Size Program

.

1454

1025

291

6.07

0.34

999

11571

314

25.58

4.23

VLIW

3207

1088

486

23.70

1.91

YACC

5751

1632

580

4.60

1.98

Table

III.

Characteristics

of Benchmark

Programs

Partial Read

Writes

Writes

(70)

(%)

(’%)

(%)

(%)

0.41

13.24

0.77

9.04

0.38

11.14

Program

Inst

Cw

523,245,987

17.61

11.61

Knuth-Bendix

312,086,438

19.66

22.31

Lexgen

328,422,283

16,08

10.44

Fetches

Assignments

Life

413,536,662

12.18

9.26

PIA

122,215,151

25.27

16.50

Simple

604,611,016

23.86

14.06

VLIW

399,812,033

17.89

15.99

0.01 0.00 0.20 0.00 0.00 0.00 0.10

YACC

133,043,324

18.49

14.66

0.32

collector and other run-time support times are the minimum of five runs. Table and

III

kinds

percentage list

characterizes

the benchmark

of memory

references

of instructions.

the reads,

full-word

code,

The writes,

which

programs

5.92

0.21

12.33

0.00

15.45

0.00

8.39

0.05

7.58

60KB.

according

do. All

numbers

Reads,

Writes,

and

and partial-word

0.00

is about

they

The

run

to the number

are

Partial

writes

Nops

reported writes

as a

columns

done by the program

and the garbage collector; the assignments column lists the noninitializing writes done by the program only. The Nops column lists the nops executed by the program and the garbage collector. All the benchmarks have long traces; most related works use traces that are an order of magnitude smaller. Also, the benchmark programs do few assignments; the majority of the writes are initializing writes. Table IV gives the allocation and total allocation

allocation

the allocation by kind: functions, closures for cludes

statistics

for

each

benchmark

program.

All

sizes are reported in words. The Allocation column lists the done by the benchmark. The remaining columns break down

spill

records,

z Closures

for

callee-save

first-class

continuations.

closures for escaping functions, closures callee-save continuations,2 records, and

arrays,

continuations

ACM

strings,

vectors,

can be trivially

Transactions

ref cells,

allocated

on Computer

store

list

on a stack

Systems,

for known others (inrecords,

in the

and

absence

Vol. 13, No. 3, August

of

1995.

254

Amer D[wan et al.

.

Table

IV.

Allocation

Allocation Program

Characteristics Known

Escaping

(words)

of Benchmark

%

Size

70

Callee

Size

Saved Size

%

Programs Other

Records To

Size

Size

%

Cw

56,467,440

4.0

4,12

3.3

15.39

67,2

6.20

19.5

3.01

6.0

4,00

Knuth-Bendix

67,733,930

37.6

6.60

0.1

15.22

49.5

4.90

12.7

3.00

0.1

15.05

Lexgen

33,046,349

3.4

6.20

5.4

1296

72.7

6.40

15.1

3.00

3.7

6.97

Life

37,840,681

0.2

3.45

0.0

15.00

77.8

5.52

22.2

300

0.0

10.29 3.22

PIA

18,841,256

0.4

5.56

28.0

11.99

25.0

4.69

12,7

3.41

339

Simple

80,761,644

4.0

5,70

1.1

15.33

68.1

6.43

8.3

3.00

185

VLIW

59,497,132

9.9

5.22

6.0

26.62

61.8

7.67

20.3

3.01

2.1

2.60

YACC

17,015,250

2.3

4.83

15.3

15.35

54.8

7.44

23.7

3.04

4.0

10.22

Table

V

Timings

of Memory

Operations Tlmmgs

Task Non-page-mode Page-mode

4

flush

16 bytes

from

memory

Read

32 bytes

from

memory

15 19 195

period

5

time

0

Write

hit

or miss

Write

hit

(16 bytes,

no subblocks)

Write

hit

(32 bytes,

no subblocks)

Write

miss

(16 bytes,

no subblocks)

miss

(32 bytes,

no subblocks)

Write TLB

floating-point

(subblocks)

0 0 15 19

miss

28

numbers).

For

each

allocation

kind,

the

% column

words allocated for objects of that kind as a percentage and the Size column gives the average size in words,

one-word

4.4

1 11

Read

Refresh

total tion,

5

write write

Page-mode

Refresh

tag,

(Cycles)

write

Partial-word

3.41

of an object

of that

gives

the

of total allocaincluding the

kind.

Metrics

We state cache performance numbers in cycles per useful instruction (CPI). All instructions besides nops are considered useful. Table V lists the timings used in the simulations. These numbers are derived from the penalties for the DECStation 5000/200, but are similar to those in other machines of the same class. In addition to the times in Table V, all reads and writes may also incur write-buffer penalties. In an actual implementation, there may be a one-cycle penalty for write misses in caches with subblock placement. This is because a tag needs to be written to the cache after the miss is detected. This does not change our results, adds at most 0.02–0.05 to the CPI of caches with subblock placement. ACM

TransactIons

on

Computer

Systems,

Vol.

13,

No.

3, August

1995.

since

it

Performance We used

a DRAM

Page-mode after

flush

a series

TLB

data

page

is the

of Programs with Intensive Heap Allocation size of 4K in the

number

of page-mode is reported

is used instead

of just

5. RESULTS

AND ANALYSIS

Section

configurations 5.4, and

SML/NJ.

simulated and

TLB

memory

a qualitative

with

data

for

Qualitative

Recall

from

portant

not

analysis why

performance.

2 that of

In

to the CPI.

of the

This

metric

for all the

memory

5.2 we list

they

were

system Section

behavior

the cache

selected.

of

and TLB

In Sections

performance,

5.3,

write-buffer

5.6 we validate

an analytical model allocation behavior.

SML/NJ

a copying

allocation

enough

uses

collector

initializes

since the last garbage

large

misses. memory

Section

is that

touched

pipeline

that

the

simula-

extends

these

Analysis

property

behavior

writes.

write

page size of 4K was used in the

memory

tions. In Section 5.7 we present results to programs with different 5.1

the

the measurements

In Section

and explain

5.5 we present

performance,

of page-mode

to flush

contribution

us to present

A virtual

5.1 we present compiled

miss

CPI to allow

in one chart.

programs

simulation

needed

255

writes.

as the TLB

benchmarks simulations.

In

of cycles

.

to contain

The slowdown that system organization.

write

to

in an area

This

means

allocation

these

collector.

respect

memory

collection. the

a copying with

area

misses

that

The

most-im-

memory that

system

has not

been

for caches that

there

will

translate

be many

are write

to depends

on the

Recall from Section 4.3 that SML/NJ programs have the following important properties. First they do few assignments: the majority of the writes are initializing writes. Second, programs do heap allocation at a furious rate: 0.1 to 0.22

words

per

instruction.

Third,

writes

come

in bunches

correspond to initialization The burstiness of writes

of a newly allocated combined with the

mentioned

above

that

particular, where the

writes should not stall the CPU. Memory CPU has to wait for a write to be written

memory will need to wait

suggests

an aggressive

area. property write

perform poorly. Even memory systems for writes if they are issued far apart

because

of copying policy

they

collectors

is necessary.

In

system organizations through (or back) to

where the CPU (e.g., two cycles

does not apart in

the HP 9000 series 700) may perform poorly due to the bunching of writes. This means that the memory system needs two features. First, a write buffer or fast page-mode writes are essential to avoid waiting Second, on a write miss, the memory system must block from memory if it is going to be written before this

requirement

placement

only

[Koopman

holds

for caches

et al. 1992],

with

a block

for writes to memory. avoid reading a cache being read. Of course,

a write-allocate size of one word,

policy. and

Subblock the

ALLO-

CATE instruction [Peng and Sohi 1989] can achieve this. Since the effects on cache performance of these features are similar, we discuss only subblock placement. For large caches, when the allocation area fits in the cache and ACM

Transactions

on Computer

Systems,

Vol.

13, No

3, August

1995.

256

Amer Dlwan et al.

.

Table

VI.

Memory

System

Organizations

Miss

Block

Cache

Write

Page

Size

Sizes

Buffer

mode

8K-512K

1-6

yes

8K-512K

6 deep

no

8K-512K

6 deep

no

Write

Write

Policy

Policy

Subblocks

through

allocate

yes

1, 2

16, 32 bytes

through

allocate

no

1, 2

16, 32 bytes

through

no allocate

no

1, 2

16, 32 bytes

Table

VII.

Architecture

Memory

Assoc

System

Write

Write

Policy

Policy

Organization

Miss

Studied

of Some Popular

.Machines

Write Buffer

Subblocks

deep

Assoc

Block

Cache

Size

Size

DS31OO

through

allocate

4 deep

—

1

4 bytes

64K

DS5000/200

through

allocate

6 deep

yes

1

16 bytes

64K

HP 9000 SPARCStation SPARCStations 128K

have

instruction

and 256K a fetch

II

data

back

allocate

none

no

1

32 bytes

64K-2M

through

no allocate

4 deep

no

1

32 bytes

64K

unified

cache cache

caches;

and 256K

for model

size of 16 bytes,

This

most

data

HP 9000

cache

series

for models

750: the DS5000/200

is stronger

than

700 caches

are much

720 and 730, and 256K actually

subblock

has a block

placement

since

smaller

than

instruction

size of four

it has a full

2M: cache

bytes

with

tag on every

“subblock.”

thus there reduced.

are few

write

misses,

the

5.2 Cache and TLB Configurations The design involved,

space for memory

of subblock

placement

will

be

Simulated

systems

and the dependencies

benefit

is enormous.

among

study only a subset of the memory restrict ourselves to features found

them

There

are complex.

are many

variables

Therefore

we could

system design space. In this study, we in currently popular RISC workstations

[Cypress 1990; DEC 1990a; 1990b; Slater 1991]. Table VI summarizes the cache organizations simulated. Table VII lists the memory system organizations of some popular machines. We simulated only separate instruction and data caches (i.e., no unified caches). While many current machines have separate caches (e.g., DECStations, HP 700 series), there are some exceptions (notably SPARCStations). We report data only for write-through caches, but the CPI for write-back caches

can be inferred

from

the

data

for write-through

caches.

While

write-

through and write-back caches have identical misses, their contribution to the CPI may differ for two reasons. First, a write hit or miss in a write-back cache may take one cycle more than in a write-through cache. A write-back cache must probe the tag before writing to the cache [Jouppi 1993], unlike a write-through cache. It is easy to adjust the data for write-through caches for this to obtain the data for write-back caches. If the program has w writes and n useful instructions, then the CPI for a write-back cache can be obtained CPI of the write-through cache with the same size ACM

TransactIons

on

Computer

Systems,

Vol

13,

No,

3, August

1995,

by adding w/n and configuration.

to the For

Performance

VLIW w\n

is 0.18. Second,

different

write-buffer

different penalties

for

than

for

writes caches

We simulated separate Two

of the the

over Thus,

placement

the

subblock

TLBs

from

(such

as the

From

Section

TLBs

perform

for

write-through caches

write-through

and

penalty

is

with

an

cache

parameters

placement

not

data

HP

9000

series)

5.5 it is clear are

versus

that

have for the

write-allocate

no subblock

for

placement the

versus

placement.

placement

subblock

collect

1 to 64 entries

well.

write-no-allocate/subblock

did

with

write-buffer

write-buffer

unified

write-no-allocate/no we

between

since

have

for write-back

machines TLBs.

unified

combination

provement mance.

Some

and data

and

difference

may

memory

the

those

257

caches.

most-important

write-no-allocate these,

This

associative,

even small

expect

than

.

caches

to main

are less frequent

to be negligible

policy.

instruction

benchmarks

We

to be smaller

caches.

fully

replacement

write-back

do writes

points.

memory

even for write-through

LRU

they

different

caches

is likely

and

because

at

to main

write-through

write-back small

and

write-back

since

write-through

penalties

frequencies

caches

of Programs with Intenswe Heap Allocation

for

of

offers

no im-

cache

perfor-

write-no-allocate/subblock

configuration.

5.3 Memory System Performance We present graphs.

memory

Each

program locate

for a range

performance

graph

(1 or 2). Each

memory gram,

system

breakdown

components depth The

for

in a summary There

breaks

graphs

only

programs

the

and omitted

the

are

remaining

for conciseness.

summary

memory

system

a summary

graph.

asso-

to a different

graphs

for a block

summary

similar

3, 4, and 5 are the breakdown

configurations;

(write-aland

for each

pro-

size of 32 bytes. overhead The

into

its

write-buffer

at six entries.

we present

other

the

in

policies

or without),

corresponds

and another

down

is fixed

graph

and breakdown of one benchmark

write-miss (with

are two

configuration

graphs

the performance

placement

size of 16 bytes

one

section

data

Figures

curve

graph

for

in these

In this

subblock

organization.

one for a block

Each

in summary

summarizes

of cache sizes (8K to 512K),

or write-no-allocate),

ciativity

size

system

summary

and

are

given

in

the

for VLI W for the

graphs breakdown

The breakdown

for VLIW (Figure

graphs

graphs graphs

for

2).

Appendix.

16-byte

block

VLI W are similar

for the other

benchmarks

are similar (and predictable from the summary graphs) and are thus omitted for the same reason. The full set of graphs is available from the authors. In the summary graphs, the nops curve is the base CPI: the total number of instructions

executed

tions

this

the

executed; breakdown

miss

is the

CPI

graphs,

partial

the

contribution

tion of write misses (if instruction fetch misses; buffer;

divided

corresponds

word

by the

nop

area

of read any); write

is the CPI ACM

number

of useful

to the CPI for a perfect is the misses;

CPI write

(not

contribution miss

nop)

memory

instruc-

system. of nops;

is the CPI

For read

contribu-

inst fetch miss is the CPI contribution of buffer is the CPI contribution of the write contribution

Transactions

of partial-word

on Computer

Systems,

writes. Vol. 13, No. 3, August

1995.

258

Amer Diwan et al.

.

—.—

write-no-allot,

~

write-allot,

— ‘— —.—

write-allot,

no-subblk, subblk,

‘---*

assoc=l

---

assoc=l

no-subblk,

‘---

assoc= 1

‘--

write-no-allot,

no-subblk,

-Cl---

wnte-~l~~,

subblk,

‘--

write-allot,

no-subblk,

●

assoc=2

assoc=2 assoc=2

nops

3.5

(a) Block

‘:

size =

16 bytes

:, ‘t q\

~

“y ,. \ “Q,

‘b:\ ‘:, ,

\

‘*::, ‘,

“’a:,

\ ‘“>

!~—0

“

.,,

,A

1

MM~MMMM w Split

I and D Cache

Split

sizes Fig. 2.

9

❑

nop

db=

MMM

.s:~;wz

a

read miss

❑

VLIW

I and D Cache

inst fetch

miss

❑

write

Fig. 3, TransactIons

VLIW

breakdown,

on Computer

Systems,

buffer

W partml

(b)

write-no-allocate,

f%SOC

=

%

m

Spht I and D cache sizes

ACM

sizes

summary.

i\

x+’.j

M~

2

ti

Split I and D cache sizes no subblock,

Vol. 13, No, 3, August

1995,

block

size = 16,

word

v-1

Performance

❑

❑

nop


❑

read miss

inst fetch

❑

miss

write

❑

buffer

.

partial

259

word I

3.5 ( I

c

.-0 g ~

3.5

3

(a) Assoc = 1

34 G 0 .+ s ! g ~ 2.5 ~

I 2.5

h

MM

cowed

T

MMMM~

(b)

MM

Fig.4.

❑

Split I and write-allocate,

❑

read miss

instfetch

~ . ~

subblock,

❑

miss

write

block

D cache

sizes

size = 16.

❑

buffer

partial

word

3.5

3.5 :

c 0 .,.. s ~

2

:5$

VLIWbreakdown,

❑

nop

=

MMMMM

ro$:~z”z%i$

.

Split I and D cache sizes

r

fhSOC

3

(b) fhsOC

=

2

2.5

32 ~ -u G 1,5

—-——..———_ %L4titi%titi COQN Spht Fig.5.

MM

COWC4

mz~~~ I and

D cache

VLIWbreakdown, ACM

ro3~?: Split

sizes write-allocate, Transactions

nosubblock,

on Computer

——.—___ .

,, MMMMM

1-

1

I and

D cache

block

size = 16.

Systems,

sizes


1995,

I

260

Amer Diwan et al.

.

The

64K

point

on the

write

allot,

assoc = 1 curves

sub block,

corresponds

closely to the DECStation 5000/200 memory system. In the following subsections we describe the impact

of write-miss

subblock

size,

placement,

partial-word programs. 5.3.1

writes

blocks,

arithmetic

mean

with

system

cache

direct-mapped associative sponds

with

that

the

programs

the

system

machines

size of 16 bytes,

mean

the memory

and

corresystem

It

of SML/NJ miss

64K

two-way

is similar

1993].

the very high write-miss

for

memory

5000/200

Bershad

performance

due to

9%)

configuration The

an

large

97o) for 32K

system.

and

the

and

sufficiently

mean

DECStation

despite

caches

the overhead

direct-mapped

[Chen

configurations.

organization,

(arithmetic

memory

on the

memory

current

a block

64K

programs

for

is acceptable;

13%

5000/200

summary

is write-alloc-

the CPI by 0.35 to 0.88, with

placement 3 to

the

all other

Surprisingly,

programs

and

benchmark

write-allocate

1 to 1370 (arithmetic

Recall

Fortran

that

on some

cache

and

of SML/NJ

emphasizing

reduces

of 0.55.

from

DECStation

of C and

good

ranges

From we studied

and

buffer,

of the

Placement. outperforms

policy

write

performance

direct-mapped

placement

of SML/NJ

caches.

overhead

64K

cache

organization

it substantially with

caches

to the

Sub block

write-allocate/subblock

misses

size,

system

cache

improvement

performance

data

and

subblock

the

block

memory

the best

system

four-word caches

that

placement;

a memory

that

Policy

it is clear

ubblock

For

on the

Write-Miss

graphs, ate/s

associativity,

is

to

worth

programs

rates;

for

read-miss

is

a 64K

ratios

for

VLIW are 0.23 and 0.02 respectively. Recall would

that

in Section

be substantial,

The summary even

for

creases

5.1 we argued but

graphs

128K

cache

sharply

for

that

indicate sizes; larger

that

the benefit

the benefit that

benefit

six

of the

placement

for larger

in benefit

the

for

of subblock

decrease

the reduction

however, caches

would

caches.

is not substantial

of subblock

placement

benchmark

de-

programs.

This

suggests that the allocation area size of six of the benchmark programs is between 256K and 512K. The performance of write-allocate/no subblock is almost identical to that of write-no-allocate/no ence between This

them

suggests

that

an 8K cache,

subblock is so small an address

an address

(Knuth-Bendix in most

is being

is read

read

after

is an exception).

graphs

that

the

soon after

being

written

two

being before

The

curves

differoverlap.

written;

even in

it is evicted

from

the cache (if it was evicted from the cache before being read, then write -alloeate/no sub block would have inferior performance). The only difference between

these

two schemes

case, it is brought miss.

Because

ments, allocated

SML/NJ

a newly

allocated

another

is when

in on a write

a cache block

miss;

programs object

C bytes,

allocate

remains

where

C is

programs allocate 0.4–0.9 bytes per read of a block occurs within 9K–20K ACM

Transactions

on

Computer

Systems,

Vol.

is read

in the other,

13,

sequentially

in the the

instruction, instructions No.

3, August

from

memory.

it is brought

cache size

and until

of the

do few

the

assign-

program

cache.

has

Since

our results suggest of its being written. 1995,

In one

in on a read

that

the a

Performance The

benefit

languages ment


of subblock

such

placement

as Standard

combined

with

is not

ML.

Jouppi

an 8K data

cache

limited

[1993] and

to strongly

reports

that

a 16-byte

5.3.2

Changing

ciativity

improves

to two-way from

and four-word

blocks,

an arithmetic

from

associativity

However, increasing improvements may From

Figures

instruction

nate

with

to a two-way

with

set. When

conflicts 5.3.3 with

64K

the

write-allocate

sizes

have

From

Figure

much

from

larger

a write

Size.

that

that

misses,

less

spatial

spatial

sizes

with

locality

of the

functions

block

block system

size reduces of 0.22. For

size can halve when

In particular,

the writethere

is a

larger

/sub block

is comparable

of

locality.

increasing

subblock

reads

is the

chances

a memory

organizations

as write-allocate[no

the

cache

to hold

the

spatial

write-allocate

in going from

the performance

performance

write-no-allocate

and

caches.

data

improvement

the block

et al. 1992].

4(a),

of small

Thus,

For

cache

can domi-

enough

increasing

mean

improve

to caches

the

2 we see that

caches,

[Koopman

size

locality.

performance.

doubling

the

not large

had strong

Figure

3(a),

penalty

from

the

data

observed

are examined,

an arithmetic

block

that

the

improves

cache penalty of the

16K.

and thus

or smaller

the code consists

write-allocate

to offer

the

the

From

with

miss

block

and

benefit

than

on

(Figures

for 128K

a lot

programs

also improves

2 we see that larger

caches

if the instructions

organizations,

little

sizes

associativity

cache is simply

lowers

than Block

Thus,

for

suggests

which

by 0.14 to 0.35,

rate.

penalty

The data

direct-mapped

CPI

the CPI by 0.01

The improvement

suggests

obtained

write-allocate

or no impact

the instruction

system.

cache

16 to 32 bytes

the miss

placement,

the benchmark

calls,

Changing

cache

little

cache is not surprising:

are greater

size from

64K

reduces

higher

has

is due to conflict

misses.

frequent

small

is substantial

for the memory

of the instruction

with

assoone-way

of 0.06. The maximum

for direct-mapped

subblock

due to capacity working

but

cache penalty

cache

for

from

the improvement

associativity

5 we see that

performance

associative

instruction

than

increasing

in going

system

place-

eliminates

[1993] finds that the subblock placement.

2 we see that

improvement

is obtained

Surprisingly,

the penalty

increasing mean

3, 4, and

5(a)) the instruction For caches

smaller

a memory

line

associativity may increase CPU cycle time, not be realized in practice [Hill 1988].

cache

performance.

For

Figure

The improvement

is much

placement.

to 0.09, with higher

From

all organizations.

set associativity

subblock

caches

Associativity.

functional

subblock

cache

31% of the memory references for C programs. Reinhold memory performance of Scheme programs is good with

261

.

block

placement.

benefit

just

placement;

as this

to that

of the

writes. Note that subblock placement improves performance way associativity and 32-byte blocks combined. 5.3.4 identified

Changing for cache

Cache sizes.

Size.

Three

The first ACM

distinct

region

TransactIons

more

than

even two-

regions

of performance

can be

corresponds

to the range

of cache

on Computer

Systems,


1995.

262

Amer Diwan et al.

.

sizes when the allocation area does not fit in the cache (i.e., allocation happens in an area of memory which is not cache resident). For most of the benchmarks,

this

region

corresponds

to cache

Simple

sizes

of less

than

256K

(for

and Knuth-Bendix this region extends increasing the cache size uniformly improves

beyond 512K). In this region, performance for all con@-ura-

tions.

from

However,

the performance

improvement

doubling

the cache size is

small. From the breakdown graphs we see that in the first has little effect on the data cache miss contribution

region the cache size to CPI. Most of the

improvement

cache

improved

in

CPI

that

performance

sizes have interactions longer may

with

to access. Thus, not be achieved

The second cache until

comes

of the

from

increasing

instruction

cache.

the cycle time

small

the As with

of the CPU:

improvements

size

is due

associativity,

larger

caches

due to increasing

the

to

cache can take

cache

size

in practice.

region

ranges

the allocation

from area

when

fits

the allocation

area begins

in the cache. For most

to fit in the

of the benchmarks

Simple and Knuth-Bendix), this region corresponds to (once again excepting cache sizes in the range 256K to 512K.3 In this region, increasing the cache size sharply without region

improves

subblock has little

instruction

cache miss

The third the

the data

region

cache.

For

larger

than

larger

cache

cache

performance

placement. However, to offer for instruction penalty

corresponds

five

512K

of the

benchmarks,

this

In this

range,

increasing

impact on memory system performance resident, and thus there are no capacity 5.3.5

Write-Buffer

low at this

to cache sizes when

Lexgen, Knuth-Bendix,

(for

sizes).

is already

and

for memory

region

Write

this the

area fits

corresponds

and Simple this the

size in because

point. the allocation

cache

Overheads.

down graphs we see that the write-buffer and partial-word tions to the CPI are negligible. A six-deep write buffer coupled

region

starts

remains

From

in

to caches

size has little

because everything misses to eliminate.

Partial-Word

organizations

increasing the cache cache performance

the

at

or no cache

break-

write contribuwith page-mode

writes is sufficient to absorb the bursty writes. As expected, memory system features which reduce the number of misses (such as higher associativity and larger

cache sizes)

5.4 Write-Buffer

also reduce

the write-buffer

overhead.

Depth

In Section 5.3.5 we showed that a six-deep write buffer coupled with pagemode writes was able to absorb the bursty writes in SML/NJ programs. In this section we explore the impact of write-buffer depth on the write-buffer contribution to CPI. Since the speed at which the write buffer can retire writes depends on whether or not the memory system has page-mode writes, we conducted two sets of experiments: one with and the other without page-mode writes. We varied the write-buffer depth from 1 to 6. We conducted this study for two of the larger benchmarks: CW and VLIW. We fixed

3For

Lexgen

ACM

Transactions

this

region on

extends

Computer

a little Systems,

beyond

512K.

Vol.

No,

13,

3, August

1995

Performance

I —.—


wb depth= 1 ~

wb depth.2

wb depth=4

‘*—

.

~

263

wb depth.6

0.2 ~

0.2 0.18

0.18 ] (b)

0.16

= 0.16

ASSOC =

2

0 %

~

0.14

0.14 ‘, \ 0.12 K \~ \

~ 0.12 .,-1

~

().1

Z > +

0.08

0“1 I 0.08 +

0.06

0.06 t

G’ 0,04

~’\, ‘>\’;\

....

=

2

~..+

‘--+-.D

!

.

‘ lT___

t

-~+ :’%, ‘\\

_

+

\ ‘b.

...=

0.5

-C-.—4

‘ ,+

+

“%,,,’-’’*...

..*_

____e

k. -=&, ‘

+

e-.—-.—+

I

() L—–-..k—

o ,—.–-+—+

—. M

co c+

1 and Fig. 7.

write,

is

it is to the

subsequent writes they also complete Due

one after

to

next

I and

CPI contribution

likely

for VLIW,

without

be on the same DRAM

address.

It

will

D cache

page-mode

allocation, another

it

will

is likely

size writes.

page as the previous

therefore

complete

in

to initialize this object find an empty in one cycle due to page-mode writes.

to sequential

allocated

size

Write-buffer

however,

since

D cache

that

writes

also be on the same

write

write,

one cycle. buffer

to initialize

DRAM

All

since objects

page. In the best

delay will happen case, with no read misses and refreshes, a write-buffer-full only once per N words of allocation, where N is the size of the DRAM page. Thus,

the write

programs

if

explanation,

buffer

the

depth

memory

we measured

has little system the

effect

has

on the performance

page-mode

probability

writes.

of two

To

consecutive

of SML/NJ confirm writes

this being

on the same DRAM page. This probability averaged over the benchmarks was 969Z0. The small impact of write-buffer depth on performance does not imply that a write buffer is useless if the memory system has page-mode writes. Instead, it says that a deep write buffer memory system with page-mode

offers writes

little performance if the programs

improvement in a spatial have strong

in the writes and if the majority of the reads and instruction spatial locality means that the probability hit in the cache. Strong consecutive writes are to the same DRAM page is high.

locality

Write-buffer have ACM

page-mode Transactions

depth

is important,

writes on Computer

(Figure Systems,

however, 7). In

this

if the

memory

case,


1995.

system

a six-deep

write

fetches that two does not buffer

Performance 25


.

265

T

2

2

4

8

16

32

64

Numberof TLB enhies Fig. 8.

performs have 5.5

much

different

better

than

TLB

contribution

a one-deep

write

to CPI.

buffer.

Note

that

to the

CPI

for

Figures

6 and 7

scales.

TLB Performance

Figure

8 gives

the

TLB

miss

contribution

each

benchmark

program. We see that the CPI contribution of TL13 misses falls below 0.01 for all our programs for a 64-entry unified TLB; for half the benchmarks, it is below 5.6

0.01 even for a 32-entry

TLB.

Validation

To validate DECStation

our simulations, 5000/200

we ran

(running

each of the benchmarks

Mach

2.6)

and

measured

five the

times

elapsed

on a user

for each run. The programs were run on a lightly loaded machine but placenot in single-user mode. The simulations with write-allocate/subblock direct-mapped caches, 16-byte blocks, and 64-entry TLB correment, 64K spond closely to the DE Citation 5000/200 with the following three importime

tant and

differences. First, the simulations ignored the effects of context switches address = system calls. Second, the simulations assumed a virtual

physical

the

simulations

mapping which can have many mapping used in Mach 2.6 [Kessler

address

random

assume

that

all instructions

take

fewer conflict misses than and Hill 1992]. Third, the

exactly

one cycle (plus

memory

system overhead). In order to minimize the memory system effects of the virtual-to-physical CPI of the five runs for mapping and context switches, we took the minimum each program and compared it to the CPI obtained via simulations. We Measured (see) is the user time of the present our findings in Table VIII; ACM

Transactions

on Computer

Systems,


1995.

266

.

Amer Diwan et al. Table

VIII.

Measured

Measured

versus

Measured

Simulated

Simulated

Multicycle

Discrepancy

Promam

(see)

CPI

CPI

CPI

(?6)

Cw

2583

142

1,39

0.00

2.48

Knuth-Bendix

14.95

1,27

1.21

0.00

5.22

Lexgen

16.13

1.40

1.31

0.00

6.29

Life

17.16

1.23

1.21

0.00

PIA

6.41

1.43

1,18

Simple

29.81

1.33

1.21

* *

VLIW

25.61

1.76

1.39

6.58

1,39

1.36

YACC ‘cannot

be determined

program time;

in seconds;

Simulated

is the

CPI

without

computed;

simulating

the

Measured

is the

CPI

CPI

of multicycle

Discrepancy

is the

FP unit

CPI

obtained

from

9.66 2.20

pipehnes

obtained

instructions

discrepancy

9,03

0.20 0.00

and/or

is the

CPI

overhead

CPU

1.19 17,62

from

the

the

simulations;

when

it could

between

the

measured Multicycle

be accurately

simulated

CPI

plus

the multicycle CPI and the measured CPI as a percentage of measured CPI. is less Table VIII shows that with the exception of PIA, the discrepancy than 10’% and that the actual runs validate the simulations. The discrepancy in

PIA is due

instructions their

to multicycle

executed.

result

instructions

Since

is used, their

which

multicycle

cost can usually

We were

able to determine

the overhead

for VLIW

since

of most

ately

afterward.

the

results

comprise

instructions

be determined of multicycle

multicycle

In the case of PIA,

4.8%

do not cause

total

a stall

until

only by simulations. instructions

instructions

the distance

of the

accurately

are used

between

multicycle

immediinstruc-

tions and instruction

their use varies considerably. However, even if each multicycle stalls the CPU for half its maximum latency, the discrepancy falls

well below PIA.

10%. Thus,

5.7

Extending

multicycle

can explain

the discrepancy

for

the Results

Section

5.3 demonstrated

system

cost if new objects

section,

we present

due

heap

to

instructions

that

heap

cannot

an analytic

allocation

allocation

model

when

can have

be allocated

this

directly

which is

the

predicts case.

a significant into

the memory

The

memory

the cache.

model

In this

system

formalizes

cost the

intuition presented in Section 5.1 and predicts the memorys ystem cost due to or heap allocation rates heap allocation when block sizes, miss penalties, change. We use the model to speculate about the memory system cost of heap allocation for caches without subblock placement if SML/NJ were to use a simple stack. 5.7.1

An

collection time and allocation initialized, ACM

Analytic

Model.

Recall

that

heap

allocation

with

copying

garbage

allocates memory which typically has not been touched in a long is unlikely to be in the cache. This is especially true when the area does not fit in the cache. When newly allocated memory is write misses occur. The rate of write misses depends on the

Transactions

on Computer

Systems,


1995.

Performance Table Cache

IX.

Percentage

Size


Difference

between

Write-No-Allot/No

(Kilobytes)

Analytical

Model

Subblock

and Simulations

Write-Allot/No

Subblock (%)

(%)

8K

7.12

2.4

16K

6.84

2.2

7.02

2.2

32K 64K

10.8

5.7

128K

31.8

23.5

256K

128.8

111.4

512K

1847’.7

1746.2

allocation rate (a words\ instruction) rate of write misses, we can calculate allocation. Under initializing

The read

and write

the assumption writes miss,

Under allocated,

miss

the additional we have

Since

the

benchmarks

account

=

allot

no allot

WP respectively.

area does not fit in the cache, i.e.,

WP*a/b,

that

do few

for the difference

are rP and

penalties

assumption

c write should

and the block size (b words). Given the the memory system cost, C, due to heap

that the allocation we have

c write

267

.

programs

touch

data

soon after

it is

=rp+a/b,

assignments,

the

in CPIS when

cost

of heap

the write-miss

allocation

policy

is varied.

Hence,

c write Table

IX

predicted actual ‘P

=

no allot/no

shows

cost,

difference 15, and

the

CPIW,i,,

=

subblock

average

C, and

the

actual

~0 ,llOC,~O,U~~lOC~– CPIW,,,e ~llOC/,U~~lOC~

(arithmetic difference

in CPIS. The values

mean)

difference

in CPIS,

between

used to calculate

Table

IX were:

allocation area fits in the cache. infinity as the benefit of subblock model can be used small cache sizes. 5.7.2

of the b = 4,

Wp = 15.

The model is accurate for cache sizes of 128K or less, when the area does not fit in the cache. As expected, the model is inaccurate

SML/NJ

cost of heap

the

as a percentage

to predict

with

allocation

a Stack.

the absence of first-class callee-save continuations

The percentage difference heads toward placement becomes negligible. Thus, this

the memory

in SML\NJ

allocation when the

system

cost of heap

allocation

for

We can speculate about the memory system when a stack is used using this model. In

continuations, which can be stack-allocated

the benchmarks do not easily. The callee-save

use, con-

tinuations correspond to procedure activation records. The first two columns of Table X give the rate of heap allocation with and without heap allocation of callee-save continuations, ACM Transactions

on Computer

Systems,

Vol

13, No. 3, August

1995

268

Amer Dlwan et al,

.

Table

X,

Expected

Placement

Memory

(assuming

System

procedure

Allocation Including Program

Cost of Heap activation

(Words/Useful

for Caches

Cents.

Excluding

InstructIon)

without

are stack-allocated

Allocation

Rate

Callee-Save

Allocation

records

Subblock

in SML/NJ)

Rate

Callee-Save

(Words/Useful

Cents.

Instruction)

c (Cycles/Instruction)

Cw

0.12

0.04

0.15

Knuth-Bendix

0.23

012

044

Lexgen

0.11

012

Life

0.11

0.03 (),()2

PIA

0.17

013

0.47

Simple

0.14

0.05

0.17

VLIW

0.16

0.06

0.23

YACC

0.14

0.07

0.24

Median

0.14

0.05

0.20

Assuming presents that

only

do not have

area.

continuations

an estimate

The

block

write-miss

are

of the memory

subblock

penalty

stack-allocated, system

placement

size is 16 bytes, for the

0.09

and are too small the

read-miss

no-subblock

caches

system

costs

probably

due to heap

6. FUTURE

WORK

We suggest

three

—measuring work,

the

—measuring

the impact

—measuring

other

Regarding

allocation

be significant

directions impact

in which of other

caches

the allocation

15 cycles, Since

for

the

and read

write-allocate

system

cost

to stack-allocate a simple stack, without

X

for caches

subblock

the and and

of heap

additional the memory placement

programs.

this

study

architectural

of different

aspects

architectural

for

for SML\NJ

to hold

15 cycles.

allocation with a stack because it may be possible objects [Kranz et al. 1986]. We see that even with

3 of Table

allocation

penalty

write miss penalties are the same, C is the same write-no-allocate organizations. This is an upper bound on the expected memory

will

column

cost of heap

compilation

can be extended: features

not

techniques,

explored

in this

and

of programs. features,

there

is a need to explore

memory

system

performance of heap allocation on newer machines. As CPUS get faster relative to main memory, memory system performance becomes even more crucial to good performance. To address the increasing discrepancy between CPU speeds and main-memory speeds, newer machines, such as Alpha workstations [DEC 1992], often have features such as secondary caches, stream buffers, and register scoreboarding. The interaction of these features with heap allocation needs to be explored. Regarding different compilation techniques, the impact of stack allocation is worth on most ACM

measuring. A stack reduces memory system organizations)

Transactions

on Computer

Systems,

Vol

heap allocation (which performs badly in favor of stack allocation (which can

13, No

3, August

1995.

Performance have good cache locality of memory,

namely


since it focuses

most

the top of the stack).

of heap-allocated

objects

of the references

For SML\NJ

can be allocated

on the

stack

ations,

stack

and threads,

doing stack Regarding promising

allocation

can slow down

A careful

allocation. measuring for future

study

other

aspects

(Table

IV).

Therefore

first-class

to evaluate

of

part

the majority

of SML\NJ programs or with small cache

exceptions,

is needed

269

to a small

programs,

stack allocation can substantially improve performance on memory organizations without subblock placement sizes. However,

.

programs,

continu-

the pros and cons of several

areas

seem

work:

(1) Measuring the impact performance.

of different

garbage

collection

(2) Measuring the impact of changing various garbage (such as allocation area size) on cache performance.

algorithms collector

on cache parameters

(3) Measuring the cost of various operations related to garbage collection: tagging, store checks, and garbage collection checks. A preliminary study of this

is reported

(4) Measuring

in Tarditi

the impact

and Diwan

of optimizations

[1993]. on cache performance.

7. CONCLUSIONS We have copying nique

studied

the

garbage

memory

collection,

for programming

language

features

system

performance

a general

languages.

where

objects

automatic

Heap may

of heap storage

allocation have

is useful

indefinite

allocation

with

management

tech-

for implementing

extent,

such

as list-

processing, higher-order functions, and first-class continuations. However, heap allocation is widely believed to lead to poor memory system performance [Peng and Sohi 1989; Wilson et al. 1990; 1992; Zorn 1991]. This belief is based on the high (write) initialized. We studied the programs

compiled

cate at intensive activation

miss

memory with

rates.

records,

ratios

the They

is done

that

occur

system

when

new objects

performance

SML\NJ

of mostly

compiler.

These

use heap-only

allocation:

on the

We

memory systems typical of current To our surprise, we found that

heap.

are allocated functional

programs

simulated

allo-

including

a wide

performed

SML

heap

all allocation,

workstations. heap allocation

and

variety

well

of

on some

memory systems. In particular, on an actual machine (the DE Citation 5000/200), the memory system performance of heap allocation was good. However, heap allocation performed poorly on most memory system organizations. The memory system property crucial for achieving good performance was the ability to allocate and initialize a new object into the cache without a penalty. This can be achieved by having subblock placement or a cache large enough

to hold

the

allocation

area,

sufficiently deep write buffer. We found for caches with subblock data

cache penalty

was under ACM

along

with

fast

placement,

the

970 for 64K or larger Transactions

on Computer

page-mode arithmetic

caches; Systems,

writes mean

or a of the

for caches without

Vol. 13, No, 3, August

1995

270

Amer Diwan et al.

.

Table

subblock

XI.

CPI with

a Perfect

Memory

System

Program

CPI

Cw

1.15

Knuth-Bendix

1.06

Lexgen

1.14

Life

1.18

PIA

1.09

Simple

1.08

VLIW

1.10

YACC

1.13

placement,

the

mean

of the

data

cache

penalty

was

often

higher

than 50%. We also found that a cache size of 512K aIlowed the allocation area for six of the benchmark programs to fit in the cache, which substantially

improved

the

performance

of cache

organizations

without

subblock

placement. The

implications

achieve heap

of these

good memory

allocation

performance,

penalty

in place

guages with

Heap

of stack

technique.

which

small

make

primary

are clear.

performance.

of procedure

system

management

results

system

activation allocation

allocation, Second,

heavy caches

records can

needed

memory

can also have without

good memory a performance

can better allocation

placement

to

system,

it is a more-general

storage

subblock

is not

right

architects

of dynamic

by using

a stack the

be used

even though computer

use

First,

Given

with

storage

support

lan-

on machines a subblock

size

of one word.

APPENDIX. Table

XI

system.

gives Table

different CPI

SUMMARY the XII

memory

TABLES

CPI gives

of the

benchmark

the CPI

organizations.

programs

for

for each of the benchmark The

following

abbreviations

a perfect

memory

programs are used

for the in the

table:

wa/wn:

write

s/ns:

subblock

allocateiwrite

418:

Block

no allocate

placement\no

size = 4 words\

subblock Block

placement

size = 8 words.

ACKNOWLEDGMENTS

We thank Peter Lee for his encouragement and advice during this work. We thank E. Biagioni, B. Chen, O. Danvy, A. Forin, U. Hoelzle, K. McKinley, E. Nahum, D. Stefanovi6, and D. Stodolsky for comments on drafts of this article, B. Milnes and T. Dewey for their help, J. Larus for creating QPT, M. Hill for creating Tycho, for creating SML\NJ. ACM

Transactions

on Computer

and A. W. Appel,

Systems,

Vol

13, No

D. MacQueen,

3, August

1995.

and many

others

Performance

Table


XII.

Associativity Config

8K

16K

32K

64K

Cycles

per Useful

Associativity 256K

512K

271

Instructions

= 1 128K

.

= 2

8K

16K

32K

64K

128K

256K

512K

Cw wn,ns,4

2.41

2.07

1.88

1.73

1.43

1.30

1.18

2.22

1.96

1.78

1.62

1.42

1.30

1.17

wajns,4

2.44

2.09

1.90

1.74

1.44

1.31

1.18

2.24

1.98

1.79

1.63

1.43

1.31

1.17

wa,s,4

1.96

1.62

1.44

1.39

1.24

1.20

1.17

1.77

1.50

1.33

1.25

1.21

1.18

1.16

wnjnsj8

2.18

1.89

1.72

1.60

1.37

1.27

1.18

1.98

1.76

1.62

1.50

1,35

1.26

1.17

wa,ns,8

2.19

1.89

1.72

1.60

1.37

1.27

1.18

1.98

1.76

1.61

1.49

1.35

1.26

1.17

wa,s,8

1.88

1.59

1.43

1.38

1.23

1.20

1.17

1.68

1.46

1.32

1.25

1.21

1.18

1.16

Leroy wn,ns,4

2.32

2.18

1.96

1.87

1.79

1.65

1.37

2.17

2.03

1.89

1.82

1.76

1.65

1.40

wa,ns,4

2.53

2.40

2.18

2.08

1.98

1.83

1.50

2.38

2.25

2.09

2.00

1.95

1.83

1.57 1.08

wa,s,4

1.66

1.52

1.30

1.20

1.15

1.12

1.09

1,51

1.37

1.21

1.12

1.10

1.09

wn,ns,8

2.05

1.93

1.75

1.67

1.60

1.50

1.30

1.92

1.81

1.68

1.61

1.57

1.50

1.33

wa,ns,8

2.12

2.00

1.82

1.73

1.66

1.56

1.35

1.99

1.87

1.74

1.67

1.63

1.55

1.38

wa,s,8

1.56

1.44

1.26

1.18

1.13

1.11

1.09

1.43

1.32

1.19

1.11

1.09

1.08

1.07

Lexgen wn,ns,4

3.59

1.94

1.84

1.72

1.65

1.50

1.31

2.04

1.85

1.70

1.64

1.60

1.50

1.34

wa,ns,4

3.61

1.96

1.85

1.74

1.67

1.51

1.31

2.06

1.87

1.71

1.66

1.61

1.51

1.35

wa,s,4

3.17

1.53

1.42

1.31

1.27

1.21

1,19

1.62

1.43

1.28

1.22

1.20

1.19

1.18

wn,ns,8

2.96

1.78

1.67

1.57

1.51

1.39

1.27

1.86

1.71

1.56

1.49

1.45

1.38

1.28

wa,ns,8

2.97

1.79

1.68

1.57

1.51

1.40

1.27

1.87

1.71

1.57

1.50

1.46

1.39

1.28

wa,s,8

2.70

1.51

1.41

1.30

1.26

1.21

1,19

1.60

1.44

1.30

1.22

1.20

1.18

1.18

Life wn,ns,4

1.79

1.70

1.65

1.62

1.61

1.42

1.20

1.77

1.64

1.61

1.60

1.60

1.48

1.19

wajns,4

1.80

1.70

1.65

1.62

1,60

1.42

1.20

1.74

1.64

1.60

1.60

1.59

1.48

1.19

wa,s,4

1.39

1.29

1.24

1.21

1,20

1.19

1,19

1.33

1.23

1.20

1.19

1.19

1.19

1.18

wn,ns,8

1.65

1.55

1.49

1.47

1.46

1.34

1.19

1.57

1.54

1.46

1.45

1.44

1.37

1.19

wa,nsj8

1.65

1.55

1.50

1.47

1.45

1.34

1.19

1.57

1.51

1.45

1.45

1.44

1.37

1.19

wa,s,8

1.39

1.29

1.24

1.21

1.20

1.19

1.19

1.31

1.25

1.20

1.19

1.19

1.18

1.18

wn,ns,4

2.23

1.93

1.80

1.75

1.62

1.34

1.12

2.23

1.83

1.73

1.72

1.63

1.36

1.12

wa,ns,4

2.27

1.96

1.82

1.77

1.65

1.36

1.12

2.26

1.85

1.76

1.74

1.66

1.39

1.12

wa,s,4

1.66

1.36

1.22

1.18

1.15

1,13

1.10

1.66

1.25

1.15

1.14

1.12

1.11

1.10

Pia

wn,ns,8

1.92

1.70

1.59

1.55

1.46

1.26

1.11

1.87

1.60

1.53

1.51

1.45

1.28

1.11

wa,ns,8

1.95

1.72

1.60

1.56

1.47

1.27

1.11

1.89

1.61

1.54

1.52

1.47

1.30

1.11

wa,s,8

1.55

1.32

1.21

1.17

1.14

1.12

1.10

1.49

1.22

1.14

1.13

1.11

1.10

1.10

Simple wn,ns,4

2.32

2.02

1.79

1.73

1.70

1.68

1.62

1.98

1.74

1.70

1,70

1.68

1.66

1.63

wa,ns,4

2.35

2.05

1.81

1.75

1.72

1.70

1.64

2.01

1.76

1.72

1.72

1.69

1.68

1.65

wa,s,4

1.80

1.50

1.26

1.21

1.19

1.18

1.15

1.46

1.21

1.18

1.17

1.16

1.15

1.14

wn,ns,8

2.03

1.79

1.57

1.51

1.49

1.47

1.43

1.72

1.52

1.49

1.48

1.47

1.46

1.44

wa,ns,8

2.05

1.80

1.58

1.53

1.50

1.48

1.44

1.74

1.54

1.50

1.49

1.48

1.47

1.44

wa,s,8

1.70

1.45

1.23

1.18

1.16

1.15

1.13

1.39

1.19

1.15

1.15

1.14

1.13

1.13 1,14

VLIW wn,ns,4

3.26

2.73

2.30

1.98

1.79

1.35

1.15

3.06

2.64

2.19

1.88

1,69

1.37

wa,ns,4

3.29

2.75

2.32

2.00

1.82

1.36

1.15

3.08

2.66

2.21

1.91

1.72

1.40

1.14

wa,s,4

2.67

2.13

1.70

1.39

1.31

1.17

1.13

2.47

2.04

1.59

1.29

1.16

1.14

1.12

wn,ns,8

2.62

2.25

1.93

1.70

1.57

1.28

1.14

2.48

2.16

1.85

1.63

1.50

1.30

1.13

wa,ns,8 wa, s,8

2.63 2.23

2.26 1.86

1.94 1.55

1.70 1.31

1.58 1.25

1.29 1.16

1.14 1.12

2.48 2.09

2.16 1.76

1.85 1.46

1.64 1.25

1.51 1.16

1.31 1.13

1.13 1.12

ACM

Transactions

on Computer

Systems,


1995.

272

Amer Dlwan et al.

.

Table

XII—Continuezi Yacc

wn,ns,4

2.38

2.16

1.99

1,90

1.83

1,64

1.33

2.13

1,99

1.89

1.84

1.79

1,65

1.33

wa,ns,4

2.42

2.20

2.02

1.92

1,86

1.67

1,34

2.16

2.02

1.92

1.86

182

1,68

1.35

wa,s,4

185

1,63

1.45

1.35

132

1.27

1.21

1.59

1.45

1.35

1.29

1.26

1.24

1.20

wn,ns,8

2.11

1.90

1.75

1.67

1.62

1.49

1.28

1.87

1.75

1.67

1.62

1.58

1.49

1.28

wa,ns,8

2.13

1.92

1.77

1.68

1.63

1.50

1.29

1.89

1.76

1.68

1.63

1.59

1.50

1.29

wa,s,8

1.76

1.55

1.40

1.32

1.29

1.25

1.20

1.52

1.40

1.32

1.27

1.24

1.22

1.19

REFERENCES APPEL, A. W. 1987.

Garbage

collection

can be faster

than

stack

allocation.

Zrzf. Process.

Lett.

25,

4, 275-279. APPEL,

A. W.

Exper.

1989.

Simple

19, 2 (Feb.),

APPEL, A. W. 1990. APPEL,

A, W.

generational

garbage

collection

and

fast

allocation.

S’oftuJ.

Pi-act.

171-184. A runtime

1992.

system.

Compiling

LMp

with

Symb.

Cornput.

Continuations.

3, 4 (Nov.),

Cambridge

343-380.

University

Press,

Cambridge,

Mass. APPEL, A, W. AND JIM, T. Y. 1989. the

16th

ACM,

Ann ual

New

ACM

York,

Continuation-passing,

Symposium

closure-passing

on Principles

User

BALL,

manual

distributed

with

T. AND LARUS, J. R. 1992.

sium

on Principles

CHEN,

Standard

Optimally

of Programming

J. B. AND BERSHAD,

system

In Proceedings

Languages

(Austin,

of

Tex.).

293-302.

APPEL, A. W., MATTSON, J. S., AND TARDITI, D. 1989. ML.

style.

of Programmmg

performance.

In

B. N. the

ML profihng

Languages. 1993.

14th

A lexical of New

The

the

ACM,

impact

Symposium

analyzer

tracing New

for Standard

programs.

In the

19th

Sympo-

York.

of operating

on

generator

Jersey.

system

Operating

Systems

structure

on memory

Principles

ACM,

New

York. CHENEY,

C.

1970.

A nonrecursive

list

compacting

algorithm

Commun

ACM

13,

11 (Nov.),

677-678. CLEAVELAND, based

R., F’ARROW, J., AND STEFFEN, B. 1993,

tool

(Jan.),

for the verification

UCID

P., HENDRICKSON,

17715,

Lawrence

CYPRESS. 1990. 1990a.

Corp., DEC.

SPARC

Palo

RISC

User’s

KN02

DECStation Corp.,

1992.

Laboratory, Guzde,

System

IBM/T, FENICHEL,

3100

Palo

DECch Corp.,

ip

Alto,

Program.

A semantics-

Lang.

Syst.

15, 1

computer R.,

first-class Language

Desktop

Research

systems DYBVIG,

Commun. R. K.,

ed. Cypress

Semiconductor,

Functional

Specification,

Workstation

Microprocessor

Center

Des~gn

and

In

ACM

C.

HILL,

M, AND SMITH, A. 1989.

Hardware

1969.

Proceeduzgs

A case for direct

Rep. A

code.

Tech,

Rep,

San Jose, Digital

Calif.

Equipment

13

Speciflcatzon,

ed. Digital

Reference

Manual,

1st ed. Digital

C.

(White mapped

Evaluating

Representing

SIGPLAN

Plains,

caches.

associativity

garbage-collector

’90

N Y.). ACM,

Computer in CPU

21,

Systems,

Vol. 13, No

3. August

control

1995

in

Conference New

vu-tual-memory

IEEE

the

presence

of

on Programming

York,

12 (Dec.),

caches.

1612-1630. on Computer

for

611–612.

1990.

of the

scientltic programming. Tech. Mass. July. Also available as

12686. LISP

12, 11 (Nov.),

Implementation

M. D. 1988.

Functzon

An exercise m future 273, MIT, Cambridge,

Research

J.

AND BRUGGEMAN,

continuations.

TransactIons

SIMPLE

Feb.

Mass.

R. AND YOCHELSON,

HILL,

12 (Dec.),

2nd

The

Calif,

Calif.

21064-AA

Maynard,

J. Watson R.

1978.

Livermore,

Module

EKANADH.AM, K. AND ARVIND. 1987. SIMPLE: Rep. Computation Structures Group Memo

ACM

Workbench:

Trans.

Calif.

Alto,

1990b.

Equipment

HIEB,

ACM

C. P., AND RUDY, T. E.

Livermore

DS5000/200

Equipment DEC.

The Concurrency

systems.

36-72.

CROWLEY) W.

DEC.

of concurrent

66-77.

25-40. Trans.

Comput.

38,

Performance JOUPPI, N. P. 1993. International

Cache


write

Symposium

policies

and

on Computer

performance.

Architecture

In

.

Proceedings

of the 20th

Calif.).

Press,

(San Diego,

ACM

273 Annual

New

York,

191-201. KANE, G. AND HEINRICH,

J. 1992.

KESSLER, R. E. AND HILL, ACM

Trans.

Comput.

Syst.

KOOPMAN, P. J., JR., LEE, reduction. KRANZ,

ACM

D.,

10, 4 (Nov.),

Program.

R.,

REES,

J.,

optimizing

compiler

for Scheme.

Compiler

Construction

(Palo

LARUS,

J. R.

Pratt.

1990.

Exper.

20,

MILNER,

1083,

R.,

M.,

Kaufmannj

AND HARPER,

heap

S.

Approach.

1.

TARDITI,

New

function

WILSON,

N.

The

1986.

’86 Conference

ORBIT:

An

Symposium

on

219-233.

for

efficiently

files

to measure

tracing

Definition

programs.

program

behavior.

Madisonj

of Standard

SoftoJ.

Wise.

ML.

Tech. Mar.

MIT

Press,

Architecture:

A Quantitative

Approach.

memory 860,

design

considerations

Computer

Sciences

to support

Dept.,

Univ.

languages

of Wisconsin-

and

Memory

Programming.

performance

Destgru

A

Performance-Directed

Addison-Wesley,

of garbage-collected

Science,

MIT,

workstations

set

A. W.

ML-YACC,

1990.

Hierarchy

Calif.

Cambridge,

Mass.

languages.

Ph.D.

Mass.

price/performance version

Reading,

programming records.

Microprocess.

2.0. Distributed

with

Rep.

Standard

5, ML

6 of

A.

1993.

The

full

cost

’93 Workshop

of a generational

on Memory

copying

Management

and

garbage Garbage

collection Collection.

P.,

A case

AND MICHAELSON,

study.

Tech.

Rep.

G.

1990.

Computer

Parallel Science

implementations 90/4,

from

Heriot-Watt

Univ.,

U.K.

at Chicago, P. R., LAM, collection. Calif.).

ZORN, B. 1991.

Received

York,

Computer

Rep.

of Functional

MCANDREW,

collection:

Francisco,

graph

York.

P. R., LAM,

garbage

AND ADAMS,

of Wisconsin-Madison,

1990.

San Mateo,

00PSLA

prototypes:

garbage Illinois

In

G.,

Edinburgh, WILSON,

Univ.

PA

D. AND DIWAN,

K.

of combinator

Software.

implementation. ACM,

Cache

Cache

D. AND APPEL,

WAUGH,

J.,

New

Univ.

Cache

for Computer

1991.

Jersey.

TARDITI,

behavior

July.

Elements

Laboratory M.

R.

Tech.

Kaufmann,

M. B. 1993.

(Apr.), New

Wise. 1990.

Morgan

SLATER,

N.J.

caches.

Calif.

allocation.

A.

READE, C. 1989. thesis,

Cliffs,

real-indexed

265–277.

of the SIGPLAN

executable

J. L. 1990.

San Mateo,

Madison,

PRZYBYLSKI,

REINHOLD,

PHILBIN,

A technique

Dept.,

Cache

14, 2 (Apr.),

ACM,

Rewriting

Sciences

AND SOHI, G. S. 1989.

dynamic

Madison,

Calif.).

execution:

Englewood

for large

Mass.

PENG, C.-J. with

P.,

1241-1258.

PATTERSON, D. A. AND HENNESSY, Morgan

HUDAK,

Abstract

Computer

D. P. 1992.

Syst.

Alto,

Prentice-Hall,

algorithms

338-359.

Lang.

12 (Dec.),

TOFTE,

Cambridge,

Architecture. placement

In Proceedings

LARUS, J. R. AND BALL, T. 1992. Rep. Wis

RISC Page

P., AND SIEWIOREK,

Trans.

KELSEY,

MIPS

M. D. 1992.

M.

Chicago, M. ACM,

December

Ill. 1992

New

1993;

ACM

of garbage

considerations

caches.

Tech.

for

generational

Rep. EECS-90-5,

Univ.

of

Caching

on Lisp

considerations

and

Functional

for

generational

Programming

(San

32–42. collection

January

ACM

T. G. 1992. Conference

Boulder,

revised

Caching

set-associative

Dec.

York,

at Boulder,

T. G. 1990.

and

S., AND MOHER,

In the

The effect

of Colorado

S., AND MOHER,

A case for large

on cache performance.

Tech.

Rep. CU-CS-528-91,

Colo. May.

1995;

Transactions

accepted

May

on Computer

1995

Systems,


1995.

Memory system performance of programs with ... - cs.umass.EDU

Memory system performance of programs with ... - cs.umass.EDU

Suggest Documents

Memory Subsystem Performance of Programs Using ... - CiteSeerX

Performance Debugging Shared Memory Parallel Programs Using ...

39 Memory Performance Estimation of CUDA Programs - Compiler ...

39 Memory Performance Estimation of CUDA Programs - Compiler ...

Robust Pipelined Memory System with Worst Case Performance ...

Verifying System Components with Memory

A Study of Memory System Performance of Multimedia Applications

Reducing Memory Fragmentation with Performance ...

Working memory performance correlates with ... - Boston University

Detecting Memory-Boundedness with Hardware Performance Counters

Working memory performance correlates with ... - Boston University

High Performance Memory Architectures with Dynamic ... - CiteSeerX

A system with template answer set programs*

Performance evaluation of a remote memory system ...

UPMLIB: A Runtime System for Tuning the Memory Performance of ...

Performance Analysis of MPI Programs

Monotonic Abstraction for Programs with Dynamic Memory Heaps

Calculating lenient programs' performance

Concurrency Analysis for Shared Memory Programs with Textually ...

Performance Implications of Memory Reclamation

investigation of hagc system performance with various

Smarter Memory = Better Performance

Flash Memory Performance on a Highly Scalable IOV System - Usenix

Data Locality and Memory System Performance in the ... - CiteSeerX