Supporting Dynamic Data Structures on Distributed-Memory Mach nes ANNE
ROGERS and MARTIN
Princeton
C. CARLISLE
University
JOHN
H. REPPY
AT&T
Bell Laboratories
and LAURIE J. HENDREN McGill
University
Compiling Much
for distributed-memory
of this work
To date, based
little
work
dynamic
array-breed Recursive In this dynamic based and
automatic
call
0 rd en, that
runs
gvamming;
D.3.4 Terms:
and
and NSF
the McGill
Authors’
addresses
University,
Bell
Laboratories,
L. J. Hendreu, email:
Permission provided ACM that
Rogers NJ
to make that
the
copyright/server copying
otherwise, permimion
of control
a technique
using
based
compiler
a prototype Machines
Concurrent
~lers;
data
Grant
system,
CM-5.
on
analyses which
We discuss
we our
run-time
Prograuuuing-panz
l[el pTo -
environments
structures
ASC-9110766.
M.
and
Avenue,
was supported, Department
{sun-; Murray
Science,
McGill
the Fannie
was supportsd,
and
in part,
John
Hertz
by FCAR,
in
Foun-
NSERC,
Research.
C. Carlisle,
email:
M. C. Carlisle
Fellowship,
L. J. Hendren Studies
and
of Computer
mcc}@cs.princeton. Hill,
NJ
07974;
University,
edu; email:
Montreal,
Science,
Prince-
J. H. Reppy,
AT&T
jhr(ilresearch.at Que,
Canada,
t .com; H3A
2A7;
ca. digital/hard
copies
are
notice,
is by permission
copy not
of all
made
the title on
or part
or distributed of the publication,
of the Association
to republish, to post and/or a fee.
@ 1995 ACM
a thread
using model
be develo~d.
use pointer-based
five benchmark.
Graduate
08540;
of Computer
[email protected].
execution
Thinking
[Software]:
by NSF
Foundation
600 Mountain School
for ruigrating
implemented
Dynamic
of Graduate
A.
must that
of
addressable.
Performance
ASC-9110766.
Princeton,
techniqu=
parallelism
use pointerexecution
and directly
mechanism
the
with
defined
years.
structures.
that
SPMD
programs
this
data
programs
supporting
so new
and
D.1.3
area in recent
primary
supporting
We have
Processors—comp
in part,
Grant
to exploit
on experiments
Science
Faculty
for
and introduces
iP S C/860
and Phrases:
was supported,
for
uses a simple
Descriptors:
Words
properties,
model
techniques.
Languages,
Key
these
research
as their
of supporting
are statically
arrays
data
Intel
[Software]:
by a National
dation,
ton
Subject
the problem
that
We intend
on the
and report and
A. Rogers
model
parallelization
Categories
part,
Thk
active
use arrays
develo~d
have
of heap-allocated
a very
that
techniques
an execution
task creation.
implementation
Additional
do not
structumx.
and lazy
The
on the fact
structures
data
General
rely
we describe
on the layout
futures
structures.
programs data
has been
on programs
has been done to address
data
article,
machines
has concentrated
servers,
0164-0925/95/0300-0233 ACM Transact, on. onPmgrarnrn,n
of this for
material
profit
and its date
for Computing
without
or commercial appear,
Machinery,
or to redistribute
to lists
fee is granted advantage,
and notice Inc.
requires
(ACM). prior
the
is given To copy specific
$03.50 gLanguagesand Systems,
Vol
17, No
2. March
1995,
Pages
233-263
234 1
.
Anne
Rogers
et al
INTRODUCTION
Compiling for distributed-memory machines has been a very active area of research in recent years. 1 Much of this work has concentrated on scientific programs that use arrays as their ture.
primary
Such programs
into relatively
data structure
and loops as their
tend to have the property
independent
that
pieces, and therefore
primary
the arrays
control
struc-
can be partitioned
the operations
performed
on these
pieces can proceed in parallel. It is this property of scientific programs that has led to impressive results in the development of vectorizing and parallelizing compilers [Allen and Kennedy 1987; Allen et al. 1988; Wolfe 1989]. More recently, this property has been exploited by researchers investigating methods for automatically generating parallel programs for SPMD (Single-Program, Multiple-Data) execution on distributed-memory machines. In this article, we address the problem of supporting programs that operate on recursively defined dynamic data structures. Such programs
typically
procedures
as their
Before
we examine
programs
using
scientific
use list-like
primary
use arrays
problems that prevent pointer-based dynamic From
machine
are satisfied arranging
standpoint, explicitly
a computation
execution.
programs
review
the most important
so that
for recursive
why it is possible to programs
property
most
references
references
are expensive.
are local
of scientific
with
of a distributed-
has its own address space; remote
properties
for
out the fundamental
the same techniques
passed messages, which
The aforementioned
and have recursive
and loops, and then point
is that each processor
through
SPMD
let us first
us from applying data structures.
a compilation
memory
to generate
data structures,
that
data structures,
structure.
if it is plausible
dynamic
programs
or tree-like
control
Therefore,
is crucial
programs
to efficient
make them
ideal
applications for distributed-memory machines. Each group of related data can be placed on a separate processor, which allows operations on independent groups to be done in parallel
with
The key insight lelizing
programs
data should ically,
determine
and
cesses.
The
inserts
code
to
popular
is the
including
the
trol
and
loop
ACM
et al.
which allocating
for
rule:
the
method for
and
can and
to the
loops
Kennedy
[1993],
et al.
the
needs
to
are
[1990],
on Programmmg
Languages
and Systems,
V.1
17, No
most
(v
= e),
v.
Con-
by all processors. using
[1993], Rogers
[1988].
Transactions
“owns”
the
the
dependence
The
arsenal
of
analysis
1989].
and Lam
Koelbel
pro-
a partic-
but
statement
as data
Wolfe
into
execute
that
Typtarget
resolution,
possible,
substantially
1987;
the
program
run-ttme
processor
such
onto
called
are executed
Anderson
[1991],
data
of an assignment
compilers,
and Lam
Hiranandani
work
be improved
vectorizing
[Allen
processor
work
of e, is assigned
see Amarasinghe [1990],
sometimes
policies
paral-
of a program’s
is assigned to processors. program’s
to decompose
strategy,
as conditionals
by this
mapping
for automatically
is that the layout
of the
at run-time
computes
restructuring
Gerndt
methods
machines
mapptng
uses this
computation such
a
compilation
determine
developed
1For example, [1988],
compiler
owner
produced
techniques
specifies
Different
statements
code
Zima
the
communication.
developed
how the work in the program
simplest
statement.
interprocessor recently
for distributed-memory
the programmer
machine,
ular
little
underlying
2, March
1995
Callahan and
Pingali
and Kennedy [1989],
and
Supporting
Run-time available
resolution for
responsible
for
is applied global
to
the nice
array
array
of every
for
used
mathematical
Now
structures.
property
that
their
subpieces.
Thus,
\Ve
can
of array ture not
rather
the
than
structure the
A recent his
paper
this
is determined
by
is also
numbering
to restrictions
node
is determined
Gupta’s node
list’s
As in earlier data
by
onto
to
as it
position it
to
to
all
in
the
in a data
tree.
primary
and
element
the
owner
be used. data
accomplish
structure.
this,
This
is registered
mapping
scheme
of a
the problem can
of
an example,
structure,
control names,
of a dynamic To
and
As
do
communication.
resolution
The
struc-
recursion,
for addressing
structure
in the
data
compile-time
a data
structure.
be used as a naming
the nodes
to
on a
a structure
is the
processors.
is added
the
to
statically,
every
struc-
independence
Second,
without
run-time
to
might
its two
scheme
that
list.
have
this
the
position, lists
approaches,
ACM
way
name
with
a node
all to
a
a breadth-first
for a binary
tree.
it can traverse
cannot node
of naming
structures only
one
can node
names
may
consider
a list
is added
to the
we rely
TransactIons
can
Once
a
the structure
to reflect
be added
front
on the programmer
mapping
is done
on Programmmg
change
Languages
of a at
ramification
of
when
name
list,
the
a new
is simply rest
of the
in position.
to specify
at run-time
name
to a structure
a node’s of the
structures
the
to be reassigned
in which
their
data
Since
Another
have
example,
This
dynamic
be used.
be concatenated).
If a node
to be renamed
the processors.
new
data
is that
For
in the will
known
node
sets,
clear
data
of a dynamic
a mechanism
similar
is assigned
the
the
programs
of operations
interprocessor
position
on how
is introduced.
nodes
be done
without
on its
for
example,
naming
position
cannot
is made
node’s
index Finally,
Fur-
over
it becomes
for dynamic
references.
structures.
tail.
tasks these
references
of array
independent its
parallel
nodes
required
independent
and
determining
the
the
smaller,
head
independence
therefore
suggests
name
to note
leads
(for
[1992]
its
work
than
partionable
use pointer-based
communication.
is important
a time
the
and
in general,
of adding based
further
easily data
each
the
has a name
without
that have
investigation,
because
an approach
of the nodes
processor
its
to
as part
processor
fact and
in mapping
do not
harder
to processors
name
further
properties
its
by Gupta
is assigned
processors
It
with
a global
and
with
names,
dynamic
so that
approach,
a name
on the
relatively
to distribute
programs
is partially
be determined,
names
structure,
the
commu-
simple
exhibit
into
into
problem
determining
mathematical
of nodes
cannot
But,
is that
into
partitioned
be used
that
often
be partitioned
is substantially This
with
programs
partitioned
often
compile-time
looping
used
of global In
problem
nice
mapping
node
function knows
without
rely
are
processor
mapping
locally
235
is, names the
to be very
programs
can be recursively
machines.
have
such
can
used for scientific
operations.
share
that
we see no fundamental
structure
do not
that
processor
code
tend
.
determine
every
resolution
of parallelizing
note
can
so far,
first
data
Since
elements
be recursively
partitioning
the techniques The
To
can be done
run-time
structures
to distributed-memory
tures.
test
array
problem
a tree
a list
this
dynamic
in nature,
programmer-supplied name.
this
improving
data
For example,
thermore,
that
element,
to our
data
and
the
global
to reference
dynamic
pieces.
are static
Structures
properties.
let us return
subtrees,
arrays
Data
at compile-time.
element,
element’s
array
Techniques
expressions
because
of an array
a given the
name
nication.
works
all elements
Dynamic
a mapping
by specifying
and Systems,
Vol
17. No
of the a proces-
2, March
1995
236
.
Anne
Rogers
sor name
with
approach
and earlier
We believe fined,
that
able,
data
Thus,
the
processor Before
our
owns
presenting
our
a simple
and
lazy
and
discuss
task
6, we
report for
briefly
sketch
the
advantage
8 and
conclusions 1994;
In
we explain
our
return
SPMD
single
owns
heap).
We
objects,
also
a processor
processor has been would
over
number
local
address
techniques into
upon
execution 7, we
we intend
a program
that
work
earlier
4
Section
Section that
related our
let
us review
an identical
the
copy
can
in Section
work
basic
stacks,
there address
1>).
of the
arguments,
a heap
(val
= 1;
new->left
=
new->right return
*left;
lo.tmp; Si.zeof(struct
M,LOC(10,
TreeAlloc(level-1,
tree));
lo+num_proc/2,
= TreeAlloc(level-1,
10,
num_proc/2);
num_proc/2);
new;
3 1
Fig.
Fig. 2.
1.
Allocation
Allocation
code.
example.
ACM Transaction~ ~nPrOfl~~~~~g Lanm~g~s ~~d Syste~~, V~l
11, No
Z, March
1995
238
.
Anne
EXECUTION
3. This
section
thread
Rogers
MODEL
describes
splitting,
trol through splitting,
et al.
the
Thread
two
a set of processors
which
of our execution model: thread migration and is a mechanism for migrating a thread of con-
parts
migration
based on the layout
is based on continuation
duce parallelism
by providing
of heap-allocated
capturing
operations,
work for a processor
data.
Thread
is used to intro-
when the thread
it is executing
migrates. 31
Thread
The
basic
idea
attempts
to
P
from
thread’s
sets
up
its
that
that
and
and
the
refers
to Q‘s
always
the
Full
because
stack This
frame
executing sending the
has
3.1.1 cessor
the
of the
tion
is requested.
that
the object
ACM
Ti-ansact,cms
at
the
another
(or
of a processor
as a single nonlocal
or store routine.
current
the
that handler
sending
stack
it
to
references
implementation,
nonlocal
to test
references trap
and
the thread,
In our
address,
instruction The
information with
code
accesses.
encoded
address
contains
the
Recall
Instead,
pointer
Olden
return thread
and
return
then
control this,
of the
return
pointer
for
of computation the
return
completes Note
frame
the
it
is
Processor
Q sets to the the
up
a
caller.
currently to
P
pointer,
procedure stack
is the
When to
back
the
that
stack only
to complete
frame.
accomplish
address,
address.
procedure
frame
by and
return
frame
is not
needed. the programmer allocation
is responsible
request.
explicitly
migrates
compiler
expands
The the
this
can be also completed
initiated
on Programmmg
the
P
return
that
initialization
To
entire we send
stack
in place
the
return
current to
caller.
Processor
memory
system,
that
and
thread’s
current
to be used
migrates
at the
and
is necessary
of the
stub
it is no longer
The
it
the
affordable,
for the
counter,
stub
registers.
each
because migration
is necessary
frame
return
run-time
processor
the
thread
procedure,
thread
Allocation.
of our
until
a pair
detect
to detect
expensive,
that
return
that
with
state
a migration.
into
program
The
because
to a load
is sent
hardware
make
stack
holds
part
to the
the
a message
name
To
the
function.
returned,
cause
of
to a library
stack
checks
state
a special
the
thread’s
the
is quite
for
restarting
up
explicit
from
contents
by
Q then
thread
as needed.
registers,
it
frame
traps
translation
message.
to return
be used
instruction
cannot
migration
the
idle
be encoded
executes
P
if the entire
and
of the thread’s
execution:
P,
that
in the
portion time
packaging
can
can
Processor
the
for
thread
included
counter,
Processor
of the
remains
is migrated
program
to Q.
execution P
on Processor
thread
current
as consisting
information
hardware
When
Notice
thread
the
the
registers
Processor
address
This
translation
inserts
the
resumes
a memory
use the address
compiler
and
executing
Q,
to i;.
address.
be local
we do not
migrate
view
sending of the
migration.
migrates
a thread
Processor
contents
the
memory,
Q.
current
caused
1991].
is responsible Processor
the
when on
entails
registers,
we
Li
migration
the
address and
is that residing
loads
a local
[Appel
will
and
thread
Recall
migration
a location
thread
stack,
same)
name
Full
stack,
instruction the
of thread access
to Q.
P
the
Migration
the
Languages
for
allocation
thread
when
migration before
Vol
17, No
2, March
test
1995
a prowhich
a nonlocal
the thread
allocation. and Systems,
specifying routine, in-line
is
allocato ensure
migrates
back
Supporting
3.1.2 to
processor
locality.
Our
data
that
system
together,
depth
in
data
The
Dascusston.
the
then
the
This
can
thereby
on
the
by
the
than
the
amount
execution
all
the If
the
Recent
the
in
to
by
is to exploit
that
places
a subtree
at
has
related
some
chosen
the
data
allows
the
most
of the
required
Hsieh
fixed
a good
owns
communication
work
of computation data
a layout
239
.
thread
programmer
that
of interprocessor
time.
the
data
computation processor
Structures
migrating
to choose
processor.
on the
Data
of migrating
rather
placing
same
migrating
reduce
data
on the programmer
computation
reduce
purpose
the
example,
tree
layout,
perform
relies
for
the
primary
owns
Dynamic
system
to
data.
substantially
et al.
[1993]
and
supports
this
claim.
Thread
3,2 While
Splitting
the
data,
migration
it does
not
tation.
lfrhen
section,
we describe
introduce
a thread
migrates
calls,
since
this
the
The
return
out
ations
be based
must
parallelizing
Futures.
ditional that
Lisp
evaluation thread cell,
that
and
restarts
Our
and
view
body, the
3.2.2 of two describe the
(future
e) with
provides
a fairly In
pieces
effect,
that
its
thread.
is essentially
the
child
to
parent
puts
a
the
of tra-
the
system
result
of this
between
the
child
touches
the
future
then
the
parent
is finished,
e, it
In
The
point
by
directly.
a variant
1985].
context.
If the
evaluating
either
Olden
is an annotation
can oper-
be done using
[Halstead
as a synchronization
before
cap-
the
result
in the
cell
which
is influenced
future’s
body
call’s
by the context
directly.
a continuation
stealing.
In
lazy-task-creation
(return
? If a migration from
addition,
all
the
work
touches
scheme
continuation)
in
occurs list
and
in the start
Olden
of Mohr
on a work
are
list
execution
executing done
it;
explicitly
operation.
Futures
m
operations: these
Lisps
in parallel
finishes
When
of the
is at procedure
many
can
mechanism parallel
its value,
one
threads.
we grab
~utare
which
this is to
of continuation-capturing
program,
parent
In
approach
in parallel.
into
or by a programmer
the
thread
blocked
then
touch
future
to read child
the
of execution
compu-
idle.
program.
This
be done
of the
serves
e and
can
Olden
man-y
is to save the future
is called
using
that
in the executing
continuations
introduction
expression
cell
of futures,
to evaluate
this
2This
any
[1991],
of the
are
the
in
be evaluated
is evaluating
When
et al.
the
e can
is, attempts
blocks.
that
thread
the
for
found
is a future
that
Thus
start
distributed the
left
is
Our
a continuation.
work
the
P
points
can
capture
continuation-capturing
context,
that
at key
is effectively
Q,
on from
parallelism.
P
to
operating
to
P
introducing
place
chops
mechanism
says
Processor
natural
targeted
Our
extracting
operations
on analysis
compiler
future
for
labeling
of order.
parallelism
Processor
for
technique
for
to Q,
linkage
be executed
3.2.1
P
a mechanism for
from
capturing from
mechanism
capturing
the
migrates
a mechanism
continuations.
inexpensive
provides
a mechanism
continuation-
a thread tured
scheme
provide
futurecall
operations cells
is also similar
Det atl.
More
and
and their the
Our and
future
to the workcrews
scheme
related stack,
paradigm
for In
touch. data which
proposed
ACM Transactions on Programming
introducing the
parallelism
remainder
of this
structures.
The
is a stack
of future
by Roberts
two
consists section,
data cells
that
and Vandevoorde
Languages and Systems, Vol 17, N.
we
structures serves
[1989].
2, March
1995
240
Anne
.
Rogers
et al.
return
Fig.
as a work
state
list.
A future
transitions
for
This
Empty:
cell state
transitions.
states
(Figure
3 gives
the allowable
cell):
initial
state
state
when
state
when
and
contains
the
information
necessary
to restart
parent.
Stolen:
This
is the
Waiting:
This
is the
sequently thread Full:
Future
cell can be in one of four
a future
is the
the
3.
touched (note:
This
the
the
parent
continuation
the
parent
continuation
future.
no future
is the
final
It contains
the
can be touched
state
of a future
has been
stolen.
has
stolen
by more
cell,
been
continuation
which
and
of the
than
sub-
blocked
one thread).
contains
the
result
the
normal
of the
computation. The
future
when
stack
allocate
the
The cell
the
routine,
the
cell’s
future
cell
stolen,
case, there called;
cell
is filled
If the
cell
a processor
run-time
cell from fiignal
the
that
changed
the
Touching
to
a future
do by
the
the
stack.
after
This
routine
If the
executes has
stack
the
been
future
of stack,
continuation state.
has
In the
routine
a thread
Stolen
MsgDi. spatch
migration,
does
is not
resume stolen,
from
is
is resumed.
as just
ch.
off the
corresponding system
pushing
return
an examination
or Waiting
run-time
to do, such
and
following
continuation
cell,
Upon
is popped
the
Stolen
waiting
future
cell
to
details).
routine.
result,
case,
are required
a new future
specified
otherwise
continuation
that The
calling
per future
computation.
ACM
then
MsgDispat
stack
parent
stores
attempted a touch
the
of the
to block.
Waiting,
work
routine
5 for
for allocating
return
In
instructions
(see Section
the
in the
the
stack.
extra
the
empty,
stealing
it pops
a future
continuation the
state
it calls
actual
in the
of the
cell.
future
To
cell
is
to Stolen.
putation to
the
has no work
from top
with
to do, and
state,
system
of continuations
calling
is either
more
Waiting
a few
result
and
normally;
cell
is nothing
in the
When
the
the
is Empty,
continues and
execution
only
is responsible
stack,
state.
execution
touch
future
the
been
and
the
occurs,
operation
on the
and
through
or stealing
f uturecall
the
the
is threaded
no migration
As
is not
touch
the
Full
continuation that
a result,
the
the
changes of the
MsgDi. spatch. and
causes
operation
touch state
The must
currently
executing
the state
of the future
touch model
in the requires
be done
of a future
will
cell,
and
that
Stolen
is attempted.
Transactions
on Programmmg
Languages
and Systems,
Vol
17, No
2, March
1995
of com-
from looks
only
by the future’s be either
thread
one
parent
Stolen for
other
touch
be
thread
or Full
of
when
Supporting
int
TreeAdd
(tree
Dynamic
Data
Structures
241
.
%)
{ if
(t
== NULL)
return
O;
return
(TreeAdd(t->left)
else + TreeAdd(t->right)
+ t->val);
}
Fig. 4.
TreeAdd.
Discussion. The design of futures in Olden was influenced heavily by our to use parallelizing compiler technology [Hendren 1990; Hendren and Nicolau
3.2.3
intent
1990; Hendren restrictions
et al.
two restrictions which
1992; Larus
mentioned
allow ustoallocate
helps make futures
Another
1991] to insert
in the previous
important
paragraph
future
them follow
automatically. from
Both
this intention.
cells onthestack
rather
the
These
than the heap,
less expensive.
fact to note about
the use of futures
in Olden
is that
when
a process becomes idle, it steals work only from its own worklist. A process never removes work from the worklist of another process. The motivation for this design decision was our expectation on local data. sirable
Although
for load-balancing
Instead, Mohr
that most futures
allowing
purposes,
load balancing
is achieved
et al.’s formulation,
another
process’
another
captured
it would
simply
by a careful
where
by a process would operate
process to steal this work may seem decause unnecessary
data layout.
an idle processor
removes
to
work
the tail
of
is related
to
from
work queue.
The choice of using a queue versus a stack to hold pending the question
migrations.
This is in contrast
of who can steal work from
a processor’s
futures
worklist.
The breadth-first
decomposition that results from using a work queue is desirable when work can be stolen by another process, because it tends to provide large granularity tasks. A stack-based implementation is appropriate for Olden, because when a process steals its own work, it is better for it to steal work that is needed sooner rather than later.
4. A SIMPLE To make
the
example.3 which
EXAMPLE ideas
Figure
computes
Figure
of the
previous
4 gives the
sum
of the
5 gives
a semantically
futures.
In
annotated
replaced
by
recursive
call.
To the
Q).
the
understand
When
what
TreeAdd,
values
version, and
this
a tree
sections
node
stored
equivalent
a futurecall,4
case where
two
more
concrete,
the C code for a prototypical
the
means tp
executing
the
(on
in
in a binary
program left
result terms
Processor
that
recursive is not
has
on P, is recursively
call
to
Transactions
on Programmmg
Languages
annotated
TreeAdd until
program’s
after
execution,
a left
child
called
on
3 We present more realistic benchmarks in Section 6. 4Note that using a f uturecall on the right sub tree is not cost effective, computation between the return from Tre eAdd and the use of its result. ACM
a simple program,
tree. has been
demanded
of the P)
we present
divide-and-conquer
and Systems,
been
the
right
consider
(on Processor
tQ tQ,
with
has
it
attempts
to
since there is very little
Vol.
17, No, 2, March
1995.
242
Anne
.
lnt
TreeAdd
Rogers
(tree
et al
*t)
{ if
(t
==
NULL)
return
O;
else
{ tree
*t_left;
future.cell
f_left;
int
sum_right; I*
t_left
= t->left;
f_left
= futurecall
sum_right
(TreeAdd,
= TreeAdd
return
this
can
cause
a migration
~i
t_left);
(t->right);
(touch(f.left)
+ sum_right
+ t–%ral);
1 }
Fig.
fetch
the
left
cause
the
thread
cuting.
child
of the
stolen
return
future
Please
note
can
be the
the
computation
executing
that
have
we
call
our
will
and
that
translates
with
embedded
tions,
returns
The
remain
Olden.
ACM
when
Trcansact,ons
it
parts
owner
will exe-
point P,
future thread
the
reference
of the
execution
cell
the
reference
the
to
associated
of control
to
of t until
tot->val
t->left
computation
to
for
causes
the
currently
it has completed.
into
idle. Languages
system
of
the ch,
The new
futures,
and Systems,
run-time
touches,
and
multiple
active
control
Captured
code
structure
empty Vol
and
for
target
17, No
the
by
pointer
is responsible futures 2, March
the
Intel
earlier,
annotated
system
in this
on
system
computations
in detail
which
run
mentioned
is a C program
assembly
system. starts
that As
of a run-time
the
to support
of the
model CiW5.
Olden
consists
for
overview
becomes
execution Machines
input
is MsgDispat
on Programming
oft,
and
code
an
procedure
owner
run-time
discipline
Figure processor
The
to the
stack
6 provides
for
nor
a nonlocal
of our
C code
System
main
the
wait
to t->right
Thinking
migrations,
two
to touch must
Once
on the
Olden
51 The Run-Time The
attempts
oftp
This
continue
return
it
execution
reference
the
touches.
generates the
at the
can
P
prototypes
calls
compiler
system. it
IMPLEMENTATION
and
from
run-time
Q, where
is on Processor
to the
annotated
a nonstandard \Ve describe
the
MODEL
system
the
on Processor
point,
tomigrate
hypercube
futurecalls
to
to Processor
Q.
implemented
iPSC/860
a trap
annotations
subtree
of a migration.
function
compiler
right
until
this
neither
source
5 EXECUTION We
the
continues
Processor
causes
resumes
that
At
code with
to migrate
execution
dlon~Q.
from
which
~Q,
Assuming
the
TreeAdd
of control
Meanwhile,
futurecall. with
of
5.
machine
handles stealing
tests,
threads
with
a compiler
migrafutures.
and
it
uses
of computation.
section.
of the for are 1995
run-time
assigning given
first
system. work
to
priority,
a
Supporting
Dynamic
Data Structures
.
243
Touch stolen future
P!!” CallStub
User Code
reference
Fig.
then
messages
scheme,
from
because
processors. The discuss
the
steal
future,
to dereference
Migrate,
if necessary,
data
and
sends stack
save
registers,
each
stack
allows
5 and
some
for
the
us to construct
migration
messages
mentioned, The
processor
the calls
will
the
We
allocate
cost
migrate
contains
the
of callee-
extra
space
information,
in
which
of migrations. to
call
necessary
contents
MsgDi. spatch
we
if it
and
the
message the
mi-
first;
pointer
packages
bookkeeping
reducing
situ,
sending
and
sooner.
on reference,
caller,
information. registers
tn
to test
Migrate
of the
processors
a thread
code
priority to other
on migration
A migration
area
bookkeeping
the
subsection.
compiler.
build
callee-save
is complete,
next
this
off work
migrate
We focus
processor.
argument
to idle
events:
touch.
pointer. by the
Yf7e choose to spin
us to get work
in the
appropriate
the
received. is likely
different
As previously
is inserted
it to the
order
suspend
of futures
frame,
frame
allows
a nonlocal
system.
a future
four
and
on Reference.
attempts
current
run-time
in the
first
processes
implementation
Migrate
5.1.1
processors
futures
system
on return,
The
in a computation
Processing
run-time
grate
other
early
6.
Once assign
the
a new
task, When
MsgDispat
ch selects
transfers
control
the stack
for a return
Then
it copies
frame
size, return
space from area the
stub
frame the
return
migrated
5 We bypass we have
return
message.
and
the
the
function the SPARC
pointers
changes
procedure. and
callee-save
the argument
into
the
exist),
pointer,
the frame, message and frame
adjusts
on the ‘S register
from
Then
it loads
to the
window
into
processor. mechanism
build
the
stored
this
in our CM-5
the
causing
frame. the
into
the
register
values
argument
the
build
to ret
addr,
migration,
registers
procedure
it on
stores
message
in the frame
appropriate
When
and
task, space
and a stub
CallStub
callee-save
area
pointer
the
area,
space. the
next
allocates
the migration
return
address
as the
CallStub
allocated
pointer
it loads
counter new
message.
the
the return
Finally,
program
message
the
size,
for
migration
to process
value
message,
(if they
message,
the migration
allocated the
an incoming
to CallStub
and exits,
implementation.
the
resumes it will
go
As a result,
registers. ACM Tmnsactlons on Programming
Languages and Systems, Vol, 17, No. 2, March 1995
244
.
Anne
to the
procedure
in the
return
the
A simple
it
to
node
This
5.1.2
on
the
procedure
return
exited
so,
the
stub’s
to
a tail-call
callee-save
that
began
registers
execution
the
When
of
deactivates
the
remote
registers
will
as part and
message (for
frame
place,
processor
is not
the
the
tailet al.
if a migrated
it migrated
a return
from
It
as the
callee-save
reg-
migrated,
and
that
then
Note
migrate the
message
of the
migrated.
as before
to [Horwat
sent
procedure
that
pointer, migration
instead.
if necessary.
before
same
is similar
contents
of the
whether
of the
example,
directly
the
function.
frame
compiler
selects
loads
procedure
be the
passed
return
appropriate of the
address,
as it began
case that
to determine
Smalltalk
MsgDispatch which
address
and
is called
eReturn,
in the a single
return
optimization
MigrateReturn
to the
original
Concurrent
is that
returns
of executing
of the function
frame
processor
on the
callee-save
of the
of trivial
course
address
in the
Return.
value
the
the
same
message,
at the
of the
If
used
Migrat
return
values processor
a chain
during return
from
to P):
it calls
the
ecution
stub.
back
from
copies
return
on the
A!ltgrate
task,
isters
current
optimization
exits
to Q then
next
the
is analogous
A similar
the
to the
to avoid
times
optimization
procedure
stores
message
is added
id are pulled
message.
code
the
MsgDispatch.
several
the
~onoarding
P
calls
e examines
1989].
This
sends
optimization
points
and
and
migrates
Mi.grat
et al.
retaddr.
message,
procedure,
a thread
Rogers
resumes that
on return,
procedure
ex-
since the
the state
was called.
The Compiler
5.2 The
compiler
is derived
We ported modifications to test
to
for
futures,
5.2.1
resides7
each
heap
pointer
checks
code
that
Migrate
references
pointers
on all pointer the
conditionally
and
to
These call
an ANSI and
C compiler.6
SPARC)
modifications
Mi.grat
and added
generate
e as necessary,
to
code capture
lcc
of a tag
has
to move
indicating
the
check
thread,
this
is local,
and
removes
for
the
tag
the
explicitly
generating
the
mi-
In our where
tag
mechanism calls
will
pointer.
processor the
option
We modified pointer
heap the
to
a command-line
to see if the
a process
a nonlocal
so it is necessary
dereferences.
tag
mentioned,
dereference
consist
address,
dereference.
examines
1995], (i860
As previously
attempts
a local
pointer
Hanson
processors model.
References.
if it
heap and
and
on touches.
Memory
implementation,
[Fraser
appropriate execution
memory
on a reference
data
our
to synchronize
Testing
lcc
to the
handle
nonlocal
and
grate
from
the backend
on null
to generate
library
to reveal
routine the
local
address. The
additional
overhead
and
a conditional
structions a conditional
that
only
branch
the first
cause a migration.
6lcc 7The
does not range
uses the eight Tranmct,ons
each
pointer
branch
on the
i860
SPARC.
This
in a sequence
For example,
dereference and
overhead
four
is three
integer
instructions
can be reduced
of references
in TreeAdd,
integer
inand
by observing
to the same pointer
only the reference
t–>left
can
can cause
an optimizer.
of possible
implementation
ACM
have
on the
reference
of testing
heap
addresses
diet at es the maximum
uses the seven high-onier high-order
bits
to hold
size of the processor
the tag,
and the iPSC/860
bits.
on Proqammmg
Languages
and Systems,
Vol
17, No
2, March
1995
tags.
The
CM-5
implementation
Supporting
int
TreeAdd
(tree
Dynamic
Data Structures
.
245
*t)
{ if
(t
==
NULL)
return
O;
else
{ tree
*t_left,
future.cell
f_left;
int
sum_right;
local_t
= local_t->left;
f_left
= futurecall
(as discussed
t are guaranteed allows
removes
us to
the
references
pointer
t,ifit
advantage
tag
and forces
compiler
optimization
is dorninatedby
can
be
that
with
references
optimization
to the
no special
5.2.2
Compdang
use offutures.
examine
our
implementation
calls
and
then
the
store
the
continuation
pointer,
and
the
Empty,
and
push
call,
check
the
execution 8Note
that
processor
The
through local,
expression
is nonlocal.
take and
advantage
local(t)
Subsequent
of this
aiigned if there
that
The
resulting in Figure Note
with
indirect
first
local
reference
address
7 shows
that
construct.
another
the
the
say
references in the
can result
deference
We
dereferenceof
are no intervening
is, the
code
example.
backend
code state
In
several
be
to
sequence,
used
by
the
of applying
oflocal_t
this require
is generated
the steps
annotation,
consists
of
in the
the future
in-lineto
of the future.
We review
the
to the
basic
topic
ideas
un-
offutures
necessary
to
to implement
optimizations.
counter)
cell onto
we presented
we return
afuturecall
(which
program
3.2,
subsection,
in detail.
encounters
the
Section
In this
discuss
When
the
can
futurecall.
our
to
The
references
an annotation,
handling.
derlying
future
observation.
variabletis
local.
indirect
provides
pointer.
reference,
TreeAdd
subsequent
amigrationiff
reference
directly.
*/
local.
compiler
of this
local
A dominating
annotated
aligned
All
Olden
ofsomepointer
pointers.
emigration
+ local_t->val);
code using
4).
The
resulting
adereference
other
in Section
can use the
A simple that
TreeAdd
take
cause
t_left);
+ sum_right
7.
to belocal.s
can
(local.t–>right);
(touch(f_left)
Fig.
this
(TreeAdd,
= TreeAdd
return
which
/*
= local(t);
t_left
sum_right
emigration
*local_t;
the
future
cell,
stack.
Then
store
If it is Empty,
the
return
the
it generates
callee-save initialize the value
in-line
registers, call
the
cell’s
frame
state
is generated.
in the
cell is popped
code
the
to
After
future
cell,
and
the
stack,
and
from
continues. this on exit
observation from
relies
the called ACM
on the
fact
that
function
calls
return
always
to the
original
routine.
Transactions
on Programming
Languages
and Systems,
Vol
17, No
2, March
1995.
246
.
Anne
In the
case that
descendants If it
is Stolen,
the
migrate
of the
early We
of future
calls.
Our with
first
one.
have
is profitable,
Register
must
because
unwinding priority.
have
been
callee-save each
the
function
may
register
procedure,
and
when
stealing
a future,
the
future
address
frame.
the
run-time
For
this
and
the
associated
futures
the future the
state
of callee-save
function
This
capturing
stealing
with
Therefore,
code
in the
overhead
than
ch gives
the restores user
to
later
cost
less common
steal.
size
function
cost of steals.
~, associated
the
one of its
on the
the
the the
MsgDispat
perform
cell of the
registers
last
executing
to
restore
sequence
for
cell
during the of the second
two
cell
is stored register first future
in
cell
the
pointer
takes
advantage using
ACM
on Programmmg
site,
8. When for
their
original
storing
the
is initially
it
a future
cell
to code
which
can
be
stored
in
H migrates,
H, and
then
G.
values. state
Empty
set to the
When
of a future
instructions, two
of the frame frame
in
(see Figure
stack
that Since
address
sequence
for
call
frame
as we go.
is stolen, will
the
test
in
sequence return
for
Stolen
call and
on the
one
load)
i860 in
is only the
seven
expected
stolen.
unwinding.
frame
in Figure
Then,
action.
implemented store
need
the
call
return
to point
overhead
integer
is not
have the
cell.
appropriate the
the execution TransactIons
future is modified
to
sequence
the
contain
a future
procedure.
appropriate
restore
will
the
pointers
from
show
the
in
restore
at the
callee-save
for
site
of frame
offset
registers
address
the
stores, we
list
scenario
the
a callee-save call
callee-save
is to eliminate
function
future
the
the
execute
optimizations,
CM-5,
future
The
eliminates
is much
is stored
callee-save
return
us to eliminate
is simply
to reduce
of increasing
each
at a constant
an Empty
(four
the
pointer address
The called
these
case where
the
will the
for
of the
instructions
allows
fixed
the
at
each
consider
optimization
or Waiting,
On
(or
to remain
then
the
contribute
depends
but
we generate
address
sequence
a word
system
of instructions
its
executing
example,
cell.
Given
from
we traverse
cell,
is done,
future
address
function called
by executing
chain
called migrates
the
that
to
for
do not
is for
before
unwinding,
of a restore
A second the
call
store
by reading
the
Once
be restored
calls
of optimizations
fact
how
f.
To implement
the
if the
of the function,
If it
describe
situation
a future
the
processor.
We
It is imperative
future
unwinding,
immediately
in the
each
obtained
exploits
executing
procedure
contains
stealing
to the cell.
large.
function
at the expense
A descendant
registers
for
regwter
registers
or one of its or Waiting.
subsection.
most
calls,
a series
called
callee-save
the highest
of recursive
work
next
only
called
function is Stolen
future
is quite
A common
implemented
assign
because
the
structure.
optimization,
saving
tradeoff
data
in a sequence
sequence.
calls
called
future
on this
parallelism
Whether
the
in the
of future
generates
is, the
to
a touch
of computation
migrates.
layout
been
to be inexpensive,
A future
descendants)
(that
see whether
ch is called
have
overhead
of futures
parallelism.
Empty to
MsgDispat
thread
As described,
is not
we test
must
a waiting
et al.
cell
then there
capturing
and
the
migrated),
is Waiting, resume
Rogers
additional pointer
of the
optimizations.
on entry.
caller,
we
The
frame
pointer
the
chain
with
an
The
We notice
can
determine
of the
parent
address
that
that the
first since frame
continuation precedes
the
8).
of the facts
a linked
list
Languages
and Systems,
that
and that Vol
the future future 17, No
cells 2, March
stack
is threaded
are four-byte 1995
through aligned,
to
Supporting
la
Dynamic
Data Structures
247
.
El? Q F performs
F
futurecall(G,...)
fp
ret addr
G
G calls
H
fp
ret addr
H migrates
H
ret addr
Fig.
eliminate
another
future the
cell’s
state
link
store.
and
we may
Given tions
iPSC/860
Compiling
on the
otherwise
ues of the
callee-save
future the
cell,
loads
the
resumes
5.3
Stack
In
the
the
to hold
link
state
and
FuH
link
on the
contains
touch
last
two
states
field
field
bits
of the
as zero,
(because
we get
it contains
is no longer
necessary,
information. CM-5
one
sample
cell
annotation
cell.
library
the
If the
routine the
requires
load)
only
in
the
futurecall
frame
pointer,
of the
is Full,
five
case
code
returns and
in-line then
is called.
pointer, and
call
generates
state
Suspend
as Waiting, future
frame
execution
instruc-
when
the
both
the
for
and
calls and
the
code
the
program
the
counter
thread
Suspend
stores
program
counter
MsgDispatch finds
to synchro-
touching
to
cell
assign
marked
from
the
the
val-
in the work
to
Waiting,
suspended
it cell,
procedure.
Management
discussion
belonging when
the
When
management. for
The
registers,
registers,
and
in the
and
we set the
instructions,
future
the
marks
processor.
information
implementations.
of the
continues;
when
bits
example.
Empty
on a steal
Appendix
touch.
state
state
a futurecall
integer
The
CM-5
free
two
uawinding
the
Then,
bottom
two
stolen.
and
5.2.3
for
optimizations,
stores,
is not
nize
the
the
representing
pointer).
reuse
these
(two
future
By stored
a four-byte-aligned
Register
We store
field.
information
8.
thus
When to the stolen
the
stack
frames,
simple
stack
which
basic frame,
could
technique
glossed the
and the frame
returns.
overwrite policy.
for
over
The
the
multiple
of the that
stolen
frames
To avoid
certain
portion
problem,
environments
related
migrated migrated
stack
the
frame
must
be preserved
may
allocate
thread
we adopt
by Bobrow
to
between
continuation
of the
this
details stack
new
if we used
a simplification
and
Wegbreit
a of
[1973].
spaghetti stacks, splits a frame into a can be shared among several access modules, and an extension,
Wegbreit’s
which
have
is stolen,
thread
management and
we
continuation
migrated
a single-stack Bobrow
far,
a future
method,
ACMTransactIons
called
on Programming
Languages
and Systems,
Vol.
17, No
2, March
1995
248
.
Anne
Rogers
et al.
Sp ,[ G (l e ‘1 d II F
K Fig.
which
contains
is a counter
data
local
of the
are allocated
one.
use of a coroutine,
The
access
so the
module,
no other stack
pointer
to the
is then
frames end
to allow
segments. the
may
Since
our
model
the
method
simplify maintain
a single
terleaved
frames
frames
are
a frame frame
stack
into
is in the
middle
at the
deactivating
and
end
any
at splink.
Otherwise,
frames. stack, then
it
context ACM
the
end
Transactions
frame
stack
invariant,
of memory a pointer
called
traverse
pointer
to the of this
a meshed
implementations
on Programming
after
to
an example
stack,
these
since
the
Vol
the
end
in-
stack
invariant exit
can We
the new
that
routine
for
whose
it as garbage. is adjusted,
Additionally,
then
to the
beginning
the
of the
stack,
thus stack which
Assume
by
that
Hogen
2, March
lists at the
a null null
initially
and
languages.
17, No
of the
frame the
1995
is reserved
pointer
linked
reaching
containing
spltnk,
a null
form a live
until
was developed
and Systems,
the
in each frame,
location
or logic
stack,
pointer
pointers
process.
contains
The
is live,
list
we
returns.
it.
deallocating
of functional
Languages
nearest
frame,
by marking
stack
below
If the frame
contains
various
If a procedure
simultaneously.
live
one word
splink
global
immediately
information.
the
method,
frame
of the
of
it is moved into
and
maintain
deallocation.
it is deactivated
necessary
of parallel
and
exits,
is merely global
We
of a live frame.
end
at the end of each frame,
9 provides
9 A similar
pointer.
the
per
calls
spaghetti
exits,
frames
if
end
overflow,
stack, that
a simplified
stack the
an
to an extension,
is split
extension
stack
To maintain
set the
Figure
word
one
of the
deactivation,
management this
deactivation
to the
stack
placing
parts:
from is freed
The
stack,
stack
on function
of the
garbage
To implement
stack
marks
deallocating
is adjusted
deallocates
two
global
stack
prevent
spaghetti
In
always
is one).
of the
the
to
than
overhead
a simpltjied
called
off the
Thus,
set to
extension
frame
returns
end
and
counter
increasing.
more
the
continuations.g
pointer
If a frame
for
stack,
frames.
as necessary
require
reduce
control
frame
frame
On exit
basic
count
the
basic
of the frame
the
extension
and
the
points.
and
When
to be ever
not
and
of the
is split
pointer
does
a copy
is deleted,
extension new
a tendency
allocated
the global
the
cause
is, the
each a basic
with
continuation
appropriately.
is performed
have
would
(that
it to allocate
Compaction
stack
it
between
with
Initially,
at the end of the stack,
extension
adjusted
exist
case.
Associated
modules.
go to different
references
— simple
module.
for example, may
stack
access
contiguously
appropriate
extension
if active
of active
returns
the
spaghetti
to an access
number
extension
to be made
Simplified
9.
is stored frame.
By
of garbage end of the
pointer,
and
pointer. the procedure
Loogen
[1993]
in the
Supporting
Dynamic Sp
Sp ,1
Sp
Data Structures
249
Sp
Sp ‘1
,1
,1 G
.
G
G
G
e
e
e
,1
e ,1
,1
d F
H
F
B
3 Fig.
F is executing. Notice
that
another the
When
B 1?
4
Simplified
spaghetti
of execution.
returns
stack
calls G, a new frame
F
to the
When
initial
— complex
Figure
the
7
case.
at the end of the stack.
the end of the
since
F
H
6
is allocated
F and
G exitsl
state.
F
B
5
there are live frames between
thread
stack
10.
,1
,1
,1
,1
Sp
d
d
d
,1
stack
preceding
10 provides
that
belong
frame
a more
(e)
complex
to
is live,
example.
F calls G. G migrates, and the run-time system gives control to e. The frame e is not at the end of the stack; hence, when it exits, a pointer is placed When G at the end of its frame marking it as garbage. The exit for d is similar. Again,
resumes stack, the
execution the
list
frame
and
labeled
(topmost
resulting
is required frame,
will
from
is deactivated. be followed
On the CM-5, end
that
of the
reduces
branch
stack
5.3.1 Since
Discussion.
the
an
guarantees
that
bound
the
be live
stack
become
paction explicitly
the
in the stack
stack
could
run
two
if,
We
can
in an actual achieve
to the
because
to the run-time capacity
l’ransactlons
optimal
run-time
The
dead
frames
of the cost
on Programming
the
frames
stack
be present number not
imOlden
in the
of frames at the
to check
are not based
checks
end
that of the com-
the
stack
deallocated
can
on fragmentation to determine
is small.
Systems,
sys-
We can only
use by performing
that
and
proposed. frames
unnecessarily.
can be used
Languages
express is at the
instructions.
dead
It suffices
A metric
of such
path
optimization
we have
processor.
stack
system.
system.
will
by
run,
This
that
space
current
frame
to three
deallocate
frame
processor
even
system,
remaining
stack
on a particular
to be compacted. ACM
of any
called
stack
case).
scheme not
of stack
one on the
on entry
by calls
out
and
sooner.
run-time
or the
needs
copies
processor,”
the
exit
instructions
adds, one load, and two
current
common
does
One extra
expected
optimization,
the
(in
over the basic stack
(three
six instructions
with
program
theoretically,
be generated
(the
scheme
“home
dead
it
from
stack
at most
as necessary
below
sequence
is one problem
size of a stack
could
of the
frame exit
The
instructions)
that
of the
frames.
four additional
an additional
indicates
spaghetti
Olden
one on the
only
a live
There
simplified
mediately, tem,
with
that
end
is reached
stacks is low.
requires
instructions
we have implemented
the cost of the stack
pointer
spaghetti
underneath),
and two conditional
a bit
it is at the
to store zero at splink.
live frame
maintains
Since a null
all of the garbage
exit code for the i860 and six additional conditional branches) for the CM-5. checkout,
until
our simplified
on entry
with
(an add, one load,
its frame
frames
thus deallocating
F),
The overhead instruction
exits,
of garbage
Vol.
In
17, No
our 2, March
when expe1995.
250
.
rience,
Anne
the
additional
we do not
check
We chose
A
principle ing
would
reason
not
the with
overhead
might
using
prototype
spaghetti
between
implementation
and
the
multiple
used lead
best
allocation
did
not
to
stacks
make
system
libraries.
design
choice,
described
our
easy
to
use
changes
code
and but
our
stack
us-
to be used.
we started
more
stacks stacks,
difficulties.
since
policy
to make
it
disjoint
implementation
is historical;
want
disjoint
multiple
“deallocate-on-return”
heap
and
existing
be
a simple
compiler
Furthermore,
in our
implementation
copying
allow
for
stack-based
compiler
first
approach
age patterns
and
as a compromise
Our
excessive
is small,
overflow.
stacks
stacks.
heap-allocated
growth
stack
spaghetti
incurred
et al.
stack
for
heap-allocated this
Rogers
from than
Our
a exist-
necessary.
compiled
by
Olden’s
RESULTS
6 The
previous
section
Intel
iPSC/860
sults
for
five
Voronoi ilar
and
Diagram.
that with
Sort
Sort.
Bitonic
has two
Sort.
rithm.
finally,
Voronoi
uses a different
method
a problem
our
with
Bitonic
simple
levels
versions
The
Bitonic
And
Power,
and
computationally
two
in Original
implementations CM-5.
is very
is more
algorithm
prototype
Machines
TreeAdd,
TreeAdd
but
We experimented and
Thinking
benchmarks:
structure,
complex
the
for
model.
All
two
a sim-
Bitonic to
malloc
divide-and-conquer
algo-
algorithm
it because
use trees
reand
patterns.
call
divide-and-conquer
benchmarks
has
Original
is an extra
We include
the
is a relatively
locality
we call
is a classic
subresults.
of these
Power Sort
complex
which
the
is a classic
merging
and
for
we report Salesman,
locality. Bitonic
Sort,
Salesman
section,
Traveling
has ideal
between
Diagram
Sort,
of recursion
difference
of Olden
this
intensive.
of Bitonic
Traveling
In
that
it illustrates
as their
primary
data
structures. We
performed
CM-5S and
at
NCSA in
Tables
runs.
Timings
reported
for
benchmark compiled or
includes
this
The
was
not
claim
for
stack,
vary
over but
and
data
for
the
at from
the
between
runs
done
without
the
and
experiments are
runs.
from The mode.
(sequential),
overhead
on
University
in dedicated
implementation
a one-processor
Princeton
Syracuse
iPSC/860
significantly
three
a sequential
compiler,
raw
given
at
NPAC
timings Each which
of futures,
implementation
are single
pointer
( one)
which
were
local
to show that
hand-annotated
annotation. that
the
the
to
The
execution
annotated
include
programs model
programs
future
were
allows could
calls,
annotated
efficient
be
touches,
parallelization,
generated
and
aggressively, We
automatically
a
Our do
using
technology.
TreeAdd is a very
subtree nodes
ACM
our
spaghetti
compiler
TreeAdd
in
do not
a timing
The
timings
are averages
iPSC/860
Centers:
overhead.
of the
existing
machine
using the
a 16-node
of Illinois. The
II.
CM-5
benchmarks
goal
6.1
University
on this the
on
Supercomputing
I and
includes
testing,
variant
experiments
National
at the
given
was
our
two
(see Figure in
Figure
Figure Transactions
1.
11. The
simple 7).
program
We report
The
tree
subtrees
on Programming
that
recursively
results
for
is allocated
at
Languages
a fixed
using
depth
and Systems,
walks
a balanced,
Vol
in
a tree,
binary
the
allocation
the
tree
17, No
2. March
are 1995
tree
summing of size
scheme equally
each 1024K
presented distributed.
Supporting Table
I.
iPSC/860
Dynamic
Timings
Data Structures
.
251
in Seconds
Processors Benchmarks
TreeAdd
sequential
(1024K)
Power
1.94
(10,000) Biton’ic
Sort
(32 K)
Original
Bitonic
Sort
(128 K)
Bitonic
Sort
(32 K)
Bitonic
Sort
(128 K)
Voronoi
Sal~sman Diagram
‘(32,767) (64K)
Table
II.
2
2.62
517.28
original
Traveling
one
1.31
517.46
8
‘t
260.23
16
0.66
0.33
0.17
130.43
65.88
34.20
4.87
5.69
3.23
2.29
2.04
2.09
29.21
64.67
15.18
9.38
6.69
5.36
3.60
4.45
2.64
2.01
1.91
2.02
17.46
21.21
11.98
7.96
6.21
5.05
65.79
66.02
33.49
17.48
9.61
5.77
13.60
20.17
13.06
11.35
16.00
33.16
CM-5
Timings
in Seconds Processors
Benchmarks
TreeAdd Power
sequential
(1024K)
4.49
(10,000)
286.59
Original
Bitonic
Sort
(32 K)
Original
Bitonic
Sort
(128 K)
Bitonic
Sort
(32 K)
Bitonic
Sort
(128 K)
Traveling Voronoi
Salesman Diagram
one
(32,767) (64K)
2
5.94 313.04
4
8
16
32
2.99
1.50
0.75
0.37
0.19
160.75
79.73
43.87
25.03
11.87
8.50
10.79
5.80
3.49
2.40
1.88
1.67
40.66
52.56
27.15
15.72
9.86
6.65
4.89
2.10
1.75
1.64
8.52
5.97
4.40
6.55
8.48
4.93
2.90
31.83
41.18
21.81
13.08
43.35
45.45
22.58
11.72
6.47
3.91
2.74
48.46
62.88
38.03
25.87
24.83
39.33
69.30
-- + --
TreeAdd
(860)
-- # --
TreeAdd
wrt one (860)
—.—— TreeAdd ~ TreeAdd
(CM5) wrt one (CM5)
32 28 24
8
I
I
124
8
1[
I
I
i
I
I
I
12 1620242832 Processors
Fig. ACM
Transactions
11.
TreeAdd
on Programming
performance. Languages
and Systems,
Vol.
17, No
2, March
1995
252
.
Anne
Rogers
et al. --4--
Power (860)
~
Power (CM5)
32 28 1 24
,0 ,’ .“ ,’
I 8
II I 124
I I I 1620242832
I 12
1
I
Processors Fig.
The
lower
They
two
curves
represent
The
efficiencies
amortize upper
curves
plot
speedup
because
6.2
Power Pricing
Power
solves given
and
the
will
optimize
ward
pass that
10 main node all
there
propagates
feeders
that
determined
from
1,201 in the
the
and
following
way:
the
using
leaves
are distributed
erals
are root
is placed
We report in Figure ACM
a block on the
on the same for
the
behavior
on Pmgrammmg
branches nodes;
10,000
based same
processor
et al. each
nodes
on an in-order
as the
as the head
root
prices
that
computation has
a down-
customers
one
nodes; has
layout
divided
as fol-
at the
customers
[1993]:
The
are
The
the
branch
customers.
the
phase
to 20 lateral
and
processor
be stated plant
to the
from
Lumetta
branch
distribution
on Processor
speedups
12. Power’s
TransactIons
branch
occur.
1993].
root
the
parallelism.
can
Each
the
information
each feeder
of five nodes
processors
the
root;
from
from
internal
The
placed
demand
et al.
than
migrations
power
is reached.
to The
linear
to determine
[Lumetta
information
configuration
of a line
information
convergence
propagates
which the
which
stack.
rather
available
with
over
CM-5.
near-perfect
of processors,
problem,
community
until
work
time.
the
spaghetti
display
exploit
by a tree
use local
pricing
input
head
are
to the
of phases
use the
is the
leaves,
the
for
implementation,
can
Opttmwatton represented
74%
real
and
curves
number
model
time/parallel
and
little
tests,
These
P is the
i860
is very
pointer
execution
network
benefit
pass
We
1,where
the
at the
the
there
is, sequential the
the one-processor
Power-System-
a series
an upward root.
the
customers
executes
P —
that
a power
for
as the baseline.
only
demonstrates
that
of 72%
futures,
using
performance
speedup,
because
of the
implementation,
speedup,
true
perfect
overhead
This
lows:
the
Power
efficiencies
are not
the
sequential
plot
parallel
12.
numbering
the
substation; each lateral
10 leaves. of the
evenly
corresponding
and to
among of the branch;
of their
line
the
time
to build
but
on a much
In
tree
of branches;
is the
tree. latand
O. whole
program
is very
Languages
and
(including
similar Systems,
to TreeAdd’s, Vol
17, No, 2, March
1995,
the
tree)
smaller
Supporting
Dynamic
Data Structures
-- ~ -- Original BiSort (32K) (860) -- ~ -- OriginalBiSort(128K) (860) —+— Original BiSort (32K) (CM5) —A——Original BiSort(128K) (CM5)
.
253
-- ~ -- BiSort (32K) (860) -- ● -- BiSort (128K) (860) ~ BiSort (32K) (CM5) ~ BiSort (128K) (CM5) 10
10
1
1 8
8-
6–
2 1
l– II
I
I
124
8
I
I
I
[
I
I
I
12 16 202428
32
I
I I
I
124
8
Fig.
The
done
main
at the
al.’s
results.
6.3
Bitonic
Bitonic
two
to store trees, calls the
the
numbers.
subtrees the report
easily
call.
figure Sort
but
disjoint
a substantial
results
amount
are comparable
data
depth
for
in the
overhead the
does
as much this
subtrees
not
include
as the
accounts
32
of work
is
to Lumetta
to maintain Transactions
sort,
the
et
over
the
(32K,
the
they
the
amount
the
to malloc.
curves It
still
sort. and
in the
half
128K
but
that
expensive left
numbers)
implementations
increases
call
graph
sizes
These
recursive
processors,
following
13 displays
extra
available
The
swaps
such
layout
numbers),
which
two
having
results.
sub-
phase
processors
the
per-
is used
in both
merge
performs
on the
problem
perform
perform
an un-
of work
for
per
an improved
displays
paral-
version.
a loss of about
would
sequences The
and
tree
this
Sort.
of Figure
A binary
to avoid
the
(128K
set of integers
then
placed
different Bitonic
original
for
benchmark
ACM
skew
problem half
and
been
phases
two
right
result.
to maintain
phase
recursive
bitonic
sorted
distributed
sorting
curves:
creates
sequences, have
of Original each
backward.
one
the
modified the
tree-building four
a random
and
are evenly
been for
allocates
algorithm
bitonic
The
has
that not
overhead,
moving
the
on a medium-sized malloc
System
is that
forward
to generate
only
13 contains
Bitonic
I
I
performance.
Our
1989]
one
implementation
necessary
lelism,
them
them
at a fixed
respectably
this
two
speedup
The
Sort
in Power.
Nicolau
sequences.
parallelizable
each
Bitonic
First,
merges
benchmark
of Figure for
and
to create
and
tree
on them:
then
on these
We
13.
between
of the
[Bilardi sorts
and
subtrees
I
Sort
Sort
forms
difference
leaves
I
Processors
Processors
tree.
I
12 16202428
not
locality
show for
on Programmmg
30%
of the
perfect
the
second
Languages
speedup.
speedup sort
Even
because
the
without cost
of
is expensive.
and Systems,
Vol
17, No
2, March
1995.
254
.
Anne
Rogers
et al.
16
-- -a --
TSP (860)
~
TSP (CM5)
1
/
4 1 I I
I
I
I
124
8
I
I
I
I
I
12 16 2024
28 32
Processors Fig.
Traveling
6.4 This
ever,
computes
Salesman
Karp’s
[1977]
solution et al.
the
These
tree
is laid
for
computing
results
do not
The
Voronoi
Diagram
generates
a random
points
that
Rather
than
on
an
In our
for
divides
computing
the of
implementation,
alternately
we use the
circuit implementation
how-
a set of points
an exponential
closest-point
exact
heuristic
[Cormen
final
subdiagrams, the
whole
depth 64K
the
points
displays
in a balanced the
dividing
presented
in Figure
distributed
tree-building
the
set
edges
to knit
set.
The
points
are
1. We report
points
in Figure
14.
phase.
the
tree
phase
sorted
Diagrams
at the
Schachter
the
to processors
and
the
two
hull the
subtrees
merges
them
of the
Voronoi
so that
points,
To compute
of the
convex
to form
1980]
for these
median),
along
together
assigned
Lee and Diagram
by X coordinate.
Voronoi walks
them
1985;
a Voronoi
of points
speedup
reported the
a substantial phases
obtained
speedup
tree
no parallelism
on Programrnmg
merge
Stolfi
two
Diagram
subtrees
at a
distributed.
15, we report The
and
computes
computes
adds
or building
Tmnsactmns
of the
binary
algorithm
and
merge
scheme uniformly
[Guibas and
The
almost
later
cost
result.
points.
represents
the
benchmark
are evenly
In Figure for
32,767
set of points
(thereby the
allocation
for
Diagrams
Diagram,
recursively
ACM
hamiltonian
based
[1993].
in a tree
MulIer,
best
is
by Muller
level.
the
include
are stored
a Voronoi to form
using a tour
Voronoi
and
of the
program
are given
at each
out
Building
since
cities
sizes, following
6.5
for
performance.
1989].
results
fixed
The
algorithm
coordinate
for small
The
The
Salesman
an estimation
problem.
partitioning
we assume
by X or Y
Traveling
Salesman
benchmark
Traveling
14.
used for
Languages
to represent
two
fraction reference
does
reasons. of the
subsets and Systems,
for not
building include
the First,
sets the
computation. alternately Vol
17, No
on 2, March
the the
Voronoi cost
Diagram
of generating
of points.
This
example
merge
phase
is sequential
But
more
importantly,
different 1995
processors,
the
Supporting -- *-
- Voronoi
- =-
Voronoi
Dynamic
Data Structures
255
o
(860) (CM5)
8-
6a % ~
4-
Wa .& . ...-.*---
2#.o
1-
i2
*--- ----a ------.-
4
8
i2
-o i6
Processors Fig.
available occur
parallelism
during
because
Voronoi
is swamped
these
merges.
messages
by
Diagram
the
Migrations
are expensive.
performance.
cost
of the
“ping-pong”
are expensive
in our
migrations
current
that
implementation,
10
Discussion and Future Work
6.6
In early
experiments,
marks.
We
location from
we observed
traced
this
routines.
the
explicitly.
The
overhead
futures,
and
benchmark
by
On
overhead.
Pointer
the
user-level
trap
this
handler
of such
a scheme
and
1991].
We
top
of the
Li
built
on
group
will
[Schoinas
user-level
the
solution
accounts
depend are
for
heavily
currently
that
1994],
manage
and
80%
of the
from
points
546ps
message,
on the
on Programmmg
may
Wisconsin
trap
a
[Appel
of Olden Wind
mechanism
be
and
effectiveness
a user-level
the
for
Tunnel
registering
reference. out
that
pointer-based
a 250-byte
It
an implementation
a simple
on a nonlocal
overof the
hardware The
a
80%
overhead.
references.
for
processor
of this and
translation
with
capturing
one
component 44%
effect
acquired
overhead
and
between
address
system
benchmark
the cost Of sending
sequential
nonlocal
provides
of parallelizing
Transa.t]ons
to
the
the
the
al-
memory
testing,
measure
on the cost of servicing
which
are invoked
as 496LM on the iPSC/860 ACM
the
for
experimenting
Diagram
to the problem
24%
by using and
managing
a significant
accounts
bench-
memory
allocating
of pointer
the
of these
We reduced
and
can
between
Tempest/Blizzard
Voronoi
10We have ~easured message,
it
cost
We
represents
overhead
problem:
times
the
stack.
testing
to detect
et al.
handlers
Finally,
testing
several iPSC/860’s
synchronization, a few
difference
for
of the
a related
includes
spaghetti
pointer
CM-5,
to reduce
from only
technique
the
speedup
behavior
a global
malloc
the
iPSC/860,
On
possible
requires
computing
the
superlinear
quadratic suffers
calling
of our
implementations. head.
CM-5
managing by
the
system
problems
storage
to
The
operating
of these
nal
15.
our
model
programs
which
is the average
is not
for
the fi-
distributed-
size Of a figratiOn
CM-5. Languages
and Systems,
Vol
17, No
2, March
1995
256
Anne
Rogers
et al.
machines.
Our
results
.
memory better
to use software
used
for
caching
are
to be examined We
have
Eicken
implemented
on the
performance.
computation
migration
7. AUTOMATIC This
article
was
are needed
to determine
and
it
in
structures
the
1988 b].
In
our
Hendren cally
type
parts
the when
can
RELATED
cessors
support
et
approaches space (tuple Tmnsactmns
objective
to build
compiler
we
between
mentioned as the
be executed
earlier,
target
for
analyses
a
that
in parallel
operations.
data
safely,
See Rogers
the
For
the to:
conquer
step
the
conquer
maximum
provides
on
tests
to
of precomputation,
part,
followed
if it is safe and
informa-
divide-and-conquer-
consists
step,
stati-
Based
appropriate
a typical
in parallel,
1990;
the relationships
of lists.
conquer
and
automatically. [Hendren
analysis
determine
to determine
to determine
noncircularity the
the
1988a;
structure,
analysis
defined
data and
Hilfinger
or futures
designed
procedure
be used
with
data
matrtx
we have
implementing
to ensure
program
A typical the
we
the
the
analysis
futures.
must
1992],
and to approximate
or
that
et al.
hierarchical
path
structure.
say
recursive
and
calls
previously
with
Larus
function
tree-like,
techniques
Hendren
1989;
of the
on the
analysis,
calls
the
programs
is to analyze
to introduce
for
for
we restrict al.’s
objective to
the
1990;
an interprocedural
analysis
of supporting
Carriero common
present
by
post-
to execute
the
determine
if the
precom-
place
touches
in
the
the
concurrency.
WORK
Programming goal
it
upon
[Larus
parallel
this
recursive
position
Lisp
to insert
be overlapped
possible
of necessity
the
build
Nicolau
of subtrees,
is safe
Our
and for
of the
procedure, the
As
use
touch
imperative
pieces
from
it
we can
are indeed
subcomputations
putation latest
[von to im-
dereference.
to
may and
to disjoint
disjointness
available
computation.
ACM
structures
recursive
various
cases,
1990],
by
At
choose
model. briefly
futurecall
compiler
refer
different
followed
pointer
intention
subcomputations the
Hendren
and Nicolau
determine
our
1990;
we plan
information
each
need
be used
automatically
be
substantially. Messages
can
encouraging.
that for
Active this
may
messages
processors costs
using
quite
we outline
analysis,
purposes,
about
multiple
how
execution
our
of parallelizing
information
if the data
between
8.
this
these
to use this
For
from
it
The
details.
computations
then
by
section,
place
context
both
on Olden’s heavily
restructuring
which
others,
communication
been
caching
and
migration.
cache
and
which
more
[Hendren
CURARE
have
software
case,
are exploring
heuristics
to
to support
proposed
tion
for
order
results
In this
is best
[1993]
In
and
this
data
user-invalidate
CM-5,
concentrated
compiler.
et al.
when
can reduce
compile-time
influenced
parallelizing
where
in
computation
PARALLELIZATION
has
design
and
they
Initial with
that
than
cheaper,
a simple
1992]
are experimenting
its
much
rather
simultaneously,
et al.
prove
indicate
caching
work
machines
discussion
[1986]
work
Olden,
a shared
is a very
to papers
pointer-based
with on
parallel
our
dynamic
on
data
distributed
namely, linked
that
structures.
on Pmgrammng
Languages
and Systems,
Vol
of research.
structures
on
a mechanism
for
In
are quite different. The Linda model space), but no control over the actual
area
particularly
Out
relevant
to
structures.
data
providing
active seem
the
details,
Linda
shares
distributed
however,
the
two
provides a global shared address assignment of data to processors.
17, No
2, March
1995.
a
pro-
Supporting Also,
while
cluding
Linda
arrays,
programmer
to
in Olden control automatic
and
with
let
the
have
annotated
both
distributed
by
parallel
hand.
The
structures
a run-time
data model
model
that
.
in-
requires
data
but
257
structures,
that
distributed
at present,
the
structures
we provide
we believe
“Para-functional
programming”
cases,
sequential
semantics
the
computation.
In
are
mapping
expression,
layout
makes
Emerald
[1986] In
to distribute
is that
oriented
and
annotations
data
which
a program
layout
Smith’s
the
the
ference
of arbitrary
it is an explicitly
to hierarchical
Olden.
are used
however, run
parallelize
data
flexibility sets,
Data Structures
exact
is amenable
to
parallelization.
Hudak
tations
the and
are restricted over
ilarity
allows
graphs,
Dynamic
and
preserving [Jul
whereas
determine
Hudak
et
languages
the Smith
location
1988]
developed
Amber
the
the The
a referentially
semantics
and
we annotate
sim-
and
anno-
programming,
specify
of computations.
from
has some
preserved,
para-functional which
in Olden,
start
sequential
al.
expressions,
model are
processor
to
allocations
and
other
dif-
main
transparent
language,
trivial. [Chase
at the University
et
al.
1989],
of Washington,
which
object-
are
employ
thread
and
object migration mechanisms to improve locality. Emerald was designed for programming distributed systems. Amber permits an application program to use a network
of homogeneous
processors
as an integrated
multiprocessor.
These
guages provide primitives for object location and mobility, and constructs the programmer to indicate whether the thread or the object(s) should satisfy
an invocation
that
mer has not specified which
mechanism
Prelude,
a remote
object.
the compiler
In cases where the program-
uses several heuristics
to determine
is appropriate.
a language
199 1], includes
references
a preference,
lan-
to allow move to
being
a migration
developed mechanism
at MIT similar
[Hsieh
et al.
to ours.
1993; Weihl
et al.
The goal of the Prelude
project is to develop a language and support system for writing portable parallel programs. Prelude is an explicitly parallel language that provides a computation model based on threads and objects. Annotations are added to a Prelude program to specify architecture-specific which of several mechanisms The language object
supports
migration,
lude projects
implementation details. These annotations specify should be used to implement an object or thread.
a variety
of mechanisms
and computation
are different.
Our
migration.
including
remote
procedure
The goals of the Olden
focus is on effectively
parallelizing
call,
and Pre-
sequential
C
programs; thread migration is simply part of the underlying model. Their goal is to develop a new language that facilitates writing portable parallel programs using computation Orca[Bal
migration as one possible et al. 1992] also provides
tool. an explicitly
parallel
programming
model
based on threads and objects. Orca hides the distribution of the data from the programmer, but is designed to allow the compiler and run-time system to implement
shared
shared objects object
should
objects
efficiently.
are accessed that be replicated,
The
Orca
compiler
produces
is used by its run-time
a summary
of how
system to decide if a shared
and if not, where it should
be stored.
Operations
on
replicated and local objects are processed locally; operations on nonlocal objects are handled using a remote procedure call to the processor that owns the object. Split-C [Culler et al. 1993] is a parallel memory, message-passing, and data-parallel ACM
Transactions
on Programmmg
extension
of C that
abstractions
to the
Languages
and Systems,
provides
shared-
programmer. Vol
17. No
%lit2, March
1995
258
.
Anne
C provides
a global
providing
both
tothelocal
Split-C
efficiently. In and
In
based
on
objects, and
like
This
compiler
and
through
message
of data,
structures goal
Concert
(such
control.
passing
execution
provides
threaded
and
Concert
composition,
of the
concert
project
1994],
a recently
also
from
allocating
data
writing
of
programs
Olden’s. Chien
1993]
object
provides
efficient
provides
concurrent space,
common
inheritance,
and
some
asynchronously parallel
messages,
provide
space is also
parallel
communicate
first-class
is to
for
shared
forwarding),
the
system
of fine-grained
tail
(invocations).
object
synchronous
and
a globally
and
for from
efficient
are single
parallel
different
and
places
The
in explicitly
Karamcheti
Olden
of an algorithm
routines and
1993;
as RPC
Objects
for
for
pointers
a global
structures.
to be used
remote)
global
data.
describe
provides
by point
(possibly
between
description
object,
to
programmer
necessary
data
model
et al.
support
idioms
concurrency
major
allocation
programs.
programming
and
of a data
is intended
[Chien
run-time
object-oriented
A
system
system
of its
abstraction
has a work
Concert
layout
toany
manipulate
the
[1993] the
of locality
guaranteed
difference
any
et al.
to decouple
are
point to
Split-C,
to retrieve
reading
object
Split-C
The
pointer
may
In
concept
pointers
is a major data.
Lumetta
of an optimized
asynchronous
an object.
pointers
the
a way
a clear
Local
of primitives
routines
of work,
a global
for
maintains
is allocated
provides
description
global
follows
provided
piece that
that
pointers.
a variety
work
work
uses the
a related
space
global
whereas,
way
Olden,
abstraction the
and
provides
The
Split-C.
et al.
address
local
processor,
location.
work
Rogers
and
collections
continuations.
support
for
fine-grain
concurrency. Cid
[Nikhil
locks
model
of parallelism.
mechanism
allows
the thread begun
should
This
a structure
is based
on global
directly.
Instead,
of several copy
in return.
maintains
Cid’s
on existing
but
may
(for
existing
goal
cannot
a global
they
locality
have while
object
mechanism
cannot
be dereferenced
and
goal
that
explicitly
using
a pointer
to a local
the
Cid
This
creation on which
once
of data
object
is that
compilers.
element
is given
locking,
thread
migrate
pointers
of Cid
a threads-and-
the
processing
and
use implicit
using
and
access to the
read-only)
design
run-time
programs
one
system be able
supports
to
portability,
efficiency.
CONCLUSIONS
JVe have
presented data
approach
we
resolution
techniques,
programs,
to pointer-based
able.
In
have
property
TransactIons
a new
structures
contrast,
ownership, ACM
these
requests
C, supports
advantage
also provides
A major
dynamic
This
threads
to take
Split-C,
objects
to
lightweight, a specific
Cid
example,
global
hardware
hamper
Cid
Unlike
programmer
modes
consistency.
run
9.
the
sharing
Olden,
it awkward
iteratively. pointers.
are
to name
Unlike
makes
extension
threads
programmer
be run.
execution.
traversing
the
proposed
Cid
and
approach on
noted
a fundamental
currently dynamic
programs
used
data
to produce
data
therefore
makes
Array
structures
structures
Languages
the and
precludes
17, No
2, March
1995
our
apply
to
new
run-time
array-based
directly
use of simple model
pointer-based
for
are
be traversed the
to
programs
structures
resolution
use
developing
trying
SPMD
must
Vol
In
with
data
run-time Systems,
that
machines.
problem
programs.
of dynamic
on Programmmg
supporting
message-passing
address-
be addressable. local
ineffective.
tests
for
Supporting
Our the
mechanism
dynamic
avoids
nature
decide
if it should
of the
data
migration
parallelism
and
have
used
couraging,
our
this
in light
We believe
systems
improve.
programs.
that
Our
computed
to
on
to split
of the
our
the
owns
thread
message-passing
will
processors
to focus
on this
to
issue
thread
Coupled
iPSC/860
with
introduces
and
Our
the
results
performance results
the
importance
performance
as part
piece
the
which
better
out
the
relevant
data.
programs.
achieve point
closely
processor
of computation.
poor
also
the
migrates
the
on the
sample
model
experiences
that
each
mechanism,
some
different
We plan
that
mechanism
run
if it owns
futurecall
execution
system
especially
machines.
sults
processors
implemented
processor
more
making
strategy
259
.
matching
than
by determining
is our
Structures
by
Rather
migration
to the
technique
by allowing
We have
a statement
Data
problems
structures.
we use a thread
automatically
thread
fundamental
data
execute
structure,
of computation the
these
of the
Dynamic
of our
CM-5, are
en-
of current
as communication of merging
re-
of divide-and-conquer continuing
research.
APPENDIX SAMPLE The
FUTURECALL
following
CODE
is an example
of the code sequence
for f uturecall
in Olden’s
iPSC/860
implementation.
II
Intel
//
Future
i.860 cell
//
r14
future
is
st.1
rO,
st.1
r14,
st.1
fp,
addu
O,
call
_TreeAdd
addu
32,
//
is
with
cell
stack
rl,
r14
rl
Return
return
value
pointer
12(r30) r30,
integer
r30
4(r30)
(Future
//
Mark
future
//
Store
next
//
Store
fp
//
Push
future
//
Call
TreeAdd
//
Delay
//
Modify
//
is
stored
//
the
first
had
cell
cell
slot
of
been
in
Store
//
Pass
fc
pointer
//
Load
fp
from
//
Call
stolen
//
address
12(fp),
r19 fp
orh
h%_TreeAdd.cs,
br
___stolen
or
l%_TreeAdd.cs,
rO,
_TreeAdd.cs ri6,
ld.1
4(r14),
16(r14)
r14
(Future
which point
to
st.1
register return
unwinding
code
value as future
argument
to
stolen
cell
r18
r18,r18
Normal Return
long st.1
to
stolen)
//
fp,
rl,
subsequent
16(fp)
ld.1
call
address,
Poi.nterto
mov
TreeAdd
return
//
r16,
empty/full
field
_TreeAdd.cs
st.1
//
in
0(r30)
Abnormal
long
Futurecall
has
not
been
II
pointer
//
Store
//
Pop
as
wi.th
register
unwinding
argument
stolen) to value
future
register in
future
unwinding
code
cell
stack
L.14: ACM Transact, orison Pro~ammlng
Lan@ages and Systems, Vol 17, N. 2, March 1995
260
.
The
Anne
Rogers
following
CiW5
is an
et al.
example
of the
code
sequence
return
value
for
futurecall
in
Olden’s
implementation. ! CM-5 Future
call
! Future
cell
is
! Future
stack
St
with
integer
located
at
register
mov
H5, [7k10+41 Xr10,%15
call
_TreeAdd;
add
%r10
is
%15
. word
field
this
fc
Call
st
__TreeAdd.cs
!
(Future
!
jump
%oO, [X15+16]
!
ld
[%15+41,%15
! Pop
register
has to
stack address
unwinding
been
steal to
store
of
return
code
return
! Pointer
St
top
modify
to
! Abnormal
%oO, [%fp+16]
at
routine,
! Pointer
___stolen;
ba
next
!
%07,12,%07
__TreeAdd.cs
. word
! Store ! Place
stolen)
% store register
return
ret
val
unwinding
code
value
future
cell
stack
ACKNOWLEDGMENTS Dave
Hanson
and
Dave
Hanson
suggested
computer
Systems
experiments. (NCSA)
and
provided
the
the
The
anonymous
initial
many
provided
made
helped
that
by Mark
to improve
by
CM-5s.
our
Josh
provided
an
as a possible
L.
Guibas
and provided
Hill,
this
for
of lllinois
Diagram
TSP
it.
Super-
we used
Lumetta
written
about
Intel
totheir
Voronoi
recommended
the
University
Steve
implementation, Miiller
questions and
the
access
suggested
suggestions
reviewers
at
of Barnes-Hut.
Jorg
Li
iPSC/860
centers
Hummel
many
Kai
to the
(NPAC)
Joe
at ORNL.
implementation.
answered stacks.
access
implementation
we obtained
fromNetlib
and
Supercomputing
University
an
ICC
provided
of Power.
application;
wrote
use of spaghetti
National
Syracuse
implementation
and
Fraser the
Division
The
Barnes
Stolfi,
Chris
Mukund
and
J.
aninitial
Raghavachari,
article.
REFERENCES ALLEN,
R.
AND
form. .4LLEN,
KENNEDY,
ACiLf F.,
Trans.
BCJRKE,
the PTRAN
K
1987.
P,ogmm.
M
system
translation
of
9, 4 (Oct.),
S~st.
P
, CYTRON,
, CHARLES,
analysis
Automatic
Lang.
FORTRAN
programs
AND FERRANTE,
R.,
for multiprocessing.
to
vector
491–542. J. 1988.
J. Para/le/Distrib.
An
Comptit.
overview
of
5, 5 (Oct.),
617–640. .AMARASINGHIZ,
memory
gramming ANDERSON,
parallel
Language APPEL,
A.
ming BILARDI,
machines. LI,
of G.
Transact,
In
Communication Proceedings
Global
In Proceedings Virtual
ACM, F.,
systems.
AND NICOLAU,
A.
IEEE 1989.
Lan~aCes
New
C’onjerence
York,
generation
for
on Pro-
126–138.
parallelism
and
locality
on scal-
on Programming
112–125.
foruserprogram,s.
/.$upport
In Proceedings
forprogrammang
Languages
of and
96–107. ~
Trans. Adaptive
SIAMJ.
code
’93
’93 Conference
York,
primitives
TANENBAuhi,
machines.
ProgaInmlng
York,
New
for
o.f the SIGPLAN
on Archatectura
New AND
ACM,
ACM,
and
SIGPLAN
optimizations
memory
conference M.
optimization of the
and Implementation.
1993,
1991.
distributed
orison
1993.
and Implementation. K.
Systems.
KAASHOEK,
forshared-memory ACM
hl,
International
Operating H.,
Destgn
LAM,
Destgn AND
the4th BAL,
AND
M
machines.
Language J.
able
AND LAM,
S.
distributed
1992.
Softw. bitonic
Comput.
and Systems,
Vol
Orca:
Eng.
Alanguagefor
18, 3 (Mar.),
sorting:
An
optimal
18, 2, 216-228. 17, No
2, March
parallel
1995
program-
190–205. parallel
algorithm
Supporting
BOBROW,
D.
AND
ments.
WE
CALLAHAN,
M.,
ROGERS,
Languages
and
D.
GELERNTER,
vol.
768.
Springer-
CARRIERO,
N.,
A.,
A.
ACM,
A.,
D.,
KARAMCHETI,
T.,
New
D.
EXCKEN,
FRASER,
environ-
for dktributed-memory
AND
HENDREN,
L.
6th International
D.
PADUA,
LEICHTER,
1994.
Eds.
J.
Annual
Early
multipro-
experiences
Lecture
1986.
ACM
E.,
with
in
In
Olden.
Workshop,
Notes
Distributed
data
Symposium
LEVY,
AND
U. BANER-
Computer
Science,
structures
on Principles
in
Linda.
o.f PTogTamming
of
AND
J.
grained
Principles.
1993.
Computer
The
Concert
New
J. 1989.
University
York,
Hlinois,
of
147–158. and
programs. of
The
Proceedings
system–compiler
Introduction
1989.
In
ACM,
object-oriented
Science, R.
R.
of multiprocesom.
concurrent
RIVEST,
AND LITTLEFIELD,
M.,
Systems
PLEVYAK,
fine-
H.
on a network
on Operating
C.,
programs
llachines:
progr anuning
E.,
DLJSSEAU,
T.,
AND YELICK,
C. W.
tation.
M.
Ph.
D.
GIJIBAS,
A.,
93. IEEE
1990.
run-
Tech.
Rep.
Ill.
June.
Urbana,
to Algorithms.
McGraw-Hill,
R.
1992.
guages.
IEEE
HALSTEAD,
execution
L.
Cornell
Ca.,
C CompileT:
VON
S.,
Proceedings
of
262–273.
Design
distributsd-memory
and Implemen-
multiprocessing
subdivisions
and
voronoi
J. 1990.
Trans.
HENDREN,
L.
Syst.
systems.
diagrams.
ACM
Trans,
ACM,
HIRANANDANI,
New
S., MIMD
G.
AND
structmes
’92 York,
W.,
CHIEN, In
distributed-memory on Principles
Lan-
232–241. symbolic
data
programs
A. 1992.
computation.
structures.
with
recursive
Abstractions
and transformation
Ph.
ACM
D.
data
for recursive
of imperative
on Programming
AND
TSENG,
R.
1993.
C.
thesis,
structures.
pointer
programs.
Language
AND
Design
data
In Proceed-
and Implemen-
of
stack
and PTactice
1989.
New
W.
systems. of Parallel
ACM Transactions
Tech.
for
o.f Supercomp
uting
1993.
technique
for
Rep.
3, RWTH
Experience
the SIGPLAN
ACM,
P., AND WEIHL, parallel
W.
optirnizations
FORTRAN
91. IEEE
Ca., 86–100.
A new
DALLY,
Compiler
Proceedings
In
implementations.
Proceedings
1991.
machines.
Los Alarnitos,
and Implementation. WANG,
distributed
1, 1, 35–47.
Conference
K.,
Press,
A.,
Ca.,
on
on Computer
249–260.
in distributed
plementation.
structures
Conference
recursive
Parallelizing
Syst.
memory
LO OGEN,
data
for concurrent
with
J., AND NICOLAU,
KENNEDY,
Society
programs
A. 1990.
dktributed
Computer
Los Alaruitos,
A language
the analysis
of the SIGPLAN
tation.
dynamic
(Oct.), 501-538.
7, 4
DistTib.
Improving
with
N.Y.
NICOLAU,
PaTallel
J., HUMMEL,
structures:
Press,
Parallelizing Ithaca,
L. J. AND
IEEE
programs
of the 1992 Inte’rnationa[
Multilisp:
Lang.
University,
HENDREN,
of
Society
JR. 1985.
Program.
HENDREN,
W.,
Los Alamitos,
In
Ca.
for
General
In Proceedings
Computer
R. H.,
TTans.
Design
LUMETTA,
Split-C.
Bonn.
J. 1985.
SPMD
machines.
HORWAT,
City,
A.,
in
~, 2, 74-123.
memory
on
Press,
A Retargetable
parallelization of
KRISHNA?AURTHY,
programming
Society
Redwood
Automatic
C.,
Parallel
D. R. 1995.
University
L. AND STOLFI,
ings
S.
1993.
Computer
AND HANSON,
thesis,
Graph. GUPTA,
GOLDSTEIN,
K.
Benjamin/Curnrnings,
GERNDT,
HSIEH,
multiple
236-242.
Dept.
Supercomputing
HOGEN,
of
York.
CULLER,
D
implementation
151–169.
LAZOWSKA,
efficient,
LEISERSON,
stack
261
.
1–20.
AND
York,
V.,
for
UIUCDCS-R-93-1815, CORMEN,
OLAU,
Symposium
support
and
Compiling
AND
o,f the 13th
Parallel
the 1%h ACM
model
Parallel
Berlin,
F. G.,
system
time
Nrc
New
J., AMADOR,
Amber
J.,
for
Verlag,
Recod
Languages.
CHIEN,
K. 1988.
REPPY,
GELERNTER,
Conference
A
2, 2 (Oct.),
Compilers
JEE,
1973.
Data Structures
16, 10, 591–603.
J, Supercomput.
CARLISLE,
CHASE,
B.
ACM
D. AND KENNEDY,
cessors.
In
GBREIT,
Commun.
Dynamic
York,
with
’89 Conference
management
of runtime
Aachen.
CST:
Programming
and
on Programming
im-
Language
101-108.
Computation
In Proceedings
migration: of th e ~th ACM
Programming.
on Programmmg
the
Languages
ACM,
New
and Systems,
Enhancing
locfllt
SIGPLAN
Symposium
York, Vol.
y for
239–248. 17, No
2, March
1995
262
.
Anne
HUDAK,
Rogers
P. AND SMITH,
multiprocessor
E.,
H.,
Trans.
V.
93. IEEE
lem in the plane. KOELBEL,
C. 1990.
University, LARUS,
A.
Math.
Oper.
York,
—
Symposium
on
mobility
in the Emerald
support
for
runtime
hardware.
Los Alamitos,
In
Ca.,
algorithms
Res. 2, 3 (Aug.),
for programming
ACM
243-254.
efficient
on stock
of partitioning
programs
Lafayette,
J. R. 1991.
Concert Press,
A paradigm
A. 1988. Fine-grained
1993.
Society
Compiling
West
New
6, 1, 109–133.
analysis
ng:
13th Annual
of the
Syst.
languages
Computer
R. 1977. Probabilistic
ACM,
AND BLACK,
I!.,
Comput.
programming
programnni
Record
Languages.
AND CHIEN,
object-oriented putzng
Conference
HUTCHDNSON,
ACM
KARAMCHETI,
KARP,
Pa-a-functional
In
oj PTogramm8ng
LEVY,
system.
L. 1986.
systems.
Pranczples JtJL,
et al.
conmumnt
Proceed ~ngs of SupeTcom -
598–607.
for the traveling-saleman
prob-
209–224.
for nonshared
memory
machines.
Ph. D. thesis,
Purdue
Ind.
Compiling
lisp
programs
for parallel
execution.
Lisp
Symb.
Comput.
4, 1,
29–99. LARUS,
J. R. 1989.
sors. Ph. LARUS,
J.
Restructuring
D. thesis,
R.
AND
of the
Implementation.
ACM,
cution.
In
New
Proceedings
Experience LEE,
with
Int.
LUMETTA,
J. Comput.
pricing:
93. IEEE MOHR,
E.,
(July),
J. 1993.
Iinda. R.
1994.
Dept.
Two
algorithms
D.,
Press,
programs
ACM,
for
New
York,
for constructing
program.
Los Alamitos,
and
concurrent
exe-
Programming: 100-110.
a delaunay
of parallel
triangu-
1993.
Decentralized
optimal
of SupeTcomputang
240–249.
JR. 1991.
programs.
I
In Proceedings
Ca.,
R. H.,
auf
A.
Cid:
AND
A
,
A parallel,
Lazy
IEEE
task
Trans.
creation
A technique
Parallel
Dist.
Syst.
2, 3
“shared-memory”
Science,
Cornell
New
J.,
1989.
In
6th International Operattng
, CtJLLER)
A mechanism
19th Annual
D.
for
Process
and
Research
for
A.
An
Center,
decomposition
Parallel
abstraction
Palo
Alto,
through
on Programming
L.
1993.
Supporting
C’ompzlers A.
for
Pro-
In
Computzng,
for
controlling
Ca. April.
locality
of reference.
Language
R.,
York,
Destgn
In
and
Im-
D.
Languages
for
5th
PADUA,
dynamic
Intemataonal
Eds.
Lecture
Notes
J.
, AND
WOOD,
192–207.
REINHARDT,
for distributed
execution
Machints:
AND
Berlin,
S,
K.,
shared
Support
LARUS,
memory.
for
R
In Proceedings
Programming
of the
Languages
and
297–306.
GOLDSTEIN,
int egratiW
SPMD
Parallel
NICOLAU,
on ATchitectura[
International
on Pmgrammmg
net work-
N.Y.
S.
C
, AND
communication
Sympostum
on
SCHAUSER,
and
ComputeT
K.
and Systems,
Vd
E
1992.
computation.
In
Architecture.
ACM,
256–2E6. Transactmns
und
machines.
Compilers
WorkCrews:
Springer-Verlag,
New E.,
express
69–80.
LEBECK,
ACM,
mit tels
distributed-memory
and
Ithaca,
1989.
Systems
access control
Conference
Systems. T
York,
757.
B.,
Fine-grain
T.
D. GELERNZ-ER, vol
FALSAFI,
University,
M.
’89 Conference
Languages
Science,
C for
on Languages
AND HENDREN,
U. BANERJEE,
I.,
EICKEN,
K.
ACM,
REPPY,
D. A. 1994.
sages:
42, DEC
PNGALI,
Computer
SCHOINAS,
Rep.
ation-clustern
RWTH-Aachen.
WoTkshop
of the SIGPLAN
structures.
workst
in Elektrotedmik,
Tech.
Workshop,
ACM
accesses.
Design
1988 — PaTallel
AND KHALIL,
of a parallel
Parallelveraxbeitung
plementation.
the
structure
Language
9, 3, 219–242. CULLER,
E. S. AND VANDEVOORDE,
Proceedings
in
between
Lisp
PPEALS
and Systems.
AND HALSTEAD,
o.f the ?’th Annual
parallelism.
data
X.,
granularity
of Computer
ROBERTS,
ROGERS,
A.,
Diplomarbeit
ceedings
ROGERS,
LI,
on multiproces-
264-280.
NIULLER,
IV IKHIL,
the
J. 1980. Sci.
conflicts
on Programming
Restructuring
Languages
Society
D.
Detecting
Conference
P. N. 1988b.
B.
execution
21-34.
development
Computer KflANz,
for increasing
York,
Inf.
L.,
The
1988a.
’88
Applications,
S., MtJRPHY,
power
P/.
for concurrent
Berkeley.
of the A CM/SIGPLAN
D. T. AND SCHACHTER, lation.
VON
P.
SIGPLAN
J. R. AND HILFINGER)
LARUS,
programs
of California,
HILFINGER,
Proceedings
In
symbolic
University
17, NO 2, March
1995
Active
mes-
Proceedings New
of
York,
Supporting
W’EIHL,
W.,
SPURGER
BREVVER, ,
MIT/LCS WOLFE, ZIMA,
h{ H.,
BAST,
February
E.,
AND
519, 1989.
parallelization.
Received
C,,
COLBROOK,
WANG,
P.
Massachusetts
Optimizing H.,
AND
1994; revised
ACM
DELLAROCAS, Prelude:
Institute
Supercompilers
GERNDT,
Panzlte~
A., 1991.
M,
Comptit.
August
Transactmns
Dynamic
1988. 6, 1,
1994;
of
Data
C.,
A
HSIEH,
system
Technology,
for
for
W.,
A
tool
for
263
.
JOSEPH,
portable
Cambridge,
SupemomputeTs.
SUPERB:
Structures
A.,
parallel
WALDsoftware.
Mass.
Pitman
Publishing,
semi-automatic
London.
MIMD/SIMD
1–18.
accepted
on Programmmg
November
Languages
and
1994
Systems,
Vol.
17, No
2, March
1995