Supporting Dynamic Data Distributed-Memory Mach

3 downloads 70205 Views 2MB Size Report
article, we describe an execution model for supporting programs that use pointer-based dynamic data structumx. ...... and bookkeeping information, which allows.
Supporting Dynamic Data Structures on Distributed-Memory Mach nes ANNE

ROGERS and MARTIN

Princeton

C. CARLISLE

University

JOHN

H. REPPY

AT&T

Bell Laboratories

and LAURIE J. HENDREN McGill

University

Compiling Much

for distributed-memory

of this work

To date, based

little

work

dynamic

array-breed Recursive In this dynamic based and

automatic

call

0 rd en, that

runs

gvamming;

D.3.4 Terms:

and

and NSF

the McGill

Authors’

addresses

University,

Bell

Laboratories,

L. J. Hendreu, email:

Permission provided ACM that

Rogers NJ

to make that

the

copyright/server copying

otherwise, permimion

of control

a technique

using

based

compiler

a prototype Machines

Concurrent

~lers;

data

Grant

system,

CM-5.

on

analyses which

We discuss

we our

run-time

Prograuuuing-panz

l[el pTo -

environments

structures

ASC-9110766.

M.

and

Avenue,

was supported, Department

{sun-; Murray

Science,

McGill

the Fannie

was supportsd,

and

in part,

John

Hertz

by FCAR,

in

Foun-

NSERC,

Research.

C. Carlisle,

email:

M. C. Carlisle

Fellowship,

L. J. Hendren Studies

and

of Computer

mcc}@cs.princeton. Hill,

NJ

07974;

University,

edu; email:

Montreal,

Science,

Prince-

J. H. Reppy,

AT&T

jhr(ilresearch.at Que,

Canada,

t .com; H3A

2A7;

ca. digital/hard

copies

are

notice,

is by permission

copy not

of all

made

the title on

or part

or distributed of the publication,

of the Association

to republish, to post and/or a fee.

@ 1995 ACM

a thread

using model

be develo~d.

use pointer-based

five benchmark.

Graduate

08540;

of Computer

[email protected].

execution

Thinking

[Software]:

by NSF

Foundation

600 Mountain School

for ruigrating

implemented

Dynamic

of Graduate

A.

must that

of

addressable.

Performance

ASC-9110766.

Princeton,

techniqu=

parallelism

use pointerexecution

and directly

mechanism

the

with

defined

years.

structures.

that

SPMD

programs

this

data

programs

supporting

so new

and

D.1.3

area in recent

primary

supporting

We have

Processors—comp

in part,

Grant

to exploit

on experiments

Science

Faculty

for

and introduces

iP S C/860

and Phrases:

was supported,

for

uses a simple

Descriptors:

Words

properties,

model

techniques.

Languages,

Key

these

research

as their

of supporting

are statically

arrays

data

Intel

[Software]:

by a National

dation,

ton

Subject

the problem

that

We intend

on the

and report and

A. Rogers

model

parallelization

Categories

part,

Thk

active

use arrays

develo~d

have

of heap-allocated

a very

that

techniques

an execution

task creation.

implementation

Additional

do not

structumx.

and lazy

The

on the fact

structures

data

General

rely

we describe

on the layout

futures

structures.

programs data

has been

on programs

has been done to address

data

article,

machines

has concentrated

servers,

0164-0925/95/0300-0233 ACM Transact, on. onPmgrarnrn,n

of this for

material

profit

and its date

for Computing

without

or commercial appear,

Machinery,

or to redistribute

to lists

fee is granted advantage,

and notice Inc.

requires

(ACM). prior

the

is given To copy specific

$03.50 gLanguagesand Systems,

Vol

17, No

2. March

1995,

Pages

233-263

234 1

.

Anne

Rogers

et al

INTRODUCTION

Compiling for distributed-memory machines has been a very active area of research in recent years. 1 Much of this work has concentrated on scientific programs that use arrays as their ture.

primary

Such programs

into relatively

data structure

and loops as their

tend to have the property

independent

that

pieces, and therefore

primary

the arrays

control

struc-

can be partitioned

the operations

performed

on these

pieces can proceed in parallel. It is this property of scientific programs that has led to impressive results in the development of vectorizing and parallelizing compilers [Allen and Kennedy 1987; Allen et al. 1988; Wolfe 1989]. More recently, this property has been exploited by researchers investigating methods for automatically generating parallel programs for SPMD (Single-Program, Multiple-Data) execution on distributed-memory machines. In this article, we address the problem of supporting programs that operate on recursively defined dynamic data structures. Such programs

typically

procedures

as their

Before

we examine

programs

using

scientific

use list-like

primary

use arrays

problems that prevent pointer-based dynamic From

machine

are satisfied arranging

standpoint, explicitly

a computation

execution.

programs

review

the most important

so that

for recursive

why it is possible to programs

property

most

references

references

are expensive.

are local

of scientific

with

of a distributed-

has its own address space; remote

properties

for

out the fundamental

the same techniques

passed messages, which

The aforementioned

and have recursive

and loops, and then point

is that each processor

through

SPMD

let us first

us from applying data structures.

a compilation

memory

to generate

data structures,

that

data structures,

structure.

if it is plausible

dynamic

programs

or tree-like

control

Therefore,

is crucial

programs

to efficient

make them

ideal

applications for distributed-memory machines. Each group of related data can be placed on a separate processor, which allows operations on independent groups to be done in parallel

with

The key insight lelizing

programs

data should ically,

determine

and

cesses.

The

inserts

code

to

popular

is the

including

the

trol

and

loop

ACM

et al.

which allocating

for

rule:

the

method for

and

can and

to the

loops

Kennedy

[1993],

et al.

the

needs

to

are

[1990],

on Programmmg

Languages

and Systems,

V.1

17, No

most

(v

= e),

v.

Con-

by all processors. using

[1993], Rogers

[1988].

Transactions

“owns”

the

the

dependence

The

arsenal

of

analysis

1989].

and Lam

Koelbel

pro-

a partic-

but

statement

as data

Wolfe

into

execute

that

Typtarget

resolution,

possible,

substantially

1987;

the

program

run-ttme

processor

such

onto

called

are executed

Anderson

[1991],

data

of an assignment

compilers,

and Lam

Hiranandani

work

be improved

vectorizing

[Allen

processor

work

of e, is assigned

see Amarasinghe [1990],

sometimes

policies

paral-

of a program’s

is assigned to processors. program’s

to decompose

strategy,

as conditionals

by this

mapping

for automatically

is that the layout

of the

at run-time

computes

restructuring

Gerndt

methods

machines

mapptng

uses this

computation such

a

compilation

determine

developed

1For example, [1988],

compiler

owner

produced

techniques

specifies

Different

statements

code

Zima

the

communication.

developed

how the work in the program

simplest

statement.

interprocessor recently

for distributed-memory

the programmer

machine,

ular

little

underlying

2, March

1995

Callahan and

Pingali

and Kennedy [1989],

and

Supporting

Run-time available

resolution for

responsible

for

is applied global

to

the nice

array

array

of every

for

used

mathematical

Now

structures.

property

that

their

subpieces.

Thus,

\Ve

can

of array ture not

rather

the

than

structure the

A recent his

paper

this

is determined

by

is also

numbering

to restrictions

node

is determined

Gupta’s node

list’s

As in earlier data

by

onto

to

as it

position it

to

to

all

in

the

in a data

tree.

primary

and

element

the

owner

be used. data

accomplish

structure.

this,

This

is registered

mapping

scheme

of a

the problem can

of

an example,

structure,

control names,

of a dynamic To

and

As

do

communication.

resolution

The

struc-

recursion,

for addressing

structure

in the

data

compile-time

a data

structure.

be used as a naming

the nodes

to

on a

a structure

is the

processors.

is added

the

to

statically,

every

struc-

independence

Second,

without

run-time

to

might

its two

scheme

that

list.

have

this

the

position, lists

approaches,

ACM

way

name

with

a node

all to

a

a breadth-first

for a binary

tree.

it can traverse

cannot node

of naming

structures only

one

can node

names

may

consider

a list

is added

to the

we rely

TransactIons

can

Once

a

the structure

to reflect

be added

front

on the programmer

mapping

is done

on Programmmg

change

Languages

of a at

ramification

of

when

name

list,

the

a new

is simply rest

of the

in position.

to specify

at run-time

name

to a structure

a node’s of the

structures

the

to be reassigned

in which

their

data

Since

Another

have

example,

This

dynamic

be used.

be concatenated).

If a node

to be renamed

the processors.

new

data

is that

For

in the will

known

node

sets,

clear

data

of a dynamic

a mechanism

similar

is assigned

the

the

programs

of operations

interprocessor

position

on how

is introduced.

nodes

be done

without

on its

for

example,

naming

position

cannot

is made

node’s

index Finally,

Fur-

over

it becomes

for dynamic

references.

structures.

tail.

tasks these

references

of array

independent its

parallel

nodes

required

independent

and

determining

the

the

smaller,

head

independence

therefore

suggests

name

to note

leads

(for

[1992]

its

work

than

partionable

use pointer-based

communication.

is important

a time

the

and

in general,

of adding based

further

easily data

each

the

has a name

without

that have

investigation,

because

an approach

of the nodes

processor

its

to

as part

processor

fact and

in mapping

do not

harder

to processors

name

further

properties

its

by Gupta

is assigned

processors

It

with

a global

and

with

names,

dynamic

so that

approach,

a name

on the

relatively

to distribute

programs

is partially

be determined,

names

structure,

the

commu-

simple

exhibit

into

into

problem

determining

mathematical

of nodes

cannot

But,

is that

into

partitioned

be used

that

often

be partitioned

is substantially This

with

programs

partitioned

often

compile-time

looping

used

of global In

problem

nice

mapping

node

function knows

without

rely

are

processor

mapping

locally

235

is, names the

to be very

programs

can be recursively

machines.

have

such

can

used for scientific

operations.

share

that

we see no fundamental

structure

do not

that

processor

code

tend

.

determine

every

resolution

of parallelizing

note

can

so far,

first

data

Since

elements

be recursively

partitioning

the techniques The

To

can be done

run-time

structures

to distributed-memory

tures.

test

array

problem

a tree

a list

this

dynamic

in nature,

programmer-supplied name.

this

improving

data

For example,

thermore,

that

element,

to our

data

and

the

global

to reference

dynamic

pieces.

are static

Structures

properties.

let us return

subtrees,

arrays

Data

at compile-time.

element,

element’s

array

Techniques

expressions

because

of an array

a given the

name

nication.

works

all elements

Dynamic

a mapping

by specifying

and Systems,

Vol

17. No

of the a proces-

2, March

1995

236

.

Anne

Rogers

sor name

with

approach

and earlier

We believe fined,

that

able,

data

Thus,

the

processor Before

our

owns

presenting

our

a simple

and

lazy

and

discuss

task

6, we

report for

briefly

sketch

the

advantage

8 and

conclusions 1994;

In

we explain

our

return

SPMD

single

owns

heap).

We

objects,

also

a processor

processor has been would

over

number

local

address

techniques into

upon

execution 7, we

we intend

a program

that

work

earlier

4

Section

Section that

related our

let

us review

an identical

the

copy

can

in Section

work

basic

stacks,

there address

1>).

of the

arguments,

a heap

(val

= 1;

new->left

=

new->right return

*left;

lo.tmp; Si.zeof(struct

M,LOC(10,

TreeAlloc(level-1,

tree));

lo+num_proc/2,

= TreeAlloc(level-1,

10,

num_proc/2);

num_proc/2);

new;

3 1

Fig.

Fig. 2.

1.

Allocation

Allocation

code.

example.

ACM Transaction~ ~nPrOfl~~~~~g Lanm~g~s ~~d Syste~~, V~l

11, No

Z, March

1995

238

.

Anne

EXECUTION

3. This

section

thread

Rogers

MODEL

describes

splitting,

trol through splitting,

et al.

the

Thread

two

a set of processors

which

of our execution model: thread migration and is a mechanism for migrating a thread of con-

parts

migration

based on the layout

is based on continuation

duce parallelism

by providing

of heap-allocated

capturing

operations,

work for a processor

data.

Thread

is used to intro-

when the thread

it is executing

migrates. 31

Thread

The

basic

idea

attempts

to

P

from

thread’s

sets

up

its

that

that

and

and

the

refers

to Q‘s

always

the

Full

because

stack This

frame

executing sending the

has

3.1.1 cessor

the

of the

tion

is requested.

that

the object

ACM

Ti-ansact,cms

at

the

another

(or

of a processor

as a single nonlocal

or store routine.

current

the

that handler

sending

stack

it

to

references

implementation,

nonlocal

to test

references trap

and

the thread,

In our

address,

instruction The

information with

code

accesses.

encoded

address

contains

the

Recall

Instead,

pointer

Olden

return thread

and

return

then

control this,

of the

return

pointer

for

of computation the

return

completes Note

frame

the

it

is

Processor

Q sets to the the

up

a

caller.

currently to

P

pointer,

procedure stack

is the

When to

back

the

that

stack only

to complete

frame.

accomplish

address,

address.

procedure

frame

by and

return

frame

is not

needed. the programmer allocation

is responsible

request.

explicitly

migrates

compiler

expands

The the

this

can be also completed

initiated

on Programmmg

the

P

return

that

initialization

To

entire we send

stack

in place

the

return

current to

caller.

Processor

memory

system,

that

and

thread’s

current

to be used

migrates

at the

and

is necessary

of the

stub

it is no longer

The

it

the

affordable,

for the

counter,

stub

registers.

each

because migration

is necessary

frame

return

run-time

processor

the

thread

procedure,

thread

Allocation.

of our

until

a pair

detect

to detect

expensive,

that

return

that

with

state

a migration.

into

program

The

because

to a load

is sent

hardware

make

stack

holds

part

to the

the

a message

name

To

the

function.

returned,

cause

of

to a library

stack

checks

state

a special

the

thread’s

the

is quite

for

restarting

up

explicit

from

contents

by

Q then

thread

as needed.

registers,

it

frame

traps

translation

message.

to return

be used

instruction

cannot

migration

the

idle

be encoded

executes

P

if the entire

and

of the thread’s

execution:

P,

that

in the

portion time

packaging

can

can

Processor

the

for

thread

included

counter,

Processor

of the

remains

is migrated

program

to Q.

execution P

on Processor

thread

current

as consisting

information

hardware

When

Notice

thread

the

the

registers

Processor

address

This

translation

inserts

the

resumes

a memory

use the address

compiler

and

executing

Q,

to i;.

address.

be local

we do not

migrate

view

sending of the

migration.

migrates

a thread

Processor

contents

the

memory,

Q.

current

caused

1991].

is responsible Processor

the

when on

entails

registers,

we

Li

migration

the

address and

is that residing

loads

a local

[Appel

will

and

thread

Recall

migration

a location

thread

stack,

same)

name

Full

stack,

instruction the

of thread access

to Q.

P

the

Migration

the

Languages

for

allocation

thread

when

migration before

Vol

17, No

2, March

test

1995

a prowhich

a nonlocal

the thread

allocation. and Systems,

specifying routine, in-line

is

allocato ensure

migrates

back

Supporting

3.1.2 to

processor

locality.

Our

data

that

system

together,

depth

in

data

The

Dascusston.

the

then

the

This

can

thereby

on

the

by

the

than

the

amount

execution

all

the If

the

Recent

the

in

to

by

is to exploit

that

places

a subtree

at

has

related

some

chosen

the

data

allows

the

most

of the

required

Hsieh

fixed

a good

owns

communication

work

of computation data

a layout

239

.

thread

programmer

that

of interprocessor

time.

the

data

computation processor

Structures

migrating

to choose

processor.

on the

Data

of migrating

rather

placing

same

migrating

reduce

data

on the programmer

computation

reduce

purpose

the

example,

tree

layout,

perform

relies

for

the

primary

owns

Dynamic

system

to

data.

substantially

et al.

[1993]

and

supports

this

claim.

Thread

3,2 While

Splitting

the

data,

migration

it does

not

tation.

lfrhen

section,

we describe

introduce

a thread

migrates

calls,

since

this

the

The

return

out

ations

be based

must

parallelizing

Futures.

ditional that

Lisp

evaluation thread cell,

that

and

restarts

Our

and

view

body, the

3.2.2 of two describe the

(future

e) with

provides

a fairly In

pieces

effect,

that

its

thread.

is essentially

the

child

to

parent

puts

a

the

of tra-

the

system

result

of this

between

the

child

touches

the

future

then

the

parent

is finished,

e, it

In

The

point

by

directly.

a variant

1985].

context.

If the

evaluating

either

Olden

is an annotation

can oper-

be done using

[Halstead

as a synchronization

before

cap-

the

result

in the

cell

which

is influenced

future’s

body

call’s

by the context

directly.

a continuation

stealing.

In

lazy-task-creation

(return

? If a migration from

addition,

all

the

work

touches

scheme

continuation)

in

occurs list

and

in the start

Olden

of Mohr

on a work

are

list

execution

executing done

it;

explicitly

operation.

Futures

m

operations: these

Lisps

in parallel

finishes

When

of the

is at procedure

many

can

mechanism parallel

its value,

one

threads.

we grab

~utare

which

this is to

of continuation-capturing

program,

parent

In

approach

in parallel.

into

or by a programmer

the

thread

blocked

then

touch

future

to read child

the

of execution

compu-

idle.

program.

This

be done

of the

serves

e and

can

Olden

man-y

is to save the future

is called

using

that

in the executing

continuations

introduction

expression

cell

of futures,

to evaluate

this

2This

any

[1991],

of the

are

the

in

be evaluated

is evaluating

When

et al.

the

e can

is, attempts

blocks.

that

thread

the

for

found

is a future

that

Thus

start

distributed the

left

is

Our

a continuation.

work

the

P

points

can

capture

continuation-capturing

context,

that

at key

is effectively

Q,

on from

parallelism.

P

to

operating

to

P

introducing

place

chops

mechanism

says

Processor

natural

targeted

Our

extracting

operations

on analysis

compiler

future

for

labeling

of order.

parallelism

Processor

for

technique

for

to Q,

linkage

be executed

3.2.1

P

a mechanism for

from

capturing from

mechanism

capturing

the

migrates

a mechanism

continuations.

inexpensive

provides

a mechanism

continuation-

a thread tured

scheme

provide

futurecall

operations cells

is also similar

Det atl.

More

and

and their the

Our and

future

to the workcrews

scheme

related stack,

paradigm

for In

touch. data which

proposed

ACM Transactions on Programming

introducing the

parallelism

remainder

of this

structures.

The

is a stack

of future

by Roberts

two

consists section,

data cells

that

and Vandevoorde

Languages and Systems, Vol 17, N.

we

structures serves

[1989].

2, March

1995

240

Anne

.

Rogers

et al.

return

Fig.

as a work

state

list.

A future

transitions

for

This

Empty:

cell state

transitions.

states

(Figure

3 gives

the allowable

cell):

initial

state

state

when

state

when

and

contains

the

information

necessary

to restart

parent.

Stolen:

This

is the

Waiting:

This

is the

sequently thread Full:

Future

cell can be in one of four

a future

is the

the

3.

touched (note:

This

the

the

parent

continuation

the

parent

continuation

future.

no future

is the

final

It contains

the

can be touched

state

of a future

has been

stolen.

has

stolen

by more

cell,

been

continuation

which

and

of the

than

sub-

blocked

one thread).

contains

the

result

the

normal

of the

computation. The

future

when

stack

allocate

the

The cell

the

routine,

the

cell’s

future

cell

stolen,

case, there called;

cell

is filled

If the

cell

a processor

run-time

cell from fiignal

the

that

changed

the

Touching

to

a future

do by

the

the

stack.

after

This

routine

If the

executes has

stack

the

been

future

of stack,

continuation state.

has

In the

routine

a thread

Stolen

MsgDi. spatch

migration,

does

is not

resume stolen,

from

is

is resumed.

as just

ch.

off the

corresponding system

pushing

return

an examination

or Waiting

run-time

to do, such

and

following

continuation

cell,

Upon

is popped

the

Stolen

waiting

future

cell

to

details).

routine.

result,

case,

are required

a new future

specified

otherwise

continuation

that The

calling

per future

computation.

ACM

then

MsgDispat

stack

parent

stores

attempted a touch

the

of the

to block.

Waiting,

work

routine

5 for

for allocating

return

In

instructions

(see Section

the

in the

the

stack.

extra

the

empty,

stealing

it pops

a future

continuation the

state

it calls

actual

in the

of the

cell.

future

To

cell

is

to Stolen.

putation to

the

has no work

from top

with

to do, and

state,

system

of continuations

calling

is either

more

Waiting

a few

result

and

normally;

cell

is nothing

in the

When

the

the

is Empty,

continues and

execution

only

is responsible

stack,

state.

execution

touch

future

the

been

and

the

occurs,

operation

on the

and

through

or stealing

f uturecall

the

the

is threaded

no migration

As

is not

touch

the

Full

continuation that

a result,

the

the

changes of the

MsgDi. spatch. and

causes

operation

touch state

The must

currently

executing

the state

of the future

touch model

in the requires

be done

of a future

will

cell,

and

that

Stolen

is attempted.

Transactions

on Programmmg

Languages

and Systems,

Vol

17, No

2, March

1995

of com-

from looks

only

by the future’s be either

thread

one

parent

Stolen for

other

touch

be

thread

or Full

of

when

Supporting

int

TreeAdd

(tree

Dynamic

Data

Structures

241

.

%)

{ if

(t

== NULL)

return

O;

return

(TreeAdd(t->left)

else + TreeAdd(t->right)

+ t->val);

}

Fig. 4.

TreeAdd.

Discussion. The design of futures in Olden was influenced heavily by our to use parallelizing compiler technology [Hendren 1990; Hendren and Nicolau

3.2.3

intent

1990; Hendren restrictions

et al.

two restrictions which

1992; Larus

mentioned

allow ustoallocate

helps make futures

Another

1991] to insert

in the previous

important

paragraph

future

them follow

automatically. from

Both

this intention.

cells onthestack

rather

the

These

than the heap,

less expensive.

fact to note about

the use of futures

in Olden

is that

when

a process becomes idle, it steals work only from its own worklist. A process never removes work from the worklist of another process. The motivation for this design decision was our expectation on local data. sirable

Although

for load-balancing

Instead, Mohr

that most futures

allowing

purposes,

load balancing

is achieved

et al.’s formulation,

another

process’

another

captured

it would

simply

by a careful

where

by a process would operate

process to steal this work may seem decause unnecessary

data layout.

an idle processor

removes

to

work

the tail

of

is related

to

from

work queue.

The choice of using a queue versus a stack to hold pending the question

migrations.

This is in contrast

of who can steal work from

a processor’s

futures

worklist.

The breadth-first

decomposition that results from using a work queue is desirable when work can be stolen by another process, because it tends to provide large granularity tasks. A stack-based implementation is appropriate for Olden, because when a process steals its own work, it is better for it to steal work that is needed sooner rather than later.

4. A SIMPLE To make

the

example.3 which

EXAMPLE ideas

Figure

computes

Figure

of the

previous

4 gives the

sum

of the

5 gives

a semantically

futures.

In

annotated

replaced

by

recursive

call.

To the

Q).

the

understand

When

what

TreeAdd,

values

version, and

this

a tree

sections

node

stored

equivalent

a futurecall,4

case where

two

more

concrete,

the C code for a prototypical

the

means tp

executing

the

(on

in

in a binary

program left

result terms

Processor

that

recursive is not

has

on P, is recursively

call

to

Transactions

on Programmmg

Languages

annotated

TreeAdd until

program’s

after

execution,

a left

child

called

on

3 We present more realistic benchmarks in Section 6. 4Note that using a f uturecall on the right sub tree is not cost effective, computation between the return from Tre eAdd and the use of its result. ACM

a simple program,

tree. has been

demanded

of the P)

we present

divide-and-conquer

and Systems,

been

the

right

consider

(on Processor

tQ tQ,

with

has

it

attempts

to

since there is very little

Vol.

17, No, 2, March

1995.

242

Anne

.

lnt

TreeAdd

Rogers

(tree

et al

*t)

{ if

(t

==

NULL)

return

O;

else

{ tree

*t_left;

future.cell

f_left;

int

sum_right; I*

t_left

= t->left;

f_left

= futurecall

sum_right

(TreeAdd,

= TreeAdd

return

this

can

cause

a migration

~i

t_left);

(t->right);

(touch(f.left)

+ sum_right

+ t–%ral);

1 }

Fig.

fetch

the

left

cause

the

thread

cuting.

child

of the

stolen

return

future

Please

note

can

be the

the

computation

executing

that

have

we

call

our

will

and

that

translates

with

embedded

tions,

returns

The

remain

Olden.

ACM

when

Trcansact,ons

it

parts

owner

will exe-

point P,

future thread

the

reference

of the

execution

cell

the

reference

the

to

associated

of control

to

of t until

tot->val

t->left

computation

to

for

causes

the

currently

it has completed.

into

idle. Languages

system

of

the ch,

The new

futures,

and Systems,

run-time

touches,

and

multiple

active

control

Captured

code

structure

empty Vol

and

for

target

17, No

the

by

pointer

is responsible futures 2, March

the

Intel

earlier,

annotated

system

in this

on

system

computations

in detail

which

run

mentioned

is a C program

assembly

system. starts

that As

of a run-time

the

to support

of the

model CiW5.

Olden

consists

for

overview

becomes

execution Machines

input

is MsgDispat

on Programming

oft,

and

code

an

procedure

owner

run-time

discipline

Figure processor

The

to the

stack

6 provides

for

nor

a nonlocal

of our

C code

System

main

the

wait

to t->right

Thinking

migrations,

two

to touch must

Once

on the

Olden

51 The Run-Time The

attempts

oftp

This

continue

return

it

execution

reference

the

touches.

generates the

at the

can

P

prototypes

calls

compiler

system. it

IMPLEMENTATION

and

from

run-time

Q, where

is on Processor

to the

annotated

a nonstandard \Ve describe

the

MODEL

system

the

on Processor

point,

tomigrate

hypercube

futurecalls

to

to Processor

Q.

implemented

iPSC/860

a trap

annotations

subtree

of a migration.

function

compiler

right

until

this

neither

source

5 EXECUTION We

the

continues

Processor

causes

resumes

that

At

code with

to migrate

execution

dlon~Q.

from

which

~Q,

Assuming

the

TreeAdd

of control

Meanwhile,

futurecall. with

of

5.

machine

handles stealing

tests,

threads

with

a compiler

migrafutures.

and

it

uses

of computation.

section.

of the for are 1995

run-time

assigning given

first

system. work

to

priority,

a

Supporting

Dynamic

Data Structures

.

243

Touch stolen future

P!!” CallStub

User Code

reference

Fig.

then

messages

scheme,

from

because

processors. The discuss

the

steal

future,

to dereference

Migrate,

if necessary,

data

and

sends stack

save

registers,

each

stack

allows

5 and

some

for

the

us to construct

migration

messages

mentioned, The

processor

the calls

will

the

We

allocate

cost

migrate

contains

the

of callee-

extra

space

information,

in

which

of migrations. to

call

necessary

contents

MsgDi. spatch

we

if it

and

the

message the

mi-

first;

pointer

packages

bookkeeping

reducing

situ,

sending

and

sooner.

on reference,

caller,

information. registers

tn

to test

Migrate

of the

processors

a thread

code

priority to other

on migration

A migration

area

bookkeeping

the

subsection.

compiler.

build

callee-save

is complete,

next

this

off work

migrate

We focus

processor.

argument

to idle

events:

touch.

pointer. by the

Yf7e choose to spin

us to get work

in the

appropriate

the

received. is likely

different

As previously

is inserted

it to the

order

suspend

of futures

frame,

frame

allows

a nonlocal

system.

a future

four

and

on Reference.

attempts

current

run-time

in the

first

processes

implementation

Migrate

5.1.1

processors

futures

system

on return,

The

in a computation

Processing

run-time

grate

other

early

6.

Once assign

the

a new

task, When

MsgDispat

ch selects

transfers

control

the stack

for a return

Then

it copies

frame

size, return

space from area the

stub

frame the

return

migrated

5 We bypass we have

return

message.

and

the

the

function the SPARC

pointers

changes

procedure. and

callee-save

the argument

into

the

exist),

pointer,

the frame, message and frame

adjusts

on the ‘S register

from

Then

it loads

to the

window

into

processor. mechanism

build

the

stored

this

in our CM-5

the

causing

frame. the

into

the

register

values

argument

the

build

to ret

addr,

migration,

registers

procedure

it on

stores

message

in the frame

appropriate

When

and

task, space

and a stub

CallStub

callee-save

area

pointer

the

area,

space. the

next

allocates

the migration

return

address

as the

CallStub

allocated

pointer

it loads

counter new

message.

the

the return

Finally,

program

message

the

size,

for

migration

to process

value

message,

(if they

message,

the migration

allocated the

an incoming

to CallStub

and exits,

implementation.

the

resumes it will

go

As a result,

registers. ACM Tmnsactlons on Programming

Languages and Systems, Vol, 17, No. 2, March 1995

244

.

Anne

to the

procedure

in the

return

the

A simple

it

to

node

This

5.1.2

on

the

procedure

return

exited

so,

the

stub’s

to

a tail-call

callee-save

that

began

registers

execution

the

When

of

deactivates

the

remote

registers

will

as part and

message (for

frame

place,

processor

is not

the

the

tailet al.

if a migrated

it migrated

a return

from

It

as the

callee-save

reg-

migrated,

and

that

then

Note

migrate the

message

of the

migrated.

as before

to [Horwat

sent

procedure

that

pointer, migration

instead.

if necessary.

before

same

is similar

contents

of the

whether

of the

example,

directly

the

function.

frame

compiler

selects

loads

procedure

be the

passed

return

appropriate of the

address,

as it began

case that

to determine

Smalltalk

MsgDispatch which

address

and

is called

eReturn,

in the a single

return

optimization

MigrateReturn

to the

original

Concurrent

is that

returns

of executing

of the function

frame

processor

on the

callee-save

of the

of trivial

course

address

in the

Return.

value

the

the

same

message,

at the

of the

If

used

Migrat

return

values processor

a chain

during return

from

to P):

it calls

the

ecution

stub.

back

from

copies

return

on the

A!ltgrate

task,

isters

current

optimization

exits

to Q then

next

the

is analogous

A similar

the

to the

to avoid

times

optimization

procedure

stores

message

is added

id are pulled

message.

code

the

MsgDispatch.

several

the

~onoarding

P

calls

e examines

1989].

This

sends

optimization

points

and

and

migrates

Mi.grat

et al.

retaddr.

message,

procedure,

a thread

Rogers

resumes that

on return,

procedure

ex-

since the

the state

was called.

The Compiler

5.2 The

compiler

is derived

We ported modifications to test

to

for

futures,

5.2.1

resides7

each

heap

pointer

checks

code

that

Migrate

references

pointers

on all pointer the

conditionally

and

to

These call

an ANSI and

C compiler.6

SPARC)

modifications

Mi.grat

and added

generate

e as necessary,

to

code capture

lcc

of a tag

has

to move

indicating

the

check

thread,

this

is local,

and

removes

for

the

tag

the

explicitly

generating

the

mi-

In our where

tag

mechanism calls

will

pointer.

processor the

option

We modified pointer

heap the

to

a command-line

to see if the

a process

a nonlocal

so it is necessary

dereferences.

tag

mentioned,

dereference

consist

address,

dereference.

examines

1995], (i860

As previously

attempts

a local

pointer

Hanson

processors model.

References.

if it

heap and

and

on touches.

Memory

implementation,

[Fraser

appropriate execution

memory

on a reference

data

our

to synchronize

Testing

lcc

to the

handle

nonlocal

and

grate

from

the backend

on null

to generate

library

to reveal

routine the

local

address. The

additional

overhead

and

a conditional

structions a conditional

that

only

branch

the first

cause a migration.

6lcc 7The

does not range

uses the eight Tranmct,ons

each

pointer

branch

on the

i860

SPARC.

This

in a sequence

For example,

dereference and

overhead

four

is three

integer

instructions

can be reduced

of references

in TreeAdd,

integer

inand

by observing

to the same pointer

only the reference

t–>left

can

can cause

an optimizer.

of possible

implementation

ACM

have

on the

reference

of testing

heap

addresses

diet at es the maximum

uses the seven high-onier high-order

bits

to hold

size of the processor

the tag,

and the iPSC/860

bits.

on Proqammmg

Languages

and Systems,

Vol

17, No

2, March

1995

tags.

The

CM-5

implementation

Supporting

int

TreeAdd

(tree

Dynamic

Data Structures

.

245

*t)

{ if

(t

==

NULL)

return

O;

else

{ tree

*t_left,

future.cell

f_left;

int

sum_right;

local_t

= local_t->left;

f_left

= futurecall

(as discussed

t are guaranteed allows

removes

us to

the

references

pointer

t,ifit

advantage

tag

and forces

compiler

optimization

is dorninatedby

can

be

that

with

references

optimization

to the

no special

5.2.2

Compdang

use offutures.

examine

our

implementation

calls

and

then

the

store

the

continuation

pointer,

and

the

Empty,

and

push

call,

check

the

execution 8Note

that

processor

The

through local,

expression

is nonlocal.

take and

advantage

local(t)

Subsequent

of this

aiigned if there

that

The

resulting in Figure Note

with

indirect

first

local

reference

address

7 shows

that

construct.

another

the

the

say

references in the

can result

deference

We

dereferenceof

are no intervening

is, the

code

example.

backend

code state

In

several

be

to

sequence,

used

by

the

of applying

oflocal_t

this require

is generated

the steps

annotation,

consists

of

in the

the future

in-lineto

of the future.

We review

the

to the

basic

topic

ideas

un-

offutures

necessary

to

to implement

optimizations.

counter)

cell onto

we presented

we return

afuturecall

(which

program

3.2,

subsection,

in detail.

encounters

the

Section

In this

discuss

When

the

can

futurecall.

our

to

The

references

an annotation,

handling.

derlying

future

observation.

variabletis

local.

indirect

provides

pointer.

reference,

TreeAdd

subsequent

amigrationiff

reference

directly.

*/

local.

compiler

of this

local

A dominating

annotated

aligned

All

Olden

ofsomepointer

pointers.

emigration

+ local_t->val);

code using

4).

The

resulting

adereference

other

in Section

can use the

A simple that

TreeAdd

take

cause

t_left);

+ sum_right

7.

to belocal.s

can

(local.t–>right);

(touch(f_left)

Fig.

this

(TreeAdd,

= TreeAdd

return

which

/*

= local(t);

t_left

sum_right

emigration

*local_t;

the

future

cell,

stack.

Then

store

If it is Empty,

the

return

the

it generates

callee-save initialize the value

in-line

registers, call

the

cell’s

frame

state

is generated.

in the

cell is popped

code

the

to

After

future

cell,

and

the

stack,

and

from

continues. this on exit

observation from

relies

the called ACM

on the

fact

that

function

calls

return

always

to the

original

routine.

Transactions

on Programming

Languages

and Systems,

Vol

17, No

2, March

1995.

246

.

Anne

In the

case that

descendants If it

is Stolen,

the

migrate

of the

early We

of future

calls.

Our with

first

one.

have

is profitable,

Register

must

because

unwinding priority.

have

been

callee-save each

the

function

may

register

procedure,

and

when

stealing

a future,

the

future

address

frame.

the

run-time

For

this

and

the

associated

futures

the future the

state

of callee-save

function

This

capturing

stealing

with

Therefore,

code

in the

overhead

than

ch gives

the restores user

to

later

cost

less common

steal.

size

function

cost of steals.

~, associated

the

one of its

on the

the

the the

MsgDispat

perform

cell of the

registers

last

executing

to

restore

sequence

for

cell

during the of the second

two

cell

is stored register first future

in

cell

the

pointer

takes

advantage using

ACM

on Programmmg

site,

8. When for

their

original

storing

the

is initially

it

a future

cell

to code

which

can

be

stored

in

H migrates,

H, and

then

G.

values. state

Empty

set to the

When

of a future

instructions, two

of the frame frame

in

(see Figure

stack

that Since

address

sequence

for

call

frame

as we go.

is stolen, will

the

test

in

sequence return

for

Stolen

call and

on the

one

load)

i860 in

is only the

seven

expected

stolen.

unwinding.

frame

in Figure

Then,

action.

implemented store

need

the

call

return

to point

overhead

integer

is not

have the

cell.

appropriate the

the execution TransactIons

future is modified

to

sequence

the

contain

a future

procedure.

appropriate

restore

will

the

pointers

from

show

the

in

restore

at the

callee-save

for

site

of frame

offset

registers

address

the

stores, we

list

scenario

the

a callee-save call

callee-save

is to eliminate

function

future

the

the

execute

optimizations,

CM-5,

future

The

eliminates

is much

is stored

callee-save

return

us to eliminate

is simply

to reduce

of increasing

each

at a constant

an Empty

(four

the

pointer address

The called

these

case where

the

will the

for

of the

instructions

allows

fixed

the

at

each

consider

optimization

or Waiting,

On

(or

to remain

then

the

contribute

depends

but

we generate

address

sequence

a word

system

of instructions

its

executing

example,

cell.

Given

from

we traverse

cell,

is done,

future

address

function called

by executing

chain

called migrates

the

that

to

for

do not

is for

before

unwinding,

of a restore

A second the

call

store

by reading

the

Once

be restored

calls

of optimizations

fact

how

f.

To implement

the

if the

of the function,

If it

describe

situation

a future

the

processor.

We

It is imperative

future

unwinding,

immediately

in the

each

obtained

exploits

executing

procedure

contains

stealing

to the cell.

large.

function

at the expense

A descendant

registers

for

regwter

registers

or one of its or Waiting.

subsection.

most

calls,

a series

called

callee-save

the highest

of recursive

work

next

only

called

function is Stolen

future

is quite

A common

implemented

assign

because

the

structure.

optimization,

saving

tradeoff

data

in a sequence

sequence.

calls

called

future

on this

parallelism

Whether

the

in the

of future

generates

is, the

to

a touch

of computation

migrates.

layout

been

to be inexpensive,

A future

descendants)

(that

see whether

ch is called

have

overhead

of futures

parallelism.

Empty to

MsgDispat

thread

As described,

is not

we test

must

a waiting

et al.

cell

then there

capturing

and

the

migrated),

is Waiting, resume

Rogers

additional pointer

of the

optimizations.

on entry.

caller,

we

The

frame

pointer

the

chain

with

an

The

We notice

can

determine

of the

parent

address

that

that the

first since frame

continuation precedes

the

8).

of the facts

a linked

list

Languages

and Systems,

that

and that Vol

the future future 17, No

cells 2, March

stack

is threaded

are four-byte 1995

through aligned,

to

Supporting

la

Dynamic

Data Structures

247

.

El? Q F performs

F

futurecall(G,...)

fp

ret addr

G

G calls

H

fp

ret addr

H migrates

H

ret addr

Fig.

eliminate

another

future the

cell’s

state

link

store.

and

we may

Given tions

iPSC/860

Compiling

on the

otherwise

ues of the

callee-save

future the

cell,

loads

the

resumes

5.3

Stack

In

the

the

to hold

link

state

and

FuH

link

on the

contains

touch

last

two

states

field

field

bits

of the

as zero,

(because

we get

it contains

is no longer

necessary,

information. CM-5

one

sample

cell

annotation

cell.

library

the

If the

routine the

requires

load)

only

in

the

futurecall

frame

pointer,

of the

is Full,

five

case

code

returns and

in-line then

is called.

pointer, and

call

generates

state

Suspend

as Waiting, future

frame

execution

instruc-

when

the

both

the

for

and

calls and

the

code

the

program

the

counter

thread

Suspend

stores

program

counter

MsgDispatch finds

to synchro-

touching

to

cell

assign

marked

from

the

the

val-

in the work

to

Waiting,

suspended

it cell,

procedure.

Management

discussion

belonging when

the

When

management. for

The

registers,

registers,

and

in the

and

we set the

instructions,

future

the

marks

processor.

information

implementations.

of the

continues;

when

bits

example.

Empty

on a steal

Appendix

touch.

state

state

a futurecall

integer

The

CM-5

free

two

uawinding

the

Then,

bottom

two

stolen.

and

5.2.3

for

optimizations,

stores,

is not

nize

the

the

representing

pointer).

reuse

these

(two

future

By stored

a four-byte-aligned

Register

We store

field.

information

8.

thus

When to the stolen

the

stack

frames,

simple

stack

which

basic frame,

could

technique

glossed the

and the frame

returns.

overwrite policy.

for

over

The

the

multiple

of the that

stolen

frames

To avoid

certain

portion

problem,

environments

related

migrated migrated

stack

the

frame

must

be preserved

may

allocate

thread

we adopt

by Bobrow

to

between

continuation

of the

this

details stack

new

if we used

a simplification

and

Wegbreit

a of

[1973].

spaghetti stacks, splits a frame into a can be shared among several access modules, and an extension,

Wegbreit’s

which

have

is stolen,

thread

management and

we

continuation

migrated

a single-stack Bobrow

far,

a future

method,

ACMTransactIons

called

on Programming

Languages

and Systems,

Vol.

17, No

2, March

1995

248

.

Anne

Rogers

et al.

Sp ,[ G (l e ‘1 d II F

K Fig.

which

contains

is a counter

data

local

of the

are allocated

one.

use of a coroutine,

The

access

so the

module,

no other stack

pointer

to the

is then

frames end

to allow

segments. the

may

Since

our

model

the

method

simplify maintain

a single

terleaved

frames

frames

are

a frame frame

stack

into

is in the

middle

at the

deactivating

and

end

any

at splink.

Otherwise,

frames. stack, then

it

context ACM

the

end

Transactions

frame

stack

invariant,

of memory a pointer

called

traverse

pointer

to the of this

a meshed

implementations

on Programming

after

to

an example

stack,

these

since

the

Vol

the

end

in-

stack

invariant exit

can We

the new

that

routine

for

whose

it as garbage. is adjusted,

Additionally,

then

to the

beginning

the

of the

stack,

thus stack which

Assume

by

that

Hogen

2, March

lists at the

a null null

initially

and

languages.

17, No

of the

frame the

1995

is reserved

pointer

linked

reaching

containing

spltnk,

a null

form a live

until

was developed

and Systems,

the

in each frame,

location

or logic

stack,

pointer

pointers

process.

contains

The

is live,

list

we

returns.

it.

deallocating

of functional

Languages

nearest

frame,

by marking

stack

below

If the frame

contains

various

If a procedure

simultaneously.

live

one word

splink

global

immediately

information.

the

method,

frame

of the

of

it is moved into

and

maintain

deallocation.

it is deactivated

necessary

of parallel

and

exits,

is merely global

We

of a live frame.

end

at the end of each frame,

9 provides

9 A similar

pointer.

the

per

calls

spaghetti

exits,

frames

if

end

overflow,

stack, that

a simplified

stack the

an

to an extension,

is split

extension

stack

To maintain

set the

Figure

word

one

of the

deactivation,

management this

deactivation

to the

stack

placing

parts:

from is freed

The

stack,

stack

on function

of the

garbage

To implement

stack

marks

deallocating

is adjusted

deallocates

two

global

stack

prevent

spaghetti

In

always

is one).

of the

the

to

than

overhead

a simpltjied

called

off the

Thus,

set to

extension

frame

returns

end

and

counter

increasing.

more

the

continuations.g

pointer

If a frame

for

stack,

frames.

as necessary

require

reduce

control

frame

frame

On exit

basic

count

the

basic

of the frame

the

extension

and

the

points.

and

When

to be ever

not

and

of the

is split

pointer

does

a copy

is deleted,

extension new

a tendency

allocated

the global

the

cause

is, the

each a basic

with

continuation

appropriately.

is performed

have

would

(that

it to allocate

Compaction

stack

it

between

with

Initially,

at the end of the stack,

extension

adjusted

exist

case.

Associated

modules.

go to different

references

— simple

module.

for example, may

stack

access

contiguously

appropriate

extension

if active

of active

returns

the

spaghetti

to an access

number

extension

to be made

Simplified

9.

is stored frame.

By

of garbage end of the

pointer,

and

pointer. the procedure

Loogen

[1993]

in the

Supporting

Dynamic Sp

Sp ,1

Sp

Data Structures

249

Sp

Sp ‘1

,1

,1 G

.

G

G

G

e

e

e

,1

e ,1

,1

d F

H

F

B

3 Fig.

F is executing. Notice

that

another the

When

B 1?

4

Simplified

spaghetti

of execution.

returns

stack

calls G, a new frame

F

to the

When

initial

— complex

Figure

the

7

case.

at the end of the stack.

the end of the

since

F

H

6

is allocated

F and

G exitsl

state.

F

B

5

there are live frames between

thread

stack

10.

,1

,1

,1

,1

Sp

d

d

d

,1

stack

preceding

10 provides

that

belong

frame

a more

(e)

complex

to

is live,

example.

F calls G. G migrates, and the run-time system gives control to e. The frame e is not at the end of the stack; hence, when it exits, a pointer is placed When G at the end of its frame marking it as garbage. The exit for d is similar. Again,

resumes stack, the

execution the

list

frame

and

labeled

(topmost

resulting

is required frame,

will

from

is deactivated. be followed

On the CM-5, end

that

of the

reduces

branch

stack

5.3.1 Since

Discussion.

the

an

guarantees

that

bound

the

be live

stack

become

paction explicitly

the

in the stack

stack

could

run

two

if,

We

can

in an actual achieve

to the

because

to the run-time capacity

l’ransactlons

optimal

run-time

The

dead

frames

of the cost

on Programming

the

frames

stack

be present number not

imOlden

in the

of frames at the

to check

are not based

checks

end

that of the com-

the

stack

deallocated

can

on fragmentation to determine

is small.

Systems,

sys-

We can only

use by performing

that

and

proposed. frames

unnecessarily.

can be used

Languages

express is at the

instructions.

dead

It suffices

A metric

of such

path

optimization

we have

processor.

stack

system.

system.

will

by

run,

This

that

space

current

frame

to three

deallocate

frame

processor

even

system,

remaining

stack

on a particular

to be compacted. ACM

of any

called

stack

case).

scheme not

of stack

one on the

on entry

by calls

out

and

sooner.

run-time

or the

needs

copies

processor,”

the

exit

instructions

adds, one load, and two

current

common

does

One extra

expected

optimization,

the

(in

over the basic stack

(three

six instructions

with

program

theoretically,

be generated

(the

scheme

“home

dead

it

from

stack

at most

as necessary

below

sequence

is one problem

size of a stack

could

of the

frame exit

The

instructions)

that

of the

frames.

four additional

an additional

indicates

spaghetti

Olden

one on the

only

a live

There

simplified

mediately, tem,

with

that

end

is reached

stacks is low.

requires

instructions

we have implemented

the cost of the stack

pointer

spaghetti

underneath),

and two conditional

a bit

it is at the

to store zero at splink.

live frame

maintains

Since a null

all of the garbage

exit code for the i860 and six additional conditional branches) for the CM-5. checkout,

until

our simplified

on entry

with

(an add, one load,

its frame

frames

thus deallocating

F),

The overhead instruction

exits,

of garbage

Vol.

In

17, No

our 2, March

when expe1995.

250

.

rience,

Anne

the

additional

we do not

check

We chose

A

principle ing

would

reason

not

the with

overhead

might

using

prototype

spaghetti

between

implementation

and

the

multiple

used lead

best

allocation

did

not

to

stacks

make

system

libraries.

design

choice,

described

our

easy

to

use

changes

code

and but

our

stack

us-

to be used.

we started

more

stacks stacks,

difficulties.

since

policy

to make

it

disjoint

implementation

is historical;

want

disjoint

multiple

“deallocate-on-return”

heap

and

existing

be

a simple

compiler

Furthermore,

in our

implementation

copying

allow

for

stack-based

compiler

first

approach

age patterns

and

as a compromise

Our

excessive

is small,

overflow.

stacks

stacks.

heap-allocated

growth

stack

spaghetti

incurred

et al.

stack

for

heap-allocated this

Rogers

from than

Our

a exist-

necessary.

compiled

by

Olden’s

RESULTS

6 The

previous

section

Intel

iPSC/860

sults

for

five

Voronoi ilar

and

Diagram.

that with

Sort

Sort.

Bitonic

has two

Sort.

rithm.

finally,

Voronoi

uses a different

method

a problem

our

with

Bitonic

simple

levels

versions

The

Bitonic

And

Power,

and

computationally

two

in Original

implementations CM-5.

is very

is more

algorithm

prototype

Machines

TreeAdd,

TreeAdd

but

We experimented and

Thinking

benchmarks:

structure,

complex

the

for

model.

All

two

a sim-

Bitonic to

malloc

divide-and-conquer

algo-

algorithm

it because

use trees

reand

patterns.

call

divide-and-conquer

benchmarks

has

Original

is an extra

We include

the

is a relatively

locality

we call

is a classic

subresults.

of these

Power Sort

complex

which

the

is a classic

merging

and

for

we report Salesman,

locality. Bitonic

Sort,

Salesman

section,

Traveling

has ideal

between

Diagram

Sort,

of recursion

difference

of Olden

this

intensive.

of Bitonic

Traveling

In

that

it illustrates

as their

primary

data

structures. We

performed

CM-5S and

at

NCSA in

Tables

runs.

Timings

reported

for

benchmark compiled or

includes

this

The

was

not

claim

for

stack,

vary

over but

and

data

for

the

at from

the

between

runs

done

without

the

and

experiments are

runs.

from The mode.

(sequential),

overhead

on

University

in dedicated

implementation

a one-processor

Princeton

Syracuse

iPSC/860

significantly

three

a sequential

compiler,

raw

given

at

NPAC

timings Each which

of futures,

implementation

are single

pointer

( one)

which

were

local

to show that

hand-annotated

annotation. that

the

the

to

The

execution

annotated

include

programs model

programs

future

were

allows could

calls,

annotated

efficient

be

touches,

parallelization,

generated

and

aggressively, We

automatically

a

Our do

using

technology.

TreeAdd is a very

subtree nodes

ACM

our

spaghetti

compiler

TreeAdd

in

do not

a timing

The

timings

are averages

iPSC/860

Centers:

overhead.

of the

existing

machine

using the

a 16-node

of Illinois. The

II.

CM-5

benchmarks

goal

6.1

University

on this the

on

Supercomputing

I and

includes

testing,

variant

experiments

National

at the

given

was

our

two

(see Figure in

Figure

Figure Transactions

1.

11. The

simple 7).

program

We report

The

tree

subtrees

on Programming

that

recursively

results

for

is allocated

at

Languages

a fixed

using

depth

and Systems,

walks

a balanced,

Vol

in

a tree,

binary

the

allocation

the

tree

17, No

2. March

are 1995

tree

summing of size

scheme equally

each 1024K

presented distributed.

Supporting Table

I.

iPSC/860

Dynamic

Timings

Data Structures

.

251

in Seconds

Processors Benchmarks

TreeAdd

sequential

(1024K)

Power

1.94

(10,000) Biton’ic

Sort

(32 K)

Original

Bitonic

Sort

(128 K)

Bitonic

Sort

(32 K)

Bitonic

Sort

(128 K)

Voronoi

Sal~sman Diagram

‘(32,767) (64K)

Table

II.

2

2.62

517.28

original

Traveling

one

1.31

517.46

8

‘t

260.23

16

0.66

0.33

0.17

130.43

65.88

34.20

4.87

5.69

3.23

2.29

2.04

2.09

29.21

64.67

15.18

9.38

6.69

5.36

3.60

4.45

2.64

2.01

1.91

2.02

17.46

21.21

11.98

7.96

6.21

5.05

65.79

66.02

33.49

17.48

9.61

5.77

13.60

20.17

13.06

11.35

16.00

33.16

CM-5

Timings

in Seconds Processors

Benchmarks

TreeAdd Power

sequential

(1024K)

4.49

(10,000)

286.59

Original

Bitonic

Sort

(32 K)

Original

Bitonic

Sort

(128 K)

Bitonic

Sort

(32 K)

Bitonic

Sort

(128 K)

Traveling Voronoi

Salesman Diagram

one

(32,767) (64K)

2

5.94 313.04

4

8

16

32

2.99

1.50

0.75

0.37

0.19

160.75

79.73

43.87

25.03

11.87

8.50

10.79

5.80

3.49

2.40

1.88

1.67

40.66

52.56

27.15

15.72

9.86

6.65

4.89

2.10

1.75

1.64

8.52

5.97

4.40

6.55

8.48

4.93

2.90

31.83

41.18

21.81

13.08

43.35

45.45

22.58

11.72

6.47

3.91

2.74

48.46

62.88

38.03

25.87

24.83

39.33

69.30

-- + --

TreeAdd

(860)

-- # --

TreeAdd

wrt one (860)

—.—— TreeAdd ~ TreeAdd

(CM5) wrt one (CM5)

32 28 24

8

I

I

124

8

1[

I

I

i

I

I

I

12 1620242832 Processors

Fig. ACM

Transactions

11.

TreeAdd

on Programming

performance. Languages

and Systems,

Vol.

17, No

2, March

1995

252

.

Anne

Rogers

et al. --4--

Power (860)

~

Power (CM5)

32 28 1 24

,0 ,’ .“ ,’

I 8

II I 124

I I I 1620242832

I 12

1

I

Processors Fig.

The

lower

They

two

curves

represent

The

efficiencies

amortize upper

curves

plot

speedup

because

6.2

Power Pricing

Power

solves given

and

the

will

optimize

ward

pass that

10 main node all

there

propagates

feeders

that

determined

from

1,201 in the

the

and

following

way:

the

using

leaves

are distributed

erals

are root

is placed

We report in Figure ACM

a block on the

on the same for

the

behavior

on Pmgrammmg

branches nodes;

10,000

based same

processor

et al. each

nodes

on an in-order

as the

as the head

root

prices

that

computation has

a down-

customers

one

nodes; has

layout

divided

as fol-

at the

customers

[1993]:

The

are

The

the

branch

customers.

the

phase

to 20 lateral

and

processor

be stated plant

to the

from

Lumetta

branch

distribution

on Processor

speedups

12. Power’s

TransactIons

branch

occur.

1993].

root

the

parallelism.

can

Each

the

information

each feeder

of five nodes

processors

the

root;

from

from

internal

The

placed

demand

et al.

than

migrations

power

is reached.

to The

linear

to determine

[Lumetta

information

configuration

of a line

information

convergence

propagates

which the

which

stack.

rather

available

with

over

CM-5.

near-perfect

of processors,

problem,

community

until

work

time.

the

spaghetti

display

exploit

by a tree

use local

pricing

input

head

are

to the

of phases

use the

is the

leaves,

the

for

implementation,

can

Opttmwatton represented

74%

real

and

curves

number

model

time/parallel

and

little

tests,

These

P is the

i860

is very

pointer

execution

network

benefit

pass

We

1,where

the

at the

the

there

is, sequential the

the one-processor

Power-System-

a series

an upward root.

the

customers

executes

P —

that

a power

for

as the baseline.

only

demonstrates

that

of 72%

futures,

using

performance

speedup,

because

of the

implementation,

speedup,

true

perfect

overhead

This

lows:

the

Power

efficiencies

are not

the

sequential

plot

parallel

12.

numbering

the

substation; each lateral

10 leaves. of the

evenly

corresponding

and to

among of the branch;

of their

line

the

time

to build

but

on a much

In

tree

of branches;

is the

tree. latand

O. whole

program

is very

Languages

and

(including

similar Systems,

to TreeAdd’s, Vol

17, No, 2, March

1995,

the

tree)

smaller

Supporting

Dynamic

Data Structures

-- ~ -- Original BiSort (32K) (860) -- ~ -- OriginalBiSort(128K) (860) —+— Original BiSort (32K) (CM5) —A——Original BiSort(128K) (CM5)

.

253

-- ~ -- BiSort (32K) (860) -- ● -- BiSort (128K) (860) ~ BiSort (32K) (CM5) ~ BiSort (128K) (CM5) 10

10

1

1 8

8-

6–

2 1

l– II

I

I

124

8

I

I

I

[

I

I

I

12 16 202428

32

I

I I

I

124

8

Fig.

The

done

main

at the

al.’s

results.

6.3

Bitonic

Bitonic

two

to store trees, calls the

the

numbers.

subtrees the report

easily

call.

figure Sort

but

disjoint

a substantial

results

amount

are comparable

data

depth

for

in the

overhead the

does

as much this

subtrees

not

include

as the

accounts

32

of work

is

to Lumetta

to maintain Transactions

sort,

the

et

over

the

(32K,

the

they

the

amount

the

to malloc.

curves It

still

sort. and

in the

half

128K

but

that

expensive left

numbers)

implementations

increases

call

graph

sizes

These

recursive

processors,

following

13 displays

extra

available

The

swaps

such

layout

numbers),

which

two

having

results.

sub-

phase

processors

the

per-

is used

in both

merge

performs

on the

problem

perform

perform

an un-

of work

for

per

an improved

displays

paral-

version.

a loss of about

would

sequences The

and

tree

this

Sort.

of Figure

A binary

to avoid

the

(128K

set of integers

then

placed

different Bitonic

original

for

benchmark

ACM

skew

problem half

and

been

phases

two

right

result.

to maintain

phase

recursive

bitonic

sorted

distributed

sorting

curves:

creates

sequences, have

of Original each

backward.

one

the

modified the

tree-building four

a random

and

are evenly

been for

allocates

algorithm

bitonic

The

has

that not

overhead,

moving

the

on a medium-sized malloc

System

is that

forward

to generate

only

13 contains

Bitonic

I

I

performance.

Our

1989]

one

implementation

necessary

lelism,

them

them

at a fixed

respectably

this

two

speedup

The

Sort

in Power.

Nicolau

sequences.

parallelizable

each

Bitonic

First,

merges

benchmark

of Figure for

and

to create

and

tree

on them:

then

on these

We

13.

between

of the

[Bilardi sorts

and

subtrees

I

Sort

Sort

forms

difference

leaves

I

Processors

Processors

tree.

I

12 16202428

not

locality

show for

on Programmmg

30%

of the

perfect

the

second

Languages

speedup.

speedup sort

Even

because

the

without cost

of

is expensive.

and Systems,

Vol

17, No

2, March

1995.

254

.

Anne

Rogers

et al.

16

-- -a --

TSP (860)

~

TSP (CM5)

1

/

4 1 I I

I

I

I

124

8

I

I

I

I

I

12 16 2024

28 32

Processors Fig.

Traveling

6.4 This

ever,

computes

Salesman

Karp’s

[1977]

solution et al.

the

These

tree

is laid

for

computing

results

do not

The

Voronoi

Diagram

generates

a random

points

that

Rather

than

on

an

In our

for

divides

computing

the of

implementation,

alternately

we use the

circuit implementation

how-

a set of points

an exponential

closest-point

exact

heuristic

[Cormen

final

subdiagrams, the

whole

depth 64K

the

points

displays

in a balanced the

dividing

presented

in Figure

distributed

tree-building

the

set

edges

to knit

set.

The

points

are

1. We report

points

in Figure

14.

phase.

the

tree

phase

sorted

Diagrams

at the

Schachter

the

to processors

and

the

two

hull the

subtrees

merges

them

of the

Voronoi

so that

points,

To compute

of the

convex

to form

1980]

for these

median),

along

together

assigned

Lee and Diagram

by X coordinate.

Voronoi walks

them

1985;

a Voronoi

of points

speedup

reported the

a substantial phases

obtained

speedup

tree

no parallelism

on Programrnmg

merge

Stolfi

two

Diagram

subtrees

at a

distributed.

15, we report The

and

computes

computes

adds

or building

Tmnsactmns

of the

binary

algorithm

and

merge

scheme uniformly

[Guibas and

The

almost

later

cost

result.

points.

represents

the

benchmark

are evenly

In Figure for

32,767

set of points

(thereby the

allocation

for

Diagrams

Diagram,

recursively

ACM

hamiltonian

based

[1993].

in a tree

MulIer,

best

is

by Muller

level.

the

include

are stored

a Voronoi to form

using a tour

Voronoi

and

of the

program

are given

at each

out

Building

since

cities

sizes, following

6.5

for

performance.

1989].

results

fixed

The

algorithm

coordinate

for small

The

The

Salesman

an estimation

problem.

partitioning

we assume

by X or Y

Traveling

Salesman

benchmark

Traveling

14.

used for

Languages

to represent

two

fraction reference

does

reasons. of the

subsets and Systems,

for not

building include

the First,

sets the

computation. alternately Vol

17, No

on 2, March

the the

Voronoi cost

Diagram

of generating

of points.

This

example

merge

phase

is sequential

But

more

importantly,

different 1995

processors,

the

Supporting -- *-

- Voronoi

- =-

Voronoi

Dynamic

Data Structures

255

o

(860) (CM5)

8-

6a % ~

4-

Wa .& . ...-.*---

2#.o

1-

i2

*--- ----a ------.-

4

8

i2

-o i6

Processors Fig.

available occur

parallelism

during

because

Voronoi

is swamped

these

merges.

messages

by

Diagram

the

Migrations

are expensive.

performance.

cost

of the

“ping-pong”

are expensive

in our

migrations

current

that

implementation,

10

Discussion and Future Work

6.6

In early

experiments,

marks.

We

location from

we observed

traced

this

routines.

the

explicitly.

The

overhead

futures,

and

benchmark

by

On

overhead.

Pointer

the

user-level

trap

this

handler

of such

a scheme

and

1991].

We

top

of the

Li

built

on

group

will

[Schoinas

user-level

the

solution

accounts

depend are

for

heavily

currently

that

1994],

manage

and

80%

of the

from

points

546ps

message,

on the

on Programmmg

may

Wisconsin

trap

a

[Appel

of Olden Wind

mechanism

be

and

effectiveness

a user-level

the

for

Tunnel

registering

reference. out

that

pointer-based

a 250-byte

It

an implementation

a simple

on a nonlocal

overof the

hardware The

a

80%

overhead.

references.

for

processor

of this and

translation

with

capturing

one

component 44%

effect

acquired

overhead

and

between

address

system

benchmark

the cost Of sending

sequential

nonlocal

provides

of parallelizing

Transa.t]ons

to

the

the

the

al-

memory

testing,

measure

on the cost of servicing

which

are invoked

as 496LM on the iPSC/860 ACM

the

for

experimenting

Diagram

to the problem

24%

by using and

managing

a significant

accounts

bench-

memory

allocating

of pointer

the

of these

We reduced

and

can

between

Tempest/Blizzard

Voronoi

10We have ~easured message,

it

cost

We

represents

overhead

problem:

times

the

stack.

testing

to detect

et al.

handlers

Finally,

testing

several iPSC/860’s

synchronization, a few

difference

for

of the

a related

includes

spaghetti

pointer

CM-5,

to reduce

from only

technique

the

speedup

behavior

a global

malloc

the

iPSC/860,

On

possible

requires

computing

the

superlinear

quadratic suffers

calling

of our

implementations. head.

CM-5

managing by

the

system

problems

storage

to

The

operating

of these

nal

15.

our

model

programs

which

is the average

is not

for

the fi-

distributed-

size Of a figratiOn

CM-5. Languages

and Systems,

Vol

17, No

2, March

1995

256

Anne

Rogers

et al.

machines.

Our

results

.

memory better

to use software

used

for

caching

are

to be examined We

have

Eicken

implemented

on the

performance.

computation

migration

7. AUTOMATIC This

article

was

are needed

to determine

and

it

in

structures

the

1988 b].

In

our

Hendren cally

type

parts

the when

can

RELATED

cessors

support

et

approaches space (tuple Tmnsactmns

objective

to build

compiler

we

between

mentioned as the

be executed

earlier,

target

for

analyses

a

that

in parallel

operations.

data

safely,

See Rogers

the

For

the to:

conquer

step

the

conquer

maximum

provides

on

tests

to

of precomputation,

part,

followed

if it is safe and

informa-

divide-and-conquer-

consists

step,

stati-

Based

appropriate

a typical

in parallel,

1990;

the relationships

of lists.

conquer

and

automatically. [Hendren

analysis

determine

to determine

to determine

noncircularity the

the

1988a;

structure,

analysis

defined

data and

Hilfinger

or futures

designed

procedure

be used

with

data

matrtx

we have

implementing

to ensure

program

A typical the

we

the

the

analysis

futures.

must

1992],

and to approximate

or

that

et al.

hierarchical

path

structure.

say

recursive

and

calls

previously

with

Larus

function

tree-like,

techniques

Hendren

1989;

of the

on the

analysis,

calls

the

programs

is to analyze

to introduce

for

for

we restrict al.’s

objective to

the

1990;

an interprocedural

analysis

of supporting

Carriero common

present

by

post-

to execute

the

determine

if the

precom-

place

touches

in

the

the

concurrency.

WORK

Programming goal

it

upon

[Larus

parallel

this

recursive

position

Lisp

to insert

be overlapped

possible

of necessity

the

build

Nicolau

of subtrees,

is safe

Our

and for

of the

procedure, the

As

use

touch

imperative

pieces

from

it

we can

are indeed

subcomputations

putation latest

[von to im-

dereference.

to

may and

to disjoint

disjointness

available

computation.

ACM

structures

recursive

various

cases,

1990],

by

At

choose

model. briefly

futurecall

compiler

refer

different

followed

pointer

intention

subcomputations the

Hendren

and Nicolau

determine

our

1990;

we plan

information

each

need

be used

automatically

be

substantially. Messages

can

encouraging.

that for

Active this

may

messages

processors costs

using

quite

we outline

analysis,

purposes,

about

multiple

how

execution

our

of parallelizing

information

if the data

between

8.

this

these

to use this

For

from

it

The

details.

computations

then

by

section,

place

context

both

on Olden’s heavily

restructuring

which

others,

communication

been

caching

and

migration.

cache

and

which

more

[Hendren

CURARE

have

software

case,

are exploring

heuristics

to

to support

proposed

tion

for

order

results

In this

is best

[1993]

In

and

this

data

user-invalidate

CM-5,

concentrated

compiler.

et al.

when

can reduce

compile-time

influenced

parallelizing

where

in

computation

PARALLELIZATION

has

design

and

they

Initial with

that

than

cheaper,

a simple

1992]

are experimenting

its

much

rather

simultaneously,

et al.

prove

indicate

caching

work

machines

discussion

[1986]

work

Olden,

a shared

is a very

to papers

pointer-based

with on

parallel

our

dynamic

on

data

distributed

namely, linked

that

structures.

on Pmgrammng

Languages

and Systems,

Vol

of research.

structures

on

a mechanism

for

In

are quite different. The Linda model space), but no control over the actual

area

particularly

Out

relevant

to

structures.

data

providing

active seem

the

details,

Linda

shares

distributed

however,

the

two

provides a global shared address assignment of data to processors.

17, No

2, March

1995.

a

pro-

Supporting Also,

while

cluding

Linda

arrays,

programmer

to

in Olden control automatic

and

with

let

the

have

annotated

both

distributed

by

parallel

hand.

The

structures

a run-time

data model

model

that

.

in-

requires

data

but

257

structures,

that

distributed

at present,

the

structures

we provide

we believe

“Para-functional

programming”

cases,

sequential

semantics

the

computation.

In

are

mapping

expression,

layout

makes

Emerald

[1986] In

to distribute

is that

oriented

and

annotations

data

which

a program

layout

Smith’s

the

the

ference

of arbitrary

it is an explicitly

to hierarchical

Olden.

are used

however, run

parallelize

data

flexibility sets,

Data Structures

exact

is amenable

to

parallelization.

Hudak

tations

the and

are restricted over

ilarity

allows

graphs,

Dynamic

and

preserving [Jul

whereas

determine

Hudak

et

languages

the Smith

location

1988]

developed

Amber

the

the The

a referentially

semantics

and

we annotate

sim-

and

anno-

programming,

specify

of computations.

from

has some

preserved,

para-functional which

in Olden,

start

sequential

al.

expressions,

model are

processor

to

allocations

and

other

dif-

main

transparent

language,

trivial. [Chase

at the University

et

al.

1989],

of Washington,

which

object-

are

employ

thread

and

object migration mechanisms to improve locality. Emerald was designed for programming distributed systems. Amber permits an application program to use a network

of homogeneous

processors

as an integrated

multiprocessor.

These

guages provide primitives for object location and mobility, and constructs the programmer to indicate whether the thread or the object(s) should satisfy

an invocation

that

mer has not specified which

mechanism

Prelude,

a remote

object.

the compiler

In cases where the program-

uses several heuristics

to determine

is appropriate.

a language

199 1], includes

references

a preference,

lan-

to allow move to

being

a migration

developed mechanism

at MIT similar

[Hsieh

et al.

to ours.

1993; Weihl

et al.

The goal of the Prelude

project is to develop a language and support system for writing portable parallel programs. Prelude is an explicitly parallel language that provides a computation model based on threads and objects. Annotations are added to a Prelude program to specify architecture-specific which of several mechanisms The language object

supports

migration,

lude projects

implementation details. These annotations specify should be used to implement an object or thread.

a variety

of mechanisms

and computation

are different.

Our

migration.

including

remote

procedure

The goals of the Olden

focus is on effectively

parallelizing

call,

and Pre-

sequential

C

programs; thread migration is simply part of the underlying model. Their goal is to develop a new language that facilitates writing portable parallel programs using computation Orca[Bal

migration as one possible et al. 1992] also provides

tool. an explicitly

parallel

programming

model

based on threads and objects. Orca hides the distribution of the data from the programmer, but is designed to allow the compiler and run-time system to implement

shared

shared objects object

should

objects

efficiently.

are accessed that be replicated,

The

Orca

compiler

produces

is used by its run-time

a summary

of how

system to decide if a shared

and if not, where it should

be stored.

Operations

on

replicated and local objects are processed locally; operations on nonlocal objects are handled using a remote procedure call to the processor that owns the object. Split-C [Culler et al. 1993] is a parallel memory, message-passing, and data-parallel ACM

Transactions

on Programmmg

extension

of C that

abstractions

to the

Languages

and Systems,

provides

shared-

programmer. Vol

17. No

%lit2, March

1995

258

.

Anne

C provides

a global

providing

both

tothelocal

Split-C

efficiently. In and

In

based

on

objects, and

like

This

compiler

and

through

message

of data,

structures goal

Concert

(such

control.

passing

execution

provides

threaded

and

Concert

composition,

of the

concert

project

1994],

a recently

also

from

allocating

data

writing

of

programs

Olden’s. Chien

1993]

object

provides

efficient

provides

concurrent space,

common

inheritance,

and

some

asynchronously parallel

messages,

provide

space is also

parallel

communicate

first-class

is to

for

shared

forwarding),

the

system

of fine-grained

tail

(invocations).

object

synchronous

and

a globally

and

for from

efficient

are single

parallel

different

and

places

The

in explicitly

Karamcheti

Olden

of an algorithm

routines and

1993;

as RPC

Objects

for

for

pointers

a global

structures.

to be used

remote)

global

data.

describe

provides

by point

(possibly

between

description

object,

to

programmer

necessary

data

model

et al.

support

idioms

concurrency

major

allocation

programs.

programming

and

of a data

is intended

[Chien

run-time

object-oriented

A

system

system

of its

abstraction

has a work

Concert

layout

toany

manipulate

the

[1993] the

of locality

guaranteed

difference

any

et al.

to decouple

are

point to

Split-C,

to retrieve

reading

object

Split-C

The

pointer

may

In

concept

pointers

is a major data.

Lumetta

of an optimized

asynchronous

an object.

pointers

the

a way

a clear

Local

of primitives

routines

of work,

a global

for

maintains

is allocated

provides

description

global

follows

provided

piece that

that

pointers.

a variety

work

work

uses the

a related

space

global

whereas,

way

Olden,

abstraction the

and

provides

The

Split-C.

et al.

address

local

processor,

location.

work

Rogers

and

collections

continuations.

support

for

fine-grain

concurrency. Cid

[Nikhil

locks

model

of parallelism.

mechanism

allows

the thread begun

should

This

a structure

is based

on global

directly.

Instead,

of several copy

in return.

maintains

Cid’s

on existing

but

may

(for

existing

goal

cannot

a global

they

locality

have while

object

mechanism

cannot

be dereferenced

and

goal

that

explicitly

using

a pointer

to a local

the

Cid

This

creation on which

once

of data

object

is that

compilers.

element

is given

locking,

thread

migrate

pointers

of Cid

a threads-and-

the

processing

and

use implicit

using

and

access to the

read-only)

design

run-time

programs

one

system be able

supports

to

portability,

efficiency.

CONCLUSIONS

JVe have

presented data

approach

we

resolution

techniques,

programs,

to pointer-based

able.

In

have

property

TransactIons

a new

structures

contrast,

ownership, ACM

these

requests

C, supports

advantage

also provides

A major

dynamic

This

threads

to take

Split-C,

objects

to

lightweight, a specific

Cid

example,

global

hardware

hamper

Cid

Unlike

programmer

modes

consistency.

run

9.

the

sharing

Olden,

it awkward

iteratively. pointers.

are

to name

Unlike

makes

extension

threads

programmer

be run.

execution.

traversing

the

proposed

Cid

and

approach on

noted

a fundamental

currently dynamic

programs

used

data

to produce

data

therefore

makes

Array

structures

structures

Languages

the and

precludes

17, No

2, March

1995

our

apply

to

new

run-time

array-based

directly

use of simple model

pointer-based

for

are

be traversed the

to

programs

structures

resolution

use

developing

trying

SPMD

must

Vol

In

with

data

run-time Systems,

that

machines.

problem

programs.

of dynamic

on Programmmg

supporting

message-passing

address-

be addressable. local

ineffective.

tests

for

Supporting

Our the

mechanism

dynamic

avoids

nature

decide

if it should

of the

data

migration

parallelism

and

have

used

couraging,

our

this

in light

We believe

systems

improve.

programs.

that

Our

computed

to

on

to split

of the

our

the

owns

thread

message-passing

will

processors

to focus

on this

to

issue

thread

Coupled

iPSC/860

with

introduces

and

Our

the

results

performance results

the

importance

performance

as part

piece

the

which

better

out

the

relevant

data.

programs.

achieve point

closely

processor

of computation.

poor

also

the

migrates

the

on the

sample

model

experiences

that

each

mechanism,

some

different

We plan

that

mechanism

run

if it owns

futurecall

execution

system

especially

machines.

sults

processors

implemented

processor

more

making

strategy

259

.

matching

than

by determining

is our

Structures

by

Rather

migration

to the

technique

by allowing

We have

a statement

Data

problems

structures.

we use a thread

automatically

thread

fundamental

data

execute

structure,

of computation the

these

of the

Dynamic

of our

CM-5, are

en-

of current

as communication of merging

re-

of divide-and-conquer continuing

research.

APPENDIX SAMPLE The

FUTURECALL

following

CODE

is an example

of the code sequence

for f uturecall

in Olden’s

iPSC/860

implementation.

II

Intel

//

Future

i.860 cell

//

r14

future

is

st.1

rO,

st.1

r14,

st.1

fp,

addu

O,

call

_TreeAdd

addu

32,

//

is

with

cell

stack

rl,

r14

rl

Return

return

value

pointer

12(r30) r30,

integer

r30

4(r30)

(Future

//

Mark

future

//

Store

next

//

Store

fp

//

Push

future

//

Call

TreeAdd

//

Delay

//

Modify

//

is

stored

//

the

first

had

cell

cell

slot

of

been

in

Store

//

Pass

fc

pointer

//

Load

fp

from

//

Call

stolen

//

address

12(fp),

r19 fp

orh

h%_TreeAdd.cs,

br

___stolen

or

l%_TreeAdd.cs,

rO,

_TreeAdd.cs ri6,

ld.1

4(r14),

16(r14)

r14

(Future

which point

to

st.1

register return

unwinding

code

value as future

argument

to

stolen

cell

r18

r18,r18

Normal Return

long st.1

to

stolen)

//

fp,

rl,

subsequent

16(fp)

ld.1

call

address,

Poi.nterto

mov

TreeAdd

return

//

r16,

empty/full

field

_TreeAdd.cs

st.1

//

in

0(r30)

Abnormal

long

Futurecall

has

not

been

II

pointer

//

Store

//

Pop

as

wi.th

register

unwinding

argument

stolen) to value

future

register in

future

unwinding

code

cell

stack

L.14: ACM Transact, orison Pro~ammlng

Lan@ages and Systems, Vol 17, N. 2, March 1995

260

.

The

Anne

Rogers

following

CiW5

is an

et al.

example

of the

code

sequence

return

value

for

futurecall

in

Olden’s

implementation. ! CM-5 Future

call

! Future

cell

is

! Future

stack

St

with

integer

located

at

register

mov

H5, [7k10+41 Xr10,%15

call

_TreeAdd;

add

%r10

is

%15

. word

field

this

fc

Call

st

__TreeAdd.cs

!

(Future

!

jump

%oO, [X15+16]

!

ld

[%15+41,%15

! Pop

register

has to

stack address

unwinding

been

steal to

store

of

return

code

return

! Pointer

St

top

modify

to

! Abnormal

%oO, [%fp+16]

at

routine,

! Pointer

___stolen;

ba

next

!

%07,12,%07

__TreeAdd.cs

. word

! Store ! Place

stolen)

% store register

return

ret

val

unwinding

code

value

future

cell

stack

ACKNOWLEDGMENTS Dave

Hanson

and

Dave

Hanson

suggested

computer

Systems

experiments. (NCSA)

and

provided

the

the

The

anonymous

initial

many

provided

made

helped

that

by Mark

to improve

by

CM-5s.

our

Josh

provided

an

as a possible

L.

Guibas

and provided

Hill,

this

for

of lllinois

Diagram

TSP

it.

Super-

we used

Lumetta

written

about

Intel

totheir

Voronoi

recommended

the

University

Steve

implementation, Miiller

questions and

the

access

suggested

suggestions

reviewers

at

of Barnes-Hut.

Jorg

Li

iPSC/860

centers

Hummel

many

Kai

to the

(NPAC)

Joe

at ORNL.

implementation.

answered stacks.

access

implementation

we obtained

fromNetlib

and

Supercomputing

University

an

ICC

provided

of Power.

application;

wrote

use of spaghetti

National

Syracuse

implementation

and

Fraser the

Division

The

Barnes

Stolfi,

Chris

Mukund

and

J.

aninitial

Raghavachari,

article.

REFERENCES ALLEN,

R.

AND

form. .4LLEN,

KENNEDY,

ACiLf F.,

Trans.

BCJRKE,

the PTRAN

K

1987.

P,ogmm.

M

system

translation

of

9, 4 (Oct.),

S~st.

P

, CYTRON,

, CHARLES,

analysis

Automatic

Lang.

FORTRAN

programs

AND FERRANTE,

R.,

for multiprocessing.

to

vector

491–542. J. 1988.

J. Para/le/Distrib.

An

Comptit.

overview

of

5, 5 (Oct.),

617–640. .AMARASINGHIZ,

memory

gramming ANDERSON,

parallel

Language APPEL,

A.

ming BILARDI,

machines. LI,

of G.

Transact,

In

Communication Proceedings

Global

In Proceedings Virtual

ACM, F.,

systems.

AND NICOLAU,

A.

IEEE 1989.

Lan~aCes

New

C’onjerence

York,

generation

for

on Pro-

126–138.

parallelism

and

locality

on scal-

on Programming

112–125.

foruserprogram,s.

/.$upport

In Proceedings

forprogrammang

Languages

of and

96–107. ~

Trans. Adaptive

SIAMJ.

code

’93

’93 Conference

York,

primitives

TANENBAuhi,

machines.

ProgaInmlng

York,

New

for

o.f the SIGPLAN

on Archatectura

New AND

ACM,

ACM,

and

SIGPLAN

optimizations

memory

conference M.

optimization of the

and Implementation.

1993,

1991.

distributed

orison

1993.

and Implementation. K.

Systems.

KAASHOEK,

forshared-memory ACM

hl,

International

Operating H.,

Destgn

LAM,

Destgn AND

the4th BAL,

AND

M

machines.

Language J.

able

AND LAM,

S.

distributed

1992.

Softw. bitonic

Comput.

and Systems,

Vol

Orca:

Eng.

Alanguagefor

18, 3 (Mar.),

sorting:

An

optimal

18, 2, 216-228. 17, No

2, March

parallel

1995

program-

190–205. parallel

algorithm

Supporting

BOBROW,

D.

AND

ments.

WE

CALLAHAN,

M.,

ROGERS,

Languages

and

D.

GELERNTER,

vol.

768.

Springer-

CARRIERO,

N.,

A.,

A.

ACM,

A.,

D.,

KARAMCHETI,

T.,

New

D.

EXCKEN,

FRASER,

environ-

for dktributed-memory

AND

HENDREN,

L.

6th International

D.

PADUA,

LEICHTER,

1994.

Eds.

J.

Annual

Early

multipro-

experiences

Lecture

1986.

ACM

E.,

with

in

In

Olden.

Workshop,

Notes

Distributed

data

Symposium

LEVY,

AND

U. BANER-

Computer

Science,

structures

on Principles

in

Linda.

o.f PTogTamming

of

AND

J.

grained

Principles.

1993.

Computer

The

Concert

New

J. 1989.

University

York,

Hlinois,

of

147–158. and

programs. of

The

Proceedings

system–compiler

Introduction

1989.

In

ACM,

object-oriented

Science, R.

R.

of multiprocesom.

concurrent

RIVEST,

AND LITTLEFIELD,

M.,

Systems

PLEVYAK,

fine-

H.

on a network

on Operating

C.,

programs

llachines:

progr anuning

E.,

DLJSSEAU,

T.,

AND YELICK,

C. W.

tation.

M.

Ph.

D.

GIJIBAS,

A.,

93. IEEE

1990.

run-

Tech.

Rep.

Ill.

June.

Urbana,

to Algorithms.

McGraw-Hill,

R.

1992.

guages.

IEEE

HALSTEAD,

execution

L.

Cornell

Ca.,

C CompileT:

VON

S.,

Proceedings

of

262–273.

Design

distributsd-memory

and Implemen-

multiprocessing

subdivisions

and

voronoi

J. 1990.

Trans.

HENDREN,

L.

Syst.

systems.

diagrams.

ACM

Trans,

ACM,

HIRANANDANI,

New

S., MIMD

G.

AND

structmes

’92 York,

W.,

CHIEN, In

distributed-memory on Principles

Lan-

232–241. symbolic

data

programs

A. 1992.

computation.

structures.

with

recursive

Abstractions

and transformation

Ph.

ACM

D.

data

for recursive

of imperative

on Programming

AND

TSENG,

R.

1993.

C.

thesis,

structures.

pointer

programs.

Language

AND

Design

data

In Proceed-

and Implemen-

of

stack

and PTactice

1989.

New

W.

systems. of Parallel

ACM Transactions

Tech.

for

o.f Supercomp

uting

1993.

technique

for

Rep.

3, RWTH

Experience

the SIGPLAN

ACM,

P., AND WEIHL, parallel

W.

optirnizations

FORTRAN

91. IEEE

Ca., 86–100.

A new

DALLY,

Compiler

Proceedings

In

implementations.

Proceedings

1991.

machines.

Los Alarnitos,

and Implementation. WANG,

distributed

1, 1, 35–47.

Conference

K.,

Press,

A.,

Ca.,

on

on Computer

249–260.

in distributed

plementation.

structures

Conference

recursive

Parallelizing

Syst.

memory

LO OGEN,

data

for concurrent

with

J., AND NICOLAU,

KENNEDY,

Society

programs

A. 1990.

dktributed

Computer

Los Alaruitos,

A language

the analysis

of the SIGPLAN

tation.

dynamic

(Oct.), 501-538.

7, 4

DistTib.

Improving

with

N.Y.

NICOLAU,

PaTallel

J., HUMMEL,

structures:

Press,

Parallelizing Ithaca,

L. J. AND

IEEE

programs

of the 1992 Inte’rnationa[

Multilisp:

Lang.

University,

HENDREN,

of

Society

JR. 1985.

Program.

HENDREN,

W.,

Los Alamitos,

In

Ca.

for

General

In Proceedings

Computer

R. H.,

TTans.

Design

LUMETTA,

Split-C.

Bonn.

J. 1985.

SPMD

machines.

HORWAT,

City,

A.,

in

~, 2, 74-123.

memory

on

Press,

A Retargetable

parallelization of

KRISHNA?AURTHY,

programming

Society

Redwood

Automatic

C.,

Parallel

D. R. 1995.

University

L. AND STOLFI,

ings

S.

1993.

Computer

AND HANSON,

thesis,

Graph. GUPTA,

GOLDSTEIN,

K.

Benjamin/Curnrnings,

GERNDT,

HSIEH,

multiple

236-242.

Dept.

Supercomputing

HOGEN,

of

York.

CULLER,

D

implementation

151–169.

LAZOWSKA,

efficient,

LEISERSON,

stack

261

.

1–20.

AND

York,

V.,

for

UIUCDCS-R-93-1815, CORMEN,

OLAU,

Symposium

support

and

Compiling

AND

o,f the 13th

Parallel

the 1%h ACM

model

Parallel

Berlin,

F. G.,

system

time

Nrc

New

J., AMADOR,

Amber

J.,

for

Verlag,

Recod

Languages.

CHIEN,

K. 1988.

REPPY,

GELERNTER,

Conference

A

2, 2 (Oct.),

Compilers

JEE,

1973.

Data Structures

16, 10, 591–603.

J, Supercomput.

CARLISLE,

CHASE,

B.

ACM

D. AND KENNEDY,

cessors.

In

GBREIT,

Commun.

Dynamic

York,

with

’89 Conference

management

of runtime

Aachen.

CST:

Programming

and

on Programming

im-

Language

101-108.

Computation

In Proceedings

migration: of th e ~th ACM

Programming.

on Programmmg

the

Languages

ACM,

New

and Systems,

Enhancing

locfllt

SIGPLAN

Symposium

York, Vol.

y for

239–248. 17, No

2, March

1995

262

.

Anne

HUDAK,

Rogers

P. AND SMITH,

multiprocessor

E.,

H.,

Trans.

V.

93. IEEE

lem in the plane. KOELBEL,

C. 1990.

University, LARUS,

A.

Math.

Oper.

York,



Symposium

on

mobility

in the Emerald

support

for

runtime

hardware.

Los Alamitos,

In

Ca.,

algorithms

Res. 2, 3 (Aug.),

for programming

ACM

243-254.

efficient

on stock

of partitioning

programs

Lafayette,

J. R. 1991.

Concert Press,

A paradigm

A. 1988. Fine-grained

1993.

Society

Compiling

West

New

6, 1, 109–133.

analysis

ng:

13th Annual

of the

Syst.

languages

Computer

R. 1977. Probabilistic

ACM,

AND BLACK,

I!.,

Comput.

programming

programnni

Record

Languages.

AND CHIEN,

object-oriented putzng

Conference

HUTCHDNSON,

ACM

KARAMCHETI,

KARP,

Pa-a-functional

In

oj PTogramm8ng

LEVY,

system.

L. 1986.

systems.

Pranczples JtJL,

et al.

conmumnt

Proceed ~ngs of SupeTcom -

598–607.

for the traveling-saleman

prob-

209–224.

for nonshared

memory

machines.

Ph. D. thesis,

Purdue

Ind.

Compiling

lisp

programs

for parallel

execution.

Lisp

Symb.

Comput.

4, 1,

29–99. LARUS,

J. R. 1989.

sors. Ph. LARUS,

J.

Restructuring

D. thesis,

R.

AND

of the

Implementation.

ACM,

cution.

In

New

Proceedings

Experience LEE,

with

Int.

LUMETTA,

J. Comput.

pricing:

93. IEEE MOHR,

E.,

(July),

J. 1993.

Iinda. R.

1994.

Dept.

Two

algorithms

D.,

Press,

programs

ACM,

for

New

York,

for constructing

program.

Los Alamitos,

and

concurrent

exe-

Programming: 100-110.

a delaunay

of parallel

triangu-

1993.

Decentralized

optimal

of SupeTcomputang

240–249.

JR. 1991.

programs.

I

In Proceedings

Ca.,

R. H.,

auf

A.

Cid:

AND

A

,

A parallel,

Lazy

IEEE

task

Trans.

creation

A technique

Parallel

Dist.

Syst.

2, 3

“shared-memory”

Science,

Cornell

New

J.,

1989.

In

6th International Operattng

, CtJLLER)

A mechanism

19th Annual

D.

for

Process

and

Research

for

A.

An

Center,

decomposition

Parallel

abstraction

Palo

Alto,

through

on Programming

L.

1993.

Supporting

C’ompzlers A.

for

Pro-

In

Computzng,

for

controlling

Ca. April.

locality

of reference.

Language

R.,

York,

Destgn

In

and

Im-

D.

Languages

for

5th

PADUA,

dynamic

Intemataonal

Eds.

Lecture

Notes

J.

, AND

WOOD,

192–207.

REINHARDT,

for distributed

execution

Machints:

AND

Berlin,

S,

K.,

shared

Support

LARUS,

memory.

for

R

In Proceedings

Programming

of the

Languages

and

297–306.

GOLDSTEIN,

int egratiW

SPMD

Parallel

NICOLAU,

on ATchitectura[

International

on Pmgrammmg

net work-

N.Y.

S.

C

, AND

communication

Sympostum

on

SCHAUSER,

and

ComputeT

K.

and Systems,

Vd

E

1992.

computation.

In

Architecture.

ACM,

256–2E6. Transactmns

und

machines.

Compilers

WorkCrews:

Springer-Verlag,

New E.,

express

69–80.

LEBECK,

ACM,

mit tels

distributed-memory

and

Ithaca,

1989.

Systems

access control

Conference

Systems. T

York,

757.

B.,

Fine-grain

T.

D. GELERNZ-ER, vol

FALSAFI,

University,

M.

’89 Conference

Languages

Science,

C for

on Languages

AND HENDREN,

U. BANERJEE,

I.,

EICKEN,

K.

ACM,

REPPY,

D. A. 1994.

sages:

42, DEC

PNGALI,

Computer

SCHOINAS,

Rep.

ation-clustern

RWTH-Aachen.

WoTkshop

of the SIGPLAN

structures.

workst

in Elektrotedmik,

Tech.

Workshop,

ACM

accesses.

Design

1988 — PaTallel

AND KHALIL,

of a parallel

Parallelveraxbeitung

plementation.

the

structure

Language

9, 3, 219–242. CULLER,

E. S. AND VANDEVOORDE,

Proceedings

in

between

Lisp

PPEALS

and Systems.

AND HALSTEAD,

o.f the ?’th Annual

parallelism.

data

X.,

granularity

of Computer

ROBERTS,

ROGERS,

A.,

Diplomarbeit

ceedings

ROGERS,

LI,

on multiproces-

264-280.

NIULLER,

IV IKHIL,

the

J. 1980. Sci.

conflicts

on Programming

Restructuring

Languages

Society

D.

Detecting

Conference

P. N. 1988b.

B.

execution

21-34.

development

Computer KflANz,

for increasing

York,

Inf.

L.,

The

1988a.

’88

Applications,

S., MtJRPHY,

power

P/.

for concurrent

Berkeley.

of the A CM/SIGPLAN

D. T. AND SCHACHTER, lation.

VON

P.

SIGPLAN

J. R. AND HILFINGER)

LARUS,

programs

of California,

HILFINGER,

Proceedings

In

symbolic

University

17, NO 2, March

1995

Active

mes-

Proceedings New

of

York,

Supporting

W’EIHL,

W.,

SPURGER

BREVVER, ,

MIT/LCS WOLFE, ZIMA,

h{ H.,

BAST,

February

E.,

AND

519, 1989.

parallelization.

Received

C,,

COLBROOK,

WANG,

P.

Massachusetts

Optimizing H.,

AND

1994; revised

ACM

DELLAROCAS, Prelude:

Institute

Supercompilers

GERNDT,

Panzlte~

A., 1991.

M,

Comptit.

August

Transactmns

Dynamic

1988. 6, 1,

1994;

of

Data

C.,

A

HSIEH,

system

Technology,

for

for

W.,

A

tool

for

263

.

JOSEPH,

portable

Cambridge,

SupemomputeTs.

SUPERB:

Structures

A.,

parallel

WALDsoftware.

Mass.

Pitman

Publishing,

semi-automatic

London.

MIMD/SIMD

1–18.

accepted

on Programmmg

November

Languages

and

1994

Systems,

Vol.

17, No

2, March

1995

Suggest Documents