Exploiting Heterogeneous Parallelism on a Multithreacled ...

25 downloads 0 Views 1MB Size Report
ainwl in t 11,s docmt~en[ are those of Tera. Computer. Company and. sI1,)uIc{ no( be int erpl et c+ as representing the offkial poli(:ies, rither expressed or i]lpliccl,.
Exploiting

Heterogeneous

on a Multithreacled Gail

Alverson

Brian

Rol>ert

Koblenz

Multiprocessor* Davicl

.Alvm-son

Allan T’era

Porterfielcl

Computer

Seattle,

USA

hunclreds

paper

describes

an integrated

and operating

geneons

parallelism,

threaded

system The

multiprocessor.

(multiple

operations

(multiple

jobs)

activities.

to smoothly ponent

interface

is primarily

its four

feedback

enable

resource

allocation

dated.

This

dynamic

[or managinx

mechanisms

resources

to chan~ia~

fosters

and the efficienl

ho{ 11h igli

expression

aging

u [iliza tioa

aad execu(ioa

the

is that,

of para-

it

from

pipeliuecl

allel

loops.

parallelisln

withiu

at, n~an~

an application

or co-scheduled Medium

operations

grain

parallelisn~,

each

was support

vancecl

Research

nology

Office

Projects

ARP.4

issued by

DARPA

The

and

those

of

O

views

Tera

Permission granted direct

the

to

t!tle

of

copying

the

ra~ige

to tightly

par-

the

To

‘92-7

01992

/92/D.

ACM

fee

and

copy

ainwl

order

DefeILw

$iieli(.,e Prograll) ract

in

rither

all are

ACM

the Its

otherwise,

or not

date of the

of

Code

Ad.

or

the

i]l~pliccl,

allel

are

for

copyright

notice

the

and

Association republish,

notice for

et c+ of

C.,

is given

/0188

This,

a fee

the

. ..$1.50

188

which

is programmed These

units.

nmltit,breaded

must

arcllinumber

multi-

by the processor Our

hard-

principle

processors

units grow

and with

be tolerated

attribute

is to

synchronizathe

for

size

of the

an effectively

largely

call

be used.

language each with

language with

creation

to the

influences The has

and

through require

and

time-sharing more

logical

defined

number

of

mem-

imperative

language

constructs

to specify

synchronization. of processors

the par-

access to a shared

control

is

program-

machine

an unbounded

uniform

system

parallel how easily

virtual

is a traditional data

number

program

of the overall

it presents

in turn,

parallel

The

softl\vare

of

memory

important

machine

unbo(lnded

USA -485 -6/92 /0007

and

hardware

some

uniprocessor.

lat,encies

machine

augmented

perm!sswn.

0-88791

virtual

t,lwead

each

set of functional

they

of the

supports

multiplexed

these

“processors”,

Computing

requtres

the

parallel

machine, nlost,

by our

is

or distributed

appear,

or to

material

made

and

and man-

addition,

a multithreaded

processor

from

Because

nler.

Nlo.

h] I) A972-90-C

t 11,s docmt~en[

expressed

this

In

attribute In

adopting

latency

t ion.

ory. part

paral-

for

to changing

important

a single

multiprocessor,

of

ai~d ‘le{.l-

sI1, )uIc{ no( be int erpl

and

poli(:ies,

copies

is by permlss!on

specific

S( at es

Cc,lit

cent

and soft-

runtime,

levels.

to adapt

a traditional

for

tolerate

Government.

advantage,

pubhcation

Machinery. andlor

the

mirlcr

Company

without

that

commercial

S.

WI

b51 2/2-4:

/C’MO

offkial

U.

copv

prov!ded

that

ICS

the

or

Unit

hardware

compiler,

different,

streams,

motivation

levels.

can

oil

h~forn~ari[~tl

No.

conclusio],s

Computer

as representing DARPA

Agency

Order

007,5.

OTI

ecl by tile

of

resource

heterogeneous

at

are then

onto

Tl]e research

spectrum and

are responsible

physical

Ike

scalable *This

this

components

most

imtruction

ware set contains

exploit

is multitbreacled.

ple streams

parallelism

simultane-

requirements.

Brie[ly,

of t he

Introduction

grain

to

grain-size

hardware,

cooperate

essentially

application

applications

run

We use the phrase

refer

to

parallelism

resource

of

Fine

vargrain

reqairemen(s

llelism.

An

to varying

of

can

fashion.

an integrated

system

t,ec.ture,

1

form

that

designed of the

components

u l>-

the

clmcribes

Each

operating

components

and nsage to be dynamically

adaptation

system

lelism.

its owa level of

between

paper

ware

each com-

in

sections)

widely

general, Coarse

lnents.

TIlis

jobs.

itself

para/lelwn with

more

activities,

in a lllllltiprograllll~led

require

we were able

ts. IVhile

presents

parallelism

the opel-

describes

independent

application

heterogeneous

coarse

across

largely

large

ously

and IIIntIIILe

while

in the design,

componell

responsible

activity,

machine

a job, parallelism

system

parallel

and available

The compiler within

on managing

the entire

(or

mu!ti -

) to very

of instructions,

and

parallelism

heteLo-

the exec II tie]] of J,ery fIHe

an ]nstructton

parallelism

ating systemfocuses

compiler,

to exploiting

I ale is (1 Ijlpc]iilcd

enabli]ls

parallel

By considering

archltec

within

focus on managing

architecture,

solntion

Smith

company

ied, runtime,

Callahan

Burton

ll~ashin~ton

Abstract This

Parallelism

The

fiction

is maintained when

the

processors

of an by

semantics than

the of

avail-

able

streams.

tained

by

The

the

one stream

with

We focus

of uniform

computation

that

What

masks

the hardware, —

that

heterogeneous

operating

in

and

across

How

can

resource

in

tency,

and

are central

idle,

parallelism.

system,

detecting

than they

and rum

parallelism

provide

that

as the

applications? the

changing The ing

resource

second

the

of

granm[8,

II].

program

can vary

shortl

wide

Good

a way

can

out

across

move

jobs,

to users. The

Tera

valleys

a full

We also

nent

must

of the system, to provide The

description

parallelism.

Section ing

The there stream

to our

Section

7 outlines

need

for

available each

each

own

The

needed

with

for

Section

in

of compiler

generated

the

blisy

within

of

processing.

At each

that

is “ready))

is

in the pipeline.

with

streams,

each

queue

are

time.

they

contains

of the

In HEP

to

and

addition

are usually

level

is backed issuing

task

switching

to multithreading, of both

bandwidth.

pipes

has its

stream

of a per-processor

is its pipeliniug

the

has

address

processor

instruction

instruction

increase

in

a virtual

a portion

con-

processor 128 instruction

same

processor’s

fast

hardware.

and

defines

on the

The

in

multipro-

Each

An instruction

counter

introduced

up to 16 processors

switch. domains

clomain. file.

[15],

multithreaded

Because

come

independent

a key

the memory

from and

the

different

interlocks

active

streams, are seldom

encountered.

5

Synchronization full/empty

en-

purpose by

of allocat-

arises

thrust

opera-

system

domain

enables

AI,lJs

whe]lever

location

each

that, sets

when it

and

condition

reattempted

on the

the

the atomicity

set

HEP

location

that,

it, empty, not, hold,

is shared

of operations

full, When

the turn

among

register

a

general

writes

similarly,

when

sets

includes

and

indivisibly

does

next

with

is implemented

A read,

stream’s

register

achieved

empty,

full.

location

the

is

to a location

reads

chronization

HEP

memory

A write

by an instruction

to ensure

189

and

the

plelnented

a processor,

ill (I)c I’acf->of

on

register,

Because

a si]tgle

hit

in

an instruction

tile

resollrces

HEP

executing

register

instructions

syllcllr(~llizatioll. col)certlti

main

of instruction

commercial

A protection

protection

ant]

4 out-

runtime

was

proces-

outstanding

queue

of the

16 protection

mechauism

;3 by

!%ction

the gen(ral

next,

bandwidth

each job

shared

other

in Section

designs

The

queue

number

Denelcor

by a program

to dy-

channels

systeln

[13].

associated

fraction

~vas the first

space;

conlpo-

to adapt

available

a large

elements

a large

streams.

conl-

of par-

parallelism.

systetll

for

busy

support,

in

design

multithreading

the machine

processing

A H El> system

concurrent,

ef[iciell(

pro-

access.

to hold

to the

by a high

jobs.

parallelism

providing

moved

cessor.

Systems

is insufficient to keep

on

oljeratiug

to different

Related general

needs

a

in a sin-

memory

shared-resource

of each

nected

level

architecture.

6 describe+

emphasis

fly the

1~~~,

for

anct resollrces.

is followed

fine-graine(l

special

and

feature

the Tera

This

of our

the

with

resources

2

due

contrasts

for very

discusses

vironment

]U order

locations

in the instruction

tom-

ii] formation

to com-

be

is small.

systems.

lines support then

in part

section

can

how

parallelism

integration

multithreaded a brief

describes

interval

multithreaded

in 1974

an element

up to 25(3 pro-

its l)articular

the

tick,

kept

early

is to use queues

processed

unobtrusive

strealns

time

entities

shared

memory

registerless

at each stage

clock

and

is intende(l

have

the set of communication

this

next

paper

to the oth(’r,

and

tions

w-

of the by Miller

design

HEP,

\vill

in application

Interestingly,

streall]

Illu Iti- user scientific

manages

identify

provide

changes

,f (Iynamic

tvi(lliu

One

multiple-stream

the

one find

plwfikw,

work,

:32,000

This

allelism.

namic

scale

approximately

of the systeln

that

and

of uniform

amount

refers

parallel

Ideally,

means

the

memory

and writing

space.

illusion

sor proposed

paral-

computation

our

Machines

size machine.

in a

in

reasonable

reflecting

ponent

Miller. the

programs

which

the

pro-

relatively

avai lab]e

to another

of one

vides

in a given

wherein

reading

to

architectures Scalable

increase,

Shared

paradigm

address

resources

streams.

memory.

be performed

through

gle shared

of another.

use in a large

with

over

requires

(ask

peaks

environment.

cessors,

one

in

parallelism

ill

resource

mos[

system,

commercial puting

frolll

the

shared

la-

latency.

concern-

numerical

variance

these

appears the

the

mall agenlellt,

Intuitively,

to fill

that

utilization

to resource

sources

used

machine

research

of magnitude

rapid

to

parallelism

Optimized

and

to smooth

solution

from

indicate

adapt

of its applications’?

available

by orclers

intervals.

show

lelism.

stems

the

Results

time

also

recluirements

question

amount

as a whole,

hardware

multiple

of processors

can

memory

processing

for multithreaded

scalable

communication

system,

the

among

number

includes

and

allowing

proportionately.

municate ●

simply

that

increases

latency

latency,

motivation

is to

of work

within

Resource

are shared

A primary

tlflesysten-

compiler,

using

latency.

synchronization

Rather

questions

is the role of each componentof

time

is mainwaiting

in another.

two

support

access

which

on answering

to systems



fiction

multithreading,

is im-

indivisibly the

syn-

instruction for

all streams reservation

on shared

is

execution.

registers.

of bits

MASA.

MASA

sign

that

ture.

A

[9] is a illllltitllret~(lecl

has many novel

streams

for

feature

snnilarities

feature

of

lightweight

is supported

which

is shared

as having several

registers

a new

stream

ation

yet

dure

invocation,

in the

of register

of register

registers,

with

their protects

the

its content,

context

of the

to be probed of the procedure

automatically.

at no additional

New

registers

performs

adds

The

are

As

cost.

to the family

scientific Unlike

the

private

need

for

allowing

with

faster

rates

clock

The

that,

can

of this

failed

of the

the

Horizol~

next

better

For example,

the

ation ters

for for

system

nization. and

The dresses

Tera

particularly from

explain,

success data

software

at the its

for

operation

management coarser

be easier

to write it

Alewife

provides for

(temporarily)

fail

the

a

synchrocause

stream

to

and

be

multithreading,

busy.

counters

Because on

the

scalable

programs

the than

the

locality,

Tera

other

makes

on the

for

the

Tera

in

the

application

it is important

to find

both

for

locality

the

hand,

parallelism while

some

grain

program

is important

find

performance,

cation

it

Tera.

program-

the

for

Alewife

in the

appli-

parallelism.

lookaTAM.

completion

The

Tllreadecl

is a compiler-based

loca-

lightweight

the

model

scheme purpose traffic,

under

Machine

to

grained

TAM

parallelism.

schecluling,

in the

control

and

set,

ware clesign; it, imtead

is an

these

and

execution

model

storage

The

[7]

tolerance

The

by making

instruction

(TAM)

latency

synchronization.

for fine

compiler

plicit,

Abstract

approach

synchronization,

lnecllanisll~,

much

of its ‘rera

RESER\~E

creation

has

similarly,

retry phase a new

places

all

management operations

model

is not

can be implemented

types

The

and

can take

coun-

ecuting

is used

iu

synchrointerface,

compiler

ex-

a hard-

on a variety

flexibility

ad-

TAM called threads

preclecessors,

cessor

190

with

however,

has

this

perform another

a “quantum”. that,

can

share

instruction

Al-

streams, flexible

about

system

and the

for

on the

Tera

given

ex-

provides

medium-grained

overhead

the pro-

fine-grained

par-

are scheduled

enough

by

parallelism,

adequately. level A

be

runtirne

level,

are

known

scheduling

streams At

policies

and

Tera

To minimize

hardware.

simple

the Tera

The

architectures.

policies

of information

program. units.

schedules

scheduling

advantage

gram the

parallel

the

same allelism,

of parallelism,

distinguishes

TAM

oper-

explicitly

(and unconventional)

Consequently,

operations,

of the machiue’s of all

AS

is cus-

parallelism,

a

two

de-

IIorizon.

the

to memory

levels,

fewer

programnler/system

The

and

Harclware

sharing

in

inherits

of strea]n

system,

in a

is done

though it is difilcult to compare models with architectures, we contrast several general features of the TAM with those of the Tera,.

includes

applies

the

of conventional

HEP

bits

in-

until or fails

location

that

uses coarse

about

good

neti~orli

heterogeneolls

trap

request

potentially,

a processor

a control

[J nlilie

trap

where

of traps.

however,

monitoring.

only

The

the

architecture

the cooperative

resources. system

from

CREATE

lookahead

cost

to keep

to

I)acli-off

the

de-

references.

executed

systems,

stores

require

that

instructions

stores.

co-

caches

scheduling

memory

and,

Alewife may

appears

on Iylelnory

a user-level

the

facilitate

with The

and

and

occur

nler/syst,em

}vit,hout

(he

a hard\vare

architecture

assuring

conjunction

blt

It

with

sets

memory

each

assumptions

may

lookahead

of streams,

Ioacls

will

to

for

is to lniuilnize

philosophy sections

tomized

waiting

provides

Tera

a 3-bit colllpiler

full/empty

reducing

The and

to

to sinlul-

operations

The

no

ill regis-

and

of subsequent,

proposes

also

AL(J,

Tera

the

registers,

potential

are

it.

a hard-

retaining

of the program

Stream

previous on

Loads

trap

machine

has its

eliminates

plus

multithreading,

stream

design

switched.

for

ure is horizontal],

synchronization

retry

Ho~izon on the

coutains

number

synchronized

Tera. the

a

HEP,

systems

parallelism

the an

without

Horizon

substantially

sign

architect

A

for

delayed

lnore

interlocks.

instruction. the

bits

outstanding

the

be issued

is used

reserved

instruction

register

to indicate

change

having

multiple

head

of the

This

a memory,

Each allow

hardware

The

of the

MIMD

strean]

and

Horizon

instruct,ioll initiate

to

HEP,

memory

set. and

operation.

for

each

full/empty

taneously

tions

a successor

data for

of

grain

a remote

bit

Because

HEP,

(M-bit

field

of shared

register

access.

each

[10],

to tolerate

scheme

attempt.

full/enlpty

computation.

own

ter

Horizon

Alewife

soft ware.

nization.

Horizon.

The

shared

locality

a given

synchronization

are passed

the

shared-

only

Effectiveness

coarse

from

stream

proce-

on

employs

structions

by the han-

for each

largely

Alewife

trapped

them,

is an-

distributed

not

for

directory

between

pends

latency,

caches

maintained

[1] of MIT

MIMD

community.

memory

includes

herency

cre-

architecture to the

multithreaded

ware

occurs,

This

Alewife

addition

to avoid

Alewife

share

a trap

handler.

a new stream

parameters registers

When

as a trap

rnemory

As well

streams

recent

seeks

This sets,

windows.

spawnecl

parent.

traps.

The

other

of parallel

bank

by creating

shared

available,

use

MASA’S

own

allows

Similarly,

is its

Alewife.

de-

architec-

and

is created

dler.

HiZP

calls

by

automatically

stream,

MASA

the

procedure

in the style

their

awhitecture

with

in

statically

a register

its

quantum

scheduling

co-scheduled set.

hierarchy

represents Since

the

a set on

of

a pr~

registers

are

shared,

one remotle

In contrast, register eight

the

set. remote

referellcf?

has only

However,

because

references

gle remote

does

Tera

reference

not, idle

one

a register

progranl

lookahead

allows

to be unresolved also

need

not, idle

a Tera

Tera

to

The

a sin-

Tera

memory

register

and

set.

architecture

implements with

interleaved

work

looks and

data *T

*T.

[14]

acteristics

of both

machines.

The

remote that

loads

due

Each The

is a hybrid parallel

von

architecture

and

adopting

Neumann

aims

to

syllcllrollizatrioll

and

limit

char-

memory

dat,afiow

the

latency,

cost

of

particularly

ha,,

synchronization

oshort

is that

the second

The

the

fields

data

longer,

having

been

relieved

activities

by

first,

processor.

ization

processor

hal)dles

the

processor

can

data

leveraging

the high

or

quelles

activities,

the

handles

other

processor.

lilc)r{~collll) of quick

The

fewer

the

provi(led

allow

aspects,

I{ISC

by currea(

ware

RIS(’

sor

ptc)-

the

set that

processor. which

is shared \’ery

exploited.

Similar

performs

a remote

no state

behind

load on

a continuation is returned.

with

register

logic

issuing

the

can

live

and

(after

the

more

may

restore

The

S( real])

allowing

when

costly,

them

the operation

upon

from

before the

has its Tera

There

merits,

system

We is its

allelism,

including

Although

other

levels

— generally

— Tera at all

appears

are a n u lnher

with

of approaches

IIlllltitl]reacli]lg believe

a principal

ability

to exploit

parallelism systems the LlniqLle

k;ach

fine

and/or

in its goal

strtng’tll heterogeneous

from

exploit

to tol-

the

busy

at jobs

simu It,aneous

parallelism medium to exploit

at grain

par-

any

jobs.

(the

levels

and

])ara]lelism

in

may

to execute

the

next

instruction

provision

of

The

out,

example,

instruction.

the

in

the

is filled

processor

128 streams the

streams

processor

is

per

pro-

processor run-

other

jobs

they

during

do-

enable

while

independently;

saturated

an

belongs

of protection

to keep

extra

the

When it

latency

streams, provision

in-

in each

streams

instruction

other

utilize

processor

be issued which

and

a new

its predecessors.

is necessary

and

as well,

to the

proces-

instruction

the

to

time.

associates

enable

periods

a

of higher

duetosynchronizatio

four

Two

of

trap

bits,

Access

the

nwait

used of the bit,

disabled

the

are used

to a word

values

bits

with

a, full/empty

these,

full/empty not

state

bit,,

instruction)

set aud

191

one

by

stream

enough

than

Since

memories stream

the

to execute

pipelined

with

from

set and are hardclock,

instruction.

the

to remain

data

of the

levels,

appli-

sand

contention).

Ilizatioll.

several

next

the average

to fully

bits.

of the

is ready

ready

a forwarding

trap

that

hardware

(for

Tera

of the

of the

Similar

swapped

word:

approach

to at-

in mapping

register

and

are

is more

latency

latency

be ex-

Although

necessary

tick

interfering

instructions

nlelnory Summary.

its

becomes

processor

erating

can

theextradornains

own

completes,

ning are

completes),

usually

set of

accounting

parallel.

their

a different

so that

a stream

and

every

network

processor

cessor

16 prodomain

processor

flexibility

a stream

there

mains,

restarting

on

Provided

hand,

them

the

without,

willh

limits in

are

for

an independent

each

is completely

full y utilized.

streains

since

to store

t,hf>

do not,

other

by

thereby

request,

OU the other

have

that

with

system

it to issue

instruction

Iraves

operations

registers.

be

values

load

continuation

locations,

of more

operations several

processor.

remote

tick

support

ftheprocessor,

each have selects

struction

a streatn

synchronimtion

to ~est,art,

file

on a

to be effectively

processor,

issuing

Thus,

running

and plane

Lo processors.

scheduled.

ancl

pro-

a request

A protection

resource

16 applications

interpretation

reg-

ryligllt\v(~igllt,

parallelism

or remote

the

response

remote

grain

asingle

the streams

to a dataflow

advantage

presents

lllakess trealllsve

fine

contains

consume

n~oclel

among

Tllissllarillg

allows

to take

*T

the

network

allows

has

applications

the operating

allows

SimilartoTAiM,

map

uti]izationo

Streams

processor

cessors,

ister

than

perspecarbitrary

provided

The

a

net-

to cross the bisection

Consequently,

peak

The

allow

processors

Tera

stream

16 distinct,

cations

synchron-

lllllltitllr(’aclillg

be a standard

speed

information. ecuting

to

128 streams,

a memory holding

messages

tain

of the and

by

[3].

processor’s

which

tick.

a remote

synch rouizatio]]

to the

be tolerated.

bandwidth

clolnains

the

can

processor

registers

data

from

each

clock

interconnected

is sufficient

for each processor

sharedprocessors

network

relative

a bisection

implements

{iteilltt~llsive

13ecause

all

mes-

a pipeline

a response

tection

A c,ommonshort,

syllcl~rollizat,ioll,

processor

processor:

typically

processors.

directly

activities.

of a join

request.

processor

asynchronous

processor

sagescorrespondingt memory

two

units

bandwidth

latency

Each

*T

like

placement

vides of’

a physically

multi-stream

interconnection

its

resulting

to barriers. node

activity

architecture,

Overview

multiprocessor

packet-switched tive

for

Architecture

per up

at, once,

3

set,.

thread

to tag

for

lightweight

the

by the

Regardless the

one

pointer location

of the

data

data

and

synchro-

memory

of the

pointer,

memory two

bit

is controlled

bits.

and

full/empty

access

if either by the

each

bit,,

trap

memory

state bits

is

access

fails

and

the

Tera full/empty the

issuing

supports

stream

three

bit,

where

of dependence

tral)s

nlocles

of’ interacting the

again,

lnode

with

pendence

the

is selected

to plan

by

(Normal)

Read

and

ess of the state bit

write

the memory

of the full/empty

worcl

bit;

regardl-

writes

The

(Future)

Read

is full;

and

writ,e

leavethe

bit

the

word

onl’y

when

pipeline.

and

(Synchronized) .the

the

empty \Wlen

to empty.

and

set the

a memory

ation

traps.

tional

unit

suing

bit

access

in the hardware The

it

retries

when

is retried

the stream are

do not

the word

only

is full the

and

worcl

tions

is

ing

to full.

fails

before

and

when

lVrite

that

done

several issued

with

other

wide

instructions,

ters,

and

trap

encountered

Tera

fLlllC-

streams

support,

z]rcllitt?ctllre,il for

mechanisms, in the

process

~vill

follo!ving

Very

from

several

The

wide

instructions

architecture Because

tion,

support it

detect,

creation,

cclun-

as they

and

present, this

typical also

C-op house

operations performed

The

Tera

head

oprrat.io]ls:

control

A-op

(MA~).

~~-op slot

integer

and

allows

ac]dressing

with Tera

A typical

is a F1,OAT-.ADDm1UL,

l’he

floating

poin(, point

instruction

cm

will

guarantees

are handled field

stream

in the

up to eight

operations

where

this

lookaheacl within

the same

(or

instruction through instruc

tioll.

the

val~ie

is that

it allows

a

allelism”

be

looka-

It

also

to be cc)-residellt

dependence

Iat$ellcy

allows

on loacls,

benefit

pipeline,

hancllm

any

by

optimal

obtained

and

data

en-

a stream

to

be

automati-

dominant, is

level

by executing

separate

using form

“loop

Additional

of par-

separate parallelism

blocks

(lolmpilers

mapping

of code

can

are also capable to

of this

is often

achieve

times

the number tasks

e.g. how

is deemecl

011 lllllltitllreaclecl

of

a form

of

with

froln

the

ancl

jobs,

han-

is furt]ler

complicated

sort

Consicler

an isolated

other

other the

in which parallel static

the exeand

loop:

the

mapping

what

serially.

activities

dynamic

not num-

parallelism

and executed

by uncertain parallel

and how

are scheduled

useful”

the tar-

a given

of iterations

architectures

is sharecl

onto

parameters for

influences

loops

“not

processor salne

on

In particular,

of various

lented,

parallelism

dependent

at compile-time.

parallelism

192

idle

programs

The

synchronization

architecture

is implen

of

can

scientific

compilers

concurrently. explicit

cution

colI-

existing

technology.

ber of processors,

to I.)e tolel-

instructions

in tile ant]

field

n]:ljor

the

activate

pipelining[12].

dependence

each

reserve

which

of parallelism

from

when

known

Iookalleacl

amounts

by detecting

The

inenlory

are

automatic

instructions

returns

is found

depen-

allows

and

RESERVE, the

counters

concurrently.

~

to

instructions

with

program

of a loop

M-op

ouLstal)cling

RESERVE

instruction

—- para]lelism

issue

of A and

activate

processor:

The

them

it is useful it employs.

allocate,

to assist

to au-

parallelism.

that

instructions,

exploited

inserting

each

The

QUIT

together

actions,

that

iterations

However,

nwmory

The

parallelism

arithmetic

iu

assign

work

features

introduced

extracted

get

Tile

hav-

increases

iclle state.

npeclcc]

of tile

of parallelislll.

and

computation.

harclware

one

instruc-

HEP,

fine-grained

a single

QUIT.

c.urren t compiler

k-

to

on

of programs.

when

The

ancl

ancl were

be executed

results

a a 3-hit

degree

stream

the

the

compiler’s

CREATE

b~paw:cl)

silll~llt,ai~(+o~]sly

a stream.

dles register

that,

be available

hold

often

operations

streams

to issue

clata

hardware

from

may

As seen with

harclware

provides

Significant

It can

calculations

be the

parallelization

cally

and

is flexible.

Iloatillg

con-

instructions

hardware

right,

the

for cletecting

three

par-

Whereas

the pipelines

ancl exploit

critical

vironments.

applica-

and

detect

streaws

to

parallelisln.

which operation

grain. easy

packec]

simple

a succeeding

dence

ated

a typical

fine

sequential

with

fine-grain

Parallelism

compiler

tile

novel,

of an apljlication’s

and

is a JUMP,

a very

responsible

filling

Fine-grained

CREATE,

of the Tera

reasonably

most

streams,

5

define

tick.

operations

trols

are

in parallel

three

clock

the

is solely

level

Arithmetic,

op is a. LOAD,

at

very

with

utilization.

Tera

pipelines

program,

in even

instructions

Memory,

by

to

the Tera. compiler

Tera

nlultiple

of a memory

pipelines.

pipelines

their

Bef’or(> we descn

are

Parallelism

parallelism

is tedious

and scheduling

A

and

unit

the Tera

streams

tomatically

sections,

Fine-grained

and

is set by the

instance

are filled

stream,

Cleallocate

4

late

is-

~cl~i{lillgits

be defined

each

depen-

too

value

also supports

functional

pipelines

multiple

The ftJle

lookahead for

satisfied out

times

instructions.

Otherfeatureso

with

finding

de-

the hardware

theoper-

in the, memory

interfere

differ

hardware

in the

ventional

full,

Reacl only

bit

than

The

can

Tera

allelism

instruction set

streams

rather

Making

allows

operation.

to full.

cell

the

compiler

set the

scheduling

constraints

sta~ling

operations.

in the instruction

aheacl,

dence

pointer.

bet, weeu memory

explicit

physical from problem

resources.

of’ streams

DOK = l,N X(K)= Q + Y(K)*(R*ZX(K+1O) ENDDO

+ T*ZX(K+ll))

compiler

additional forms

might

then

iterations

of the

A solution

this

not

them

if available. of

from able

if there

this

begins

loop

not

execute

mini-super

has

computers

it is considered

very

dynamic

parallel

activity

empty-set-full.

been

used

01) sev-

number

as Alliant

and

loop

configuration.

Our

as soou

allocation

of resources

even

across

to changes

among

jobs,

to 30 streillnsl many

is busy

then

the

loop

only

a few

and

for a simple ant]

streams serially

instructions

divide

parallel

[II(> loop

are acquired.

no addit)ou

executes

al streams will}

pro-

1 streaills

after

usual

X(k)=

of overhead:

holds

Each

loop,

the

stream

its share QUITS.

of the

Note

that

it is ready

loop

possibly

to be

in a dif-

instancesof

uniform

such

demand

rrut

usvd

on the

scalar

loops

proces-

soft~vare

ter

a successful

apply.

incorrect,

affect

other

~~ith

..

+ T*ZX(k+ll))

set

constant.

reserve

streams

as many

current

includes ( and

This

the

u ar(! instruc

tioll

as are avail al)ie, protection

do]llain.

Rl;-

i~-op

registers

and

attempts lip ‘Ille

to

.st

in the

to

u + .sl, llulllber

but

numbe~

It

also

has

but

a –TEST

RESERVE

fails.

of the

use the

collclitioa

afmay

will

not

to run.

size,

an alternative software

this

that cost

to

any

the

variant code

may

sets

rarely streams

The

case when

generate

a serial

at run

a more

a condition

the

is to

time

be costly

mechanism

u + St

reserved.

handle

one and

to select

opera-

reserve

were

to

to the

the

u-i- st are avail-

that

approach

reserve

of acquiring

unless

code-

addition

attempts

is to have

allocation

In

is very and

implemented

support

a parallel

code

Since

sion

are best

streams

one

cocle,

time

we also

reserve

generate

versions

expected

issued

approach

both

of streams.

whether

must

the processor, depending on t lie Progran}.

193

does not

indicating

is expected

job

this in

of parallelism

RESERVE t u St that

‘ As few as 10 or as n)any as SOstrea]~~scfJLIlfl}x: needed tsaturate

reserved

on following

convention

errant

overhead

operation,

al~le.

liable

the

is only

this

of parallelism low

forms

code the

and

counter

instruction

depends

a create

is

cur-

instructions

QUIT

counter

that

the

reserved

The

above

behavior

forms

a specific

compiler

RESERVE of the

in-

count

Otherwise,

The value

current

a CREATE

if the reserved

count.

Violating

RESERVLUPTO tion

t u st where

occurs

reserve.

having Some

streanls

WheIl

jobs.

Inany

1

instruction

the number

convention

cause

space.

holds

job.

cocle fragment

the

effective,

and the other that

the

the

job

numberof

to be CREATEd.

both

each

current

current,

increase

The

For

the

for

the

a fault,

streams

counter.

optimlzatlons

for

counters

holds

is incremented. to

decrements

Q + Y(k)*(R*ZX(k+lO)

in the job

than

count,

are

two

One

reserved

not, larger

d = DONE$-if(d > 1) quit

use by the

loop.

last

one such

provides

is issued,

to allow

is an immediate for

the

independent

a fairly

streams

of Streallls struction

CHUNK, kO store DONE$, s k = kO-1 while (--s > O) { create 11 k=k sp=sp n=n k+=ko} k=-1 11: load kO, CHUNK n = tnin(kO+k,n) while (++k < n) {

Tera

Se\reral

a write-when-

DONE$

the

to another

in a processor.

active

k>,

amilab

store

SERVE-UPTO

quits

hardware

loaded

kO = cei.ling(n/s)

The

in

except

applied

present

The

evenly

If the are

s =

loop

s rO 29

reserve.upto S+=l

I/

job.

should

is

as it completes

stream

and

write

variable

involved

as a stream

ferent

in avail-

shared

variable

each

Each read is a read-

each

sor. of the compiler

however

cessor

the

and

reserved

parallelism.

One strategy is to use up

The

decrements

nlo-

and

of streams

executa-

responsive

that

synchronization.

t+~l~ell-f~lll-set, -elllpty,

exploit,

to another.

is very

that

can

such

t.

register

Variables in upper case are in memory and variables with a $ indicate uses of the full/empty bit to provide

all

code

into

new stream. In this example, the CREATE instruction is simply sliipped if the RESERVE fails to allocate any streams.

hut

important

of machine

the system

generate

streams

approach

is to allow one

so that

However, will

compiler

additional

be independent

tivation

per-

ultra-lightweight the

This

where

s – 1

stream

loop.

require

the

iterations.

illlplel~lelltatioll

so that

each

at the moment.

is to have

does

convex,

N/s

loop

and

is stored

O.

CREATE instruction is pa.rameterized by a pcjump target and three registers passed to the

The

this

“created”

streams

execution

ble

are

approximately

are not, s — 1 idle

eral

implement

streams

reserved

rO is always

relati~’e The

successfully

llegister

which

in terms expensive to fall

will is still

fail

two

one and ver-

of code but

re-

back

on.

and

so the

low.

It

6

Medium-

Grained

Parallelism

full/clnpty that,

Tera’s

medium

larity

grain

of greater

ified

by the

bination

parallelism

than

user,

100 instructions

discovered

the

system.

to

wherever

of the

guage

can he spec-

compi]er,

express

or a com-

as much

It

should

imply

of hide

Future

variables

chronization

for

that

cle-

added

lan-

Achieve

efficient,

tomatic

loacl

need

not

number

and

of

generated

processor

balancing

utilization

with

blocks

lightweight

communication

through the

when

needed,

influence

and

the

scheduling

au-

empty

used

lvithout

variables

more

complex

erties,

how

the



The

straightforward

of

system this

language

mentation

on our

so

a

mentioned

be

used

illlplelllelltat,ioll

penal (ies that

lnay

lightweight

can

i[s parallellsln.

addresses

sectfiol]

these

describes

extensions

al)d

prop-

Explicit

parallel

variables

namic

LHheir inlple-

scribes

Parallelism programming

and

future

a location

that

of a computation. may

have

pointer) ifier

any and

[6],

parallel

will

A future activity

type

identified

and

its

When

a future

with

the

statement,

as unavailable

fioat$, type

to to

result

memory

quent

reads

future

completes,

future

variable

cell

of the

the appropriate and

the location

tmnslitemted

the

of future~

discussim]

create

or

paratld

our activity

language

extensio]ls

scltedufing

and

be created.

then

a write

to

variable

to full.

can

be-

Synchro-

base

on which

the

value

future bit

) aikd block,

vallle

of futures

language

goals.

vari-

;ubse-

uses a varial][ t o FORTRAN.

remains

are

the

to

load

afore-

describe

an

balancing

due

created

dynamically

amount

of time,

Instead,

approach

If a parallel

call

and

parallel

activity

activity

on the fact

to execute

blocks

shortly

constraints,

strategy

a new

can get

and a static

we use a dy-

and rely

to synchronization

through

01’ C, we l]avc

future

in

that

next after

is

it is

an efficient

parallel

io]].

activity

executing starts

on the job.

and

chores

for

of new

the

in

the

the

processors

allocation

pool. of

are

an un-

work-pool

from called

ready

to

the

pool.

reserves

some

mrtual

processors,

repeatedly Executing

additional

parallelism

style

ready

and

select a chore

instruction in the

genera-

chores.

the

Parallel ainls can

of virtual

Resource

to react

by adjusting

runtime

number

Virtual

from

chores

continuations. a

the runtime

streams,

compiler

creation,

uses

executing,

a chore.

the

corresponds

chores

finer-grained

runtime

The

On

which

by

of ready-to-execute

to work

result

activity

either

environment

of instruction

application

194

pool,

collection

for

parallel

created

statements.

runt.ime

run

be

a ready

a job

Tera

(lwcrI

mediun-grain can

number

tion

( tile

each

chores

streams

to the

111 addit

variables

satisfies

is impractical.

crucial.

may

the

available

It

activities

not

\Yhen

variable

Il%en

synchronized

that

accomplishes

of which

strategy

associated any

and

machine

decision

TIIe

a new

is written

is lnarked

Parallelism

policy

ordered

Ixc~ I llf:re is il]clcp{>llclcnt of language construct and coutcl l)e aserl t 0 supporl ~lcla 1asking or concurrent tlweacls syst,ems SUC1las PRESTO[.5] or lJi,l&[2]. the

full the

the state

an unpredictable

Changing

‘Though

until

structures

virtual

for

placed

(lual-

a future

full/empty

is set, to empt,y

unavailable

synchronized

Similarly

powerful

the

or

variables

illt,

future

result

starts,

( the

the

becomes

blocks

resets

hardware

a very

parallel

Again,

able. is marked

then

to

empty

to empty.

be

“synchro-

access an

we

can

also

These

provide

self-scheduling

We

de-

the

these

char,

the

is used

clirect

variable

pro-

need

running.

future

variable

receive (;,?

( e.g.

with

statement

througl]

A future

extended

primitive are

is fostered eventually

In our

the state

variable

synchronization

statements.

from

that

scheduling

several

hardwale.

Expressing

direct

read

buffer this

which

type.

synwork

bounded address

of

do not

synchronization.

Since

started

6.1

primitive

until

form but

qualifier

A

combination

that

or decrease

the ‘“lka

remainder

type

Implementing

execute To illustrate

To

and

nized

6.2

issues:

activities

algorithm

pro-

programmer

synchI’onization

between

and

synchronized

comes

and

exclusion

the variable’s

presents Support

now

incrementally;

— so that,

be concerned

resources

activities

may

a powerful

provide

bit.

variable

Support an unbounded number of parallel activities — so that an application’s parallelism need not be a certain

parallel

result

pipelining”

situations. any

variables

a full

on

with

nized”

resets

can be discovered

Any the

provide

a sync

full/enlpty

should:

dependent

for

“software

mutual

combined

of the

a parallel

for

have

fronl

cletails

to full).

waiting

ducer/consumer

an

configuration

implementation goals

encour-

parallelism

as possible,

and these

should

the

it is natural.

hardware

Together,

language

blocked

ceed.

well

programmer,

tails

and

by the

parallel

programmer

apphcation the

has a granu-

of both.

A general-purpose age

generally

bit, is toggled

were

its

to the

Allocation. dynamic

resources

dynamically processors

The needs

to these

increase it employs.

and

of an

needs. decrease This

func-

tionality

is us~ful

can vary

significantly

as the

Reasonably,

a large

cessors

a small

than

Resources

should

allelism

itors

to limit

relative

merits

pool

for

a period

be released their

the ratio

lifetime

pool

ready for

size of the

the

ready

acquired

contrast,

average

over

of high

inefficient

to the number

more

virtual

in

processors

scheme

either

succeed

When

a synchronizing

full/empty to

the

retries

it.

when

the runtime

different ating

the

processor.

on a processor

system

interact

lease)

with

such

the

the form

of the

call

a tealn,

While

cooper-

teal n creation

Tile

initialized.

be checkeci (team

system

resources

should

and

be granted

hlore the

uses to decide

that,

iike

counters

whether

{hat

the

is detailed

the

the

A chore

may

synchronized there

block

because

to complete

may

variable be many

in a variety

which chores

of ways and

nization

following:



If

the

should

chronizing

will memory

and

will

little

state.

goals

program

ing

executed

soon so that,

more

then

of syn-

tl)e

cost

and

If the periocl

of time,

waiting

strategy

Synchronization use of the

then

of

full/enlpty

uot

a heavier

bit

associal

and

memory

finds

) pool

for

data

bit or

memory

succeed

for

a Iollg

weight,

non

busy-

ed )vitll

t hrougi each

wor(l

for it

the

back

virtual

reference

be-

values

are

trap

to

access

that

a trap.

the

chore

that

failed (un-

to execute.

adopted

for [4].

to either

program

could

the trap

ready

storage need

bits

The

processor

explicitly

ad-

data

in the

to that,

avoids to

7

Coarse-Grain

the The poll

a check

unblock

another

Parallelism

operating

scheduling -–

fully

system

—- fa>t!:s,

schedules scheds,

of fine graiu

under

and

parallelism

the control

A task is the system-wide in the “rera

takes

of I-Structure

trap

failure, chore

the synchronization

is similar

location

later

is neces-

The

t attempts

placing

transactions

retry

and

a retry

has one of its

another

traps

The

of the

saved.

with

c.ontiuuation

i]nplelllelltatioll

memory

are

stream

often

stream

which

of

state

it immediately

the

fails, count

synchronization.

because

cell

a second

protec-

retry

of saving

chore

register

stream

the the

cost

blocking

each

ion attempt

facility.

blocked

s traps the

with

a

is desirable.

is accolnpiished

the

handler

The will

can of one

a

reference.

synchronization

rate

limit,

the

associated

location,

resources ●

normal

stream.

the

than

the

by

When

on each

succeecl, cost

no addi-

processor

handling

of the

counter

memory

the

sytlchro-

that

a quick

the

When

domain’s

state

on a list

use of the

efficient)

for

avoids

requests

enables

a syuchronizat

trap

a stream

the

This

data

time

at the

is set to balance

split-phase

,Since

accessing

to sllplmrt

Our

be used

a future

to access a

is in the wrong

blocking,

be

for

it, is trying

executilig

synchronization

busy-wait

normal

is waiting

it is Ilecessary

synchronization are the

it

protection

to synchronize,

or because

with

in hardware,

is incremented.

lightweight, the

blocked statement

new mem-

requests

is associated

time

to ilnplemeut

dress

Synchronization

hardware

with

are required;

register Each

value

set.

7.

same

instructions

count,

the

placed

additional

in Section

retry

WThen

ou the nat~lre

the

the

returns

automatically

Interleaving

automatically

limit

domain.

sary

RE,-

The and

network

issues

issuing

restoring

be glob-

at

retrying

status.

are interleaved

rate.

while. and

tick.

limit

interaction

cal]

for success.

allocation)

Efficient

runtlime

system

By

instruction

using

reqllires

I uust

but

tional

reaches

(re-

requests

will

a short is made

the request

queue

communication

retry.

its

it,

to acqllire

IH tlIal

failed

tion

the run[, ime

is because donlai

recently

A retry

on a

a failed a retry

at a fixed

the

continue

of the operating

system

of al) operating

operating

6.3

This

and

SERVE, must

of streams

operating

as protection

controlled

takes

A group

Retry

state,

-

attempts

access

wrong

in

synchronization

or be waiting

with

recluest

flooding

required

running

synchrohardware

retry-then-trap

synchronization

memory

is in the

processor

requests

per is

RESERVE and CRl~A’TE illstructious),

new teams,

resources

The limit, large-grain

systen]

a tleam indepeudentlly

(through

must

is called

(shrink)

the

ory

between jobs,

operating

seeks to adcl new streams

physical

can grow

ally

allocatiml

with

the the

a novel

for optimistic

most

immediately

bit

places

in resource

that

mon-

On any given processor, additional streams (for use as virtual processors) can be bound by the use of RESERVE and CREATE hardware instructions, Streams can be released with the QI_]IT instruction. Until the ruuti]ne reaches a stream limit imposed by the operating system, it can thus grow and shrink its stream resources by

Interaction

state, If not,

using

is tuned

to run

accordingly.

fairness

desired

immediately. action

tdie

the assumption

and acts

communicating directly with the hardware. ensures that the opera.t,iug system controls

is in the

place

software

Our

par-

Theruntime

bit

takes

scheme.

parallelism,

of chores waiting

of virtual

and

pro-

of low

If the

nization

execution.

a period

use.

memory.

pool

progranl,

its timely

during

of the number

ready

of the

coarse

grain

teams



resources

of the compiler unit, of resource

parallelism leaving

the

— streams

and runtime. assignment.

It provides all execution environment for running a program, Each task consists of one or more scheds. A sched

the of

195

is the operating ble unit.

The

separate

the

system’s

same main. tem

sched

is a group

processor,

of streams ) The

following model.

withil~

tlhat

a Lealll

that

Tera

processor

schedu

absorbs

the

sys-

TEAM

1-2

TEAM

a running

task

runtime.

Typically

a request,

current,

team

to

be

added

sometime

contaius

A

am units

of A that

coatains may

two

two

streams

of its

Team

than

Teams

Tearll

same

ent processors.

team

and

owus

I>e 011 the

1-1 ancf/or

Schecl ?.

separately.

‘~cilrl)

bul

sched

processor. nlay

l-~

processor

Each

2.1

SCIIWI 1 and

can be schwl uled

teims,

be on the

~-1.

schecis,

Sched

l-~.

typically

‘~[1[>

a dynamically

or

sized

set

di(fC>r(Jat

in

ping

slow,

ing

or

usage

pI’oceswl

and

Thr

I-?.

simplest

executing in the can

indepelldrllt

rise

and

tasks

without, vicles

lllis

physical

processor

with

costly

software

tlaSliS to run

main

for systeln

reservecl

uler

of the

is lla IIdlI~d

(ps-scheduler).

The

system’s

tasks.

lived, resource consuming,

example,

jobs submitted

system.

Sn~all

fewer

resources,

quick

turnaround

processor’s pb-scheduler

sched

time:

and

the

in-

used



tick

applicat,

This the

pro-

a slll~ill

ions para]lf>l, lLS am

ps-scl]fdlller.

sclled

correspond of

iIre

to

are

applications:

for

short,

ai]d

o[(en

F’vry

iise

I)etweell g(-]lorall},

A

the

tlte

lzmi

ready

by the

counter

to

tracks in a pr~

time

and

scaling

load

counter

due

can

be

of streams

it is advanced associated

every with

the

for

ratio

quantum

was

or pipeline

value

divided

measure are ready

to

counter

by a

of processor

is incremented

which

used

no stream

requests

of this

counter

are

phantom

in which

memory

time

system

uses the

in resource

every to issue.

provides

tick This

a measure

Iized, some

it may

determines add

curreutly

the operating

after

counters

allocations.

system

placement,

or rrsulned

A

parallelism.

operating

A ddi tiolla]ly, illflliencc

clock

are used

counters

a precise

of strean~s

changes the

the

196

minus

operating

processor

accessi-

every

number

streams

of issue slots

excess

sor is u ncleruti

denland

sI1(JII col]]nlands, >Ilarc(l

when

average

utilization. to long

Tile

divided

The

tasks.

is a user

sfretim

per-processor

provicles

n~tmber

dynamic

hatch

lived,

One

of average

as

t,aslis

job elllry

value

the

al]plica,tions

B{g .sc/Icd

hy the

the

due

[luantum

utilization.

a I)ig sched-

across

over

on a processor;

processor to isslte

time

The

The stream

domain.

hazards. s(’(

account-

of the processor

of active

the number

ready

sect, ions.

team

ad dltiona]

counts

illl(l

to obtain

by the number

l\%

of a task

tht

by each

the fraction

is obtained.

used

measure

last, de

finwgraill

team

protection

allows

anlollgs[

paral]cl

for exall)ple, domai

-– ‘l’era

tll~

wa a renlo(e

are not< ver)

protection

with

I{psources

schedulers

sched

similarly

by t]vo scl Irdulc’rs,

characterization

or little

rapid]}

clelta,

of active

by streams

this

swap-

purpose.

issue

issued

be

cumbersome

counters

By sampling

to

sys-

(and

is incremented

The

be

cannot

to track

counter

a

to

operating

this and

a

for

ask

as it

several

system

usage.

ask

parallelism

within

of instructions

tillle

remove)

can

the

provides

which

resource

domain.

to each

tasks

ill t~arlirr

and

I)J the

to taslw

with

dotllaills

(1]1.)-scll(scl{ll(~r)

big sc,hed tasks long

described

protection

applications

scheduler

operating

parallel,

it

between

processor.

ac t,ivit, ies,

techniques

Schecluling of ready

ill

tection

is

for

I]umber

t,[~~li

diff’ereut

nligrat,e

switclliug

per

and decreasec I using

coarser-grain

sched

sharing

context

clomains

up to 15 user

are expanded

iJet\veel]

parallelism

To foster

16 protection

behavior,

can

the

argo(’d

a siliglc

dyaalnic

resollrces

\\’c

within

processors

decreasing

parallelism.

isln

clocli,

It

for

the form

a team

otherwise the

by

can

requests

or

resources

(or

for

assist

‘l”]vcJ per-protection-domain

ac( ount,

by concilrreu{ly

(:111].)lictltiuted-lllel?lory II)UIti processor, [a Sc(llo W shrt W(I Memory flfttltij)roce.ss~}rs. Jiltlwer Acadenlic F’ablishcrs, 1991. biatowicz,

D.

1991.

August Ii.

Padua,

In Proceedings

Conjere,ux

1936 D. Chaikell.

K.

support:

Proceedings

References Agarwal,

D.

Fine-grain

synchronization

A.

and parallel

PPEALS

Parallel

Sah,

machine.

methods

[1’2]

[1]

Levy,

building

1$)~~

to give m the

support

An.

1988.

and

for

Langt(c{ges

of

Tera hardware silnulator, ~l(lltitllrt?:icl(’cl l)rocessors having the ability to mask tile ]ateucy of’ one stream with the activity of another --- I?VPU nlort’ than traditional

Culler,

In

development,

the

D.

osium

and

I(M

Archdec.

Smith. A future-based language for highly-parallel computer. In D. Gel-

Nicolau,

processor

job,

e;ir]y

of

pro-

systeul,

syste] l),

B.

Compdcrs

cla tailow

utli-

onto

of tllle

rul)tin)e

conlpiler,

tools

of processor

do its IJart,icu]ar

[8]

re-

systmn

that

and

A.

H.

for

September

D. Callahan

fipril

processors,

\f7r hclieve

oj the

Computer

In A CM/SIGPLAN

Corm.,

stract

collqjiler

applications

are lightly other

of \ irtlial

systems.

harclware

of

stream

to the operating

its aulnber

places

of’ tile

Lazowska,

euviroumeut

J. Wawrzynek,

sysWI]I inte-

release

on

1990.

fosters

by tbe

is ith

al]ility

reserve

or decrease

wbicb

systelll

inclutle: directly

the operating

lizatiou,

‘1’kra

of multiprocess-

Proceedings

Syrnpostum

E,

open

onit

In

1983.

Bershad,

erut.er,

‘S re.

shortage

be compensated

June

A critique

style.

The

machine

resources

tvre,

Iannucci.

a general-purpose

actliviLies

tenlporary

Internc(tional

ming

[6]

R.A.

nual

Haven,

generated

or are less uniform.

Neumann

An

in

Similarly,

activities:

shares

tasliS.

in one task

gration.

A

loops. parallelism

concurrently

ut ilizatiou,

operations

ancl

vou

[5] B.

bitls, per-

of anot

Suggest Documents