ainwl in t 11,s docmt~en[ are those of Tera. Computer. Company and. sI1,)uIc{ no( be int erpl et c+ as representing the offkial poli(:ies, rither expressed or i]lpliccl,.
Exploiting
Heterogeneous
on a Multithreacled Gail
Alverson
Brian
Rol>ert
Koblenz
Multiprocessor* Davicl
.Alvm-son
Allan T’era
Porterfielcl
Computer
Seattle,
USA
hunclreds
paper
describes
an integrated
and operating
geneons
parallelism,
threaded
system The
multiprocessor.
(multiple
operations
(multiple
jobs)
activities.
to smoothly ponent
interface
is primarily
its four
feedback
enable
resource
allocation
dated.
This
dynamic
[or managinx
mechanisms
resources
to chan~ia~
fosters
and the efficienl
ho{ 11h igli
expression
aging
u [iliza tioa
aad execu(ioa
the
is that,
of para-
it
from
pipeliuecl
allel
loops.
parallelisln
withiu
at, n~an~
an application
or co-scheduled Medium
operations
grain
parallelisn~,
each
was support
vancecl
Research
nology
Office
Projects
ARP.4
issued by
DARPA
The
and
those
of
O
views
Tera
Permission granted direct
the
to
t!tle
of
copying
the
ra~ige
to tightly
par-
the
To
‘92-7
01992
/92/D.
ACM
fee
and
copy
ainwl
order
DefeILw
$iieli(.,e Prograll) ract
in
rither
all are
ACM
the Its
otherwise,
or not
date of the
of
Code
Ad.
or
the
i]l~pliccl,
allel
are
for
copyright
notice
the
and
Association republish,
notice for
et c+ of
C.,
is given
/0188
This,
a fee
the
. ..$1.50
188
which
is programmed These
units.
nmltit,breaded
must
arcllinumber
multi-
by the processor Our
hard-
principle
processors
units grow
and with
be tolerated
attribute
is to
synchronizathe
for
size
of the
an effectively
largely
call
be used.
language each with
language with
creation
to the
influences The has
and
through require
and
time-sharing more
logical
defined
number
of
mem-
imperative
language
constructs
to specify
synchronization. of processors
the par-
access to a shared
control
is
program-
machine
an unbounded
uniform
system
parallel how easily
virtual
is a traditional data
number
program
of the overall
it presents
in turn,
parallel
The
softl\vare
of
memory
important
machine
unbo(lnded
USA -485 -6/92 /0007
and
hardware
some
uniprocessor.
lat,encies
machine
augmented
perm!sswn.
0-88791
virtual
t,lwead
each
set of functional
they
of the
supports
multiplexed
these
“processors”,
Computing
requtres
the
parallel
machine, nlost,
by our
is
or distributed
appear,
or to
material
made
and
and man-
addition,
a multithreaded
processor
from
Because
nler.
Nlo.
h] I) A972-90-C
t 11,s docmt~en[
expressed
this
In
attribute In
adopting
latency
t ion.
ory. part
paral-
for
to changing
important
a single
multiprocessor,
of
ai~d ‘le{.l-
sI1, )uIc{ no( be int erpl
and
poli(:ies,
copies
is by permlss!on
specific
S( at es
Cc,lit
cent
and soft-
runtime,
levels.
to adapt
a traditional
for
tolerate
Government.
advantage,
pubhcation
Machinery. andlor
the
mirlcr
Company
without
that
commercial
S.
WI
b51 2/2-4:
/C’MO
offkial
U.
copv
prov!ded
that
ICS
the
or
Unit
hardware
compiler,
different,
streams,
motivation
levels.
can
oil
h~forn~ari[~tl
No.
conclusio],s
Computer
as representing DARPA
Agency
Order
007,5.
OTI
ecl by tile
of
resource
heterogeneous
at
are then
onto
Tl]e research
spectrum and
are responsible
physical
Ike
scalable *This
this
components
most
imtruction
ware set contains
exploit
is multitbreacled.
ple streams
parallelism
simultane-
requirements.
Brie[ly,
of t he
Introduction
grain
to
grain-size
hardware,
cooperate
essentially
application
applications
run
We use the phrase
refer
to
parallelism
resource
of
Fine
vargrain
reqairemen(s
llelism.
An
to varying
of
can
fashion.
an integrated
system
t,ec.ture,
1
form
that
designed of the
components
u l>-
the
clmcribes
Each
operating
components
and nsage to be dynamically
adaptation
system
lelism.
its owa level of
between
paper
ware
each com-
in
sections)
widely
general, Coarse
lnents.
TIlis
jobs.
itself
para/lelwn with
more
activities,
in a lllllltiprograllll~led
require
we were able
ts. IVhile
presents
parallelism
the opel-
describes
independent
application
heterogeneous
coarse
across
largely
large
ously
and IIIntIIILe
while
in the design,
componell
responsible
activity,
machine
a job, parallelism
system
parallel
and available
The compiler within
on managing
the entire
(or
mu!ti -
) to very
of instructions,
and
parallelism
heteLo-
the exec II tie]] of J,ery fIHe
an ]nstructton
parallelism
ating systemfocuses
compiler,
to exploiting
I ale is (1 Ijlpc]iilcd
enabli]ls
parallel
By considering
archltec
within
focus on managing
architecture,
solntion
Smith
company
ied, runtime,
Callahan
Burton
ll~ashin~ton
Abstract This
Parallelism
The
fiction
is maintained when
the
processors
of an by
semantics than
the of
avail-
able
streams.
tained
by
The
the
one stream
with
We focus
of uniform
computation
that
What
masks
the hardware, —
that
heterogeneous
operating
in
and
across
How
can
resource
in
tency,
and
are central
idle,
parallelism.
system,
detecting
than they
and rum
parallelism
provide
that
as the
applications? the
changing The ing
resource
second
the
of
granm[8,
II].
program
can vary
shortl
wide
Good
a way
can
out
across
move
jobs,
to users. The
Tera
valleys
a full
We also
nent
must
of the system, to provide The
description
parallelism.
Section ing
The there stream
to our
Section
7 outlines
need
for
available each
each
own
The
needed
with
for
Section
in
of compiler
generated
the
blisy
within
of
processing.
At each
that
is “ready))
is
in the pipeline.
with
streams,
each
queue
are
time.
they
contains
of the
In HEP
to
and
addition
are usually
level
is backed issuing
task
switching
to multithreading, of both
bandwidth.
pipes
has its
stream
of a per-processor
is its pipeliniug
the
has
address
processor
instruction
instruction
increase
in
a virtual
a portion
con-
processor 128 instruction
same
processor’s
fast
hardware.
and
defines
on the
The
in
multipro-
Each
An instruction
counter
introduced
up to 16 processors
switch. domains
clomain. file.
[15],
multithreaded
Because
come
independent
a key
the memory
from and
the
different
interlocks
active
streams, are seldom
encountered.
5
Synchronization full/empty
en-
purpose by
of allocat-
arises
thrust
opera-
system
domain
enables
AI,lJs
whe]lever
location
each
that, sets
when it
and
condition
reattempted
on the
the
the atomicity
set
HEP
location
that,
it, empty, not, hold,
is shared
of operations
full, When
the turn
among
register
a
general
writes
similarly,
when
sets
includes
and
indivisibly
does
next
with
is implemented
A read,
stream’s
register
achieved
empty,
full.
location
the
is
to a location
reads
chronization
HEP
memory
A write
by an instruction
to ensure
189
and
the
plelnented
a processor,
ill (I)c I’acf->of
on
register,
Because
a si]tgle
hit
in
an instruction
tile
resollrces
HEP
executing
register
instructions
syllcllr(~llizatioll. col)certlti
main
of instruction
commercial
A protection
protection
ant]
4 out-
runtime
was
proces-
outstanding
queue
of the
16 protection
mechauism
;3 by
!%ction
the gen(ral
next,
bandwidth
each job
shared
other
in Section
designs
The
queue
number
Denelcor
by a program
to dy-
channels
systeln
[13].
associated
fraction
~vas the first
space;
conlpo-
to adapt
available
a large
elements
a large
streams.
conl-
of par-
parallelism.
systetll
for
busy
support,
in
design
multithreading
the machine
processing
A H El> system
concurrent,
ef[iciell(
pro-
access.
to hold
to the
by a high
jobs.
parallelism
providing
moved
cessor.
Systems
is insufficient to keep
on
oljeratiug
to different
Related general
needs
a
in a sin-
memory
shared-resource
of each
nected
level
architecture.
6 describe+
emphasis
fly the
1~~~,
for
anct resollrces.
is followed
fine-graine(l
special
and
feature
the Tera
This
of our
the
with
resources
2
due
contrasts
for very
discusses
vironment
]U order
locations
in the instruction
tom-
ii] formation
to com-
be
is small.
systems.
lines support then
in part
section
can
how
parallelism
integration
multithreaded a brief
describes
interval
multithreaded
in 1974
an element
up to 25(3 pro-
its l)articular
the
tick,
kept
early
is to use queues
processed
unobtrusive
strealns
time
entities
shared
memory
registerless
at each stage
clock
and
is intende(l
have
the set of communication
this
next
paper
to the oth(’r,
and
tions
w-
of the by Miller
design
HEP,
\vill
in application
Interestingly,
streall]
Illu Iti- user scientific
manages
identify
provide
changes
,f (Iynamic
tvi(lliu
One
multiple-stream
the
one find
plwfikw,
work,
:32,000
This
allelism.
namic
scale
approximately
of the systeln
that
and
of uniform
amount
refers
parallel
Ideally,
means
the
memory
and writing
space.
illusion
sor proposed
paral-
computation
our
Machines
size machine.
in a
in
reasonable
reflecting
ponent
Miller. the
programs
which
the
pro-
relatively
avai lab]e
to another
of one
vides
in a given
wherein
reading
to
architectures Scalable
increase,
Shared
paradigm
address
resources
streams.
memory.
be performed
through
gle shared
of another.
use in a large
with
over
requires
(ask
peaks
environment.
cessors,
one
in
parallelism
ill
resource
mos[
system,
commercial puting
frolll
the
shared
la-
latency.
concern-
numerical
variance
these
appears the
the
mall agenlellt,
Intuitively,
to fill
that
utilization
to resource
sources
used
machine
research
of magnitude
rapid
to
parallelism
Optimized
and
to smooth
solution
from
indicate
adapt
of its applications’?
available
by orclers
intervals.
show
lelism.
stems
the
Results
time
also
recluirements
question
amount
as a whole,
hardware
multiple
of processors
can
memory
processing
for multithreaded
scalable
communication
system,
the
among
number
includes
and
allowing
proportionately.
municate ●
simply
that
increases
latency
latency,
motivation
is to
of work
within
Resource
are shared
A primary
tlflesysten-
compiler,
using
latency.
synchronization
Rather
questions
is the role of each componentof
time
is mainwaiting
in another.
two
support
access
which
on answering
to systems
●
fiction
multithreading,
is im-
indivisibly the
syn-
instruction for
all streams reservation
on shared
is
execution.
registers.
of bits
MASA.
MASA
sign
that
ture.
A
[9] is a illllltitllret~(lecl
has many novel
streams
for
feature
snnilarities
feature
of
lightweight
is supported
which
is shared
as having several
registers
a new
stream
ation
yet
dure
invocation,
in the
of register
of register
registers,
with
their protects
the
its content,
context
of the
to be probed of the procedure
automatically.
at no additional
New
registers
performs
adds
The
are
As
cost.
to the family
scientific Unlike
the
private
need
for
allowing
with
faster
rates
clock
The
that,
can
of this
failed
of the
the
Horizol~
next
better
For example,
the
ation ters
for for
system
nization. and
The dresses
Tera
particularly from
explain,
success data
software
at the its
for
operation
management coarser
be easier
to write it
Alewife
provides for
(temporarily)
fail
the
a
synchrocause
stream
to
and
be
multithreading,
busy.
counters
Because on
the
scalable
programs
the than
the
locality,
Tera
other
makes
on the
for
the
Tera
in
the
application
it is important
to find
both
for
locality
the
hand,
parallelism while
some
grain
program
is important
find
performance,
cation
it
Tera.
program-
the
for
Alewife
in the
appli-
parallelism.
lookaTAM.
completion
The
Tllreadecl
is a compiler-based
loca-
lightweight
the
model
scheme purpose traffic,
under
Machine
to
grained
TAM
parallelism.
schecluling,
in the
control
and
set,
ware clesign; it, imtead
is an
these
and
execution
model
storage
The
[7]
tolerance
The
by making
instruction
(TAM)
latency
synchronization.
for fine
compiler
plicit,
Abstract
approach
synchronization,
lnecllanisll~,
much
of its ‘rera
RESER\~E
creation
has
similarly,
retry phase a new
places
all
management operations
model
is not
can be implemented
types
The
and
can take
coun-
ecuting
is used
iu
synchrointerface,
compiler
ex-
a hard-
on a variety
flexibility
ad-
TAM called threads
preclecessors,
cessor
190
with
however,
has
this
perform another
a “quantum”. that,
can
share
instruction
Al-
streams, flexible
about
system
and the
for
on the
Tera
given
ex-
provides
medium-grained
overhead
the pro-
fine-grained
par-
are scheduled
enough
by
parallelism,
adequately. level A
be
runtirne
level,
are
known
scheduling
streams At
policies
and
Tera
To minimize
hardware.
simple
the Tera
The
architectures.
policies
of information
program. units.
schedules
scheduling
advantage
gram the
parallel
the
same allelism,
of parallelism,
distinguishes
TAM
oper-
explicitly
(and unconventional)
Consequently,
operations,
of the machiue’s of all
AS
is cus-
parallelism,
a
two
de-
IIorizon.
the
to memory
levels,
fewer
programnler/system
The
and
Harclware
sharing
in
inherits
of strea]n
system,
in a
is done
though it is difilcult to compare models with architectures, we contrast several general features of the TAM with those of the Tera,.
includes
applies
the
of conventional
HEP
bits
in-
until or fails
location
that
uses coarse
about
good
neti~orli
heterogeneolls
trap
request
potentially,
a processor
a control
[J nlilie
trap
where
of traps.
however,
monitoring.
only
The
the
architecture
the cooperative
resources. system
from
CREATE
lookahead
cost
to keep
to
I)acli-off
the
de-
references.
executed
systems,
stores
require
that
instructions
stores.
co-
caches
scheduling
memory
and,
Alewife may
appears
on Iylelnory
a user-level
the
facilitate
with The
and
and
occur
nler/syst,em
}vit,hout
(he
a hard\vare
architecture
assuring
conjunction
blt
It
with
sets
memory
each
assumptions
may
lookahead
of streams,
Ioacls
will
to
for
is to lniuilnize
philosophy sections
tomized
waiting
provides
Tera
a 3-bit colllpiler
full/empty
reducing
The and
to
to sinlul-
operations
The
no
ill regis-
and
of subsequent,
proposes
also
AL(J,
Tera
the
registers,
potential
are
it.
a hard-
retaining
of the program
Stream
previous on
Loads
trap
machine
has its
eliminates
plus
multithreading,
stream
design
switched.
for
ure is horizontal],
synchronization
retry
Ho~izon on the
coutains
number
synchronized
Tera. the
a
HEP,
systems
parallelism
the an
without
Horizon
substantially
sign
architect
A
for
delayed
lnore
interlocks.
instruction. the
bits
outstanding
the
be issued
is used
reserved
instruction
register
to indicate
change
having
multiple
head
of the
This
a memory,
Each allow
hardware
The
of the
MIMD
strean]
and
Horizon
instruct,ioll initiate
to
HEP,
memory
set. and
operation.
for
each
full/empty
taneously
tions
a successor
data for
of
grain
a remote
bit
Because
HEP,
(M-bit
field
of shared
register
access.
each
[10],
to tolerate
scheme
attempt.
full/enlpty
computation.
own
ter
Horizon
Alewife
soft ware.
nization.
Horizon.
The
shared
locality
a given
synchronization
are passed
the
shared-
only
Effectiveness
coarse
from
stream
proce-
on
employs
structions
by the han-
for each
largely
Alewife
trapped
them,
is an-
distributed
not
for
directory
between
pends
latency,
caches
maintained
[1] of MIT
MIMD
community.
memory
includes
herency
cre-
architecture to the
multithreaded
ware
occurs,
This
Alewife
addition
to avoid
Alewife
share
a trap
handler.
a new stream
parameters registers
When
as a trap
rnemory
As well
streams
recent
seeks
This sets,
windows.
spawnecl
parent.
traps.
The
other
of parallel
bank
by creating
shared
available,
use
MASA’S
own
allows
Similarly,
is its
Alewife.
de-
architec-
and
is created
dler.
HiZP
calls
by
automatically
stream,
MASA
the
procedure
in the style
their
awhitecture
with
in
statically
a register
its
quantum
scheduling
co-scheduled set.
hierarchy
represents Since
the
a set on
of
a pr~
registers
are
shared,
one remotle
In contrast, register eight
the
set. remote
referellcf?
has only
However,
because
references
gle remote
does
Tera
reference
not, idle
one
a register
progranl
lookahead
allows
to be unresolved also
need
not, idle
a Tera
Tera
to
The
a sin-
Tera
memory
register
and
set.
architecture
implements with
interleaved
work
looks and
data *T
*T.
[14]
acteristics
of both
machines.
The
remote that
loads
due
Each The
is a hybrid parallel
von
architecture
and
adopting
Neumann
aims
to
syllcllrollizatrioll
and
limit
char-
memory
dat,afiow
the
latency,
cost
of
particularly
ha,,
synchronization
oshort
is that
the second
The
the
fields
data
longer,
having
been
relieved
activities
by
first,
processor.
ization
processor
hal)dles
the
processor
can
data
leveraging
the high
or
quelles
activities,
the
handles
other
processor.
lilc)r{~collll) of quick
The
fewer
the
provi(led
allow
aspects,
I{ISC
by currea(
ware
RIS(’
sor
ptc)-
the
set that
processor. which
is shared \’ery
exploited.
Similar
performs
a remote
no state
behind
load on
a continuation is returned.
with
register
logic
issuing
the
can
live
and
(after
the
more
may
restore
The
S( real])
allowing
when
costly,
them
the operation
upon
from
before the
has its Tera
There
merits,
system
We is its
allelism,
including
Although
other
levels
— generally
— Tera at all
appears
are a n u lnher
with
of approaches
IIlllltitl]reacli]lg believe
a principal
ability
to exploit
parallelism systems the LlniqLle
k;ach
fine
and/or
in its goal
strtng’tll heterogeneous
from
exploit
to tol-
the
busy
at jobs
simu It,aneous
parallelism medium to exploit
at grain
par-
any
jobs.
(the
levels
and
])ara]lelism
in
may
to execute
the
next
instruction
provision
of
The
out,
example,
instruction.
the
in
the
is filled
processor
128 streams the
streams
processor
is
per
pro-
processor run-
other
jobs
they
during
do-
enable
while
independently;
saturated
an
belongs
of protection
to keep
extra
the
When it
latency
streams, provision
in-
in each
streams
instruction
other
utilize
processor
be issued which
and
a new
its predecessors.
is necessary
and
as well,
to the
proces-
instruction
the
to
time.
associates
enable
periods
a
of higher
duetosynchronizatio
four
Two
of
trap
bits,
Access
the
nwait
used of the bit,
disabled
the
are used
to a word
values
bits
with
a, full/empty
these,
full/empty not
state
bit,,
instruction)
set aud
191
one
by
stream
enough
than
Since
memories stream
the
to execute
pipelined
with
from
set and are hardclock,
instruction.
the
to remain
data
of the
levels,
appli-
sand
contention).
Ilizatioll.
several
next
the average
to fully
bits.
of the
is ready
ready
a forwarding
trap
that
hardware
(for
Tera
of the
of the
Similar
swapped
word:
approach
to at-
in mapping
register
and
are
is more
latency
latency
be ex-
Although
necessary
tick
interfering
instructions
nlelnory Summary.
its
becomes
processor
erating
can
theextradornains
own
completes,
ning are
completes),
usually
set of
accounting
parallel.
their
a different
so that
a stream
and
every
network
processor
cessor
16 prodomain
processor
flexibility
a stream
there
mains,
restarting
on
Provided
hand,
them
the
without,
willh
limits in
are
for
an independent
each
is completely
full y utilized.
streains
since
to store
t,hf>
do not,
other
by
thereby
request,
OU the other
have
that
with
system
it to issue
instruction
Iraves
operations
registers.
be
values
load
continuation
locations,
of more
operations several
processor.
remote
tick
support
ftheprocessor,
each have selects
struction
a streatn
synchronimtion
to ~est,art,
file
on a
to be effectively
processor,
issuing
Thus,
running
and plane
Lo processors.
scheduled.
ancl
pro-
a request
A protection
resource
16 applications
interpretation
reg-
ryligllt\v(~igllt,
parallelism
or remote
the
response
remote
grain
asingle
the streams
to a dataflow
advantage
presents
lllakess trealllsve
fine
contains
consume
n~oclel
among
Tllissllarillg
allows
to take
*T
the
network
allows
has
applications
the operating
allows
SimilartoTAiM,
map
uti]izationo
Streams
processor
cessors,
ister
than
perspecarbitrary
provided
The
a
net-
to cross the bisection
Consequently,
peak
The
allow
processors
Tera
stream
16 distinct,
cations
synchron-
lllllltitllr(’aclillg
be a standard
speed
information. ecuting
to
128 streams,
a memory holding
messages
tain
of the and
by
[3].
processor’s
which
tick.
a remote
synch rouizatio]]
to the
be tolerated.
bandwidth
clolnains
the
can
processor
registers
data
from
each
clock
interconnected
is sufficient
for each processor
sharedprocessors
network
relative
a bisection
implements
{iteilltt~llsive
13ecause
all
mes-
a pipeline
a response
tection
A c,ommonshort,
syllcl~rollizat,ioll,
processor
processor:
typically
processors.
directly
activities.
of a join
request.
processor
asynchronous
processor
sagescorrespondingt memory
two
units
bandwidth
latency
Each
*T
like
placement
vides of’
a physically
multi-stream
interconnection
its
resulting
to barriers. node
activity
architecture,
Overview
multiprocessor
packet-switched tive
for
Architecture
per up
at, once,
3
set,.
thread
to tag
for
lightweight
the
by the
Regardless the
one
pointer location
of the
data
data
and
synchro-
memory
of the
pointer,
memory two
bit
is controlled
bits.
and
full/empty
access
if either by the
each
bit,,
trap
memory
state bits
is
access
fails
and
the
Tera full/empty the
issuing
supports
stream
three
bit,
where
of dependence
tral)s
nlocles
of’ interacting the
again,
lnode
with
pendence
the
is selected
to plan
by
(Normal)
Read
and
ess of the state bit
write
the memory
of the full/empty
worcl
bit;
regardl-
writes
The
(Future)
Read
is full;
and
writ,e
leavethe
bit
the
word
onl’y
when
pipeline.
and
(Synchronized) .the
the
empty \Wlen
to empty.
and
set the
a memory
ation
traps.
tional
unit
suing
bit
access
in the hardware The
it
retries
when
is retried
the stream are
do not
the word
only
is full the
and
worcl
tions
is
ing
to full.
fails
before
and
when
lVrite
that
done
several issued
with
other
wide
instructions,
ters,
and
trap
encountered
Tera
fLlllC-
streams
support,
z]rcllitt?ctllre,il for
mechanisms, in the
process
~vill
follo!ving
Very
from
several
The
wide
instructions
architecture Because
tion,
support it
detect,
creation,
cclun-
as they
and
present, this
typical also
C-op house
operations performed
The
Tera
head
oprrat.io]ls:
control
A-op
(MA~).
~~-op slot
integer
and
allows
ac]dressing
with Tera
A typical
is a F1,OAT-.ADDm1UL,
l’he
floating
poin(, point
instruction
cm
will
guarantees
are handled field
stream
in the
up to eight
operations
where
this
lookaheacl within
the same
(or
instruction through instruc
tioll.
the
val~ie
is that
it allows
a
allelism”
be
looka-
It
also
to be cc)-residellt
dependence
Iat$ellcy
allows
on loacls,
benefit
pipeline,
hancllm
any
by
optimal
obtained
and
data
en-
a stream
to
be
automati-
dominant, is
level
by executing
separate
using form
“loop
Additional
of par-
separate parallelism
blocks
(lolmpilers
mapping
of code
can
are also capable to
of this
is often
achieve
times
the number tasks
e.g. how
is deemecl
011 lllllltitllreaclecl
of
a form
of
with
froln
the
ancl
jobs,
han-
is furt]ler
complicated
sort
Consicler
an isolated
other
other the
in which parallel static
the exeand
loop:
the
mapping
what
serially.
activities
dynamic
not num-
parallelism
and executed
by uncertain parallel
and how
are scheduled
useful”
the tar-
a given
of iterations
architectures
is sharecl
onto
parameters for
influences
loops
“not
processor salne
on
In particular,
of various
lented,
parallelism
dependent
at compile-time.
parallelism
192
idle
programs
The
synchronization
architecture
is implen
of
can
scientific
compilers
concurrently. explicit
cution
colI-
existing
technology.
ber of processors,
to I.)e tolel-
instructions
in tile ant]
field
n]:ljor
the
activate
pipelining[12].
dependence
each
reserve
which
of parallelism
from
when
known
Iookalleacl
amounts
by detecting
The
inenlory
are
automatic
instructions
returns
is found
depen-
allows
and
RESERVE, the
counters
concurrently.
~
to
instructions
with
program
of a loop
M-op
ouLstal)cling
RESERVE
instruction
—- para]lelism
issue
of A and
activate
processor:
The
them
it is useful it employs.
allocate,
to assist
to au-
parallelism.
that
instructions,
exploited
inserting
each
The
QUIT
together
actions,
that
iterations
However,
nwmory
The
parallelism
arithmetic
iu
assign
work
features
introduced
extracted
get
Tile
hav-
increases
iclle state.
npeclcc]
of tile
of parallelislll.
and
computation.
harclware
one
instruc-
HEP,
fine-grained
a single
QUIT.
c.urren t compiler
k-
to
on
of programs.
when
The
ancl
ancl were
be executed
results
a a 3-hit
degree
stream
the
the
compiler’s
CREATE
b~paw:cl)
silll~llt,ai~(+o~]sly
a stream.
dles register
that,
be available
hold
often
operations
streams
to issue
clata
hardware
from
may
As seen with
harclware
provides
Significant
It can
calculations
be the
parallelization
cally
and
is flexible.
Iloatillg
con-
instructions
hardware
right,
the
for cletecting
three
par-
Whereas
the pipelines
ancl exploit
critical
vironments.
applica-
and
detect
streaws
to
parallelisln.
which operation
grain. easy
packec]
simple
a succeeding
dence
ated
a typical
fine
sequential
with
fine-grain
Parallelism
compiler
tile
novel,
of an apljlication’s
and
is a JUMP,
a very
responsible
filling
Fine-grained
CREATE,
of the Tera
reasonably
most
streams,
5
define
tick.
operations
trols
are
in parallel
three
clock
the
is solely
level
Arithmetic,
op is a. LOAD,
at
very
with
utilization.
Tera
pipelines
program,
in even
instructions
Memory,
by
to
the Tera. compiler
Tera
nlultiple
of a memory
pipelines.
pipelines
their
Bef’or(> we descn
are
Parallelism
parallelism
is tedious
and scheduling
A
and
unit
the Tera
streams
tomatically
sections,
Fine-grained
and
is set by the
instance
are filled
stream,
Cleallocate
4
late
is-
~cl~i{lillgits
be defined
each
depen-
too
value
also supports
functional
pipelines
multiple
The ftJle
lookahead for
satisfied out
times
instructions.
Otherfeatureso
with
finding
de-
the hardware
theoper-
in the, memory
interfere
differ
hardware
in the
ventional
full,
Reacl only
bit
than
The
can
Tera
allelism
instruction set
streams
rather
Making
allows
operation.
to full.
cell
the
compiler
set the
scheduling
constraints
sta~ling
operations.
in the instruction
aheacl,
dence
pointer.
bet, weeu memory
explicit
physical from problem
resources.
of’ streams
DOK = l,N X(K)= Q + Y(K)*(R*ZX(K+1O) ENDDO
+ T*ZX(K+ll))
compiler
additional forms
might
then
iterations
of the
A solution
this
not
them
if available. of
from able
if there
this
begins
loop
not
execute
mini-super
has
computers
it is considered
very
dynamic
parallel
activity
empty-set-full.
been
used
01) sev-
number
as Alliant
and
loop
configuration.
Our
as soou
allocation
of resources
even
across
to changes
among
jobs,
to 30 streillnsl many
is busy
then
the
loop
only
a few
and
for a simple ant]
streams serially
instructions
divide
parallel
[II(> loop
are acquired.
no addit)ou
executes
al streams will}
pro-
1 streaills
after
usual
X(k)=
of overhead:
holds
Each
loop,
the
stream
its share QUITS.
of the
Note
that
it is ready
loop
possibly
to be
in a dif-
instancesof
uniform
such
demand
rrut
usvd
on the
scalar
loops
proces-
soft~vare
ter
a successful
apply.
incorrect,
affect
other
~~ith
..
+ T*ZX(k+ll))
set
constant.
reserve
streams
as many
current
includes ( and
This
the
u ar(! instruc
tioll
as are avail al)ie, protection
do]llain.
Rl;-
i~-op
registers
and
attempts lip ‘Ille
to
.st
in the
to
u + .sl, llulllber
but
numbe~
It
also
has
but
a –TEST
RESERVE
fails.
of the
use the
collclitioa
afmay
will
not
to run.
size,
an alternative software
this
that cost
to
any
the
variant code
may
sets
rarely streams
The
case when
generate
a serial
at run
a more
a condition
the
is to
time
be costly
mechanism
u + St
reserved.
handle
one and
to select
opera-
reserve
were
to
to the
the
u-i- st are avail-
that
approach
reserve
of acquiring
unless
code-
addition
attempts
is to have
allocation
In
is very and
implemented
support
a parallel
code
Since
sion
are best
streams
one
cocle,
time
we also
reserve
generate
versions
expected
issued
approach
both
of streams.
whether
must
the processor, depending on t lie Progran}.
193
does not
indicating
is expected
job
this in
of parallelism
RESERVE t u St that
‘ As few as 10 or as n)any as SOstrea]~~scfJLIlfl}x: needed tsaturate
reserved
on following
convention
errant
overhead
operation,
al~le.
liable
the
is only
this
of parallelism low
forms
code the
and
counter
instruction
depends
a create
is
cur-
instructions
QUIT
counter
that
the
reserved
The
above
behavior
forms
a specific
compiler
RESERVE of the
in-
count
Otherwise,
The value
current
a CREATE
if the reserved
count.
Violating
RESERVLUPTO tion
t u st where
occurs
reserve.
having Some
streanls
WheIl
jobs.
Inany
1
instruction
the number
convention
cause
space.
holds
job.
cocle fragment
the
effective,
and the other that
the
the
job
numberof
to be CREATEd.
both
each
current
current,
increase
The
For
the
for
the
a fault,
streams
counter.
optimlzatlons
for
counters
holds
is incremented. to
decrements
Q + Y(k)*(R*ZX(k+lO)
in the job
than
count,
are
two
One
reserved
not, larger
d = DONE$-if(d > 1) quit
use by the
loop.
last
one such
provides
is issued,
to allow
is an immediate for
the
independent
a fairly
streams
of Streallls struction
CHUNK, kO store DONE$, s k = kO-1 while (--s > O) { create 11 k=k sp=sp n=n k+=ko} k=-1 11: load kO, CHUNK n = tnin(kO+k,n) while (++k < n) {
Tera
Se\reral
a write-when-
DONE$
the
to another
in a processor.
active
k>,
amilab
store
SERVE-UPTO
quits
hardware
loaded
kO = cei.ling(n/s)
The
in
except
applied
present
The
evenly
If the are
s =
loop
s rO 29
reserve.upto S+=l
I/
job.
should
is
as it completes
stream
and
write
variable
involved
as a stream
ferent
in avail-
shared
variable
each
Each read is a read-
each
sor. of the compiler
however
cessor
the
and
reserved
parallelism.
One strategy is to use up
The
decrements
nlo-
and
of streams
executa-
responsive
that
synchronization.
t+~l~ell-f~lll-set, -elllpty,
exploit,
to another.
is very
that
can
such
t.
register
Variables in upper case are in memory and variables with a $ indicate uses of the full/empty bit to provide
all
code
into
new stream. In this example, the CREATE instruction is simply sliipped if the RESERVE fails to allocate any streams.
hut
important
of machine
the system
generate
streams
approach
is to allow one
so that
However, will
compiler
additional
be independent
tivation
per-
ultra-lightweight the
This
where
s – 1
stream
loop.
require
the
iterations.
illlplel~lelltatioll
so that
each
at the moment.
is to have
does
convex,
N/s
loop
and
is stored
O.
CREATE instruction is pa.rameterized by a pcjump target and three registers passed to the
The
this
“created”
streams
execution
ble
are
approximately
are not, s — 1 idle
eral
implement
streams
reserved
rO is always
relati~’e The
successfully
llegister
which
in terms expensive to fall
will is still
fail
two
one and ver-
of code but
re-
back
on.
and
so the
low.
It
6
Medium-
Grained
Parallelism
full/clnpty that,
Tera’s
medium
larity
grain
of greater
ified
by the
bination
parallelism
than
user,
100 instructions
discovered
the
system.
to
wherever
of the
guage
can he spec-
compi]er,
express
or a com-
as much
It
should
imply
of hide
Future
variables
chronization
for
that
cle-
added
lan-
Achieve
efficient,
tomatic
loacl
need
not
number
and
of
generated
processor
balancing
utilization
with
blocks
lightweight
communication
through the
when
needed,
influence
and
the
scheduling
au-
empty
used
lvithout
variables
more
complex
erties,
how
the
—
The
straightforward
of
system this
language
mentation
on our
so
a
mentioned
be
used
illlplelllelltat,ioll
penal (ies that
lnay
lightweight
can
i[s parallellsln.
addresses
sectfiol]
these
describes
extensions
al)d
prop-
Explicit
parallel
variables
namic
LHheir inlple-
scribes
Parallelism programming
and
future
a location
that
of a computation. may
have
pointer) ifier
any and
[6],
parallel
will
A future activity
type
identified
and
its
When
a future
with
the
statement,
as unavailable
fioat$, type
to to
result
memory
quent
reads
future
completes,
future
variable
cell
of the
the appropriate and
the location
tmnslitemted
the
of future~
discussim]
create
or
paratld
our activity
language
extensio]ls
scltedufing
and
be created.
then
a write
to
variable
to full.
can
be-
Synchro-
base
on which
the
value
future bit
) aikd block,
vallle
of futures
language
goals.
vari-
;ubse-
uses a varial][ t o FORTRAN.
remains
are
the
to
load
afore-
describe
an
balancing
due
created
dynamically
amount
of time,
Instead,
approach
If a parallel
call
and
parallel
activity
activity
on the fact
to execute
blocks
shortly
constraints,
strategy
a new
can get
and a static
we use a dy-
and rely
to synchronization
through
01’ C, we l]avc
future
in
that
next after
is
it is
an efficient
parallel
io]].
activity
executing starts
on the job.
and
chores
for
of new
the
in
the
the
processors
allocation
pool. of
are
an un-
work-pool
from called
ready
to
the
pool.
reserves
some
mrtual
processors,
repeatedly Executing
additional
parallelism
style
ready
and
select a chore
instruction in the
genera-
chores.
the
Parallel ainls can
of virtual
Resource
to react
by adjusting
runtime
number
Virtual
from
chores
continuations. a
the runtime
streams,
compiler
creation,
uses
executing,
a chore.
the
corresponds
chores
finer-grained
runtime
The
On
which
by
of ready-to-execute
to work
result
activity
either
environment
of instruction
application
194
pool,
collection
for
parallel
created
statements.
runt.ime
run
be
a ready
a job
Tera
(lwcrI
mediun-grain can
number
tion
( tile
each
chores
streams
to the
111 addit
variables
satisfies
is impractical.
crucial.
may
the
available
It
activities
not
\Yhen
variable
Il%en
synchronized
that
accomplishes
of which
strategy
associated any
and
machine
decision
TIIe
a new
is written
is lnarked
Parallelism
policy
ordered
Ixc~ I llf:re is il]clcp{>llclcnt of language construct and coutcl l)e aserl t 0 supporl ~lcla 1asking or concurrent tlweacls syst,ems SUC1las PRESTO[.5] or lJi,l&[2]. the
full the
the state
an unpredictable
Changing
‘Though
until
structures
virtual
for
placed
(lual-
a future
full/empty
is set, to empt,y
unavailable
synchronized
Similarly
powerful
the
or
variables
illt,
future
result
starts,
( the
the
becomes
blocks
resets
hardware
a very
parallel
Again,
able. is marked
then
to
empty
to empty.
be
“synchro-
access an
we
can
also
These
provide
self-scheduling
We
de-
the
these
char,
the
is used
clirect
variable
pro-
need
running.
future
variable
receive (;,?
( e.g.
with
statement
througl]
A future
extended
primitive are
is fostered eventually
In our
the state
variable
synchronization
statements.
from
that
scheduling
several
hardwale.
Expressing
direct
read
buffer this
which
type.
synwork
bounded address
of
do not
synchronization.
Since
started
6.1
primitive
until
form but
qualifier
A
combination
that
or decrease
the ‘“lka
remainder
type
Implementing
execute To illustrate
To
and
nized
6.2
issues:
activities
algorithm
pro-
programmer
synchI’onization
between
and
synchronized
comes
and
exclusion
the variable’s
presents Support
now
incrementally;
— so that,
be concerned
resources
activities
may
a powerful
provide
bit.
variable
Support an unbounded number of parallel activities — so that an application’s parallelism need not be a certain
parallel
result
pipelining”
situations. any
variables
a full
on
with
nized”
resets
can be discovered
Any the
provide
a sync
full/enlpty
should:
dependent
for
“software
mutual
combined
of the
a parallel
for
have
fronl
cletails
to full).
waiting
ducer/consumer
an
configuration
implementation goals
encour-
parallelism
as possible,
and these
should
the
it is natural.
hardware
Together,
language
blocked
ceed.
well
programmer,
tails
and
by the
parallel
programmer
apphcation the
has a granu-
of both.
A general-purpose age
generally
bit, is toggled
were
its
to the
Allocation. dynamic
resources
dynamically processors
The needs
to these
increase it employs.
and
of an
needs. decrease This
func-
tionality
is us~ful
can vary
significantly
as the
Reasonably,
a large
cessors
a small
than
Resources
should
allelism
itors
to limit
relative
merits
pool
for
a period
be released their
the ratio
lifetime
pool
ready for
size of the
the
ready
acquired
contrast,
average
over
of high
inefficient
to the number
more
virtual
in
processors
scheme
either
succeed
When
a synchronizing
full/empty to
the
retries
it.
when
the runtime
different ating
the
processor.
on a processor
system
interact
lease)
with
such
the
the form
of the
call
a tealn,
While
cooper-
teal n creation
Tile
initialized.
be checkeci (team
system
resources
should
and
be granted
hlore the
uses to decide
that,
iike
counters
whether
{hat
the
is detailed
the
the
A chore
may
synchronized there
block
because
to complete
may
variable be many
in a variety
which chores
of ways and
nization
following:
●
If
the
should
chronizing
will memory
and
will
little
state.
goals
program
ing
executed
soon so that,
more
then
of syn-
tl)e
cost
and
If the periocl
of time,
waiting
strategy
Synchronization use of the
then
of
full/enlpty
uot
a heavier
bit
associal
and
memory
finds
) pool
for
data
bit or
memory
succeed
for
a Iollg
weight,
non
busy-
ed )vitll
t hrougi each
wor(l
for it
the
back
virtual
reference
be-
values
are
trap
to
access
that
a trap.
the
chore
that
failed (un-
to execute.
adopted
for [4].
to either
program
could
the trap
ready
storage need
bits
The
processor
explicitly
ad-
data
in the
to that,
avoids to
7
Coarse-Grain
the The poll
a check
unblock
another
Parallelism
operating
scheduling -–
fully
system
—- fa>t!:s,
schedules scheds,
of fine graiu
under
and
parallelism
the control
A task is the system-wide in the “rera
takes
of I-Structure
trap
failure, chore
the synchronization
is similar
location
later
is neces-
The
t attempts
placing
transactions
retry
and
a retry
has one of its
another
traps
The
of the
saved.
with
c.ontiuuation
i]nplelllelltatioll
memory
are
stream
often
stream
which
of
state
it immediately
the
fails, count
synchronization.
because
cell
a second
protec-
retry
of saving
chore
register
stream
the the
cost
blocking
each
ion attempt
facility.
blocked
s traps the
with
a
is desirable.
is accolnpiished
the
handler
The will
can of one
a
reference.
synchronization
rate
limit,
the
associated
location,
resources ●
normal
stream.
the
than
the
by
When
on each
succeecl, cost
no addi-
processor
handling
of the
counter
memory
the
sytlchro-
that
a quick
the
When
domain’s
state
on a list
use of the
efficient)
for
avoids
requests
enables
a syuchronizat
trap
a stream
the
This
data
time
at the
is set to balance
split-phase
,Since
accessing
to sllplmrt
Our
be used
a future
to access a
is in the wrong
blocking,
be
for
it, is trying
executilig
synchronization
busy-wait
normal
is waiting
it is Ilecessary
synchronization are the
it
protection
to synchronize,
or because
with
in hardware,
is incremented.
lightweight, the
blocked statement
new mem-
requests
is associated
time
to ilnplemeut
dress
Synchronization
hardware
with
are required;
register Each
value
set.
7.
same
instructions
count,
the
placed
additional
in Section
retry
WThen
ou the nat~lre
the
the
returns
automatically
Interleaving
automatically
limit
domain.
sary
RE,-
The and
network
issues
issuing
restoring
be glob-
at
retrying
status.
are interleaved
rate.
while. and
tick.
limit
interaction
cal]
for success.
allocation)
Efficient
runtlime
system
By
instruction
using
reqllires
I uust
but
tional
reaches
(re-
requests
will
a short is made
the request
queue
communication
retry.
its
it,
to acqllire
IH tlIal
failed
tion
the run[, ime
is because donlai
recently
A retry
on a
a failed a retry
at a fixed
the
continue
of the operating
system
of al) operating
operating
6.3
This
and
SERVE, must
of streams
operating
as protection
controlled
takes
A group
Retry
state,
-
attempts
access
wrong
in
synchronization
or be waiting
with
recluest
flooding
required
running
synchrohardware
retry-then-trap
synchronization
memory
is in the
processor
requests
per is
RESERVE and CRl~A’TE illstructious),
new teams,
resources
The limit, large-grain
systen]
a tleam indepeudentlly
(through
must
is called
(shrink)
the
ory
between jobs,
operating
seeks to adcl new streams
physical
can grow
ally
allocatiml
with
the the
a novel
for optimistic
most
immediately
bit
places
in resource
that
mon-
On any given processor, additional streams (for use as virtual processors) can be bound by the use of RESERVE and CREATE hardware instructions, Streams can be released with the QI_]IT instruction. Until the ruuti]ne reaches a stream limit imposed by the operating system, it can thus grow and shrink its stream resources by
Interaction
state, If not,
using
is tuned
to run
accordingly.
fairness
desired
immediately. action
tdie
the assumption
and acts
communicating directly with the hardware. ensures that the opera.t,iug system controls
is in the
place
software
Our
par-
Theruntime
bit
takes
scheme.
parallelism,
of chores waiting
of virtual
and
pro-
of low
If the
nization
execution.
a period
use.
memory.
pool
progranl,
its timely
during
of the number
ready
of the
coarse
grain
teams
—
resources
of the compiler unit, of resource
parallelism leaving
the
— streams
and runtime. assignment.
It provides all execution environment for running a program, Each task consists of one or more scheds. A sched
the of
195
is the operating ble unit.
The
separate
the
system’s
same main. tem
sched
is a group
processor,
of streams ) The
following model.
withil~
tlhat
a Lealll
that
Tera
processor
schedu
absorbs
the
sys-
TEAM
1-2
TEAM
a running
task
runtime.
Typically
a request,
current,
team
to
be
added
sometime
contaius
A
am units
of A that
coatains may
two
two
streams
of its
Team
than
Teams
Tearll
same
ent processors.
team
and
owus
I>e 011 the
1-1 ancf/or
Schecl ?.
separately.
‘~cilrl)
bul
sched
processor. nlay
l-~
processor
Each
2.1
SCIIWI 1 and
can be schwl uled
teims,
be on the
~-1.
schecis,
Sched
l-~.
typically
‘~[1[>
a dynamically
or
sized
set
di(fC>r(Jat
in
ping
slow,
ing
or
usage
pI’oceswl
and
Thr
I-?.
simplest
executing in the can
indepelldrllt
rise
and
tasks
without, vicles
lllis
physical
processor
with
costly
software
tlaSliS to run
main
for systeln
reservecl
uler
of the
is lla IIdlI~d
(ps-scheduler).
The
system’s
tasks.
lived, resource consuming,
example,
jobs submitted
system.
Sn~all
fewer
resources,
quick
turnaround
processor’s pb-scheduler
sched
time:
and
the
in-
used
—
tick
applicat,
This the
pro-
a slll~ill
ions para]lf>l, lLS am
ps-scl]fdlller.
sclled
correspond of
iIre
to
are
applications:
for
short,
ai]d
o[(en
F’vry
iise
I)etweell g(-]lorall},
A
the
tlte
lzmi
ready
by the
counter
to
tracks in a pr~
time
and
scaling
load
counter
due
can
be
of streams
it is advanced associated
every with
the
for
ratio
quantum
was
or pipeline
value
divided
measure are ready
to
counter
by a
of processor
is incremented
which
used
no stream
requests
of this
counter
are
phantom
in which
memory
time
system
uses the
in resource
every to issue.
provides
tick This
a measure
Iized, some
it may
determines add
curreutly
the operating
after
counters
allocations.
system
placement,
or rrsulned
A
parallelism.
operating
A ddi tiolla]ly, illflliencc
clock
are used
counters
a precise
of strean~s
changes the
the
196
minus
operating
processor
accessi-
every
number
streams
of issue slots
excess
sor is u ncleruti
denland
sI1(JII col]]nlands, >Ilarc(l
when
average
utilization. to long
Tile
divided
The
tasks.
is a user
sfretim
per-processor
provicles
n~tmber
dynamic
hatch
lived,
One
of average
as
t,aslis
job elllry
value
the
al]plica,tions
B{g .sc/Icd
hy the
the
due
[luantum
utilization.
a I)ig sched-
across
over
on a processor;
processor to isslte
time
The
The stream
domain.
hazards. s(’(
account-
of the processor
of active
the number
ready
sect, ions.
team
ad dltiona]
counts
illl(l
to obtain
by the number
l\%
of a task
tht
by each
the fraction
is obtained.
used
measure
last, de
finwgraill
team
protection
allows
anlollgs[
paral]cl
for exall)ple, domai
-– ‘l’era
tll~
wa a renlo(e
are not< ver)
protection
with
I{psources
schedulers
sched
similarly
by t]vo scl Irdulc’rs,
characterization
or little
rapid]}
clelta,
of active
by streams
this
swap-
purpose.
issue
issued
be
cumbersome
counters
By sampling
to
sys-
(and
is incremented
The
be
cannot
to track
counter
a
to
operating
this and
a
for
ask
as it
several
system
usage.
ask
parallelism
within
of instructions
tillle
remove)
can
the
provides
which
resource
domain.
to each
tasks
ill t~arlirr
and
I)J the
to taslw
with
dotllaills
(1]1.)-scll(scl{ll(~r)
big sc,hed tasks long
described
protection
applications
scheduler
operating
parallel,
it
between
processor.
ac t,ivit, ies,
techniques
Schecluling of ready
ill
tection
is
for
I]umber
t,[~~li
diff’ereut
nligrat,e
switclliug
per
and decreasec I using
coarser-grain
sched
sharing
context
clomains
up to 15 user
are expanded
iJet\veel]
parallelism
To foster
16 protection
behavior,
can
the
argo(’d
a siliglc
dyaalnic
resollrces
\\’c
within
processors
decreasing
parallelism.
isln
clocli,
It
for
the form
a team
otherwise the
by
can
requests
or
resources
(or
for
assist
‘l”]vcJ per-protection-domain
ac( ount,
by concilrreu{ly
(:111].)lictltiuted-lllel?lory II)UIti processor, [a Sc(llo W shrt W(I Memory flfttltij)roce.ss~}rs. Jiltlwer Acadenlic F’ablishcrs, 1991. biatowicz,
D.
1991.
August Ii.
Padua,
In Proceedings
Conjere,ux
1936 D. Chaikell.
K.
support:
Proceedings
References Agarwal,
D.
Fine-grain
synchronization
A.
and parallel
PPEALS
Parallel
Sah,
machine.
methods
[1’2]
[1]
Levy,
building
1$)~~
to give m the
support
An.
1988.
and
for
Langt(c{ges
of
Tera hardware silnulator, ~l(lltitllrt?:icl(’cl l)rocessors having the ability to mask tile ]ateucy of’ one stream with the activity of another --- I?VPU nlort’ than traditional
Culler,
In
development,
the
D.
osium
and
I(M
Archdec.
Smith. A future-based language for highly-parallel computer. In D. Gel-
Nicolau,
processor
job,
e;ir]y
of
pro-
systeul,
syste] l),
B.
Compdcrs
cla tailow
utli-
onto
of tllle
rul)tin)e
conlpiler,
tools
of processor
do its IJart,icu]ar
[8]
re-
systmn
that
and
A.
H.
for
September
D. Callahan
fipril
processors,
\f7r hclieve
oj the
Computer
In A CM/SIGPLAN
Corm.,
stract
collqjiler
applications
are lightly other
of \ irtlial
systems.
harclware
of
stream
to the operating
its aulnber
places
of’ tile
Lazowska,
euviroumeut
J. Wawrzynek,
sysWI]I inte-
release
on
1990.
fosters
by tbe
is ith
al]ility
reserve
or decrease
wbicb
systelll
inclutle: directly
the operating
lizatiou,
‘1’kra
of multiprocess-
Proceedings
Syrnpostum
E,
open
onit
In
1983.
Bershad,
erut.er,
‘S re.
shortage
be compensated
June
A critique
style.
The
machine
resources
tvre,
Iannucci.
a general-purpose
actliviLies
tenlporary
Internc(tional
ming
[6]
R.A.
nual
Haven,
generated
or are less uniform.
Neumann
An
in
Similarly,
activities:
shares
tasliS.
in one task
gration.
A
loops. parallelism
concurrently
ut ilizatiou,
operations
ancl
vou
[5] B.
bitls, per-
of anot