to reuse previously computed results from inspectors. (e.g. communication .... including sparse matrix linear solvers, adaptive computational fluid dynamics.
Runtime
Compilation
Techniques
Communication Ravi t
Science
University College
Joel
Saltzt
MD
Syracuse
University NY
ecutor In
this paper,
we describe
compiler
can deal with
The first
mechanism
two new ideas
irregular
invokes
cedure via a set of compiler the user
to use progmm
tivity,
spatial
load.
The second
many
location
arrays
that
on-processor
bufler
sults for
these
The directives
elements
loop
from
loops
allow
We
it
data
copies
piler
generates
of when re-
with
arrays
These tern
is determined
time.
In these
by variable
movement
programmers
work,
map
sors. The
of data
to carry
can also be generated in a process
we call
per, we present where
compilers D
methods
Fortran
On distributed array
memory
The
inspector
cal memory element
related
loops with
by transforming
of code:
partitions
an inspector
loop
iterations,
off-processor
by a loop,
and builds
the original allocates
mation
distributed
tion.
Ia-
of graph
ments
a communication
needed
to produce connectivity,
and information
computational *This NSF
(ASC
work was sponsored in part by 9213821)
and
ONR
ARPA
(SC292-1-22913).
(NAG-1-1485),
produces
Author
represent
Investigator The content of the information does award ( CCR-9357840). not necessarily reflect the position of the policy of the Government and no official endorsement should be inferred Choudhary
was
also
supported
by
NSF
Young
th~
$1.50
ation
of
standardized
tioned.
produces
361
@ 1993 ACM 0-8186-4340-4/93/0011
load.
code that, the
that
a data
have been de-
support
and compiler
users to specify can consist
spatial
of array
array
information,
elements
alao generates
structure
that
elewith
the compiler a standardized
and
a (user
func-
of a descrip
location
generates to
the infor-
distribution
Based on user directives above
parti-
[24, 25, 19, 17, 2, 13].
associates
at runtime,
we
to each pro-
the different
a customized
representation
The compiler
of the mesh.
communication,
with
this information
For incompu-
in such a problem
elements
the runtime to allow
manner.
heuristics
associated
in the to par-
does not have a use-
interprocessor
years promising
aris as-
locations
pattern
array
need
data
of an irregular
structures
arbitrary
arrays
of processors.
advantageous
frequently
the data
needed
In our view,
tion
array
the in-
distributed
in an irregular
We have implemented
and an execu-
time
array
and memory
have been studied
transformations
data
are called
to the connectivity
and tradeoffs methods
large
the nodes
minimizes
In recent
veloped
indirect
any indirection
of distributed
arrays
need to assign
tioning
architectures,
reference
each inspector
memories
It is frequently
we partition
cessor.
coded usto Fortran
[28].
for each unique
accessed
problems
closely
may
for
arrays
processor
machine.
in a way that
implementation
make it possible
irregular
accesses can be handled
loop into two sequences tor.
that
com-
a record
have written
since the last
local
mesh are numbered
When
scheme,
machines,
the way in which
tational
pa-
to see whether
storage
ful correspondence
compiler In th~
record
data
term
distributed
stance,
preprocessing [23].
and a prototype
tition
schedule
may
In this
iter-
The
maintains
is used to indirectly
between
to specific
distributed
of proces-
memory
compilation
handle
and
Long
signed
at run-
preproces-
memories
out runtime
extensions
only
loop
off-processor
locations).
array.
memory
partitioned
mys.
access pat-
out
structures the
techniques
to efficiently
[10] or Vienna
known carry
by a d~tributed run time
we demonstrate
ing a set of language
data
between
code needed
the data
values
cases,
sing to partition the
problems
schedules, associates
buffer
often results
was invoked.
In distributed
In sparse and unstructured
loops). that
computed
or intrinsic
that
to handle
method
at runtime,
90D loop
[26] and KALI
(irregular
previously
on-processor
runtime
arrays
that
code that,
to be partitioned
Introduction
compiler
may have been modified
spector 1
reuse
In the ex-
of transformation
conservative
array
data.
and computation
communication
dwtributed
checks th~
ARF
kind
information
a Fortran
another
compiler
to
(e.g.
to a d~tributed
implementation.
The
referenced
possible
it is pos-
off-processor
communication
a simple
partitions,
performance 90D
propose inspectors
with
[21].
indirectly
ation
copies
a Fortran
with
from
partitions,
data
out
in
that
required
the actual
[16] used thw
makes
inspectors
iteration
We present
mechanisms
that from
off-processor
locations).
pro-
connec.
method
results
schedules,
are carried
Center
1.9244
to prefetch phase,
compiler
and computational
to recognize
computed
associates
graph
conservative
a compiler
communication
information
mapping
to describe
of army
to reuse previously
(e.g.
directives.
HPF
effectively.
a user specified
is a simple
casea enables
sible
by which
computations
Choudhary$
Architectures
Syracuse,
schedule
and
Reuse*
parallel
20742
Abstract
Partitioning
Alok
~No~hea~t
Department
of Maryland
Park,
Data
Schedule
Ponnusamytt
computer
for
then
specified)
code that,
at
is used to partition
passes parti-
runtime, loop it-
Pennission to copy wuhout fee all or p-ret of IMs material is granted, provided thm h copies we not made or dlstibuted for dkct ccinmercial advantage, the ACM copyrighl mice and die dlle of the publication and its dme appear, and nouce is given that copying IS by permission of the Association for Computing Machiaev. T. copy dkerwx. or to republish, requires a fee andhx sjwific permission.
Pkwe
c Single statement FORALL
loop
i = 1, N
y(ia(i))
= x(ib(i))
END
L1
+ .. . x(ic(i))
A
Generate
GCOCO1 Graph
Pzrtitien
Geocol
Phase
FORALL
Partition
Graph
>
Data
B
Psrtition
Gem#ate
Iteration
Graph
Partition
Iteration
Graph
Loop Iterzticn
>
c Sweep over edges: Loop L2 Phsse
FORALL
i = l,N
REDUCE
(ADD,
C
Remzp Arrays
x(end-pt2(i))))
REDUCE
(ADD,
g(x(end-ptl(i)), END
y(end-pt2(i)),
Phzse
piler
1: Example
Irregular
To our knowledge, is the first
to provide
the Vienna specify
this
Fortran
trol
Loops
kind
mations
describe
and language
statement
pendencies dition,
rdlowed
accumulation,
irregular
array
indirection
D syntax
pendencies. out
with
loop
second
90D
This time
compiler Our
paper technique
The first
concurrent
ad-
CHAOS; The
being
iteration
partitioners
an overview
transformations
methods
developed
and describe
fluid
as follows.
ture
in Sec-
the
effort.
We describe
extensions
data
demarcated
is called
irregular
(Figure
mapping
project
is called
the CHAOS
of the
earlier
them
first
on
support,
three
library. PARTI
(the
GeoCoL
onto
processors.
sections.
arrays
GeoCoL
graph
data
are
data
partitioned
structure)
with
decomposed
using
a particular
structure
calculates
five
of these steps here, and will
in later
distributed
distributed involves
steps in the figure
and computations
description
in detail
the
problems
our runtime
2). The
data
a brief
associated
the
data
a
access
set of loops.
is passed
how data
in
2, CHAOS data strucThe
to a partitioned.
arrays
should
be d~
In Phase B, the newly calculated used to decide how loop iterations
array d~tributions are are to be partitioned
among
takes into
data
processors. of arrays
and
In Phase
5 we
(1) coordinate the storage
the
involves
we use to con-
362
calculation In Phase
and loop
D, we carry
out
interprocessor
C we carry
a shared
generating
account out
loop
the actual
iterations. the
preprocessing
data name
communication
needed
movement,
of, and access to, copies
and (3) support
struc-
This
access patterns.
remapping
In Secdata
In Section
standard
of
the run-
schedules.
used to couple
to deal with
tributed.
code
We set the context
support of clearly The
is a superset
using
patterns The
Univertemplates
3, we describe
to compilers.
the language
dy-
of the For-
of the compiler generated paralleiized version.
generate
work
known regular manner. In Phase A of Figure procedures can be called to construct a graph
and compiler
on simple
steps
Initially,
We use th~
as part
phases.
support
concurrent machines
d~cuss
to
runtime
of a sequence
library
We provide
we carry
is similar
by Syracuse
results
of our compiler
which
loop codes.
to save communication
loop
in which
procedures
the procedures
related 8.
[21, 26, 23].
concern
de-
efficient
consist
CHAOS
major
For-
sections.
2. In Section
6 we
CHAOS
computational
Solving
directly
without
computational
dynamics
our
discuss
in Section
of
the runtime
library
loop is a single
references
second
our runtime
is organized
in Section
4 we describe
ture
or
that
de-
level of
1, we employ
is a loop
The
implementation
tion
present
that
array
in the following
reveal that the performance is within 10% of the hand the work
problems
We also assume that is indexed
loop
We have implemented sit y [9].
(e.g.
array
Overview
We have developed
irreg-
carried
of a single
in unstructured
to demonstrate
tran
etc).
two loops.
codes and molecular
transformations
Overview
In Section
the performance
the
of a single
loop
side reductions
in Figure
indirect
operations.
those loops found namics
only
that
as a result
shown
to depict The
reduction
2
memory loops
loop
7 and we conclude
2.1
transfor-
index.
In the example statement
hand
a distributed
partitioning. to characterize
We briefly
tion
runtime
to provide
We assume
the
max, rein,
methods.
data
in
described
compiler
in the context
accesses occur
with
by the loop tran
are left
The
required
where
runtime
performance
Problems
Fortran.
support,
out
loop
Irregular
a user can also
strategies
above.
accesses are carried
multiple
2: Solving
compiler-linked
of our
com-
We also note that function.
extensions
described
Fortran
definition,
to Vienna
the runtime
new capabilities ular
of support.
transformation
here can also be applied
described
memory
distribution
and compiler
We will
the implementation
dwt ributed
[28] language
a customized
support
Leeps
Figure
present in t h~ paper
Leeps
E
Rxecute
FORALL
erations.
Remap
>
D
preprocess
x(end-pt2(i))))
Figure
Itemtions
y(end.ptl(i)), Phzse
f(x(end-ptl(i)),
and Loop
of off-processor
space.
This
to
(2) manage data,
preprocessing
schedules,
translating
array
indices
to access local
and allocating data.
buffer
distributed
processor
the
Finally,
earlier
of off-processor
to retrieve
data-sets
memories.
fkom
copies
....
data
space for copies of off-processor
It is also necessary
irregularly tion
local
phases
globally
from
indexed
the numerous
E we use informa-
to carry
out
CHAOS
and PARTI
adaptive
procedures including
necessary
computational
dynamics
codes
distributed
fluid
and
memory
dynamics
a prototype
linear
codes,
compiler
S5 ALIGN
solvers,
S6 . . .
molecular
[23] aimed
set
values
Language
decomposition
lar problems While
will
directives
Sup-
be presented
our work
will
sions could pilers
Fortran
D and HPF
from
a rich set of data decomposition
a definition
of such language
that
These
languages,
users explicitly
Fortran
D can
define
In Figure
attributes
which
size, dimension titioned
array.
processors.
using two declarations. POSITION. ity
declaration statement onto
and specifies
the user with
In addition,
a distribution array
Fortran
D statement
with
ALIGN.
In statement
titioned
equal
to each processor. distribution
between
The
which
Fortran-D
for the user to couple of partitioning from
scratch
separately
constructs process.
heuristics
is no standard
by run-
are not rich of the map ar-
While
available,
can represent
interface
3
dis-
gives the dwtribution
the generation
compilation
and the application
there
coding
a significant
between
the
are such
effort.
partitioners
codes.
D
Communication
3
called
Schedule
The cost of carrying in Figure duced
is produced
Reuse
The
by the inspector Compile
We propose
second
is to be mapped
analysis
schedules
a simple
once
and
needed
conservative
us to reuse the results an inspector
for loop
pro-
then
used
to reuse inspecupon
method
results
B, C and D
the information
is touched
cases allows from
(phases
when
is computed
time
tor communication
dimensional-
out an inspector
2) can be amortized
repeatedly.
is an executable
onto
a d~tribution
sized
In statement reg.
Array
irreg
An irregular
using
an integer
array;
ement
i of the
distribution
when
the
●
The
assigned
map is aligned (in
from
in
[12,
that
7].
in many
inspectors.
L can be reused
The zs long
is set equal
compiler
that
L h&e
generates
of when
intrinsic
may
scheme,
in loop
L have
the inspector
indirection
been
arrasw
modified
sinc~
associthe last
invocation.
record
thw
referenced
since the lsst time
and
l~op
code that
a Fortran have
used to indirectly
is specified
arrays
is no ~ossibilitv with
inspector
is to be partitioned
is assigned
there ated
is par-
be used to specify
distribution
map(i)
irreg
reg
of data unchanged
was invoked,
A
S3, of Fig-
one block
S5, array
map will
S7) how distribution processors.
with
using
distributions remained
specify
decompositions
S4, decomposition blocks,
●
regu-
processors.
In statement
3, two of size N each, one dimensional
statement
map array
in Figure
the irregularly
has to be generated
a partitioned.
There
fixes the
a user can explicitly
is associated
are defined.
with
The
depicted
how to partition
elements.
a Fortran
a choice of several
is to be mapped
specific
into
the declarations
as:
D provides
lar distributions. how
array. of irreg
partitioners
processors.
Fortran
ure
in
is DECOM-
template.
a template
tributed
a wealth
is to be par-
declaration
Distribute how
Distribution
the significant
fixes the name,
is DISTRIBUTE.
with
ray to the program
require
a template
the array
array
D Irregular
it is not obvious
enough
an irregu-
array
A distribution
Decomposition
be found
The distribution
The first
and size of the distributed
Fortran,
specify of such
mapping
irreg
3: Fortran
pattern ning
is to be distributed.
is used to characterize
and way in which
between
may
D, one declares
of a distributed
exten-
specifications;
of distributed
some
irreg(map)
x,y with
difficulty
is that
D and Fortran
specified,
explicitly
an example
In Fortran
a distribution
to
partition
3, we present
declaration.
how data
be used
lar inter-processor
extensions as currently
The
D.
and com-
Vienna
Fortran
using
....
of Fortran
of languages
90) provide [1 O, 8].
of Fortran
language
and HPF.
(evolved
array
for irregu-
in the context
and analogous
be used for a wide range
such as Vienna
Fortran
in the context
be presented
D, the same optimizations
we employ
map
at
Figure The data
of
S7 DISTRIBUTE
Existing
reg
..
S8 ALIGN
of
reg(block)
map with
method
multiprocessors.
Overview port
2.2
reg(N),irreg(N)
S4 DISTRIBUTE
have been used in a vari-
sparse matrix
map(N)
S3 DECOMPOSITION
computation. et y of applications,
x(N),y(N)
S2 INTEGER
local
in Phase
the
S1 REAL*8
but
written reference
each inspector
at runtime
90D loop’s
to a distributed another checks
maintains
statements array
distributed thw
runtime
a
or array that
is
array.
In
record
to
see whether any indirection arrays may have been modified since the last time the inspector was invoked.
to p, el-
to processor
In th~
P.
presentation,
an inspector
363
for
we assume
a forall
loop.
that
we are carrying
We also
assume
that
out all
indirect
array
the form loop
references
y(ia(i))
index
associated
A data
with
(among
of the
array
cess to the array’s a global data any array
with
Note
that
number
to the distributed of times
that
writes
array,
data
data
structure
the current current nmod.
modifies
In this
out,
first
perform
L has m data arrays,
ind~,
carried
out,
array.
arrays 1