their efficiency on the NAS Parallel. Benchmarks. We also present a tool which automates detection the constructs causing data congestions in Fortran array.
Automation
of Data Traffic Architecture Michael
Frumkin,
Numerical
Haoqiang
Aerospace NASA
Control Jin,
Simulation
Ames
Jerry
Yan
Systems
Research
on
DSM
1
Division
Center
Abstract Design
of distributed
distribute
data
for example, of parallel a good paper
having
application
we discuss
Benchmarks.
in Fortran
improving
data
are
in the
computer
OS and
very
the
to reconsider traps and
to avoid
traffic are
with
problems
difficult
factor
such
about
accessing
20%
Several
point
all functional
operations Table 1.
detect
the
1M/S { f rumkin,
Traffic
data
blocking,
congestions.
efficiency
the
the
user
data
data
on
placement,
the
NAS
constructs
on code
In this Parallel
causing
data
transformations
for
code
and
data
location
with
the
computer
a machine
trashing
and
padding
data
Many
performance
the
of the
new
misses
it runs
the
and
on
thread
[13]),
has
simple
other
data
interference
similar
of CFD (see
user
to identify
The
looking
and
The
difficult
privatization.
TLB
location
compiler
and
not
implementations
machine
data
the
architecture. are
constructs
best
on
architecture
variables
excessive
to express
depends
sharing
programming
Even
allow
with
false
and
locality,
cure.
which
the on
as cache
as poor
Control
constructs
varies
in performance.
contribute units
to data code
T27A-2,
to low efficiency
operations
are stalled
Approach
traffic
detection
to achieve
to understand
data
their
advises
user
to
using,
development
However,
the
duty
may
codes
have
achieve
spending
80%
of
data.
factors
of floating
data
dimensions
of peak
including evaluate
the
programs
simplifies
processors. from
to avoid
and
a result,
of his
array
3-4 difference
time
As
such
to diagnose
only
keep
access
a few
automates
codes
greatly
requires
and
programming 2.
performance
on
techniques,
of Data
for accessing
in memory
architecure
from
application.
few explicit
time
parallel
techniques
control
oriented
memory
user
develop
computers
a tool which
in the
Significance
There
size
array
traffic
liberates
performance
DSM
of such
present
computers incrementally
DSM
use various
page
We also
congestions
good
a number and
allows
threads.
on
and
(DSM)
and
Java
scalability
transposition
1
or
programs
program
memory
processors
OpenMP
flow in the data
shared
across
does busy.
not
allow
Second,
and
during
computation
tra_c
optimization.
constructs NASA
suffering Ames
codes.
First,
an optimal
a larger
factor
waiting
for data,
The
from Research
of CFD to provide
data
first
step
congestions
Center,
Moffett
the balance mixture
comes
from
the fact
see Example in addressing and
identify
Field,
of the
number
of instructions that
the
1 in Section this the
CA
challenge data
94035-1000;
to many 2 and is'to
congestion e-mail:
hj in, yan} ©nas. nasa. gov
2Under explicit constructs directives in HPF.
we
mean
such
statements
as "register" qualifyer in C
or data
distribution
type
causing
mary
Data
(TLB)
loss Cache
misses,
to resolve
for
for reduction data
where
known
for some
known
use
possible
so far
padding
at compile allow
tools
leave
code
and
control
of the This which
We demonstrate suggest
a cure
and
LU
with
rhsz
average.
of NAS and
zsolve
nor
channel Even
For example,
it is
Parallel
corresponding
fixing
standard
with to
SP of NAS
for
size
architecture
then
way
are
implemented
the
codes
in the
problem
for
reporting
compilers
level optimizations and
others.
however, analysis.
such
On the
cannot
data
for
as loop level
perform
Many
problems
during
events,
the
inter-
compilers
deep
code
types
of analysis
searching identifies
can
not
tool's
and
ability
using
hardware
the
problems The helped
and
to resolve traffic
anal-
are
not
counters
construct.
Perforrneter
user
on tool
have
was
to improve
the
data
time
the
data
simulated able
user. solutions
CFD
performance
The
inserts
poor
traffic
to resolve
affinity
traffic. tool
been
code and in nature
and suggests
warnings.about
problematic three
to the
Some
data-to-computations
with
compile
receives
them
problems and
problems at the
to identify
and
perfex
affinity
possible
be evaluated
Benchmarks. operators
ways data
data-to-data
about
on
of a program
These tools allow to instrument the However these tools are diagnostic
and
user
relies
execution
including
a tool which
performance
Parallel
this
nest
traffic
[6, 12]. anomalies.
are evaluated time.
the
of the
data
problems
the
code. These statements code constructs at run
are
Compilers,
tool analyses
informs
statements
in BT and performance
facto
prefetching
the
of these counters constructs with
them.
traffic
of events
analyzing
we present
and
misses.
computer
traffic. page
in memory
TLB
have
data
of the
be applyed.
zsolve
trans-
at all. statistics
analysis
de
or dependency
to identify
In this paper for resolving
time
and
worse for
to improve
contention
should
code
problems
optimization
target
Buffer remedy
placement/migration these
developed
in the
and
include
privatization.
to collect
built on the top identify the code
data
pipelining,
approach
for collecting
in the
and
or page
invalidations
NPB
be either
of publications
to resolve
reason
can
Pri-
[11].
optimizations
interprocedural
size
been
of 3-4
of that
tools
remedy
optimization,
of rhsz factor
Neither
to improve software
This
transformations
[7].
and
Lookaside a proper
for nonexpert
have
in spite
These
as full
Another which
cf.
Translation
have
of cache
paper:
is to choose
step
as page
of pages
computations version)
optimizations
such
placement
in this
(SC) misses,
such
for cache
metrics
second
In a number
grouping
easy
congestion
code.
a few techniques
reduction
in compilers
loop fusion,
usually
and
that
architecture.
change, ysis
time
in x-direction,
Many
traj_c.
it is not
(OpenMP
improvements target
data
for
The
to the
and how the appropriate
Benchmarks been
them
four
Cache
(CI).
environment
misses,
in hand
identify
operators
apply
data
transposition
use
Secondary
program
include
techniques
have
and
of TLB
We
Invalidations
[4, 9, 10] and
techniques
these
misses,
Cache
controlling
addressed
and
(PDC)
and
or changing
Methods been
performance.
congestions
formation, mechanisms.
These
of the
applications traffic
of the
codes
in the
performing
constructs
data
data
and BT,
to SP
problems by 27%
in
2
Automaton
of Detection
For controlling
data
puter
hierarchy
memory
and
traffic
cases
offsets, cache
and
however
and
data
data
access
invalidations.
on cache The
the
processor
the
memory
and
typical
such
user
across
can
be complicated
target
computer
in terms
of cache
metrics
shared
to the
awareness
a help
of the
in the
architecture. as cache
the
comIn
parameters,
such
data,
variations
specifics
of a tool detecting
data
data
array
misses
traffic
order
and
depends
of the
with the
code
execution
placement
and
data
could
advising
streaming
for avoiding the
coherence
tool intended
architecture
congestions
for avoiding
size for reducing
by cache
which
traffic
data
page
caused
of the computer
grouping
on initial
an optimal
interference
data
on data
reuse,
on choosing
problem
as accessing
movement
movement
by simple
is sensitive
a tool can advise
thread
in the
be formulated
Problems
on data
of such
expertise
characterized
and
by increasing access,
on reducing The
with
Such
information
Details
can
and
Traffic
threads.
to the
reduced them.
to have
require
movement
protocol
requirements
to resolve
may
strides
by different
be greately
has
In a few cases,
coherency
of statements
user
in his application.
machine-dependent
many
the
of Data
ways
through
contention
number
of TLB
in
misses
issues.
to advise
is shown
in the following
example. Example version)
are shown
nested
loop
pages the
1. The
right
has and
pane,
execution Figure
time
and
tating
number
loops
of pages
of increase
Placement
in the
such Tool),
code
a tool
see
with
[2].
affinity key
affinity.
Two
data
run.
For a pair
stream the
often
affinity
data
items.
The
and
is the
the program
ability
Grouping improve
relation
affine
geometry
Cache
Miss
curve cache
(serial in the
number
2.
Merging
see
utilization, point
first
of memory
in Figure
expressions,
are
the
of the
Equation
data
Figure
and
the
1, total
instructions,
program that
if both
see
the
self interference
relation possibility lattice
there
of grouping array
the
traffic
with
affine that
exe-
statement
the
same
the
value
of
into a continuous
latency. ways
array
and datacapa-
instruction
groups many
and
control
!oop nest
memory
are
ALIGN
affinity
same
referred
organizing
by hiding
of the
at the
anno-
data-to-data
HPF
data
in the same
and and
through
used
Align-
for automatic
data-to-data
elements
together
performance
are
Data
to identify
automatic
used
array
designed
affinities
it with
(Automatic
is able
to extract
of arrays
items
was tool
the
tool
affine
between
to ADAPT
The
express
of the
is a many-to-many
In [3] it is shown
on the
computations
of floating
ADAPT a.
for enabling
is a correspondence
loop index.
lhsz
number
Benchmarks
a large
of the
features
directives
affinity
the
see
improves
total
Originally
HPF
to-computations bilities.
relation
cache,
by adding
directives.
during
of the
it touckes
recalculation accessed,
Parallel
curve.
to data-to-computations
affinity
NAS
optimization since
and
DISTRUBUTE
cuted
The
of primary
the
implemented
Data-to-data
of SP from
computations
utilization
in spite
in zsolve
1, left pane. down
nested
FORTRAN
and
nests
second
decreases
We have ment
slows
a poor the
2 lhsz_t
two
in Figure
actua!ly
and first
first
In general,
to group
elements
affine depends
is a set of solutions
of the
[4].
aADAPT is built on the top of CAPTools analysis and some CAPTools utilities.
[8]. It uses a CAPTools
generated
data
base, CAPTools
code
lhsz_t
lhsz
j=l,ny
do
do
do
i=l ,nx
do
k=l
do
,nz
j=l,ny do
i=l,nx
cv (k)=ws (i,j,k) rhon end
(k) =SFunct
k=l,nz
lhs(i,j,k,l)=O.OdO
ion (rho
(i, ], k) )
lhs(i,j,k,2)=-dttz2*ws(i,j,k-l)
do
do
-dttz1*SFunction(rho(i,j,k-l))
k=l
,nz
lhs(i,j,k,3)=1.0dO
lhs(i,j,k,
1)=O.OdO
+c2dttzl*SFunction(rho(i,j,k))
lhs (i ,j ,k, 2) =-dttz2*cv
(k-l)
lhs(i,j,k,4)=dttz2*ws(i,j,k+l)
-dttzl*rhon(k-l) lhs(i,j,k,3)=
-dttzl*SFunction(rho(i,j,k+l)
l.OdO
lhs(i,j,k,5)=O.OdO
+c2dttzl*rhon lhs(i,j,k,4)=
end
dttz2*cv(k+l) -dttzl*rhon
lhs(i,j end
do
,k,5)=O.OdO
do
1: Data
serial,
saves
a large large
Traffic
number
(right
control
affinity
most arrays
problems
are
profiles
of both
results
from
stencil
affinity
nest
along
over
all directed
relations
In order
ADAPT
relation
each
lists
is one-to-many
set of memory
to a statement affinity
graph data
locations c if the
has C and
affine
to it.
of program
referenced
in spite
nest
in the
affinity grids.
in different
The
relations and
allows
A
between
These
elements)
rule
an array
of
control
relations
used
of u used
a
statement. statement.
flow graph.
q from
vertices
program
program
statements,
in the
at address
D as the Many
data
all elements
We represent
set
datum
chain
time
by the
constant
and
rearranging
involved
discretization for arrays
pages By
we call
statements
to propagate
The
u forms
union the
an
of these
nest
for computation
affinity of each
mapping.
affinity. C be the
arrays
crea_es
2.
in each
dominated
nest
to an array
of q and Let
in the
leading
element
execution
see [2]. The
relation
u.
memory line.
is one-to-many
affinity
ordering
many
in Figure
with
of NPB2.3-
cache
shown
the
The
q and
graph.
path
loop
improving
between
paths
between
Data-to-computations
the
lhsz.f
are
immediately
rule,
per
of arrays
on structured
the chain
directed
through
by a set of vectors
to deduce
uses
relation
affinity
(i.e.
from Such
pair
applications
operators
taken
word
codes
for each
blocks
in CFD
one
resolved
relations
basic
by a stencil
relations.
of the same
in affinity in the
difference
can be approximated them
be deduced
case we observe
resulted
scan
only
The
can
(left)
in SFunction.
calculations it uses
this
all arrays
common
since
used
pane)
relation
and
since
misses
code
of FPI.
dependence
statement
Original
instructions
misses
of PDC
in number
The
point
of TLB
number
increase
Optimization.
few floating
computations
with
do end
(k+ i)
do
Figure
the
do
do
end end
end
(k)
program.
d is either of the
properties
4
parts can
by a bipartite and
graph
let D be the
We say a memory operand and
or result
location of c.
an arc connecting
be expressed
called
program
program
in terms
data,
d is affine
The each
i.e.
program statement
of the
affinity
Cycles
Figure
2:
The
optimization
Time
effect
of TLB
the
performance
on
Flal
(Table
TLB
PDC
Lookaside
of lhsz
SC
Buffer)
nest.
The
Cl
and
PDC
(Primary
performance
of lhsx
a reference. The horizontal axis shows different types hardware counters. The vertical axis shows a normalized
of events number
FPI
for Secondary
stands
for Floating
stands
for the
graph.
For example,
cl to c2. The
analysis
nests
data
and
loop
index
In most index
and
FFT
multiple
algorithm grids
points
with
nonlinear
most
nests
function
coefficients knowledge but
of the multiple of the
of the actual free
can
not
set
{-1,0,1},
Checking good
However,
cache temporal
some
traffic
(see
where
with
The of the of the
the
time.
The
core
working
of
with
enumerated
indicates
nests
I.
grids) the
specially
matrix
inside
at iteration
nests
tool
from
I is a vector
include
nest;
CI
connecting
at compile
nests
loop
of the
an exclusion
the
can be deduced thread
values property
be verified
without
unfriendly
necessary
analysis
arcs
on structured
working
coefficients
path
statements
the
known
and
order.
referenced
These
array.
further
with
If the
and
a case.
as
the
nests
at this
point.
In
representing
the
idx
multigrid
methods
where
of 2.
tool inserts the expression in the call such test run time test.
volve
element
a file; nests
the
the case
(I; idx(I))
coefficients
is not
any
function
numerical
tiles).
symbolic
from
is a direct
indexing
applications
in a precomputed
without
data
of ±dx function
of interference form
are
properties
is read
access
(CFD
is given
misses,
in any
In this
j, k) = i + 2 •j • k for kji
stored
functions linear
elements
coefficients Some
function
by
arrays.
of an array
this
Cache
cl if there
of expressions
domain
where
idx
function
the
address
idx(i,
the
nest
Cache)
measured with the use of of measured events. Here
can be executed
be simplified by
of I with
the
access
with
are
function
and
as a pair
application
where
idx
can used
be expressed
applications
where
use
graph
is a memory
is linear
on a statement
independent
locations
in our
in our
grid
c2 are
affinity
can
cases
function
the
c2 depends
idx(I)
of the
few nests
a statement
memory
statements
SC stands
Invalidations.
cl and
the
Instructions,
Cache
of the
and
are
the
secondary
Otherwise,
the
Point
Data
access
spatial conditions
condition
of the coefficients
knowing and
can
numerical
user
can
friendly
(see the
traffic obtains
In general,
[5] and
for cache
data the
the
patterns.
locality
only symbolic
noninterference of the
code
using
not
below) subsection
values
of the
warning
cache
friendly
be expressed
others
require
in a symbolic coefficients
at run
time.
computations in simple
can
on the
on generation
be expressed
the
computations
information
terms
be formulated
the We in[4]. and
checked.The first condition is simple: the coefficient at the innermost loop index is 1. Otherwise, nonunit stride in memory accesscan cause,underutilization of data loaded into the cache.
The
other
Detection
of self
represents
the
array
and
sizes
a set the
the
sizes,
a test
Detection affine offset
and
of this
of both
Detection
nxa
requires
same
TLB_SIZE;
loop
exceeds
the
user
can
not
2.
arrays. • nya
block
TLB
the
gets
address
then the
misses.
accessed
high then
in this
a single
by different locality.
would
case)
i, j, k with
is known. symbolic
addrp(i,j, where
nest
p is the
coefficients
thread
condition
c > a(nx-
This read
cross are
when
interference
the
inter
array
represented
k = 1, 2, 3.
An
for example, bigger
misses
(as in Example
by
evaluation
if both
same
address
4. If the
accessed
can
arrays
array. 1) usually
nest.
Otherwise,
time
be formulated
condition
is checked
only
invalidations. copied
and
assume
function
to be true if both
then
conditions
as nonoverlapping
noninterference
are
innermost
test.
at the memory
lines
in the
be proved
be placed
program access
in the can
arrays
as a run
happens
of the
conditions
condition
cache
at
condition of the for
In the
into that
the
be a linear
running
"read/write"
arrays
of "read"
cache
array
anyway
parallelized
function
is satisfied
processor
case
secondary
of
of the
and
loop
(k-loop
nest
indices
a, b, c:
k) = ai + bj + ck + cwp number,
nx, O < j < ny, O