NASA
Contractor
ICASE
Report
Report
198308
No. 96-22
ICA AN EVALUATION
OF ARCHITECTURAL
FOR PARALLEL
NAVIER-STOKES
COMPUTATIONS
D. N. Jayasimha M. E. Hayder S. K. Pillay
NASA
Contract
March
1996
No.
NAS1-19480
Institute for Computer Applications NASA Langley Research Center Hampton, Operated
National Space
VA
and Engineering
23681-0001
by Universities
Aeronautics
Space
and
Administration
Langley Hampton,
in Science
Research Virginia
Center 23681-0001
Research
Association
PLATFORMS
AN EVALUATION FOR PARALLEL
OF ARCHITECTURAL NAVIER-STOKES D.
Department
N. Jayasimha
of Computer The
PLATFORMS COMPUTATIONS
Ohio
and
State
Columbus,
Information
Science
University OH 43210
j
[email protected]
ate.edu
M. E. Hayder* Institute
for Computer
Applications
NASA
Langley
in Science
Research
Hampton,
and
Engineering
Center
VA 23681-0001
hayder_icase.edu S. If. Scientific
Engineering NASA
Pillay
Computing
Lewis
Solutions
Research
Cleveland,
Office
Center
OH 44142
spillay_lerc.nasa.gov
Abstract We study tational the The
computational,
Dynamics
compressible
chosen
at NASA
memory investigate formance
Lewis),
the impact of the
performance strengths
of an and
a shared
memory
The
the
work speed
application
of the
--
single
the
of parallel
architecture
(the the
example
(the Cray
IBM
importance
processor computing
LACE
YMP),
SP and
by popular
of architectures,
of a Compu-
flow field
the cluster
induced
highlights
for good
characteristics
accurate
muItiprocessor connecting
on a variety
of each
time
of workstations
topologies
overheads also
the
a cluster
networks
and
and scalability
solves
on a variety
are
different
of various
processor
weaknesses
study
with
application
to the
which
equations,
for this
multiprocessors
for parallelization.
bandwidth
communication,
application,
Navier-Stokes
platforms
testbed
used
the
Fluid
we are
able
distributed T3D.
We
on the
per-
passing
of matching
performance.
experimental Cray
of workstations
using
platforms.
and
the
message
of a jet
the
libraries memory
By studying
the
to point
the
out
platforms.
*This research was supported in part by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the second author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton. VA 236810001.
1
Introduction
Numerical
simulations
sociated
with
problem
which
plane.
many
The
an important
important
will
have
radiated
(time-dependent) very
play
sound
and
time
in the
The
suppression
on the
success
problems.
a great
impact
emanating
compressible
expensive
role
from
the
Navier-Stokes
consuming.
The
investigation
jet
of jet of the
can
equations.
difficulty
ing the
time-dependent exit.
supersonic hours
flow field.
We solve
the
axisymmetric
of CPU
networks rationally
time
In this
Navier-Stokes jet.
on the
Our
This
Cray
With
Recognizing
(Computational
Fluid
this,
Dynamics)
platforms
chosen
a spectrum tensive
for this
of parallel
problems:
on specific
with
of workstations
connected
ronment
(LACE)
considered
the
have
vector
different
[9] experimental
in our study
architecture
that
memory
multiprocessors
model derived platforms.
all from
many
testbed).
is cache-coherent,
from
be the
analogy obtainnear
the
flow fields
and
requires
of a many
processors
14, 18] have
and
the
proposed the
networks
CFD
IBM (the
One
important
massively
parallel
Our
application
Research to solve
(the
studied
architectures.
Lewis
multiprocessor
topologies--
via
acoustic requires
parallel
[5, I0, parallel
NASA
been
however,
flow fields
intensive
full
by limiting
accurate
of massively
the
the opportunity to parallelize compuat a fraction of the cost of traditional
of researchers
architectures
a shared
a cluster
Cray
the
two
architecture
in-
distributed
Cray
Advanced
processors
represent
computationally
SP and
T3D,
Cluster
that
has
typified
in
described
Center,
YMP),
Lewis
CFD goal
and Envi-
not
by the
been DASH
[Ii].
earlier
paper
LACE
[6]. This
hensive
covering
architectures processors,
by
the
paper
as low
cost
ii) It focuses
works and the on the physical
authors
differs
a gamut
and communication
paper
in this
In the
next
from
alternatives on the
section
application.
and
the
subject
earlier
results
of a study
one in two important
while
the
to expensive
relationship
characteristics
processing aspects
the
the
other
of this aspects:
examined
the
supercomputers
of the
of the application,
performance to the
and results
architectural
application
i) It is compre-
feasibility
of NOW
massively to the
of two disparate relevant details
parallel
computation
aspects
nodes, and to the programming tools. We have not of the application or the the details of the numerical in keeping with the readership however, we have included the
on
of the net-
laid emphasis model as we communities. from the other
one.
the
tools
presented
of architectures
have done in the other paper For the sake of completeness,
is the
advent
a number
study,
memory
An
the
time
very
applications
this study is to implement the numerical above on a variety of parallel architectural
on such
to compute
is computationally
Y-MP.
can,
such
Transport
by solving
overcome
as-
is one
Civil
and then using This technique
we concentrate
of workstations (NOWs), scientists now have intensive codes and reduce turnaround time
supercomputers.
The
study
processes
noise
Speed
computation
can be partially
equations
code
exhaust
High
be computed
solution domain to the near field where the jet is nonlinear (see [12]) to relate the far-field noise to the near-field sources. nozzle
of physical
we briefly Section
used
discuss
3 has
the
governing
a discussion
of the
for parallelizing
of Section
4.
Section
the
application.
5 describes
the
equations
and
the
parallel
architectures
The
parallelization
experimental
numerical used of the
methodology.
model
in the
of
study
application Section
6
presents the
a detailed
lessons
2
from
The
of the
this study
Numerical
We solve the jet.
discussion
learned
results. and
the
The
paper
issues
that
concludes merit
with
further
a brief
discussion
of
investigation.
Model
Navier-Stokes
Navier-Stokes
and the Euler equations
equations
for such
flows LQ
OQ
to compute
flow fields
can be written,
of an axisymmetric
in polar
coordinates
as
= S
OF
OG
o-7 + _ + Or = S where
(;)
Q = r
pu
pu 2 - ,.rxz + P
F-r
puv pull
I
G_r
-
- "rxr
u'rx_ -
v'r_,,. -
gTx
puv - "r_ PV 2 - "5,. + P
pv H - UT_,. -- v'r,.,. -- aTr
S=
F and
G are the
fluxes
in the
arises in the cylindrical polar fluxes. In the above equations radial We use
velocity the
components, fourth-order
Navier-Stokes to compute compute
and time
spatial
the accurate
derivatives
x and
I°) 0 P - _oo 0
r directions
respectively,
and S is the
source
term
that
coordinates, _-_j are the shear stresses and tcTj are the heat p, p, u, v, T, e and H denote the pressure, density, axial and
temperature,
total
MacCormack Euler
)
scheme,
equations.
solutions.
at each predictor
due
This
It uses
energy
enthalpy.
to Gottlieb
scheme
one-sided or corrector 2
and
uses
and predictor
differences step.
Turket and
(forward
For the present
[4], to solve corrector
the steps
or backward) computations,
to
the operator L
in the
equation
LQ
one-dimensional operators and as a one-dimensional operator difference
in the
predictor dimension
corrector.
the
Its
corrector
=
step
Qn
Similarly
in L2 the
__
symmetric
6__x
variant
corrector
{7(F/__1
predictor
step
is
n
At 6Ax
{7(F:
step
_
F n)
At 7/_, + Q7- __ 6"_-'-x{ ( ,
Q_ = Qi the
L:
uses
a backward
_
(F_n+2
_
F__l)
scheme
symmetric
becomes variants
difference
step
}
__
two
in L1Q
in the
for the
one-
AtSi
-
-- .t_i-l)
F_,)
-
-- (P,:-]
(F?_,
-- __2)}
- FL_)}
+ At,..q,]
+ AtS,
is
At {7(y, - 2,+1)- (_P,+,-
This
into
as
Qg÷' =
and
Qt + Fz + Gr = S, is split
and a forward difference in the corrector. The predictor model/split equation Qt = Fz + S is written as
(_i
and
= S, or equivalently
the scheme is applied to these split operators. We define L1 with a forward difference in the predictor and a backward
fourth
order
[4]. For our
accurate
in the
computations,
the
spatial
+ Ats,]
derivatives
one dimensional
when
alternated
sweeps
with
are arranged
as
Q,_+I = LlxLlrQn Q,_+: = L_ L2=Q '_+1 This
scheme
the fluxes to compute
is used
for the
are extrapolated the
solution
interior
points.
outside on the
the outflow. In our implementation, at the new time for all boundary
In order
the domain
boundary.
to advance
to artificial
We use the
we solve points. Pt -
the
points
using
characteristic
the following
scheme
near
a cubic
extrapolation
boundary
set of equations
boundaries, condition
to get the
at
solution
pcut = 0
Pt + pcut
= R2
Pt - c2Pt = R3 V t =1_
where
P_ is determined
combination Stokes
is not
equations.
by which
specified, For further
variables
P_ is just details
those
4
are specified spatial
see [6]. 3
and which
derivatives
that
are
not.
come
Whenever
from
the
the
Navier-
X MOMENTUM
UR LEVELS
1.500 0.00 DEG 9.36x10"'6 1.50x10"'2 25_100
MACH ALPHA Re TIME GRID
1 Figure
Let r be the radius solution
of the radial
represent
a reasonable
time
3
Parallel
steps
This
section
with
the
3.1 The
in this
to keep
the
1 shows
size.
The
used
size of the
the
same
jet
of axial 50r
was obtained
requirements
Computing
plot
of size
The
result
we have
axisymmetric
a contour
for a domain
computing
tools
in an excited
a 250 x 100 grid.
paper,
a brief
momentum
in the domain
after
grid,
and
about
but
axial
run
from
direction the
chosen
16,000 the
the and
time
grid steps.
experiments
for
reasonable.
Platforms
discussion
of the
various
platforms
used
in the
study
together
used.
NOW LACE
testbed
(nodes or subsets
connection Mbits/sec
half
with problem
contains
nodes
Mbps.
Figure
equations
parallelization
nodes
use.
direction
results
5000
momentum
of the nozzle.
Navier-Stokes
5r in the
For all other
1: Axial
is regularly
1-32)
characteristics. (Mbps));
Nodes
and
connected
of them
and
present
nodes
are
through
for our purposes, an upper
990 through
half
has 32 RS6000
(node
0) which
is the
various
networks
with
use
a FDDI
to consider
(nodes
configuration
connected
is for general
9--24 are interconnected 1-16)
The
RS6000/Model
are
All the
one
It is convenient, (nodes
an
of them
upgraded.
17-32).
through
and
the
interface the nodes The
two
other
lower
file
with
Ethernet
a peak has
These
speed
and
networks
(10
to
"parallel"
bandwidth
to be partitioned half
server.
different
is dedicated
processor
into
RS6000/Model
of 100 a lower 590
CPUs (the CPU has a 66.5MHz clock, 256KB data- and 32KB instruction caches)with the following networksinterconnectingthe nodes:an ATM network capableof a peakbandwidth of 155Mbps and IBM's ALLNODE switch, referred to as ALLNODE-F (for fast), capable of a peak throughput of 64 Mbps per link. The upper half has the slower RS6000/Model 560 CPUs (the CPU has a 50 MHz clock, 64KB data- and 8KB instruction caches)and is connectedthrough IBM's ALLNODE prototype switch, referred to as ALLNODE-S (for slow), capableof a peak throughput of 32 Mbps per link. The ALLNODE switch is a variant of Omegainterconnectionnetwork and is capableof providing multiple contentionlesspaths betweenthe nodes of the cluster (a maximum of 8 paths can be configuredbetweensource and destination processors). The present setup doesnot permit the use of more than 16 processorsusing the faster networks. The nodes have varying main memory capacity (64 MB, 128 MB, 256 MB, and 512 MB). We have used the popular PVM (Parallel Virtual Machine) messagepassinglibrary (version 3.2.2) to implement our parallel programs. We will refer to the LACE cluster with RS6000/Model 560 processorsas the LACE/560 and thosewith the RS6000/Model590processorsas the LACE/590. 3.2
Shared
We used Y-MP/8
Memory
the
Cray
Y-MP/8,
has
a peak
rating
and the communication We parallelized
exploiting
the
and
the
as the the
with
been
IBM
SP in the network
parallelizing
compiler
paper.
It offers
explicit
The
(version Cray
dimensional has only
a CPU
3.2)
SP1
The
T3D
developed
is also
torus with
distributed
has
memory
16 processing
32KB
data
and
it function
nodes
network,
[15]. The
16 were available
models, customized
machine
speed
in single
we programmed version
by IBM
a distributed
a clock
Cray
address
is through
directives
space shared
in addition
multiprocessors-
nodes
(the
instruction like
of the
for the
to
user
(version
mode. using 3.2).
IBM
SP1
in each
node
is a
The
original
We will refer
interconnected
in topology
the
through
to ALLNODE,
system
to this
system
a variant
permits
using MPL a customized
of
multiple (Message version of
SP.
memory used
CPU
caches).
a SP2.
SP are
similar
multiprocessor
in our
of 150 MHz
the machine
of PVM
The
Cray.
contentionless paths between nodes. We parallelized the application Passing Library), IBM's native message passing library and PVMe, PVM
a single
processors
DOALL
on the
study.
Architecture
to make
[17]. This
for this
on different
by using
clock,
upgraded
processors,
executing
on two
IBM
a 50 MHz
software
Omega
The
vector
2.7 GigaFLOPS.
processes
application
T3D.
eight
application
Memory
the
Cray
RS6K/370 has
of the
Distributed
has
of approximately the
features
We parallelized
which
between
variables.
3.3
Architecture
and
Though the
study
has
a direct the T3D
message
with 64 nodes mapped supports
passing
the
topology
of a three
(8 × 4 x 2) (each cache
of 8KB),
multiple
paradigm
node
of which
programming
resorting
to Cray's
4
Parallelization
The 1.
factors
which
Single
affect
processor
higher
3.
cost:
volume than
startup
of data
the
cost
per
item
lead
depends
(usually data
on both
Usually,
temporarily
Some
amount
SPMD
byte)
and
transfer
as far
data
the above
startup
cost,
For the
solution
internal
subdomain
with
such leading
program
bursty
From
number
cost.
which
resulted
of communication
cost
One
into
as possible.
in
is 2-3 orders
method
startups of magnitude
to reduce
the
effect
of
long vectors.
it is desirable
that
Increasing
amount
the
of communication
communication
to increased is inevitable
multiple and
the
discussion
which
overlapping
communication
be
of overlapping,
then
we group
there
leads
to a higher
with
or right)
1 and
2. It is seen
roughly
50% of the
although
the
will then
the
solution
connected and
the
via
maximum
be approximately time,
the
ratio
are
shown
consider Ethernet.
the
transferred
effect
(145,000/(10
the
of startups, and
scheme
in Tables
same.
Note
on
throughput
some
computation bound
is 1000
seconds
(1000
x 10/10)!
for the
application
in units
the
idea
of 20 MFLOPS The
a lower
point
has
that
a network
while
floating
a for
to as Euler,
To give
for Ethernet. x 20))
along
of Navier-Stokes
to be executing
a reasonable
values
are shown
basis.
bound-
of communication
referred
is about
each the
We use a similar
requirements
processor
of 10 Mbps
processor
along
number
hereafter ratio
to communication per
communication.
application
volume
a per
Assume
725 seconds
ignoring
on
relationship
communication
temperature
send.
equations, volume
among bursty
the
of the
Navier-Stokes
throughput
of computation
operations/byte
requirements
to communication
in the
to as Navier-Stokes,
and
into a single
of Euler
time.
written
an inverse
and temperature
velocity
the communication
requirements
communication 2 shows
that
computation
of communication,
workstations processor
communication
computation
communication effects
and
referred
To reduce
all the
boundary are calculated and then packaged the flux values that need to be communicated.
usually
and
velocity,
neighbor.
waiting
are usually
relationship
hereafter
first,
throughput
process
startups.
computation,
its two flux values,
(left
is also
network's
and
programs
is a subtle
equations,
communication-
computational
There
the
cost
parallel
style.
communication
exchanges
overwhelm
of communication that
of Navier-Stokes
could communication
since
data)
number
it is clear
its appropriate
startups,
point
the
computation:
granularity
of burstiness
(single
between
the
optimizations
the startup
to be communicated
to finer
communication:
capacity
The
various
of startups.
4. Bursty
ary
cost
computation
could
number
this
communication with
however,
we will explain
communicated.
is to group
Overlapped
overlapped
below.
in performance.
2. Communication the
are listed
performance:
80% improvement
and
performance
of
of 10 per time on the Table
of floating
operations/startup
per
processor. The
application
is parallelized
tion
only.
dimensional
Two
by decomposing partitioning
was
the not
domain
by blocks
attempted
since
along a simple
the
axial
analysis
direcshows
Table 1: Application Characteristics Appln N-S Euler
Total Comp. Comm./Processor (in FP Ops (x 106)) Start-ups Volume (MB) 145,000 80,000 120 (960Mb) 77,000 20,000 64 (512Mb)
Table 2: Computation-Communication Ratios No. of Procs. 2 4 8 16
FPs/Byte FPs/Start-up Nav-Stokes Euler Nav-Stokes Euler 604 601 906K 1925K 302 301 453K 963K 151 150 227K 481K 76 75 113K 241K
that for the chosengrid size,sucha partioning performsworsethan a 1-D block partitioning. For example,with a 2-D partitioning of 16 processors(4 x 4 blocks), the ratio of the number of bytes transferredcomparedto 1-D partitioning is 1.25. This ratio will, of course,decrease when we increasethe problem size. Another disadvantagewith 2-D partitioning is that the number of start-ups is higher. For the above example, the correspondingratio for the two partionings is 1.6;this ratio doesnot decreasewith the problem size. Sincethe startup cost dominatesthe transmissioncostin mostcurrent architecturesincluding the onesusedin this study (the ratio is highest for LACE and least for Cray T3D) and the averagetransmission volume per startup is only moderate (Table 1), wedid not experiment with 2-D partitioning. The parallelization on the Cray Y-MP wasdone differently (it wasmuch easieralso)sinceit is a sharedmemory architecture: we did somehand optimization to convert someloops to parallel loops,usedthe DOALL directive, and partitioned the domain along the orthogonal direction of the sweepto keepthe vector lengths large and to avoid non-stride accessto most of the variables.
5 The single
Experimental performance user
The
using The
is the
execution
were repeatable from the mean
Experiments of LACE.
indicator
mode.
experiments the deviation
Methodology
a single
performance
total times
execution reported
time. are
All experiments
for single
runs
with negligibly small discrepancies. is about 1% or less ([6]). processor of the
were
done
original
code 7
on an IBM (Version
RS6K
were
since
we found
For example,
(Model
1) for both
conducted
560)
applications
with
that
in the
LACE,
workstation is shown
in Figure 2. 16OOO m
m
_, ....
NavJer-Stokes Euler
12OOO
m I
E
I
I
I
I
! I
C
O
I
I
I
I
I I
W
!
I
I
I
1
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
,
I
I
I
:
!
I
I
I
I
I
I
i
I
I
i
I
I
I X
! I !
!
I I
I
!
!
! I
I
I
I
4OOO
I
I
I !
:
!
I
I I
I
i
i
I
I
I
I
I
!
I
i
II
II
II
I
I
I
I
!
I
I
I
I!
I
I
!
I
I
I I
I
I
! I !
I
! I I
I
! I
!
I
!
I
! •
0
"--"
!
1
3
2
I
4
5
Version
Figure
We found memory was
most
hierarchy
the
(using
that
this was
loop
were
this version
running
a number
better replace (a
faster
usage
reduction
presented
of the
SP, and
Cray
We have
last
number
(RS6000/560)
by the
memory.
arrays
in stride-1
modified than
50%,
compared
presented of which
COMMON
blocks
yielded into
to 2.0
improvement
x 109 was
achieved-
of roughly
80%
5 on
different
section.
computing
On
each
of processors
(up
platform, to 8 with
platforms Cray
some
one
(Version
5),
feasible-
Version
2),
incorporated
in accordance
Y-MP,
improvement:
up
expensive
4).
All
9.3 MFLOPS
the
in
2. We experimented
Version
we measured
3 (the
resulted
are relatively
(from
in Figure 2. The optimizations were all the above mentioned optimizations.
possible
Version
paper),
a single
former
of the
performance
wherever
in the
wherever
the
cache called
to Version
following
since
performance
fashion
program,
order
feasible
poor
Improved
by multiplications
wherever
an overall
Version
in the
function
multiple
exponentiations
5.5 × 109 divisions
yielded
parallelized
the
limited
main
The
by approximately
processor
were the
in a different
by multiplication from
and
by accessing
by collapsing
MFLOPS) as illustrated that Version 5 contains We
application cache
modifications,
(replace
division
on a single
optimization).
performed
reduction
optimizations
the
achieved
of other
register
strength
time
of the
interchange
optimizations with
parts
involving
key and the
2: Execution
these
to 16.0
in sequence
with
execution
to 16 with
so
the
ideas
time
as
LACE,
IBM
T3D).
studied
the
performance
of LACE
with 8
four
networks
of differing
characteristics
a
using
"off-the-shelf'
PVM
as the
message
passing
library.
With
the impact of parallelizing the application with two message MPL and a customized version of PVM called PVMe. In all experiments, components:
wherever
processor
processor
busy
overheads
associated
components
time
is not
feasible, busy
is itself with
and
composed
sending
possible,
we have
time and
however,
Version with
5 of the
application
computation.
of the
interior
temperature for both
Version part
from
versions.
are combined into at a time to avoid
of the
make
any
overlapping
subdomain
its neighbors.
As mentioned
while Figure earlier,
a single send. We have bursty communication.
VERSION
could
execution
special
time
the
attempts
the
is waiting
processor
3 shows
stress
the timeline
the
The
software of these
monitoring time
and
tools.
of a processor
communication flux components
for the
velocity
of a processor's
"flux columns"
experimented This variant
and
to overlap
the
the two
two additive
separation
idle
by computing
native
time.
performance
include
studied
IBM's
time into
An accurate
hardware
also
5
nearest
each
VERSION
boundary
Calculate VEL, TEMP
Send
one
6
VEL,
TEMP
and
activity
with sending the flux columns is called Version 7.
Calculate
EL,
libraries-
computation
messages.
we have
SP, we have
communication
actual
receiving time
not
6 does
of each
vectors these
does
the
passing
non-overlapped
unless
The non-overlapped communication waiting for a message.
separated
the IBM
Send TEMP
VEL, TEMP
Receive
Calculate
VEL.
STRESS,
TEMP
$TRF.,S$,
FLUX
at interior
Receive
Calculate FLUX
VEL, TEMP Calculate
Overlapped Send
Commtmication
and
STRESS,
FLUX
at Boundary
Sead
FLUX
FLUX
Receive FLUX
C°mlmtati°n
Update
_I
Interior
Update Subdomam Receive FLUX
,- Bou_lxy
Upcl_'P Boundary
Figure We found
that
the
execution
3: Timeline
time
of processor
improvement
with
activity
Versions
6 and
7 were
or even worse in many experiments. Hence all our experiments were conducted 5. We do mention, however, the impact of these versions on different networks Section 6.1. The
next
section
presents
a detailed
discussion
of the
9
results
from
our
either
minimal
with Version of LACE in
experiments.
6 The
Results execution
number ingful
times
of Navier-Stokes
of processors
for each
and
computing
Euler
platform,
have
been
plotted
using
a log-log
scale
as a function
of the
to facilitate
mean-
presentation.
Performance
6.1
of LACE 10 4
lo • _P
eo
(3_E_ ALLNODE-F [3- - --E3ALLNODE-S _...... A LACE/560 Ethernet
102
,
101
,
I
10 Number of Processors
Figure Figure
4 shows
4: Navier-Stokes
the performance
of Navier-Stokes
F, ALLNODE-S, and the upper-half networks are almost identical with performance The
close
attributed is balanced or FDDI zmtwork. With
of the
performance
and
by (100
its ability Mbps)
FDDI
with
reason: to set up their
the execution
multiple
faster
are not
shown.
ATM,
and
and slower links
link speed
networks
of LACE-
ALLNODE-S paths
do not
multiple
permit
linearly
and
of ALLNODE
contention-free
time falls almost
effects begin to show, faster than ALLNODE-S.
on LACE ALLNODE-
The performance of the ATM and the FDDI and ALLNODE-S respectively. Hence the
networks
the
time
on different
Etheraet. ALLNODE-F
of ALLNODE-F
to the following
ALLNODE,
sublinearity 70%-80%
ATM
execution
with
while
increasing
FDDI
(64 Mbps/32 ATM
physical
number
(155 paths
be
Mbps) Mbps) in the
of processors-
however, beyond 12 processors. ALLNODE-F This can be attributed to both an improved 10
can
is about network
(which is twice asfast) and the superiorperformanceof the 590model (33%faster clock, data andinstruction cacheswhich are4 times bigger,and memory buswhich is 4 times wider than the 560- thesecontribute to faster instruction execution, better cachehit ratios, and lower cachemiss penalty respectively). Ethernet performancereachesits peak at 8 processorsbeyond this, the communication requirementsof the application overwhelm the network. The inability of Ethernet to handle traffic beyond 8 processorsis shown by the following simple argument: Table 2 shows that with 8 processors, Navier-Stokes produces a byte for communication
after
a 1 second
and
processor
interval produces
imately
8.5Mbs
the
performance
not
surprising,
it has each
1.06 Mb
from
151 floating operating
8 processors.
the
on the
Ethernet
by an application
therefore,
operations
at 20 MFLOPS.
for communication,
all the
seen
completed
processor
average.
gets
steadily
this
each
to approx10Mbps
bandwidth,
beyond
Consider
interval,
translates
of supporting
of this worse
average.
This
is capable
will be a fraction
performance
on the During
peak-
however;
it is
8 processors.
104
103
._E
13-.......
_---
I-
10 2
(_--O LACE/5g0 Processor busy time [3- - -El ALLNODE-F Non-overlapped Comm. LACE/560 Processor busy time A - - A ALLNODE-S Non-overlapped Comm. -_-.---_ Non-overlapped Comm. (Ethernet)
10' Number
Figure
Figure
5 aid in a more
separated the
into
processor
With between network
both
in depth
two additive busy
non-overlapped which
5: Components
time
ALLNODE
it begins
to rise.
of execution
analysis
of the
components falls
linearly
communication
time
switches, The
the two ALLNODE respectively.
of Processors
this
difference
time
performance
as explained with
the
increases time
in the
number
remains
previous
can
11
with
The
execution
section.
of processors.
steady busy
LACE)
of LACE.
superlinearly
in processor
configurations
(Navier-Stokes_
the
With number
It is seen
be attributed
and
superior
the
of processors.
the communication
to the
is
that
Ethernet,
up to 10 or 12 processors times
time
node
beyond times and
the
10 4
103
I-co
e_e Version - - .e Version Version =-- - -=Version Version - - _ Version
o
10=
101
I
,
i
,
,
5 5 6 6 7 7
ALLNODE-S Ethernet ALLNODE-S Ethernet ALLNODE-S Ethernet
I
i
1
lO Number of Processors
Figure
Figures (the
6 and
trends
6: Communication
7 show
are
the performance
similar
with
communication
and
Ethernet
and
startups.
With
communication broken can
into
cache
Version
7 attempts 5.
however.
Since
reducing increase.
bursty
6.2
have
The
8 and chosen
for this
slightly
5) is very does
computations
and
the loop
due
bursty
setup
not
and
ALLNODE-S
6 (with close
(only
the
overheads
to loss of temporal
the
former
of Ver-
number have
of
to be
computations
are higher.
locality.
overlapped
to that
increase
for the subdomain
boundary
communication
surprisingly,
Ethernet
of ALLNODE-S
can handle
communication
in Section
the
Ethernet
Further,
Consequently,
the these
to overlapping.
Not
ALLNODE-S
9 show
interior
performance
Comparative
Figures
due
to reduce
startups.
Version
for the
LACE)
of Version
Overlapping
6, since
communication),
any gain
performance
as explained Version
(Navier-Stokes;
5, 6, and 7 with
The
ALLNODE-S.
also degrades
offset
communication with
ones with
performance
overheads
computation,
separate
be overlapped
of Versions
ALLNODE-F).
sion
5 for both
optimization
only
at
the
performs
the
of increased
better
with
is appreciably
the communication
harms
cost
worse
requirements
performance
since
the
number
Version than
of the number
of
7 than
Version
5,
application, of startups
Performance the performance study--
LACE,
of the Cray
application
Y-MP,
12
Cray
on the T3D
four computing
and
IBM
SP. The
platforms performance
we
104
103
I..tO
Version o-- - .e Version
5 ALLNODE-S 5 I:thernet
Version o- - - -a Version
6 ALLNODE-S 6 El_ernet
Version 7 ALLNODE-S z, - - _ Version 7 Ethernet
10 2
10 _
1'0 Number
Figure
of LACE
is reported
Surprisingly,
7: Communication
for ALLNODE-F
LACE,
of Processors
even
with
optimization
and
(Euler;
LACE)
ALLNODE-S.
ALLNODE-S,
outperforms
SP even
though
the
former
uses
off-the-shelf PVM and the latter uses MPL, IBM's native message passing library. (Our version of) MPL imposes a limit on the number of (non-blocking) send primitives that can be simultaneously we were factor MHz
active-
forced
to the clock)
to use
blocking
relatively between
this limit
poor the
is lower
send
than
the
primitives.
performance.
The
560 (50 MHz)
and
We CPU
the
Another
surprising
worse
than
T3D's
CPU
and
result
has
and
a peak
rating
8KB and
performance.
which
size
ALLNODE-S, poor
is worse
than
is 2.3X
suspect
32KB).
[17].
These
in addition Poor results
A reasonably
they
single-processor stress fast CPU
the with
of the
performance and
have
13
data
rating
2-way
set
T3D
(32KB
which
than
of the
T3D
of superior
cache
set associative
cache
(62.5 to the
compared
4 and
5).
is consistently
8 processors. 590 and
560
instruction has
design
cache 64KB
caches
also been to the
and a high
The models,
small direct-mapped date caches of sizes
associative
on the
contributing
contributor
cache
for less
hence,
in speed
6.1 (Figures
of Cray
3X the
performance importance
to be one
Another
see Section
ALLNODE-S
a large,
this
application;
on the SP is intermediate
We attribute the T3D's poor performance to the (both the 560 and 590 have 4-way set-associative
256 KB respectively;
elsewhere
and
is the relatively
ALLNODE-F
respectively. of 8KB size sizes
of ALLNODE-F
of the
590 (66.6MHz).
poor performance of the SP is the relatively small to 64KB on LACE/560 and 256KB on LACE/590). For a comparison
requirements
of
reported overall
bandwidth
10 4 I
i
I03 _ "0- - -0
® I-c o
Cray Y-MP A- - ._ IBM SP (RS6K/370) B- - -Q ALLNODE-S
x
"
10 2
v .......-v Cray T3D _, - - e ALLNC)DE-F
101
,
,
,
,
,
,
i
,
i
J
,
10 Number of Processors
Figure
8: Execution
time
of Navier-Stokes
on computing
platforms
104
o
.__ I-c o
x
'" 102
101
.......
1JO
Number of Processors
Figure
9: Execution
time
of Euler
14
on computing
platforms
Table
3: Speedup
No. of Procs.
Architectural ALLNODE-S 3.2
4 16
bus
connecting
the
poorly
designed
Table
3 shows
The
with
CPU-cache
speedups, of each
munication
speedup
at 16 processors.
Cray
Y-MP
has
in single
user
from
computation
the
modestly
mode
best
architectures
be attributed
of about
simulates
with
increasing
that
data
number
of processors.
a message
for transmission be implemented
that
efficiently.
time
very
corresponding
good
the network
speedup
networks the and
the
between
the
If NOW
both Such
the
and
execution also
and
time
is the
effort
we were time
for both Y-MP
not
application architectures
15
and
arise
the
level and
poor from
switching the
physical
are to be feasible network under
way
and [1].
to separate Y-MP
node
that
the
scales With
(and
the
times
multiple
times
of the
as massively
the message
can
waiting
overheads layer
does
LACE/590
performance
in processor
mainly
context
time
the applications.
on a single
increase
connect
the
passing
a
beyond
rate
able
also,
the relatively
resulting
overheads
is already
in-
has
9 that
shown
a
modest with
ALLNODE-F 8 and
comhave
is only
transfer
which
8 processors
from
interconnection
networks
peak
of connect
libraries,
the
the
to continue
Figures
MB/sec
of an 8-node
passing
characteristics,
speedup
trend
from
charac-
can sustain
ALLNODE
flattening
(150
effect
3.5 with
These
is copied
or reception.
it is clear
of the
we obtained a speedup of 7.1. Observe faster than a single node of the Y-MP.
overheads
with
ALLNODE-S.
The the
CPU
architectures.
Not surprisingly,
speed
I/O
faster
various
to illustrate
observe
than
the execution
use message
setup
to be communicated
in transferring processors,
which
to large
the
performance
rapidly this
Also,
Considering
with
both
architectures.
performance.
not include the I/O overheads), with 16 processors is about 8_ With
degrade
better
includes
a speedup
tool which
Though
network
performs
time).
achieving
a profiling
its superior
(this
SP exhibit
to expect
14.6
to a much
the corresponding
ALLNODE-S.
cost)
by far the
and
they
on NOW
than
with
setup
processor 16 processors
application.
T3D 3.9
interfaces.
and that
reasonable
of processors
T3D small
memory
single
T3D
at 4 processors,
speedup
8 processors, relatively
of the
It is only
number better
to the
indicating
superior
of Navier-Stokes
4 processors Both
speedup,
requirements
reasonable creasing
with
Cray
13.3
performs
characteristics
architecture.
linear
memory
cache-main
relative
are shown
almost
slightly
speedup
IBM SP 3.8
7.9
main
and
measured
architecture, teristics
and
Platform
ALLNODE-F 3.4
7.5
cache
the
Characteristics
that
arise
network parallel library
a
6.3
Comparison
of Message
Figures
10 and
11 compare
libraries
on the
SP--
tation
and
the
the
performance
execution
communication
Passing
times
components.
Libraries
of the
PVMe
been
separated
graphs
show
have The
and
the
MPL
message
passing
into non-overlapping that
MPL
compu-
is consistently
faster
10 4
103
®
H
Processorbus_
e- --e
Processor
H
Non overlapped
comm
with MPL
13----[]
Non overlapped
comm
with PVMe
busy
time with PVMe
E
10=
101
i
lO Number of processors
Figure than
PVMe
Observe but
creases.
computation further
6.4
part
attesting
also
Load
to our
is evenly able
(Navier-Stokes;
number
since
with the setup
previous does
the
approximately
communication
of processors
phenomenon
(see Figure
and
it implies
overheads
4) where
the
observation
that
is not
though
number
IBM
the
that
there
the
MPL
small
communication
Note
This PVMe)
in-
overlapping
however
that
the
phenomenon
communication (and
for Euler.
negligibly
is increased
of communication. non-overlapped
40% only
actual
of processors.
SP)
is
increases,
library
does
not
on LACE.
Balancing
how well is the
We were
the
includes
of LACE
PVMe
of non-overlapped
communication
as well as PVM
Finally, cation
and
in case
perform
with
and
75% for Navier-Stokes
amount
is an interesting
of computation not seen
the
it decreases This
of MPL
by approximately
also that
that
10: Comparison
application
distributed to measure
but the
load this
processor
balanced?
may
not always
busy
times
16
The
amount
translate (this
time
of computation to a load does
balanced
not include
for the appliexecution. the processor
10 4
10 3
H
Processor
busy t_me with MPL
- e Processor busy t_me with PVMe H Non overlapped comm with MPL B- -- e Non overlapped comm with PVMe"
E
10 2
101
Number
Figure
waiting able
7
time)
for Navier-Stokes
to achieve
almost
NOW
networks
CFD
to circumvent
the
plication
and
level also
performance. good
of the
of a fast
A traditional error-prone
but
the
potential
fast
and
processor,
importance
12 shows
small, still using memory
hierarchy.
reason
that
we were
direct-mapped outperforms message
passing
multiprocessors,
17
characterisstudy
a message
A proper
the
to achieve
bottleneck
cache
poor
if the
implemented
between
performance the
indicates
architectures
are efficiently
available,
for relatively
The
parallel
libraries
processor
processors
the memory the
platforms.
in transferring
of single
RISC
that
is the
distributed
passing
involved
multiprocessor
with
SP)
and scalability
to be cost-effective
message
of the network.
and
an application
of architectural
layer
off-the-shelf cache
IBM
of the SP. Figure
overheads
We believe
vector
Parallelizing
the
(Euler;
communication,
on a variety
physical
fast,
performance.
in spite
size.
traditional
With
the performance
have
highlights
processor
PVMe
balancing.
the computational,
reasonably the
and
Conclusion
application
architectures are made
study
studied
of MPL
on each load
and
we have
tics of a typical
The
perfect
Discussion
In this paper that
11: Comparison
of processors
seems
design
performance
ap-
good to be
is critical of the
to
T3D,
cache. multiprocessors libraries
of modest is rather
this effort
tedious
is worthwhile
to medium and
even
since
good
1500
_
m
m
m
_
m
m
100o
®
E
£ a.
500
i i I !
0
,i 8
J
0
4
12
16
Processor Number
Figure
scalability
limitations
study
have
to larger
available. both
busy
times
(Navier-Stokes;
IBM
SP)
is achievable.
Resource the
12: Processor
axial
to understand directly domain
mentioned
and radial the
us to limit
multiprocessors
For reasons
the
forced
and
physics
of the
problem
to 16 processors.
parallelization
4, we have
A future
We plan
study
to other
in Section
directions.
from the flow field. and a finer mesh.
our
tools
not
better
and the
a finer effects
become
decomposition
the study mesh
to extend
as resources
explored
goal is to conduct
to explore
We hope
along
for a larger
to compute
of 2-D partitioning
domain
the jet with
noise
a larger
Acknowledgments Part at
of this NASA
author OH.
work
Lewis
was
done
Research
was in residence
while Center
in the
the
first
during
ICOMP
author
was
1993-94.
program
a Visiting
Simulations
at NASA
Lewis
Senior were
done
Research
The authors would like to thank Kim Ciula, Dale Hubler, and Rich assistance with various aspects of the LACE and IBM SP architectures.
18
Research while Center,
Rinehart
Associate the
second
Cleveland,
for
their
References [1] Anderson,
T. A.,
(Networks
[2] Bailey, NAS
D. H., Barszcz,
3 User's National
lems".
Micro,
W.,
vol. 30, no.
[6] Hayder,
M. E.
Network
[7] Hayder,
Jayasimha,
M. E. and
E.
Turkel,
Sciences
M. E. Turkel,
Number 1993.
Jet
[9] Horowitz,
Flow".
J. C.
Aerosciences
[10] Landsberg, Solving
"Lewis
A. M., Flows
Sciences
E. and
[11] Lenoski,
AIAA
D. E., et al.
Multiprocessor".
[12] Lighthill,
Int'l
[13] Mankbadi,
vol.
and
Its Radiated E. and
CM-5".
32rid
Order
Accurate
Conference,
AIAA R. R.
Aerospace
Ridge
Dependent
Prob-
Orszag,
S. A.
Computers
and
Three
Ames
Boris,
J.
Dimensional
94-0413,
P.
1952,
D.
of a High
AIAA
Efficient,
93-0653,
Mach
January
Computing October
Parallel 32nd
for
1993. Method
AIAA
for
Aerospace
1994.
Aerodynamically,
Coherence
AIAA
Povinelli, Journal,
"Implementation Aerospace
Sciences
19
L. A.
Protocol
May
1990,
Part
I, General
"The
for the
pp.
DASH
148-159. Theory".
Proc.
Structure
of Supersonic
vol. 32, no. 5, pp 897-906,
1994.
of a Parallel
Euler
Conference,
a
Problems".
pp. 564-587.
M. E. and
Sound".
on
1993.
Center,
Geometries".
January
Architecture,
Generated
"An
of Viscous
Distributed
Research
Flows
to appear.
Simulations
Conference,
Environment".
NASA
of Jet
1996,
July
"Numerical
Sciences
Cluster
T. R. and
93-3074,
on Computer
AIAA
Oak
D. M. and
Solutions
Conf.
Mavriplis,
V. "PVM
Computer".
voh 34, no. 4, April
Cache
R. R., Hayder,
Sunderam,
Simulations
Directory-Based
211,
Results".
for Time
Nosenchuck,
Navier-Stokes
"The
M. J. "On Sound
Soc. London,
"High
Workshop,
in Complex
Conference,
Journal,
Advanced
Young,
Benchmark
ORNL/TM-12187,
Methods
"Navier-Stokes
Mankbadi,
31st AIAA
Applications
R. and
Report
M. G.,
on the
D. N.
AIAA
Aerospace
NOW
1988, pp. 357-364.
of Workstations",
31st AIAA
[8] Hayder,
and
1/2,
for
pp. 54 - 64.
Parallel
Manchek,
Technical 1993.
Simulations
"A Case
1994.
J., Jiang,
W. S., Littman,
Turbulence
team,
1995,
H. D. "NAS
October
A., Dongarra,
NOW
February
L., Simon,
NAS-9_,-O01,
M. E., Flannery, Scale
Structures,
on the 1994.
IEEE
D. A.,
D. and Turkel, E. "Dissipative Two-Four Math. Comp. voh 30, 1976, pp. 703-723.
"Large
I14J Morano,
Patterson,
Guide and Reference Manual", Laboratory, Oak Ridge, TN,
[5] Hayder,
Flow
E.,
E., Dagum,
Report
A., Beguelin,
[4] Gottlieb,
D.
of Workstations)".
Technical
[3] Geist,
Roy.
Culler,
Unstructured AIAA
94-0755,
Jet
Solver January
[15] Oed,W. "The Cray ResearchMassivelyParallel System-Cray T3D", Cray [16] Scott,
Research
J. N., Mankbadi,
Conditions Conference, [17]
November
AIAA
R. R., Hayder,
93-4366,
Technical
Report,
1993.
for the Computational
Stunkel, C. B., Shea, Performance Switch", pp.
[18]
GmbH,
M. E. and
Analysis
October
of Jet
Hariharan
Noise".
S. I.
31st AIAA
"Outflow
Boundary
Aerospace
Sciences
1993.
D. G., Grice, D. G., Hochschild, P.H., Tsao, M. "The Scalable High Performance Computing Conference,
SP1 May
High1994,
150-157.
Venkatakrishnan, V. "Parallel Aerospace Sciences Conference,
Implicit AIAA
Unstructured Grid Euler 94-0759, January 1994.
2O
Solvers".
32nd
AIAA
REPORT
DOCUMENTATION
PAGE
Fo,_ Approved OMB No. 0704.0188
Public reportingburdenfor thiscollection of informationis estimatedto average! hour per responseincluding the time forreviewng nstruct ons searchngexisting data sources, gatheringandmaintaining the data needed,andcompletingand reviewingthecollection of information Sendcomments regardingthis burdenest mate or any otheraspectof this collectionof information,including suggestions for reducingthis burden,to Washington HeadquartersServices,Directoratefor Information Operationsand Reports.12! 5 Jefferson Davis Highway,Suite 1204. Arlington.VA 22202-4302.and to the Officeof Managementand Budget,PaperworkReduction Project (0704-0188). Washington,DC 20503, ].
AGENCY
USE ONLY(Leave
blank)
2. REPORT March
4. TITLE
AND
AN
1 3. REPORT ]
TYPE
Contractor
AND
DATES
COVERED
Report
SUBTITLE
5. FUNDING
EVALUATION
FOR
DATE 1996
OF
PARALLEL
ARCHITECTURAL
NUMBERS
PLATFORMS
NAVIER-STOKES
C NAS1-19480
COMPUTATIONS
WU
505-90-52-01
6. AUTHOR(S) D.
N.
M.
E.
S.
Jayasimha Hayder
K.
Pilla_
7. PERFORMING Institute Mail
ORGANIZATION
for Stop
Computer
132C,
Hampton,
VA
NAME(S)
NASA
Langley
VA
Final
and
Space
NAME(S)
AND
ADDRESS(ES)
Report
No.
96-22
10. SPONSORING/MONITORING AGENCY REPORT NUMBER
Administration
NASA ICASE
CR-198308 Report
No.
96-22
NOTES
Technical
Monitor:
Dennis
M.
Bushnell
Report
Submitted
to
Journal
Unclassified-U
Subject
of
Supercomputing.
60,
(Maximum
study
the
apphcation,
induced
which
solves
communication, the
architecture testbed
popular
message
time at
scalability field
of
different the cluster passing
topologies -of workstations
hbraries
used
the for
to
the processor speed a variety of architectures, platforms.
characteristics
a jet
using
chosen memory
the
of
a Computational
Fluid
compressible
for this study multiprocessor
IBM SP and the on the performance parallehzation. for
good we are
The
Cray of work
are (the
Dynamics;
Communication;
17. SECURITY CLASSIFICATION OF REPORT Unclassified NSN 7540-01-280-5500
Navier-Stokes a cluster of Cray YMP),
also
highhghts
Navier-Stokes
Interconnection Scalability;
and
Network; Shares
Euler Message
Equations; Passing
Network Library;
of
Work-
Computa-
Memory
18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified
on
workstations (the and distributed
T3D. We investigate the application and
single processor performance. able to point out the strengths
Dynamics
equations,
the the
the
impact overheads
importance
of of
By studying the and weaknesses
15. NUMBER Fluid
Architectures; and
and flow
TERMS
Computational stations
accurate
platforms. The platforms NASA Lewis), a shared
matching the memory bandwidth performance of an apphcation on of each of the example computing
14. SUBJECT
CODE
200 words)
multiprocessors with networks connecting by
12b. DISTRIBUTION
61
computational,
a variety of parallel LACE experimental memory various
STATEMENT
nhmited
Category
: 13. ABSTRACT
tion
Center ICASE
12a. DISTRIBUTION/AVAILABILITY
We
8. PERFORMING ORGANIZATION REPORT NUMBER
Engineering
Center
11. SUPPLEMENTARY Langley
and
23681-0001
Research
Hampton,
ADDRESS(ES)
Science
Research
AGENCY
Aeronautics
Langley
in
23681-0001
9. SPONSORING/MONITORING National
AND
Applications
OF PAGES 22
16. PRICE CODE A03
19. SECURITY CLASSIFICATION OF ABSTRACT
20. LIMITATION OF ABSTRACT
Standard Form 298(Re¥. 2-89) Prescribedby ANSI Std Z39-18 298-102