AN EVALUATION OF ARCHITECTURAL PLATFORMS FOR ...

NASA

Contractor

ICASE

Report

Report

198308

No. 96-22

ICA AN EVALUATION

OF ARCHITECTURAL

FOR PARALLEL

NAVIER-STOKES

COMPUTATIONS

D. N. Jayasimha M. E. Hayder S. K. Pillay

NASA

Contract

March

1996

No.

NAS1-19480

Institute for Computer Applications NASA Langley Research Center Hampton, Operated

National Space

VA

and Engineering

23681-0001

by Universities

Aeronautics

Space

and

Administration

Langley Hampton,

in Science

Research Virginia

Center 23681-0001

Research

Association

PLATFORMS

AN EVALUATION FOR PARALLEL

OF ARCHITECTURAL NAVIER-STOKES D.

Department

N. Jayasimha

of Computer The

PLATFORMS COMPUTATIONS

Ohio

and

State

Columbus,

Information

Science

University OH 43210

j [email protected]

ate.edu

M. E. Hayder* Institute

for Computer

Applications

NASA

Langley

in Science

Research

Hampton,

and

Engineering

Center

VA 23681-0001

hayder_icase.edu S. If. Scientific

Engineering NASA

Pillay

Computing

Lewis

Solutions

Research

Cleveland,

Office

Center

OH 44142

spillay_lerc.nasa.gov

Abstract We study tational the The

computational,

Dynamics

compressible

chosen

at NASA

memory investigate formance

Lewis),

the impact of the

performance strengths

of an and

a shared

memory

The

the

work speed

application

of the

--

single

the

of parallel

architecture

(the the

example

(the Cray

IBM

importance

processor computing

LACE

YMP),

SP and

by popular

of architectures,

of a Compu-

flow field

the cluster

induced

highlights

for good

characteristics

accurate

muItiprocessor connecting

on a variety

of each

time

of workstations

topologies

overheads also

the

a cluster

networks

and

and scalability

solves

on a variety

are

different

of various

processor

weaknesses

study

with

application

to the

which

equations,

for this

multiprocessors

for parallelization.

bandwidth

communication,

application,

Navier-Stokes

platforms

testbed

used

the

Fluid

we are

able

distributed T3D.

We

on the

per-

passing

of matching

performance.

experimental Cray

of workstations

using

platforms.

and

the

message

of a jet

the

libraries memory

By studying

the

to point

the

out

platforms.

*This research was supported in part by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the second author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton. VA 236810001.

1

Introduction

Numerical

simulations

sociated

with

problem

which

plane.

many

The

an important

important

will

have

radiated

(time-dependent) very

play

sound

and

time

in the

The

suppression

on the

success

problems.

a great

impact

emanating

compressible

expensive

role

from

the

Navier-Stokes

consuming.

The

investigation

jet

of jet of the

can

equations.

difficulty

ing the

time-dependent exit.

supersonic hours

flow field.

We solve

the

axisymmetric

of CPU

networks rationally

time

In this

Navier-Stokes jet.

on the

Our

This

Cray

With

Recognizing

(Computational

Fluid

this,

Dynamics)

platforms

chosen

a spectrum tensive

for this

of parallel

problems:

on specific

with

of workstations

connected

ronment

(LACE)

considered

the

have

vector

different

[9] experimental

in our study

architecture

that

memory

multiprocessors

model derived platforms.

all from

many

testbed).

is cache-coherent,

from

be the

analogy obtainnear

the

flow fields

and

requires

of a many

processors

14, 18] have

and

the

proposed the

networks

CFD

IBM (the

One

important

massively

parallel

Our

application

Research to solve

(the

studied

architectures.

Lewis

multiprocessor

topologies--

via

acoustic requires

parallel

[5, I0, parallel

NASA

been

however,

flow fields

intensive

full

by limiting

accurate

of massively

the

the opportunity to parallelize compuat a fraction of the cost of traditional

of researchers

architectures

a shared

a cluster

Cray

the

two

architecture

in-

distributed

Cray

Advanced

processors

represent

computationally

SP and

T3D,

Cluster

that

has

typified

in

described

Center,

YMP),

Lewis

CFD goal

and Envi-

not

by the

been DASH

[Ii].

earlier

paper

LACE

[6]. This

hensive

covering

architectures processors,

by

the

paper

as low

cost

ii) It focuses

works and the on the physical

authors

differs

a gamut

and communication

paper

in this

In the

next

from

alternatives on the

section

application.

and

the

subject

earlier

results

of a study

one in two important

while

the

to expensive

relationship

characteristics

processing aspects

the

the

other

of this aspects:

examined

the

supercomputers

of the

of the application,

performance to the

and results

architectural

application

i) It is compre-

feasibility

of NOW

massively to the

of two disparate relevant details

parallel

computation

aspects

nodes, and to the programming tools. We have not of the application or the the details of the numerical in keeping with the readership however, we have included the

on

of the net-

laid emphasis model as we communities. from the other

one.

the

tools

presented

of architectures

have done in the other paper For the sake of completeness,

is the

advent

a number

study,

memory

An

the

time

very

applications

this study is to implement the numerical above on a variety of parallel architectural

on such

to compute

is computationally

Y-MP.

can,

such

Transport

by solving

overcome

as-

is one

Civil

and then using This technique

we concentrate

of workstations (NOWs), scientists now have intensive codes and reduce turnaround time

supercomputers.

The

study

processes

noise

Speed

computation

can be partially

equations

code

exhaust

High

be computed

solution domain to the near field where the jet is nonlinear (see [12]) to relate the far-field noise to the near-field sources. nozzle

of physical

we briefly Section

used

discuss

3 has

the

governing

a discussion

of the

for parallelizing

of Section

4.

Section

the

application.

5 describes

the

equations

and

the

parallel

architectures

The

parallelization

experimental

numerical used of the

methodology.

model

in the

of

study

application Section

6

presents the

a detailed

lessons

2

from

The

of the

this study

Numerical

We solve the jet.

discussion

learned

results. and

the

The

paper

issues

that

concludes merit

with

further

a brief

discussion

of

investigation.

Model

Navier-Stokes

Navier-Stokes

and the Euler equations

equations

for such

flows LQ

OQ

to compute

flow fields

can be written,

of an axisymmetric

in polar

coordinates

as

= S

OF

OG

o-7 + _ + Or = S where

(;)

Q = r

pu

pu 2 - ,.rxz + P

F-r

puv pull

I

G_r

-

- "rxr

u'rx_ -

v'r_,,. -

gTx

puv - "r_ PV 2 - "5,. + P

pv H - UT_,. -- v'r,.,. -- aTr

S=

F and

G are the

fluxes

in the

arises in the cylindrical polar fluxes. In the above equations radial We use

velocity the

components, fourth-order

Navier-Stokes to compute compute

and time

spatial

the accurate

derivatives

x and

I°) 0 P - _oo 0

r directions

respectively,

and S is the

source

term

that

coordinates, _-_j are the shear stresses and tcTj are the heat p, p, u, v, T, e and H denote the pressure, density, axial and

temperature,

total

MacCormack Euler

)

scheme,

equations.

solutions.

at each predictor

due

This

It uses

energy

enthalpy.

to Gottlieb

scheme

one-sided or corrector 2

and

uses

and predictor

differences step.

Turket and

(forward

For the present

[4], to solve corrector

the steps

or backward) computations,

to

the operator L

in the

equation

LQ

one-dimensional operators and as a one-dimensional operator difference

in the

predictor dimension

corrector.

the

Its

corrector

=

step

Qn

Similarly

in L2 the

__

symmetric

6__x

variant

corrector

{7(F/__1

predictor

step

is

n

At 6Ax

{7(F:

step

_

F n)

At 7/_, + Q7- __ 6"_-'-x{ ( ,

Q_ = Qi the

L:

uses

a backward

_

(F_n+2

_

F__l)

scheme

symmetric

becomes variants

difference

step

}

__

two

in L1Q

in the

for the

one-

AtSi

-

-- .t_i-l)

F_,)

-

-- (P,:-]

(F?_,

-- __2)}

- FL_)}

+ At,..q,]

+ AtS,

is

At {7(y, - 2,+1)- (_P,+,-

This

into

as

Qg÷' =

and

Qt + Fz + Gr = S, is split

and a forward difference in the corrector. The predictor model/split equation Qt = Fz + S is written as

(_i

and

= S, or equivalently

the scheme is applied to these split operators. We define L1 with a forward difference in the predictor and a backward

fourth

order

[4]. For our

accurate

in the

computations,

the

spatial

+ Ats,]

derivatives

one dimensional

when

alternated

sweeps

with

are arranged

as

Q,_+I = LlxLlrQn Q,_+: = L_ L2=Q '_+1 This

scheme

the fluxes to compute

is used

for the

are extrapolated the

solution

interior

points.

outside on the

the outflow. In our implementation, at the new time for all boundary

In order

the domain

boundary.

to advance

to artificial

We use the

we solve points. Pt -

the

points

using

characteristic

the following

scheme

near

a cubic

extrapolation

boundary

set of equations

boundaries, condition

to get the

at

solution

pcut = 0

Pt + pcut

= R2

Pt - c2Pt = R3 V t =1_

where

P_ is determined

combination Stokes

is not

equations.

by which

specified, For further

variables

P_ is just details

those

4

are specified spatial

see [6]. 3

and which

derivatives

that

are

not.

come

Whenever

from

the

the

Navier-

X MOMENTUM

UR LEVELS

1.500 0.00 DEG 9.36x10"'6 1.50x10"'2 25_100

MACH ALPHA Re TIME GRID

1 Figure

Let r be the radius solution

of the radial

represent

a reasonable

time

3

Parallel

steps

This

section

with

the

3.1 The

in this

to keep

the

1 shows

size.

The

used

size of the

the

same

jet

of axial 50r

was obtained

requirements

Computing

plot

of size

The

result

we have

axisymmetric

a contour

for a domain

computing

tools

in an excited

a 250 x 100 grid.

paper,

a brief

momentum

in the domain

after

grid,

and

about

but

axial

run

from

direction the

chosen

16,000 the

the and

time

grid steps.

experiments

for

reasonable.

Platforms

discussion

of the

various

platforms

used

in the

study

together

used.

NOW LACE

testbed

(nodes or subsets

connection Mbits/sec

half

with problem

contains

nodes

Mbps.

Figure

equations

parallelization

nodes

use.

direction

results

5000

momentum

of the nozzle.

Navier-Stokes

5r in the

For all other

1: Axial

is regularly

1-32)

characteristics. (Mbps));

Nodes

and

connected

of them

and

present

nodes

are

through

for our purposes, an upper

990 through

half

has 32 RS6000

(node

0) which

is the

various

networks

with

use

a FDDI

to consider

(nodes

configuration

connected

is for general

9--24 are interconnected 1-16)

The

RS6000/Model

are

All the

one

It is convenient, (nodes

an

of them

upgraded.

17-32).

through

and

the

interface the nodes The

two

other

lower

file

with

Ethernet

a peak has

These

speed

and

networks

(10

to

"parallel"

bandwidth

to be partitioned half

server.

different

is dedicated

processor

into

RS6000/Model

of 100 a lower 590

CPUs (the CPU has a 66.5MHz clock, 256KB data- and 32KB instruction caches)with the following networksinterconnectingthe nodes:an ATM network capableof a peakbandwidth of 155Mbps and IBM's ALLNODE switch, referred to as ALLNODE-F (for fast), capable of a peak throughput of 64 Mbps per link. The upper half has the slower RS6000/Model 560 CPUs (the CPU has a 50 MHz clock, 64KB data- and 8KB instruction caches)and is connectedthrough IBM's ALLNODE prototype switch, referred to as ALLNODE-S (for slow), capableof a peak throughput of 32 Mbps per link. The ALLNODE switch is a variant of Omegainterconnectionnetwork and is capableof providing multiple contentionlesspaths betweenthe nodes of the cluster (a maximum of 8 paths can be configuredbetweensource and destination processors). The present setup doesnot permit the use of more than 16 processorsusing the faster networks. The nodes have varying main memory capacity (64 MB, 128 MB, 256 MB, and 512 MB). We have used the popular PVM (Parallel Virtual Machine) messagepassinglibrary (version 3.2.2) to implement our parallel programs. We will refer to the LACE cluster with RS6000/Model 560 processorsas the LACE/560 and thosewith the RS6000/Model590processorsas the LACE/590. 3.2

Shared

We used Y-MP/8

Memory

the

Cray

Y-MP/8,

has

a peak

rating

and the communication We parallelized

exploiting

the

and

the

as the the

with

been

IBM

SP in the network

parallelizing

compiler

paper.

It offers

explicit

The

(version Cray

dimensional has only

a CPU

3.2)

SP1

The

T3D

developed

is also

torus with

distributed

has

memory

16 processing

32KB

data

and

it function

nodes

network,

[15]. The

16 were available

models, customized

machine

speed

in single

we programmed version

by IBM

a distributed

a clock

Cray

address

is through

directives

space shared

in addition

multiprocessors-

nodes

(the

instruction like

of the

for the

to

user

(version

mode. using 3.2).

IBM

SP1

in each

node

is a

The

original

We will refer

interconnected

in topology

the

through

to ALLNODE,

system

to this

system

a variant

permits

using MPL a customized

of

multiple (Message version of

SP.

memory used

CPU

caches).

a SP2.

SP are

similar

multiprocessor

in our

of 150 MHz

the machine

of PVM

The

Cray.

contentionless paths between nodes. We parallelized the application Passing Library), IBM's native message passing library and PVMe, PVM

a single

processors

DOALL

on the

study.

Architecture

to make

[17]. This

for this

on different

by using

clock,

upgraded

processors,

executing

on two

IBM

a 50 MHz

software

Omega

The

vector

2.7 GigaFLOPS.

processes

application

T3D.

eight

application

Memory

the

Cray

RS6K/370 has

of the

Distributed

has

of approximately the

features

We parallelized

which

between

variables.

3.3

Architecture

and

Though the

study

has

a direct the T3D

message

with 64 nodes mapped supports

passing

the

topology

of a three

(8 × 4 x 2) (each cache

of 8KB),

multiple

paradigm

node

of which

programming

resorting

to Cray's

4

Parallelization

The 1.

factors

which

Single

affect

processor

higher

3.

cost:

volume than

startup

of data

the

cost

per

item

lead

depends

(usually data

on both

Usually,

temporarily

Some

amount

SPMD

byte)

and

transfer

as far

data

the above

startup

cost,

For the

solution

internal

subdomain

with

such leading

program

bursty

From

number

cost.

which

resulted

of communication

cost

One

into

as possible.

in

is 2-3 orders

method

startups of magnitude

to reduce

the

effect

of

long vectors.

it is desirable

that

Increasing

amount

the

of communication

communication

to increased is inevitable

multiple and

the

discussion

which

overlapping

communication

be

of overlapping,

then

we group

there

leads

to a higher

with

or right)

1 and

2. It is seen

roughly

50% of the

although

the

will then

the

solution

connected and

the

via

maximum

be approximately time,

the

ratio

are

shown

consider Ethernet.

the

transferred

effect

(145,000/(10

the

of startups, and

scheme

in Tables

same.

Note

on

throughput

some

computation bound

is 1000

seconds

(1000

x 10/10)!

for the

application

in units

the

idea

of 20 MFLOPS The

a lower

point

has

that

a network

while

floating

a for

to as Euler,

To give

for Ethernet. x 20))

along

of Navier-Stokes

to be executing

a reasonable

values

are shown

basis.

bound-

of communication

referred

is about

each the

We use a similar

requirements

processor

of 10 Mbps

processor

along

number

hereafter ratio

to communication per

communication.

application

volume

a per

Assume

725 seconds

ignoring

on

relationship

communication

temperature

send.

equations, volume

among bursty

the

of the

Navier-Stokes

throughput

of computation

operations/byte

requirements

to communication

in the

to as Navier-Stokes,

and

into a single

of Euler

time.

written

an inverse

and temperature

velocity

the communication

requirements

communication 2 shows

that

computation

of communication,

workstations processor

communication

computation

communication effects

and

referred

To reduce

all the

boundary are calculated and then packaged the flux values that need to be communicated.

usually

and

velocity,

neighbor.

waiting

are usually

relationship

hereafter

first,

throughput

process

startups.

computation,

its two flux values,

(left

is also

network's

and

programs

is a subtle

equations,

communication-

computational

There

the

cost

parallel

style.

communication

exchanges

overwhelm

of communication that

of Navier-Stokes

could communication

since

data)

number

it is clear

its appropriate

startups,

point

the

computation:

granularity

of burstiness

(single

between

the

optimizations

the startup

to be communicated

to finer

communication:

capacity

The

various

of startups.

4. Bursty

ary

cost

computation

could

number

this

communication with

however,

we will explain

communicated.

is to group

Overlapped

overlapped

below.

in performance.

2. Communication the

are listed

performance:

80% improvement

and

performance

of

of 10 per time on the Table

of floating

operations/startup

per

processor. The

application

is parallelized

tion

only.

dimensional

Two

by decomposing partitioning

was

the not

domain

by blocks

attempted

since

along a simple

the

axial

analysis

direcshows

Table 1: Application Characteristics Appln N-S Euler

Total Comp. Comm./Processor (in FP Ops (x 106)) Start-ups Volume (MB) 145,000 80,000 120 (960Mb) 77,000 20,000 64 (512Mb)

Table 2: Computation-Communication Ratios No. of Procs. 2 4 8 16

FPs/Byte FPs/Start-up Nav-Stokes Euler Nav-Stokes Euler 604 601 906K 1925K 302 301 453K 963K 151 150 227K 481K 76 75 113K 241K

that for the chosengrid size,sucha partioning performsworsethan a 1-D block partitioning. For example,with a 2-D partitioning of 16 processors(4 x 4 blocks), the ratio of the number of bytes transferredcomparedto 1-D partitioning is 1.25. This ratio will, of course,decrease when we increasethe problem size. Another disadvantagewith 2-D partitioning is that the number of start-ups is higher. For the above example, the correspondingratio for the two partionings is 1.6;this ratio doesnot decreasewith the problem size. Sincethe startup cost dominatesthe transmissioncostin mostcurrent architecturesincluding the onesusedin this study (the ratio is highest for LACE and least for Cray T3D) and the averagetransmission volume per startup is only moderate (Table 1), wedid not experiment with 2-D partitioning. The parallelization on the Cray Y-MP wasdone differently (it wasmuch easieralso)sinceit is a sharedmemory architecture: we did somehand optimization to convert someloops to parallel loops,usedthe DOALL directive, and partitioned the domain along the orthogonal direction of the sweepto keepthe vector lengths large and to avoid non-stride accessto most of the variables.

5 The single

Experimental performance user

The

using The

is the

execution

were repeatable from the mean

Experiments of LACE.

indicator

mode.

experiments the deviation

Methodology

a single

performance

total times

execution reported

time. are

All experiments

for single

runs

with negligibly small discrepancies. is about 1% or less ([6]). processor of the

were

done

original

code 7

on an IBM (Version

RS6K

were

since

we found

For example,

(Model

1) for both

conducted

560)

applications

with

that

in the

LACE,

workstation is shown

in Figure 2. 16OOO m

m

_, ....

NavJer-Stokes Euler

12OOO

m I

E

I

I

I

I

! I

C

O

I

I

I

I

I I

W

!

I

I

I

1

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

,

I

I

I

:

!

I

I

I

I

I

I

i

I

I

i

I

I

I X

! I !

!

I I

I

!

!

! I

I

I

I

4OOO

I

I

I !

:

!

I

I I

I

i

i

I

I

I

I

I

!

I

i

II

II

II

I

I

I

I

!

I

I

I

I!

I

I

!

I

I

I I

I

I

! I !

I

! I I

I

! I

!

I

!

I

! •

0

"--"

!

1

3

2

I

4

5

Version

Figure

We found memory was

most

hierarchy

the

(using

that

this was

loop

were

this version

running

a number

better replace (a

faster

usage

reduction

presented

of the

SP, and

Cray

We have

last

number

(RS6000/560)

by the

memory.

arrays

in stride-1

modified than

50%,

compared

presented of which

COMMON

blocks

yielded into

to 2.0

improvement

x 109 was

achieved-

of roughly

80%

5 on

different

section.

computing

On

each

of processors

(up

platform, to 8 with

platforms Cray

some

one

(Version

5),

feasible-

Version

2),

incorporated

in accordance

Y-MP,

improvement:

up

expensive

4).

All

9.3 MFLOPS

the

in

2. We experimented

Version

we measured

3 (the

resulted

are relatively

(from

in Figure 2. The optimizations were all the above mentioned optimizations.

possible

Version

paper),

a single

former

of the

performance

wherever

in the

wherever

the

cache called

to Version

following

since

performance

fashion

program,

order

feasible

poor

Improved

by multiplications

wherever

an overall

Version

in the

function

multiple

exponentiations

5.5 × 109 divisions

yielded

parallelized

the

limited

main

The

by approximately

processor

were the

in a different

by multiplication from

and

by accessing

by collapsing

MFLOPS) as illustrated that Version 5 contains We

application cache

modifications,

(replace

division

on a single

optimization).

performed

reduction

optimizations

the

achieved

of other

register

strength

time

of the

interchange

optimizations with

parts

involving

key and the

2: Execution

these

to 16.0

in sequence

with

execution

to 16 with

so

the

ideas

time

as

LACE,

IBM

T3D).

studied

the

performance

of LACE

with 8

four

networks

of differing

characteristics

a

using

"off-the-shelf'

PVM

as the

message

passing

library.

With

the impact of parallelizing the application with two message MPL and a customized version of PVM called PVMe. In all experiments, components:

wherever

processor

processor

busy

overheads

associated

components

time

is not

feasible, busy

is itself with

and

composed

sending

possible,

we have

time and

however,

Version with

5 of the

application

computation.

of the

interior

temperature for both

Version part

from

versions.

are combined into at a time to avoid

of the

make

any

overlapping

subdomain

its neighbors.

As mentioned

while Figure earlier,

a single send. We have bursty communication.

VERSION

could

execution

special

time

the

attempts

the

is waiting

processor

3 shows

stress

the timeline

the

The

software of these

monitoring time

and

tools.

of a processor

communication flux components

for the

velocity

of a processor's

"flux columns"

experimented This variant

and

to overlap

the

the two

two additive

separation

idle

by computing

native

time.

performance

include

studied

IBM's

time into

An accurate

hardware

also

5

nearest

each

VERSION

boundary

Calculate VEL, TEMP

Send

one

6

VEL,

TEMP

and

activity

with sending the flux columns is called Version 7.

Calculate

EL,

libraries-

computation

messages.

we have

SP, we have

communication

actual

receiving time

not

6 does

of each

vectors these

does

the

passing

non-overlapped

unless

The non-overlapped communication waiting for a message.

separated

the IBM

Send TEMP

VEL, TEMP

Receive

Calculate

VEL.

STRESS,

TEMP

$TRF.,S$,

FLUX

at interior

Receive

Calculate FLUX

VEL, TEMP Calculate

Overlapped Send

Commtmication

and

STRESS,

FLUX

at Boundary

Sead

FLUX

FLUX

Receive FLUX

C°mlmtati°n

Update

_I

Interior

Update Subdomam Receive FLUX

,- Bou_lxy

Upcl_'P Boundary

Figure We found

that

the

execution

3: Timeline

time

of processor

improvement

with

activity

Versions

6 and

7 were

or even worse in many experiments. Hence all our experiments were conducted 5. We do mention, however, the impact of these versions on different networks Section 6.1. The

next

section

presents

a detailed

discussion

of the

9

results

from

our

either

minimal

with Version of LACE in

experiments.

6 The

Results execution

number ingful

times

of Navier-Stokes

of processors

for each

and

computing

Euler

platform,

have

been

plotted

using

a log-log

scale

as a function

of the

to facilitate

mean-

presentation.

Performance

6.1

of LACE 10 4

lo • _P

eo

(3_E_ ALLNODE-F [3- - --E3ALLNODE-S _...... A LACE/560 Ethernet

102

,

101

,

I

10 Number of Processors

Figure Figure

4 shows

4: Navier-Stokes

the performance

of Navier-Stokes

F, ALLNODE-S, and the upper-half networks are almost identical with performance The

close

attributed is balanced or FDDI zmtwork. With

of the

performance

and

by (100

its ability Mbps)

FDDI

with

reason: to set up their

the execution

multiple

faster

are not

shown.

ATM,

and

and slower links

link speed

networks

of LACE-

ALLNODE-S paths

do not

multiple

permit

linearly

and

of ALLNODE

contention-free

time falls almost

effects begin to show, faster than ALLNODE-S.

on LACE ALLNODE-

The performance of the ATM and the FDDI and ALLNODE-S respectively. Hence the

networks

the

time

on different

Etheraet. ALLNODE-F

of ALLNODE-F

to the following

ALLNODE,

sublinearity 70%-80%

ATM

execution

with

while

increasing

FDDI

(64 Mbps/32 ATM

physical

number

(155 paths

be

Mbps) Mbps) in the

of processors-

however, beyond 12 processors. ALLNODE-F This can be attributed to both an improved 10

can

is about network

(which is twice asfast) and the superiorperformanceof the 590model (33%faster clock, data andinstruction cacheswhich are4 times bigger,and memory buswhich is 4 times wider than the 560- thesecontribute to faster instruction execution, better cachehit ratios, and lower cachemiss penalty respectively). Ethernet performancereachesits peak at 8 processorsbeyond this, the communication requirementsof the application overwhelm the network. The inability of Ethernet to handle traffic beyond 8 processorsis shown by the following simple argument: Table 2 shows that with 8 processors, Navier-Stokes produces a byte for communication

after

a 1 second

and

processor

interval produces

imately

8.5Mbs

the

performance

not

surprising,

it has each

1.06 Mb

from

151 floating operating

8 processors.

the

on the

Ethernet

by an application

therefore,

operations

at 20 MFLOPS.

for communication,

all the

seen

completed

processor

average.

gets

steadily

this

each

to approx10Mbps

bandwidth,

beyond

Consider

interval,

translates

of supporting

of this worse

average.

This

is capable

will be a fraction

performance

on the During

peak-

however;

it is

8 processors.

104

103

._E

13-.......

_---

I-

10 2

(_--O LACE/5g0 Processor busy time [3- - -El ALLNODE-F Non-overlapped Comm. LACE/560 Processor busy time A - - A ALLNODE-S Non-overlapped Comm. -_-.---_ Non-overlapped Comm. (Ethernet)

10' Number

Figure

Figure

5 aid in a more

separated the

into

processor

With between network

both

in depth

two additive busy

non-overlapped which

5: Components

time

ALLNODE

it begins

to rise.

of execution

analysis

of the

components falls

linearly

communication

time

switches, The

the two ALLNODE respectively.

of Processors

this

difference

time

performance

as explained with

the

increases time

in the

number

remains

previous

can

11

with

The

execution

section.

of processors.

steady busy

LACE)

of LACE.

superlinearly

in processor

configurations

(Navier-Stokes_

the

With number

It is seen

be attributed

and

superior

the

of processors.

the communication

to the

is

that

Ethernet,

up to 10 or 12 processors times

time

node

beyond times and

the

10 4

103

I-co

e_e Version - - .e Version Version =-- - -=Version Version - - _ Version

o

10=

101

I

,

i

,

,

5 5 6 6 7 7

ALLNODE-S Ethernet ALLNODE-S Ethernet ALLNODE-S Ethernet

I

i

1

lO Number of Processors

Figure

Figures (the

6 and

trends

6: Communication

7 show

are

the performance

similar

with

communication

and

Ethernet

and

startups.

With

communication broken can

into

cache

Version

7 attempts 5.

however.

Since

reducing increase.

bursty

6.2

have

The

8 and chosen

for this

slightly

5) is very does

computations

and

the loop

due

bursty

setup

not

and

ALLNODE-S

6 (with close

(only

the

overheads

to loss of temporal

the

former

of Ver-

number have

of

to be

computations

are higher.

locality.

overlapped

to that

increase

for the subdomain

boundary

communication

surprisingly,

Ethernet

of ALLNODE-S

can handle

communication

in Section

the

Ethernet

Further,

Consequently,

the these

to overlapping.

Not

ALLNODE-S

9 show

interior

performance

Comparative

Figures

due

to reduce

startups.

Version

for the

LACE)

of Version

Overlapping

6, since

communication),

any gain

performance

as explained Version

(Navier-Stokes;

5, 6, and 7 with

The

ALLNODE-S.

also degrades

offset

communication with

ones with

performance

overheads

computation,

separate

be overlapped

of Versions

ALLNODE-F).

sion

5 for both

optimization

only

at

the

performs

the

of increased

better

with

is appreciably

the communication

harms

cost

worse

requirements

performance

since

the

number

Version than

of the number

of

7 than

Version

5,

application, of startups

Performance the performance study--

LACE,

of the Cray

application

Y-MP,

12

Cray

on the T3D

four computing

and

IBM

SP. The

platforms performance

we

104

103

I..tO

Version o-- - .e Version

5 ALLNODE-S 5 I:thernet

Version o- - - -a Version

6 ALLNODE-S 6 El_ernet

Version 7 ALLNODE-S z, - - _ Version 7 Ethernet

10 2

10 _

1'0 Number

Figure

of LACE

is reported

Surprisingly,

7: Communication

for ALLNODE-F

LACE,

of Processors

even

with

optimization

and

(Euler;

LACE)

ALLNODE-S.

ALLNODE-S,

outperforms

SP even

though

the

former

uses

off-the-shelf PVM and the latter uses MPL, IBM's native message passing library. (Our version of) MPL imposes a limit on the number of (non-blocking) send primitives that can be simultaneously we were factor MHz

active-

forced

to the clock)

to use

blocking

relatively between

this limit

poor the

is lower

send

than

the

primitives.

performance.

The

560 (50 MHz)

and

We CPU

the

Another

surprising

worse

than

T3D's

CPU

and

result

has

and

a peak

rating

8KB and

performance.

which

size

ALLNODE-S, poor

is worse

than

is 2.3X

suspect

32KB).

[17].

These

in addition Poor results

A reasonably

they

single-processor stress fast CPU

the with

of the

performance and

have

13

data

rating

2-way

set

T3D

(32KB

which

than

of the

T3D

of superior

cache

set associative

cache

(62.5 to the

compared

4 and

5).

is consistently

8 processors. 590 and

560

instruction has

design

cache 64KB

caches

also been to the

and a high

The models,

small direct-mapped date caches of sizes

associative

on the

contributing

contributor

cache

for less

hence,

in speed

6.1 (Figures

of Cray

3X the

performance importance

to be one

Another

see Section

ALLNODE-S

a large,

this

application;

on the SP is intermediate

We attribute the T3D's poor performance to the (both the 560 and 590 have 4-way set-associative

256 KB respectively;

elsewhere

and

is the relatively

ALLNODE-F

respectively. of 8KB size sizes

of ALLNODE-F

of the

590 (66.6MHz).

poor performance of the SP is the relatively small to 64KB on LACE/560 and 256KB on LACE/590). For a comparison

requirements

of

reported overall

bandwidth

10 4 I

i

I03 _ "0- - -0

® I-c o

Cray Y-MP A- - ._ IBM SP (RS6K/370) B- - -Q ALLNODE-S

x

"

10 2

v .......-v Cray T3D _, - - e ALLNC)DE-F

101

,

,

,

,

,

,

i

,

i

J

,

10 Number of Processors

Figure

8: Execution

time

of Navier-Stokes

on computing

platforms

104

o

.__ I-c o

x

'" 102

101

.......

1JO

Number of Processors

Figure

9: Execution

time

of Euler

14

on computing

platforms

Table

3: Speedup

No. of Procs.

Architectural ALLNODE-S 3.2

4 16

bus

connecting

the

poorly

designed

Table

3 shows

The

with

CPU-cache

speedups, of each

munication

speedup

at 16 processors.

Cray

Y-MP

has

in single

user

from

computation

the

modestly

mode

best

architectures

be attributed

of about

simulates

with

increasing

that

data

number

of processors.

a message

for transmission be implemented

that

efficiently.

time

very

corresponding

good

the network

speedup

networks the and

the

between

the

If NOW

both Such

the

and

execution also

and

time

is the

effort

we were time

for both Y-MP

not

application architectures

15

and

arise

the

level and

poor from

switching the

physical

are to be feasible network under

way

and [1].

to separate Y-MP

node

that

the

scales With

(and

the

times

multiple

times

of the

as massively

the message

can

waiting

overheads layer

does

LACE/590

performance

in processor

mainly

context

time

the applications.

on a single

increase

connect

the

passing

a

beyond

rate

able

also,

the relatively

resulting

overheads

is already

in-

has

9 that

shown

a

modest with

ALLNODE-F 8 and

comhave

is only

transfer

which

8 processors

from

interconnection

networks

peak

of connect

libraries,

the

the

to continue

Figures

MB/sec

of an 8-node

passing

characteristics,

speedup

trend

from

charac-

can sustain

ALLNODE

flattening

(150

effect

3.5 with

These

is copied

or reception.

it is clear

of the

we obtained a speedup of 7.1. Observe faster than a single node of the Y-MP.

overheads

with

ALLNODE-S.

The the

CPU

architectures.

Not surprisingly,

speed

I/O

faster

various

to illustrate

observe

than

the execution

use message

setup

to be communicated

in transferring processors,

which

to large

the

performance

rapidly this

Also,

Considering

with

both

architectures.

performance.

not include the I/O overheads), with 16 processors is about 8_ With

degrade

better

includes

a speedup

tool which

Though

network

performs

time).

achieving

a profiling

its superior

(this

SP exhibit

to expect

14.6

to a much

the corresponding

ALLNODE-S.

cost)

by far the

and

they

on NOW

than

with

setup

processor 16 processors

application.

T3D 3.9

interfaces.

and that

reasonable

of processors

T3D small

memory

single

T3D

at 4 processors,

speedup

8 processors, relatively

of the

It is only

number better

to the

indicating

superior

of Navier-Stokes

4 processors Both

speedup,

requirements

reasonable creasing

with

Cray

13.3

performs

characteristics

architecture.

linear

memory

cache-main

relative

are shown

almost

slightly

speedup

IBM SP 3.8

7.9

main

and

measured

architecture, teristics

and

Platform

ALLNODE-F 3.4

7.5

cache

the

Characteristics

that

arise

network parallel library

a

6.3

Comparison

of Message

Figures

10 and

11 compare

libraries

on the

SP--

tation

and

the

the

performance

execution

communication

Passing

times

components.

Libraries

of the

PVMe

been

separated

graphs

show

have The

and

the

MPL

message

passing

into non-overlapping that

MPL

compu-

is consistently

faster

10 4

103

®

H

Processorbus_

e- --e

Processor

H

Non overlapped

comm

with MPL

13----[]

Non overlapped

comm

with PVMe

busy

time with PVMe

E

10=

101

i

lO Number of processors

Figure than

PVMe

Observe but

creases.

computation further

6.4

part

attesting

also

Load

to our

is evenly able

(Navier-Stokes;

number

since

with the setup

previous does

the

approximately

communication

of processors

phenomenon

(see Figure

and

it implies

overheads

4) where

the

observation

that

is not

though

number

IBM

the

that

there

the

MPL

small

communication

Note

This PVMe)

in-

overlapping

however

that

the

phenomenon

communication (and

for Euler.

negligibly

is increased

of communication. non-overlapped

40% only

actual

of processors.

SP)

is

increases,

library

does

not

on LACE.

Balancing

how well is the

We were

the

includes

of LACE

PVMe

of non-overlapped

communication

as well as PVM

Finally, cation

and

in case

perform

with

and

75% for Navier-Stokes

amount

is an interesting

of computation not seen

the

it decreases This

of MPL

by approximately

also that

that

10: Comparison

application

distributed to measure

but the

load this

processor

balanced?

may

not always

busy

times

16

The

amount

translate (this

time

of computation to a load does

balanced

not include

for the appliexecution. the processor

10 4

10 3

H

Processor

busy t_me with MPL

- e Processor busy t_me with PVMe H Non overlapped comm with MPL B- -- e Non overlapped comm with PVMe"

E

10 2

101

Number

Figure

waiting able

7

time)

for Navier-Stokes

to achieve

almost

NOW

networks

CFD

to circumvent

the

plication

and

level also

performance. good

of the

of a fast

A traditional error-prone

but

the

potential

fast

and

processor,

importance

12 shows

small, still using memory

hierarchy.

reason

that

we were

direct-mapped outperforms message

passing

multiprocessors,

17

characterisstudy

a message

A proper

the

to achieve

bottleneck

cache

poor

if the

implemented

between

performance the

indicates

architectures

are efficiently

available,

for relatively

The

parallel

libraries

processor

processors

the memory the

platforms.

in transferring

of single

RISC

that

is the

distributed

passing

involved

multiprocessor

with

SP)

and scalability

to be cost-effective

message

of the network.

and

an application

of architectural

layer

off-the-shelf cache

IBM

of the SP. Figure

overheads

We believe

vector

Parallelizing

the

(Euler;

communication,

on a variety

physical

fast,

performance.

in spite

size.

traditional

With

the performance

have

highlights

processor

PVMe

balancing.

the computational,

reasonably the

and

Conclusion

application

architectures are made

study

studied

of MPL

on each load

and

we have

tics of a typical

The

perfect

Discussion

In this paper that

11: Comparison

of processors

seems

design

performance

ap-

good to be

is critical of the

to

T3D,

cache. multiprocessors libraries

of modest is rather

this effort

tedious

is worthwhile

to medium and

even

since

good

1500

_

m

m

m

_

m

m

100o

®

E

£ a.

500

i i I !

0

,i 8

J

0

4

12

16

Processor Number

Figure

scalability

limitations

study

have

to larger

available. both

busy

times

(Navier-Stokes;

IBM

SP)

is achievable.

Resource the

12: Processor

axial

to understand directly domain

mentioned

and radial the

us to limit

multiprocessors

For reasons

the

forced

and

physics

of the

problem

to 16 processors.

parallelization

4, we have

A future

We plan

study

to other

in Section

directions.

from the flow field. and a finer mesh.

our

tools

not

better

and the

a finer effects

become

decomposition

the study mesh

to extend

as resources

explored

goal is to conduct

to explore

We hope

along

for a larger

to compute

of 2-D partitioning

domain

the jet with

noise

a larger

Acknowledgments Part at

of this NASA

author OH.

work

Lewis

was

done

Research

was in residence

while Center

in the

the

first

during

ICOMP

author

was

1993-94.

program

a Visiting

Simulations

at NASA

Lewis

Senior were

done

Research

The authors would like to thank Kim Ciula, Dale Hubler, and Rich assistance with various aspects of the LACE and IBM SP architectures.

18

Research while Center,

Rinehart

Associate the

second

Cleveland,

for

their

References [1] Anderson,

T. A.,

(Networks

[2] Bailey, NAS

D. H., Barszcz,

3 User's National

lems".

Micro,

W.,

vol. 30, no.

[6] Hayder,

M. E.

Network

[7] Hayder,

Jayasimha,

M. E. and

E.

Turkel,

Sciences

M. E. Turkel,

Number 1993.

Jet

[9] Horowitz,

Flow".

J. C.

Aerosciences

[10] Landsberg, Solving

"Lewis

A. M., Flows

Sciences

E. and

[11] Lenoski,

AIAA

D. E., et al.

Multiprocessor".

[12] Lighthill,

Int'l

[13] Mankbadi,

vol.

and

Its Radiated E. and

CM-5".

32rid

Order

Accurate

Conference,

AIAA R. R.

Aerospace

Ridge

Dependent

Prob-

Orszag,

S. A.

Computers

and

Three

Ames

Boris,

J.

Dimensional

94-0413,

P.

1952,

D.

of a High

AIAA

Efficient,

93-0653,

Mach

January

Computing October

Parallel 32nd

for

1993. Method

AIAA

for

Aerospace

1994.

Aerodynamically,

Coherence

AIAA

Povinelli, Journal,

"Implementation Aerospace

Sciences

19

L. A.

Protocol

May

1990,

Part

I, General

"The

for the

pp.

DASH

148-159. Theory".

Proc.

Structure

of Supersonic

vol. 32, no. 5, pp 897-906,

1994.

of a Parallel

Euler

Conference,

a

Problems".

pp. 564-587.

M. E. and

Sound".

on

1993.

Center,

Geometries".

January

Architecture,

Generated

"An

of Viscous

Distributed

Research

Flows

to appear.

Simulations

Conference,

Environment".

NASA

of Jet

1996,

July

"Numerical

Sciences

Cluster

T. R. and

93-3074,

on Computer

AIAA

Oak

D. M. and

Solutions

Conf.

Mavriplis,

V. "PVM

Computer".

voh 34, no. 4, April

Cache

R. R., Hayder,

Sunderam,

Simulations

Directory-Based

211,

Results".

for Time

Nosenchuck,

Navier-Stokes

"The

M. J. "On Sound

Soc. London,

"High

Workshop,

in Complex

Conference,

Journal,

Advanced

Young,

Benchmark

ORNL/TM-12187,

Methods

"Navier-Stokes

Mankbadi,

31st AIAA

Applications

R. and

Report

M. G.,

on the

D. N.

AIAA

Aerospace

NOW

1988, pp. 357-364.

of Workstations",

31st AIAA

[8] Hayder,

and

1/2,

for

pp. 54 - 64.

Parallel

Manchek,

Technical 1993.

Simulations

"A Case

1994.

J., Jiang,

W. S., Littman,

Turbulence

team,

1995,

H. D. "NAS

October

A., Dongarra,

NOW

February

L., Simon,

NAS-9_,-O01,

M. E., Flannery, Scale

Structures,

on the 1994.

IEEE

D. A.,

D. and Turkel, E. "Dissipative Two-Four Math. Comp. voh 30, 1976, pp. 703-723.

"Large

I14J Morano,

Patterson,

Guide and Reference Manual", Laboratory, Oak Ridge, TN,

[5] Hayder,

Flow

E.,

E., Dagum,

Report

A., Beguelin,

[4] Gottlieb,

D.

of Workstations)".

Technical

[3] Geist,

Roy.

Culler,

Unstructured AIAA

94-0755,

Jet

Solver January

[15] Oed,W. "The Cray ResearchMassivelyParallel System-Cray T3D", Cray [16] Scott,

Research

J. N., Mankbadi,

Conditions Conference, [17]

November

AIAA

R. R., Hayder,

93-4366,

Technical

Report,

1993.

for the Computational

Stunkel, C. B., Shea, Performance Switch", pp.

[18]

GmbH,

M. E. and

Analysis

October

of Jet

Hariharan

Noise".

S. I.

31st AIAA

"Outflow

Boundary

Aerospace

Sciences

1993.

D. G., Grice, D. G., Hochschild, P.H., Tsao, M. "The Scalable High Performance Computing Conference,

SP1 May

High1994,

150-157.

Venkatakrishnan, V. "Parallel Aerospace Sciences Conference,

Implicit AIAA

Unstructured Grid Euler 94-0759, January 1994.

2O

Solvers".

32nd

AIAA

REPORT

DOCUMENTATION

PAGE

Fo,_ Approved OMB No. 0704.0188

Public reportingburdenfor thiscollection of informationis estimatedto average! hour per responseincluding the time forreviewng nstruct ons searchngexisting data sources, gatheringandmaintaining the data needed,andcompletingand reviewingthecollection of information Sendcomments regardingthis burdenest mate or any otheraspectof this collectionof information,including suggestions for reducingthis burden,to Washington HeadquartersServices,Directoratefor Information Operationsand Reports.12! 5 Jefferson Davis Highway,Suite 1204. Arlington.VA 22202-4302.and to the Officeof Managementand Budget,PaperworkReduction Project (0704-0188). Washington,DC 20503, ].

AGENCY

USE ONLY(Leave

blank)

2. REPORT March

4. TITLE

AND

AN

1 3. REPORT ]

TYPE

Contractor

AND

DATES

COVERED

Report

SUBTITLE

5. FUNDING

EVALUATION

FOR

DATE 1996

OF

PARALLEL

ARCHITECTURAL

NUMBERS

PLATFORMS

NAVIER-STOKES

C NAS1-19480

COMPUTATIONS

WU

505-90-52-01

6. AUTHOR(S) D.

N.

M.

E.

S.

Jayasimha Hayder

K.

Pilla_

7. PERFORMING Institute Mail

ORGANIZATION

for Stop

Computer

132C,

Hampton,

VA

NAME(S)

NASA

Langley

VA

Final

and

Space

NAME(S)

AND

ADDRESS(ES)

Report

No.

96-22

10. SPONSORING/MONITORING AGENCY REPORT NUMBER

Administration

NASA ICASE

CR-198308 Report

No.

96-22

NOTES

Technical

Monitor:

Dennis

M.

Bushnell

Report

Submitted

to

Journal

Unclassified-U

Subject

of

Supercomputing.

60,

(Maximum

study

the

apphcation,

induced

which

solves

communication, the

architecture testbed

popular

message

time at

scalability field

of

different the cluster passing

topologies -of workstations

hbraries

used

the for

to

the processor speed a variety of architectures, platforms.

characteristics

a jet

using

chosen memory

the

of

a Computational

Fluid

compressible

for this study multiprocessor

IBM SP and the on the performance parallehzation. for

good we are

The

Cray of work

are (the

Dynamics;

Communication;

17. SECURITY CLASSIFICATION OF REPORT Unclassified NSN 7540-01-280-5500

Navier-Stokes a cluster of Cray YMP),

also

highhghts

Navier-Stokes

Interconnection Scalability;

and

Network; Shares

Euler Message

Equations; Passing

Network Library;

of

Work-

Computa-

Memory

18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified

on

workstations (the and distributed

T3D. We investigate the application and

single processor performance. able to point out the strengths

Dynamics

equations,

the the

the

impact overheads

importance

of of

By studying the and weaknesses

15. NUMBER Fluid

Architectures; and

and flow

TERMS

Computational stations

accurate

platforms. The platforms NASA Lewis), a shared

matching the memory bandwidth performance of an apphcation on of each of the example computing

14. SUBJECT

CODE

200 words)

multiprocessors with networks connecting by

12b. DISTRIBUTION

61

computational,

a variety of parallel LACE experimental memory various

STATEMENT

nhmited

Category

: 13. ABSTRACT

tion

Center ICASE

12a. DISTRIBUTION/AVAILABILITY

We

8. PERFORMING ORGANIZATION REPORT NUMBER

Engineering

Center

11. SUPPLEMENTARY Langley

and

23681-0001

Research

Hampton,

ADDRESS(ES)

Science

Research

AGENCY

Aeronautics

Langley

in

23681-0001

9. SPONSORING/MONITORING National

AND

Applications

OF PAGES 22

16. PRICE CODE A03

19. SECURITY CLASSIFICATION OF ABSTRACT

20. LIMITATION OF ABSTRACT

Standard Form 298(Re¥. 2-89) Prescribedby ANSI Std Z39-18 298-102