Scalability of Sparse Direct Solvers - Semantic Scholar

5 downloads 33 Views 904KB Size Report
width, one may obtain lower bounds on the time required .... may be a 2D or 3D grid or torus, or maybe .... links from its source to its destination processor. Let p(m). = (£1,t2,... ,ld(m)) .... the flux across the line by assuming ... In a fight-looking.
Scalability

of Sparse

Robert

Direct

Solvers

Schreiber

Z

The Research Institute Association, The

Work reported

of Advanced Computer American City Building,

herein was supported

in part

Science is operated by Universities Space Research Suite 311, Columbia, MD 244, (301)730-2656

by the NAS Systems

Division

of NASA

via Cooperative

Agree-

ment NCC 2-387 between NASA and the University Space Research Association (USRA). Work was performed at the Research Institute for Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, CA 94035.

SCALABILITY

OF

SPARSE

ROBERT

DIRECT

SOLVERS

"

SCHREIBERt

Abstract. We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors. In this paper we show that the column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable. By considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm. Applying this technique to distributed, column-oriented, full Cholesky leads to the conclusion that N (the order of the matrix) must scale with P (the number of processors) so that storage grows like p2. So the algorithm is not scalable. Identical conclusions have previously been obtained by consideration of communication and computation latency on the critical path in the algorithm; these results complement and reinforce that conclusion. For the sparse case, we have experimental measurements that make the same point: for column-oriented distributed methods, the number of gridpoints (which is O(N)) must grow as P_ in order to maintain parallel efficiency bounded above zero. Our sparse matrix results employ the "fan-in" distributed scheme, implemented on machines with either a grid or a fat-tree interconnect using a subtree-to-submachine mapping of the columns. The alternative of distributing the rows and columns of the matrix to the rows and columns of a grid of processors is shown to be scalable for the dense case. Its scalability for the sparse case has been established previously. To date, however, none of these methods has achieved high efficiency on a highly parallel machine. Finally, open problems and other approaches that may be more fruitful are discussed.

Key'words. memory, ..

massively

scalable

for the

and

extensive

The

arrival

to decide Two

lines

tended

In these

has

also

have

factorization,

distributed-

way

despite

makes

perhaps

sparse

approach

is to map

along

j is held the

this

solution

some

al-

prolonged

an opportune

different

dense

[7] have

" Written May 1992. t Research Institute for Advanced

data

lines,

time

or to give it

Computer

map(j)

organize and

the

sparse

problem Venugopal

the

value

Science,

message-passing-machine oriented

its Cholesky

[2, 3, 4, 9, 14, 18,

factor and

L are

map

computation

DAXPY.

This

and

of this

MS T045-1

This Nalk

as a collection

NASA

Ames

of methods

machines

approach

[29].

approach

Recently,

for the

Research

[1].

is favored

by

Dongarra,

dense

problem

Center,

Field, CA 94035. This author's work was supported by the NAS Systems Division via Cooperative NCC 2-387 between NASA and the University Space Research Association (USRA). 1

assigned

is determined

class

on message-passing

in two dimensions.

[15], and shown

A and

methods

for the the

MIMD,

are column

by processor

scaling

used

The

that

matrix

column

[10], Kratzer Walker

up to now.

of the

- column

A second

and

direct

[2, 3, 4, 9, 10, 14, 15, 18, 19, 30].

supercomputers search,

on methods

Furthermore,

tasks:

Schreiber

undiscovered,

of researchers

the

68R10.

distributed-memory,

= b remains

taken

columns

and

and

65F25,

parallel,

parallel

been

proposed

de Geijn,

Ax

massively

been

Gilbert Van

system

to concentrate

method.

of column-oriented

highly

to continue

methods,

in some

of the

Cholesky

methods.

of attack

has

65F50,

by a number

and

or not

of iterative

to processors as part

linear

investigations of highly

community

sparse

classifications:

An efficient,

sparse

whether

up in favor

19, 30].

subject

Introduction.

gorithm

computer,

algorithms.

AMS(MOS) 1.

parallel

Moffett

Agreement

Unil_.smr

10'

l_ffo_au_

i

10' Dmkmp P'mc_n

!01

! 10

198.5

FIG.

1.

Microprocessor

1986

1987

and

1988 Year

supercompe_er

1989

1990

1991

performance

1992

per

1993

CPU.

on MIMD message passing machines; the author has also used it successfully problem on the Maspar MP-1, a massively parallel SIMD machine. In this paper we investigate the scalability of these classes of methods sparse Cholesky factorization. By a scalable algorithm for this problem, we maintains efficiency bounded away from zero as the number P of processors problem size (in this case the number of gridpoints or the order of the matrix) linearly in P. We concentrate on the model problem arising from the 5-point, stencil

on an N o x No grid.

We will show that the column-oriented

for the dense for distributed mean one that grows and the grows roughly finite difference

methods

cannot

work

well when the number of gridpoints (N - N_) grows like O(P) or even O(P log P). We show that communication will make any column-oriented, distributed algorithm useless, no matter what the mapping of columns to processors. This is true because column-oriented distribution is very bad for dense problems of order N when N is not large compared with P. Two improvements seem to be required. 1. A two-dimensional wrap mapping of the dense frontal matrices, at least for those corresponding to fronts near the top of the elimination tree. 2. A "fa_-out" submatrix Cholesky algorithm with multicast instead of individual messages. It is reasonable to ask why one should be concerned with machines having thousands of processors. Figure 1 should illustrate the reasons for believing that supercomputer architecture

is now making

an inevitable

and probably

permanent

transition

from the modestly

parallel

to the

highly

processors).

The

decade

motivate

helps •

1 Gflop

parallel

following

the work

processor

• Physically

the

memory Mhz.

be the

(Since

work

dense and

paper

for the sparse, hierarchy

some

for the dense

program

cholesky(

a, n)

fork=l

and

the

illusion

bandwidth

of shared to nonlocal

order of 100 bits

96 bits,

roughly

of computation

in parallel

at

100 Mwds/sec

speed

will be in the 5 - 50 range.

per

will

processor

Patterson

to

[23] gives

at analysis

much

later

Coleman

et al. [22],

and

work

George,

has been

a fat tree.

of distributed of this

[17] for dense

algorithms.

Cholesky

or maybe

type.

and

shared-memory

[8] have

analysis

by Rothberg

efforts

systems,

Ng

An interesting

computations.

Notable

triangular

Liu,

provided

matrix

and

Saz_l

made

some

of the effect Gupta

system,

for

and

have

of a

[26]. These

recently

come

distributed

lower bounds case and

use them

through

implementations

on communication to illustrate

an experiment

of Cholesky

time;

in Section

the problem

factorization;

Sec-

4 we compute

these

for the sparse

with column case;

mapping;

in Section

Sec-

6 we consider

are still unresolved.

Distributed

following

coming

to ours [27].

this work that

efforts

of Li and

2 we introduce

the problems

-

ratio

with a nununiform-access

similar

tion 5 extends

provide

will be on the

the

speed

column-mapped

on sparse

3 develops

2.

Thus,

those

working

In Section bounds

speed.

Ostrouchov,

memory

tion

may

will be large

is 12 bytes

[16] prefigures

analyses

to conclusions

access

nodes

word

on previous

include

[28];

investigators,

hardware

may be a 2D or 3D grid or torus,

of Leiserson

Schultz

- 65,536

the

estimates.

builds

problems

(4,097

during

resource.

between

communication

comparable

parallel

architecture

processors);

while

a sparse

achievable

• Interconnect The

multiple

for nonlocal

speed

interprocessor

This

(with

will be a constraining

• Communication

or massively

here.

memory;

latency

processors)

of supercomputer

presented

chips

distributed

memory,

100

(257 - 4,096 estimation

sparse

Cholesky.

Cholesky

factorization

may

be understood

as the

tondo

cciv(k); for

j--

k+

1 to

n do

od(j,k); od od

Procedure k th column subtracts

cdiv(k)

of A by 1/_/-A_ Ljk times

The execution require

computes

only

that

the order

cmod(j,

the

square

to produce

k th column

from

of this program k) must

follow

root

the the

of the

k th factor jth

column

element L.k;

Akk and scales

procedure

the

cmod(j,

k)

column.

is not the only cdiv(k)

diagonal

and

one possible.

cdiv(k)

must

The true follow

dependences

all the

craod(k,

l)

for t < k and Lkt _ O. A second

cholesky(

form

of Cholesky

is this:

a, n)

for

k=l

ton

do

for£=l

tok-1

do

.moa(k,O; od

c,/io(k); od

The first form looking"

is sometimes

method.

In the

The second

sparse

Furthermore,

called

case,

most

form

sparsity

cmod

"submatrix"

goes

by the names

is exploited

operations

Cholesky

are

within

omitted

and

sometimes

"column" the

a "right-

or "left-looking'.

vector

altogether

called

cdiv

and

because

cmod

the

operations.

multiplying

scalar

Ljk is zero. The column-oriented at processor may

m@[k].

the

Alternatively, updates It then

This

the

(As

integer

sequential

befits

an MiMD

O(Ops/P). O(NLS),

Other

loop.

so running hand,

Cholesky scalability

3.

data,

code

The

factor problems

Methodology.

communication

are not

running should

like O(Nlog

the number

any

2. The data

L mapped = Ir and

solution

column.

at processor

to rr, The that

set O(N'S).

like O(N), or O(N/log

and N).

sections. computation.

it useful

to employ

In order certain

to assess abstract

fan

- in(a,

integer real

n,

L, n, map) map_;

Lfl,aD;

my_Is = {jI m.p_] = my,',._e}; forj=l

tondo if ( row[j,

rnynarae]

_ 0 I[ J 6 mycols

) then

t=O; for

k 6 row_,

rayname]

do

t = t + ajk(ajk,...,a,k)

T ;

od if ( j f/ mycols

) then

Send aggregate

update

L._ = (aj_,...,,

a,j) r -

column

t to processor

rnap_]

else

while

not Receive

all aggregate

t; updates

an aggregate

L. i = Lo_ -

update

have

been

column

received

u[j, _r] for column

uL/, _r];

od L.j

=

L.jI_;

fi fi od

FIO. 2. Fan-in

distributed,

column-oriented

do

Cholesk_.

j;

lower

bounds.

Our

communication

edges

assumes

memories

possibly

grid-structured

distributed

machine

new nor

graph all,

(the

We assume

of the

vertices

of the

(We

ignore

We assume

start-up

that

accounting

identical.

data,

for

per word.

consists

channels

model

axe the

units includes

machines

(slowness)

Let/_0

We expect

inverse

physically

like a CM-5. links.

of a link in seconds

computation

be the

rate

that/_0

and/_

sit-

hypercube

having

machines

are

set of all communication

_ be the

operation.

in seconds

The

bandwidth

Let

memory

processor-memory

graph.

L be the

in this model.)

per floating-point

or receive

and

costs are

that

as well as tree-structured

be the inverse

that

communication

shared-memory

Let/_

processors

in seconds

can send

links.

It assumes

the

and

machines,

Tera machine)

identical

that

G = (W, L),

Let V C_ W be the set of all processors

processor

it is a straightforward

is given.

It assumes

message-passing

memory

per word.

deep;

topology

to processors.

undirected

at some,

and

that

local

of a given

uated

is neither

costs.

Our model of the

approach

rate

at which

of a

a processor

will be roughly

the

salne.

A distributed-memory mation

by sending

For m E M, src(m)

and

and

links taken

The first

acterized taken

its source m.

following three

1. (Flux

dear(m),

denote

the length

For

any

are

obviously

size

are

and

of processes

in ra. Each both

that

each

Let p(m)

let the

set

communicated.

m has a source

processor

machine

path

message

takes

= (£1,t2,...

of messages

from

the

a certain ,ld(m))

whose

2. (Bisection

lower

computable

its endpoints.

sep(V0,

Given

V_) -

bounds from The

paths

utilize

on the the last

set

completion

time

of message

depends

M,

of the

each

on knowledge

of which of the

V0 and

V1 disjoint,

paths

C_ L ! L' is an edge

define

separator

of

V0 and

and

bound

l,

computation.

Iml. d(m). ILl

V0, V1 C W,

min[{L'

fluw(Vo,

of

be the path

link)

width)

source path

messages. per

infor-

M(£).

,

The

exchange

of V.

shortest

that

processor. _ E L,

message

elements

of the

We assume

link

be denoted

bounds

by its

by the

processor

to its destination

It E p(m)},

of a set

Let M be the set of all messages

of words

m to its destination.

by message

{m E M The

let d(ra)

consists

messages.

the number

a destination

message from

receiving

Iml denotes

For m E M, of the

computation

_ ) =

_

is flu_c(Vo,

sep(B,

Vl )

I_1.

V_}I

is charp(M)

3. (Arrivals/Departures

(also

known

as node

max

congestion))

_

vEV

Iml&;

dest(m)

= v

max vqV

ImlOo. src(m)

4.

(Edge

= v

contention)

max E JmIB tEL meM(O

Of course, particular, This

the

the

actual

communication

can be done

With

detailed purposes

we have

found

the

four

information 4.

for the

Cholesky

of order

in the

Ng,

In this model

final

than

wires,

better

a constant doing

dense,

and

problem

We prefer

to

p(M).

distributed

Cholesky

fraction

Cholesky

is a sine

however,

time-independent)

paths

substantial

a final dense

dense

can be obtained.

(i.e.

the

statically.

to processors,

cumbersome.

bound,

In

to be scheduled.

bounds

integrated

bounds.

in advance,

of tasks

on the

we consider

need

is known

assignment

only

any of the

machine)

to be unnecessarily

section

is spent

on this

in the

edge-contention

problem,

factorization

efficiency

and

depend

of the

wires

of use of the

approach

which case

Cholesky.

Since,

schedule

realistic

be greater

the set of messages

of algorithms

above,

and,

Dense

sparse

more

bounds M

ization.

of the

may

(the

or, when

of analysis this

time

resources

dynamically

knowledge

For the use

communication

of the

factorization

qua non

factorwork

in a

of a matrix

for a scalable

sparse

algorithm. 4.1.

Mapping

Assume

that

cyclically:

the

max

eration N2/2

columns

column

We first N2¢

columns.

N

(-_, TP),

cal path

and

column from

the

O(N)

The

and

the

critical

path.

of the tasks.

This is the path an atomic

unit

to O(N 2) operations.

fan-out

of order

=- j mod

is due to the longest

multiply-adds. operations

matrix

map(j)

the scheduling first

a right-looking,

symmetric

in processor

parallelism

no matter

N3/3.

multiplies

By making

of a dense

j is stored

examine

count,

Let us consider

Execution second

time term

cdiv(1),

to processors

cmod(2, O(N)

be at least from

DAG,

1), cdiv(2),

we have

at most

must

comes

in the computation

of computation,

Therefore,

N are mapped

Cholesky.

P.

The

path

distributed

the op-

which

cmod(3,

lengthened

processors

the

has 2), .... criti-

can be used

efficiently. Next, pose

that

consider

P is a perfect

is not necessary Consider by processor

communication

for our a mapping

map(j)

square

costs and

conclusions, of the

(a fan-out

that but

computation method).

on two-dimensional the

machine

it simplifies in which After

grid

is a v/P

or toroidal

x v/P

(This

Sup-

assumption

things.) the

performing

operation the

cmad(j,

operation

map(k) must sendcolumnk to all processors{map(j) IJ > k}. 7

grid.

machines.

k) is performed cdiv(k),

processor

[

i

2D

Grid

Toms

(2/3)v/P

(1/2)v/P

3D

pll3

(3/4)p1/3

TABLE Average

Two tially

possibilities

by processor

destinations; root

by the

to compute

flux

per

Let us assume

two randomly

for a 2D torus change reduce

and

the

based

2P/j8

3/_.

and With

the

tree

bound

travel

is (1/4)N2/3

With almost per

whole

processor

to half

that,

arrive factor

only (1/2)N

words.

will use

a more

on average there

the are

bounds

If separate

middle

the

traversed

roots

have

average

in 2D this and

is (2/3)v/-P; constants

assigning

grid,

idle

distance

the

about

of the

data

we can at best

So we will stick

links;

messages

the

total

to the

estimate

are

sent,

machine

the

total

bandwidth

flux

is

is roughly

seconds. distance" all the

computation

processors.

information

changes.

Every

Most

matrix

flux is (1/2)N2P

and

words of the

arrivals

bound

leave

matrix

any

processor.

>> P,

as arriving

is (1/2)N2/ij

If N

factor

seconds.

of the

machine

through

its vertical

With

spanning

is at least estimate

of them

tree

one.

This

that

the cut in the

(1/4)N2v/P_

multicast,

is not

in fact

in (1/2)V_

cut.

seconds

the

obsdervation

of the

element

the

will

flux per link

processors

columns.

The

If N ._ P the

midline.

Thus with

the

shape leads

Since

these

bounds

for 2D grids

of the

see

bandwidth bound

most

edges, flux

tree

to a weak

a bound. since

A realistic the

plays

drops

sends

must

is (1/4)N2v/P

words.

messages

(1/4)N

in Table

2.

The

Insstead,

assumption

tree uses and

a role.

bound.

separate

multicast. We summarize

cube

length

destination.

"average

total

O(N2/p)

tree intersects v/P

so the

whose

we may approximate the flux across the line by assuming that every With individual messages, it crosses (1/2)P times, for a total flux

realistic

are

factor.

graph,

2& seconds.

at all processors, column crosses. of crossings

constant

several

we clearly

is just

if we are clever

in the

P reaching

2 words

a bisection

number

the

length

so the

(1/4)N

of (1/4)NSP

1). Even

is (1/6)N2vrPI3

multicast,

path

(Otherwise

become

to the

processor

For a mesh

columns

of ILl = 2P

bound

machine.

sequen-

seconds.

is/30,

Consider

and

P.

path

and

processors.

average

distance

roots

(Table large

2D grids.

a total

P links,

multicast,

the

on

of total

over

square

for source

flux-per-link

spanning

therefore

tree

are

will use a tree

In 3D the

of the

the

than

separately

its own

destination

message

in the

by a modest

attention

There

average

for tori

taking

to know

N is greater

be done

tree

the

we need

the

the early,

positions

each

include

processors

(3/4)

distances

us fix our

(1/3)N2P

and

we place

average

on random Let

lower

chosen

to 1 for grids

that

that

it is (1/2)v/P.

to processors,

sends

assume

may

a spanning

nodes link,

sends

messages

through

whose

We first

between

be sent and

distances.

These

separate

map(k)

messages.

processors.)

with

may

interprocessor

themselves.

map(k),

or they

is processor In order

and

present

1

half

we

is that

of all edges

The

resulting

2 seconds

with

Type

of Bound

Lower

bound

Communication Scheme

N2

Arrivals

"Trio

Flux

per link

-_3

Flux

per link

-_V_3

width

-_v_3 Costs

for

messages

tree multicast

separate

TABLE Communication

multicast

separate

-_3

Bisection width

Bisection

tree

messages

2

Column-Mapped

Fell

Cholesky.

F.fl'tcieacy - No Bmadc4m, Pffi1024, Column Mapped 100,

i

.

.

3

4

5

i

,

|

6

7

8

9C

7C 6_

40 3O 20 I0

2

9

10

nip

Fro. 3. Iso-e._iciency

lines .for dense Choles_

with column cyclic mapping; separate

messages.

Eflic_mcy - Broad_

Coim_.n

t 4O 30

I

2

3

4

5

6

7

$

9

|0

n/P

FIc.

4.

Iso-e_iciency

lines

for

From the critical path, we have that the completion

dense

Cholesky

with

column

cyclic

mapping,

P-"

1,024;

tree

mldtica_t.

average work per processor, and the bisection width bounds, time is roughly maxrN--_xzP, _2 , N4--_) with tree multicast and

maxcN--_3_,,_2 , N_4-_ ) with separate 1,024) are shown in Figures 3 and 4.

messages.

Contours

of efficiency

(in the case P --

We can immediately conclude that without spanning tree multicast, this is a nonscalable distributed algorithm. We suffer a loss of efficiency as P is increased, with speedup limited to O(v_). Even with spanning tree multicast, we may not take P > -_ and still achieve high efficiency. For example, with _ = 10_ and P = 1,000, we require N > 12,000 (72,000 matrix elements per processor) in order to achieve 50% efficiency. This is excessive for full problems and will prove to be excessive in the sparse case, too. 4.2.

Mapping

blocks.

Dongarra,

Van de Geijn, and Walker

have already

shown that

on the Intel Touchstone Delta machine (P = 528), mapping blocks is better than mapping columns. In such a mapping, we view the machine as an P_ x Pc grid and we map elements Aij and Lq to processor (inapt(i), inapt(j)). We assume a cyclic mappings here: inapt(i) ---i rood P, and similarly for mapc. In a fight-looking method, two portions of column k are needed to update the block A,_o,_ot0: L,o_°,k and Lcozo.k (rows and col8 are integer vectors here). Again, we may send the data in the form of individual messages from the P, processors holding the data to those processors that need it, or we may use multicast. The analysis of the preceding section may now be done for this mapping. Now the compute

time must be at least N2_b max (_:p,, _); 10

the longest

path in the task graph has

[[ Type

Lower

of Bound

Arrivals

4

Edge

contention

Edge

contention

bound

Comment

_+_ tree multicast

separate

messages

TABLE 3 Communication

N2/2p,.

multiplies

linear

connections

about

the paths

edge.

This

in Table

and

bound 3.

multiply-adds.

of the p(m)

taken

by messages, the flux

Pr and

Pc both

mapping

and

with

efficient

Note

that

P = O(N

2) so that

storage

Contours 5.

of efficiency

Distributed

tions

are

last

section

for the

about

the

problem,

Instead, (on

described

above. and

The

sparse structure

Laplacian

The

best

way But

summarized

like

0(P-1/2).

when

this scalable

_ >> ft.

algorithm

[21].) in Figures

problem.

that.

loaded

The

to extend

5 and

6.

interesting the

an analytical

ques-

results

of the

approach,

even

complicated.

was

done

fan-in,

was

top-level

2. the

left subtree

3. the

right

to processors

separator

the

was

and

statistics

distributed,

Matlab,

collected.

The

column-oriented

version

vector

assigned

map

are

The

4.0,

was

elimination

which

has

generated tree

experiment

sparse sparse

cyclically to the

recursively to column

all that

to the left

to the

then

Cholesky

matrix

oper-

factored,

in order

computed.

Finally,

mapping whole

as follows:

machine;

half-machine;

right

k is computed

is needed

and

was

by the subtree-to-submesh recursively

was mapped

processor

Ng x Ng grid

L.

mapped

was mapped

subtree

of the

for an

of the factor

1. the

integrated

model

full case.

the

(In fact,

in 1985

are even

are

information

heavily

drops

is scalable

is O(1).

the

to be dauntingly

used

Results time

algorithm

be to do just

simulation

software

"assigned"

L and

the

use of the most width.

trees

this

[11].

the

number

not

bisection

spanning

With

P, = Pc = 32 are shown

and

would

workstation)

the were

matrix

and case

proved

The

storage

First, columns

a Sun

and

the

Stewart

the

columns.

communication the

and

Cholesky

an experimental

simulates

to obtain

sparse sparse

and

per processor

to O'Leary

approach,

and

compute

the

multicast,

for P = 1,024

sparse

to the

model

ations

is due

rows

per link

Full Cholesky.

multicast

we may

O(v/ff),

this

Cholesky

the

in machine

With

for distributed

for Toms-Mapped

For

processors

dominates

With

Costs

half-machine; and

stored

by a simulation

that

in map(k). collects

the

The time-

statistics

• a vector

of operation

• a vector

of counts

• the

total

• the

flux of data

counts of arriving

flux of data

per processor; words

per

processor;

the

horizontal

in word-links;

(in words)

crossing

chine. 11

and

vertical

midlines

of the

ma-

100

Efficimcy- No TreeMulticast, 21:) Wrsp Map. I)=-1024 ......

*

90 80 70 6O

40 30

IC

1

2

3

4

$

6

7

8

9

10

n/P

FIG.

5.

Iso-e_ciency

lines

for

dense

Cholesky

with

P,D cyclic

mapping;

separate

messages.

Efficieacy - T_e Multica,t. 2D Wrap Map. P=1024

7o 60

t,o 4O 30 20 10

1

2

3

4

5

6

7

8

9

10

n/P

FI¢3.6.

lso-e_iciency

lines

for

dense

Cholesl_g 12

with

_D

cyclic

mapping;

tree

melticast.

Ops,Flux,Bisection Width,Arr_alsfar8 X 8 Processor Grid

10'

Opi per Pro:. 105

Bisection Width Flux per Link. Max Arrivals

10'

103

103 GridSize,Ng

FIG. 7. Four lower bounds; Pr = Pc = 8.

Flux. Bisection Width, Arrivals for Ng = 31 lOs

....

w



Flu; per Li_k

t

mm, Proces_

l0 t Grid Size, Pr ffiPc

FIG. 8. Four lower bounds; N ! = 31. 13







.

.

.

l#

Scll_l C_m

OverheKl _

_

B_m_

2.8

2.6

2.4

2.2

2

1.8

!.6

1.4

140

rs =(112)P

Fie;.

9.

Scaled

communication

and

load

balance

w/fh

N t = (1/2)P.

Figure 7 shows the computational load on the processors (Ops per Proc.), the bisection width bound, the maximum

number

of words arrivingat any processor and the average flux

of words per machine link as a function of the grid size Ng with P, - Pc = 8; there are three data points on each curve, for grids of size Ng - 15, 31, and 63. The slope of the Ops per Processor curve is greater than that of the communication

curves, as expected, and when

N o >> P efficiencywillbe good. Figure 8 shows the behavior of these four metrics as P increases and N o is fixed at 31. Now,

the operations per processor curve drops as 1/P, the communication

and efficiencyis very poor when

P is not much

curves do not,

smaller than N o.

The resultsfor the dense case lead us to suspect that efficiencywillbe roughly constant if the ratio Ng/P

is fixed. Figures 9 and 10 show two measures of efficiencyover a range

of values of Ng and P, with the ratio fixed at one half and at two. These curves are nearly fiat,which confirms the main resultof this work: one must scale the number as the square of the number

of processors in order to have efficiencybounded

P is increased. Thus, the method

of gridpoints above zero as

is not scalableby our earlierdefinition.

Recently, Thinking Machines Corporation has introduced a highly parallelmachine with a _fat-tree"interconnect scheme. A fat tree is a binary tree of nodes. Leaves are processors and internalnodes are switches. The link bandwidth

increasesgeometrically with increasing

distance from the leaves. We

simulated column-mapped

bles at each tree level.Columns

sparse Cholesky for a fat tree with bandwidth were mapped

in a subtree-to-subtreemanner: 14

that dou-

Scaled Communic_dm 14

Overhead and _

i

i

i

6

6

Bd_mo_ •

12

I0

M._op.1^,.op.

%

_

_

"6

_oo _;o _o

130

Ng=2*P

FIO.

10.

Scaled

communication

and

load

balance

with

Ng =

2P.

Avenge (Computation / Communicadm) 7 _5 NS=Hp 6 5.5

l

5

4 4.5 "---"--"-_g

3.5

= NI_

3 2.5

Ns - [_'.;

2_

6'_

614

6'.6

618 _

7

7_2

7'.4

716

718

Fm Tree HeiSt

FIG.

11.

Scaled

communication

and

load 15

balance

for

fat

trees,

with

N e oc P.

1. the top-level separator was mapped cyclically to the whole machine; 2. the left subtree was mapped recursively to the left half-machine; 3. the right subtree was mapped recursively to the right half-machine; The same set of statistics were collected; they are shown in Figure 11. Clearly, our conclusions hold for fat trees as well as meshes. Perhaps this is surprising, since average interprocessor distance is now O(log P) and the bisection bandwidth of the machine is O(P) instead of O(x/_). This is additional evidence that column-mapped methods are not scalable for highly parallel 6. Further

machines. work.

This work should

be extended

in several

ways.

• Experimental performance data should be taken from actual distributed dense and sparse Cholesky and compared with our predictions. • Variants that map the sparse matrix data in some form of two-dimensional cyclic map, as has been suggested by Gilbert and the author, Kratzer, and by Venugopal and Naik, should also be scrutinized experimentally. • The whole Cholesky factorization can be viewed as a DAG whose nodes are arithmetic operations and edges are values. (An n-input SUM operator should be used so as not to predefine the order of updates to an element.) Let us call this the computation DAG. The ultimate problem is to assign all the nodes to processors in such a way that the completion time is minimized. The computation DAG is quite large. Methods that work with an uncompressed representation of this DAG suffer from excessive storage costs. (This idea is quite like the very old one of generating straight-line code for sparse Cholesky in which the size of the program is proportional to the number of flops, and hence is larger than the matrix and its factor.) Of course, Cholesky DAGs have underlying regularity that allows for compressed representations. One such representation is the structure of L. Others, smaller still, have been derived from the supernodal structure of L and are usually only as large as a constant multiple of the size of A. All approaches to the problem to date have employed an assignment of computation to processors that is derived from the structure of L rather than from the computation DAG. None has succeeded. It is not known, however, if this failure is due to a poor choice of assignment, or alternatively if arty assignment based only on the structure of L must in some way fail, or indeed whether there is arty assignment for sparse

Cholesky

computation

investigation. • In these proceedings, which the assignment

DAGs

that will succeed.

These

issues

Ashcraft proposes a new class of column-oriented of work to processors differs from the assignment

require some methods in used in the

algorithms we have investigated. His approach may make for a substantial reduction in the fulx per link and bisection width requirements of the method, and so it should be investigated further. We note, however, that it will not reduce the length of the critical path, since it is based on the same task graph as all column-oriented methods. • It appears that the scalable implementation of iterative methods is much easier than it is for sparse

Cholesky.

Indeed,

even naive distributed 16

implementation

of attractive

iterative

methods

is quite

pings

of gridpoints

Total

flux is kept

subgrids and

processor.

Fichtner

[24], and

grids.

at some

When all that

gridpoints been

Recent

locality)

good

example

been

provided

impediment

are even

more

subspace

suitable

Bjcrstad

and

decomposition found

this

can

for irregular

the

can

number have

of also

be viewed

as

to take advantage

of

environment. methods

that

reside

be annoying;

that

(which

designed

difference

that

even

preconditioners

methods

[6], who

of finite

that

may

at worst,

parallel

methods

domain

Skogen

solution

products

compact

Annaratone,

it clear

in the distributed-memory

of paxallel

to the efficient

dot is,

fully

gridpoints

cost,

map-

products.

by mapping

[25] makes

are used, tolerable

simple

of matrix-vector count

preprocessing

Useful,

grid,

[12], Pommerell,

Wang

decomposition

Krylov

of the power by

and

them

a regular

of the grid connect

of Hammond

methods

domain

the class of preconditioned

operation

Simon,

to make

Finally,

calculation

but supportable

subspace

we require

fast

work

Pothen,

with

of the edges

grow like P log P, not P_.

developed.

spatial

allows

fraction

noticeable

Krylov

For example,

so that most

same

be done, but

to processors to a small

to processors,

on the

efficient.

P

=

equations

A

has recently was

no

with

16,384

Ng equal

to

direct

solvers

only 640. We conclude be made machines.

by admitting

competitive

that

it is not yet clear

at all for highly

(P > 256)

whether and

sparse

massively

(P > 4096)

can

parallel

REFERENCES [1] E. ANDERSON, A. BENZONI, J. DONGARRA, S. MOULTON, S. OSTROUCHOV, B. TOURANCHEAU AND R. VAN DE GEfJN, LAPACKfor distributed memory architectures: progress report, In Parallel

Processing for Scientific Computing, SIAM, 1992. [2] C.

ASHCItAFT, S. C. EISENSTAT, AND J. W. H. LIU, A fan-in algorithm _or distributed sparse nwmericalfacforization, SIAM J. Scient. Star. Comput. 11 (1990), pp. 593-599. [3] C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, AND A. H. SHERMAN, A comparison of three column-based distributed sparse factorization schemes, Research Report YALEU/DCS/RR-810, Comp. Sci. Dept., Yale Univ., 1990. [4] C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, B.W. PEYTON, AND A. H. SHERMAN, A compute.

[5] C.

ahead fan-in scheme for parallel sparse matriz _actorlzation, In D. Pelletier, editor, Proceedings, Supercomputing Symposium '90, pp. 351-361. Ecole Polytechnique de Montreal, 1990. ASHCRA_r, The fan-both family/ of column-based distributed Cholesl_ factorization algorithms,

These proceedings. [6] P.

BJ_ItSTAD

AND M.

D.

SKOOEN,

for massively parallel computeil. Decomposition. SIAM, 1992.

Domain

decomposition

Proc¢_lings

algorithms

of Schwarz

of the Fifth International

iype,

Symposium

designed

on Domain

[7] J. DONOARRA, R VAN DE GEIJN, AND D. WALKER, A look at scalable dense linear algebra libraries, Proceedings, Scalable High Performance Computer Conference, Williamsburg, VA, 1992. [8] A. GEOROE, J. W. H. LIu, AND E. No, Communication results for parallel sparse Cholesl_ factor. ization on a hvpercube, Parallel Comput. 1O (1989), pp. 287-298. [9] A. GEORGE, M. T. HEATH, J. W. H. LIu, AND E. No, Solution of sparse positive definite s_lstems on a hypercube, J. Comput. Appl. Math. 2? (1989), pp. 129--156. [10] J. R. GILBERT AND R. SCHREIBER, Highly parallel sparse Choles_ factorization,

SIAM J. Scient. Stat. Comput., to appear. [11] J. R. GILBERT, C. MOLER, AND R. SCHREIBER, Sparse matrices mentation, SIAM J. Matrix Anal. Appl. 13 (1992), pp. 333-356. 17

in MATLAB:

design and imple-

[12] S. W. HAMMOND, Mapping Unstructured thesis, Dept. of Comp. Sci., Rensselaer

Grid Computations to Massively Polytechnic Institute, 1992.

PamUel

Computers,

PhD

[13] S. W. HAMMOND AND R. SCHREIBER, Mapping unstructured grid problems to the Connection Machine, In Piyush Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 11-30. MIT Press, 1992. [14] M. T. HEATH, E. NG, AND B. W. PEYTON, Parallel algorithms for sparse linear systems, SIAM Review 33 (1991), pp. 420-460 ......... [15] S. G. KRATZER, Massively parallel sparse matriz computations, In P. Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 178-186. MIT Press, 1992. A more complete version will appear in J. Supercomputing. [16] C. E. LEISERSON, Fat-trees: universal networks for hardware-efficient supercomputing, IEEE Trans. Comput. C-34 (1985), pp. 892-901. [17] GUANGYE LI AND THOMAS F. COLEMAN, A parallel triangular solver for a distributed memory muitiprocessor, SIAM J. Scient. Stat. Comput. 9 (1988), pp. 485-502. [18] M. Mu AND J_ R: RICE, Performance of PDE sparse solvers on hypercubes, In P. Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 345-370. MIT Press, 1992. [19] M. Mu AND J. R. RICE, A grid based subtree-subcube assignment strategy for solving PDEs on hypercubes, SIAM J. Scient. Stst. Comput., 13 (1992), pp. 826-839. [20] A. T. OOIELSKI AND W. AIELLO, Sparse matriz algebra on parallel processor arrays, These proceedings. [21] D' P. O'LEARY AND G. W. STEWART, Data-flow algorithms for parallel matriz computations, Comm. ACM, 28 (1985), pp. 840-853. [22] L.S. OSTROUCHOV, M.T. HEATH, AND C.H. PuOMINE, Modelling speedup in parallel sparse matriz factorization, Tech Report ORNL/TM-11786, Mathematical Sciences Section, Oak Ridge National Lab., December, 1990. [23] D. PATTERSON, Massively parallel computer architecture: observations and ideas on a new theoretical model, Comp. Sci. Dept., Univ. of California at Berkeley, 1992. [24]Ci POMMERELL, M. ANNARATONE, AND W. FICHTNER, A set of new mapping and coloring heuristics for distributed-memory parallel processors, SIAM J. Scient. Stat. Comput. 13 (1992), pp. 194-226. POTHEN, H. D. SIMON, AND L. WANG, Spectral nested dissection, Report CS-92-01, Comp. Sci. Dept., Penn State Univ. Submitted to J. Parallel and Distrib. Comput. [26] E. I:_THBERG AND A. GUPTA, The performance impact of data reuse in parallel dense Cholesky factorization, Stanford Comp. Sci. Dept. Report STAN-CS-92-1401. AND A. GUPTA, An efficient block.oriented approach to parallel sparse Cholesky fac[27] E. R£)THBERG torization, Stanford Comp. Sci. Dept. Tech. Report, 1992. [28] Y. SAAD AND M.H. SCHULTZ, Data communication in parallel architectures, Parallel Comput. 11 (1989), pp. 131-150. [29] S. VENUGOPAL AND V. K. NAIK, Effects of partitioning and scheduling sparse matriz factorization on communication and load balance, Proceedings, Supercomputing 91, pp. 866-875.IEEE Computer Society Press, 1991. [30] E. ZMIJEWSKI, Limiting communication in parallel sparse Cholesky factorization, Tech. Report TRCS89-18, Dept. of Comp. Sci., Univ. of California, Santa Barbara, CA, 1989.

[25]A.

18

Suggest Documents