width, one may obtain lower bounds on the time required .... may be a 2D or 3D grid or torus, or maybe .... links from its source to its destination processor. Let p(m). = (£1,t2,... ,ld(m)) .... the flux across the line by assuming ... In a fight-looking.
Scalability
of Sparse
Robert
Direct
Solvers
Schreiber
Z
The Research Institute Association, The
Work reported
of Advanced Computer American City Building,
herein was supported
in part
Science is operated by Universities Space Research Suite 311, Columbia, MD 244, (301)730-2656
by the NAS Systems
Division
of NASA
via Cooperative
Agree-
ment NCC 2-387 between NASA and the University Space Research Association (USRA). Work was performed at the Research Institute for Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, CA 94035.
SCALABILITY
OF
SPARSE
ROBERT
DIRECT
SOLVERS
"
SCHREIBERt
Abstract. We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors. In this paper we show that the column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable. By considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm. Applying this technique to distributed, column-oriented, full Cholesky leads to the conclusion that N (the order of the matrix) must scale with P (the number of processors) so that storage grows like p2. So the algorithm is not scalable. Identical conclusions have previously been obtained by consideration of communication and computation latency on the critical path in the algorithm; these results complement and reinforce that conclusion. For the sparse case, we have experimental measurements that make the same point: for column-oriented distributed methods, the number of gridpoints (which is O(N)) must grow as P_ in order to maintain parallel efficiency bounded above zero. Our sparse matrix results employ the "fan-in" distributed scheme, implemented on machines with either a grid or a fat-tree interconnect using a subtree-to-submachine mapping of the columns. The alternative of distributing the rows and columns of the matrix to the rows and columns of a grid of processors is shown to be scalable for the dense case. Its scalability for the sparse case has been established previously. To date, however, none of these methods has achieved high efficiency on a highly parallel machine. Finally, open problems and other approaches that may be more fruitful are discussed.
Key'words. memory, ..
massively
scalable
for the
and
extensive
The
arrival
to decide Two
lines
tended
In these
has
also
have
factorization,
distributed-
way
despite
makes
perhaps
sparse
approach
is to map
along
j is held the
this
solution
some
al-
prolonged
an opportune
different
dense
[7] have
" Written May 1992. t Research Institute for Advanced
data
lines,
time
or to give it
Computer
map(j)
organize and
the
sparse
problem Venugopal
the
value
Science,
message-passing-machine oriented
its Cholesky
[2, 3, 4, 9, 14, 18,
factor and
L are
map
computation
DAXPY.
This
and
of this
MS T045-1
This Nalk
as a collection
NASA
Ames
of methods
machines
approach
[29].
approach
Recently,
for the
Research
[1].
is favored
by
Dongarra,
dense
problem
Center,
Field, CA 94035. This author's work was supported by the NAS Systems Division via Cooperative NCC 2-387 between NASA and the University Space Research Association (USRA). 1
assigned
is determined
class
on message-passing
in two dimensions.
[15], and shown
A and
methods
for the the
MIMD,
are column
by processor
scaling
used
The
that
matrix
column
[10], Kratzer Walker
up to now.
of the
- column
A second
and
direct
[2, 3, 4, 9, 10, 14, 15, 18, 19, 30].
supercomputers search,
on methods
Furthermore,
tasks:
Schreiber
undiscovered,
of researchers
the
68R10.
distributed-memory,
= b remains
taken
columns
and
and
65F25,
parallel,
parallel
been
proposed
de Geijn,
Ax
massively
been
Gilbert Van
system
to concentrate
method.
of column-oriented
highly
to continue
methods,
in some
of the
Cholesky
methods.
of attack
has
65F50,
by a number
and
or not
of iterative
to processors as part
linear
investigations of highly
community
sparse
classifications:
An efficient,
sparse
whether
up in favor
19, 30].
subject
Introduction.
gorithm
computer,
algorithms.
AMS(MOS) 1.
parallel
Moffett
Agreement
Unil_.smr
10'
l_ffo_au_
i
10' Dmkmp P'mc_n
!01
! 10
198.5
FIG.
1.
Microprocessor
1986
1987
and
1988 Year
supercompe_er
1989
1990
1991
performance
1992
per
1993
CPU.
on MIMD message passing machines; the author has also used it successfully problem on the Maspar MP-1, a massively parallel SIMD machine. In this paper we investigate the scalability of these classes of methods sparse Cholesky factorization. By a scalable algorithm for this problem, we maintains efficiency bounded away from zero as the number P of processors problem size (in this case the number of gridpoints or the order of the matrix) linearly in P. We concentrate on the model problem arising from the 5-point, stencil
on an N o x No grid.
We will show that the column-oriented
for the dense for distributed mean one that grows and the grows roughly finite difference
methods
cannot
work
well when the number of gridpoints (N - N_) grows like O(P) or even O(P log P). We show that communication will make any column-oriented, distributed algorithm useless, no matter what the mapping of columns to processors. This is true because column-oriented distribution is very bad for dense problems of order N when N is not large compared with P. Two improvements seem to be required. 1. A two-dimensional wrap mapping of the dense frontal matrices, at least for those corresponding to fronts near the top of the elimination tree. 2. A "fa_-out" submatrix Cholesky algorithm with multicast instead of individual messages. It is reasonable to ask why one should be concerned with machines having thousands of processors. Figure 1 should illustrate the reasons for believing that supercomputer architecture
is now making
an inevitable
and probably
permanent
transition
from the modestly
parallel
to the
highly
processors).
The
decade
motivate
helps •
1 Gflop
parallel
following
the work
processor
• Physically
the
memory Mhz.
be the
(Since
work
dense and
paper
for the sparse, hierarchy
some
for the dense
program
cholesky(
a, n)
fork=l
and
the
illusion
bandwidth
of shared to nonlocal
order of 100 bits
96 bits,
roughly
of computation
in parallel
at
100 Mwds/sec
speed
will be in the 5 - 50 range.
per
will
processor
Patterson
to
[23] gives
at analysis
much
later
Coleman
et al. [22],
and
work
George,
has been
a fat tree.
of distributed of this
[17] for dense
algorithms.
Cholesky
or maybe
type.
and
shared-memory
[8] have
analysis
by Rothberg
efforts
systems,
Ng
An interesting
computations.
Notable
triangular
Liu,
provided
matrix
and
Saz_l
made
some
of the effect Gupta
system,
for
and
have
of a
[26]. These
recently
come
distributed
lower bounds case and
use them
through
implementations
on communication to illustrate
an experiment
of Cholesky
time;
in Section
the problem
factorization;
Sec-
4 we compute
these
for the sparse
with column case;
mapping;
in Section
Sec-
6 we consider
are still unresolved.
Distributed
following
coming
to ours [27].
this work that
efforts
of Li and
2 we introduce
the problems
-
ratio
with a nununiform-access
similar
tion 5 extends
provide
will be on the
the
speed
column-mapped
on sparse
3 develops
2.
Thus,
those
working
In Section bounds
speed.
Ostrouchov,
memory
tion
may
will be large
is 12 bytes
[16] prefigures
analyses
to conclusions
access
nodes
word
on previous
include
[28];
investigators,
hardware
may be a 2D or 3D grid or torus,
of Leiserson
Schultz
- 65,536
the
estimates.
builds
problems
(4,097
during
resource.
between
communication
comparable
parallel
architecture
processors);
while
a sparse
achievable
• Interconnect The
multiple
for nonlocal
speed
interprocessor
This
(with
will be a constraining
• Communication
or massively
here.
memory;
latency
processors)
of supercomputer
presented
chips
distributed
memory,
100
(257 - 4,096 estimation
sparse
Cholesky.
Cholesky
factorization
may
be understood
as the
tondo
cciv(k); for
j--
k+
1 to
n do
od(j,k); od od
Procedure k th column subtracts
cdiv(k)
of A by 1/_/-A_ Ljk times
The execution require
computes
only
that
the order
cmod(j,
the
square
to produce
k th column
from
of this program k) must
follow
root
the the
of the
k th factor jth
column
element L.k;
Akk and scales
procedure
the
cmod(j,
k)
column.
is not the only cdiv(k)
diagonal
and
one possible.
cdiv(k)
must
The true follow
dependences
all the
craod(k,
l)
for t < k and Lkt _ O. A second
cholesky(
form
of Cholesky
is this:
a, n)
for
k=l
ton
do
for£=l
tok-1
do
.moa(k,O; od
c,/io(k); od
The first form looking"
is sometimes
method.
In the
The second
sparse
Furthermore,
called
case,
most
form
sparsity
cmod
"submatrix"
goes
by the names
is exploited
operations
Cholesky
are
within
omitted
and
sometimes
"column" the
a "right-
or "left-looking'.
vector
altogether
called
cdiv
and
because
cmod
the
operations.
multiplying
scalar
Ljk is zero. The column-oriented at processor may
m@[k].
the
Alternatively, updates It then
This
the
(As
integer
sequential
befits
an MiMD
O(Ops/P). O(NLS),
Other
loop.
so running hand,
Cholesky scalability
3.
data,
code
The
factor problems
Methodology.
communication
are not
running should
like O(Nlog
the number
any
2. The data
L mapped = Ir and
solution
column.
at processor
to rr, The that
set O(N'S).
like O(N), or O(N/log
and N).
sections. computation.
it useful
to employ
In order certain
to assess abstract
fan
- in(a,
integer real
n,
L, n, map) map_;
Lfl,aD;
my_Is = {jI m.p_] = my,',._e}; forj=l
tondo if ( row[j,
rnynarae]
_ 0 I[ J 6 mycols
) then
t=O; for
k 6 row_,
rayname]
do
t = t + ajk(ajk,...,a,k)
T ;
od if ( j f/ mycols
) then
Send aggregate
update
L._ = (aj_,...,,
a,j) r -
column
t to processor
rnap_]
else
while
not Receive
all aggregate
t; updates
an aggregate
L. i = Lo_ -
update
have
been
column
received
u[j, _r] for column
uL/, _r];
od L.j
=
L.jI_;
fi fi od
FIO. 2. Fan-in
distributed,
column-oriented
do
Cholesk_.
j;
lower
bounds.
Our
communication
edges
assumes
memories
possibly
grid-structured
distributed
machine
new nor
graph all,
(the
We assume
of the
vertices
of the
(We
ignore
We assume
start-up
that
accounting
identical.
data,
for
per word.
consists
channels
model
axe the
units includes
machines
(slowness)
Let/_0
We expect
inverse
physically
like a CM-5. links.
of a link in seconds
computation
be the
rate
that/_0
and/_
sit-
hypercube
having
machines
are
set of all communication
_ be the
operation.
in seconds
The
bandwidth
Let
memory
processor-memory
graph.
L be the
in this model.)
per floating-point
or receive
and
costs are
that
as well as tree-structured
be the inverse
that
communication
shared-memory
Let/_
processors
in seconds
can send
links.
It assumes
the
and
machines,
Tera machine)
identical
that
G = (W, L),
Let V C_ W be the set of all processors
processor
it is a straightforward
is given.
It assumes
message-passing
memory
per word.
deep;
topology
to processors.
undirected
at some,
and
that
local
of a given
uated
is neither
costs.
Our model of the
approach
rate
at which
of a
a processor
will be roughly
the
salne.
A distributed-memory mation
by sending
For m E M, src(m)
and
and
links taken
The first
acterized taken
its source m.
following three
1. (Flux
dear(m),
denote
the length
For
any
are
obviously
size
are
and
of processes
in ra. Each both
that
each
Let p(m)
let the
set
communicated.
m has a source
processor
machine
path
message
takes
= (£1,t2,...
of messages
from
the
a certain ,ld(m))
whose
2. (Bisection
lower
computable
its endpoints.
sep(V0,
Given
V_) -
bounds from The
paths
utilize
on the the last
set
completion
time
of message
depends
M,
of the
each
on knowledge
of which of the
V0 and
V1 disjoint,
paths
C_ L ! L' is an edge
define
separator
of
V0 and
and
bound
l,
computation.
Iml. d(m). ILl
V0, V1 C W,
min[{L'
fluw(Vo,
of
be the path
link)
width)
source path
messages. per
infor-
M(£).
,
The
exchange
of V.
shortest
that
processor. _ E L,
message
elements
of the
We assume
link
be denoted
bounds
by its
by the
processor
to its destination
It E p(m)},
of a set
Let M be the set of all messages
of words
m to its destination.
by message
{m E M The
let d(ra)
consists
messages.
the number
a destination
message from
receiving
Iml denotes
For m E M, of the
computation
_ ) =
_
is flu_c(Vo,
sep(B,
Vl )
I_1.
V_}I
is charp(M)
3. (Arrivals/Departures
(also
known
as node
max
congestion))
_
vEV
Iml&;
dest(m)
= v
max vqV
ImlOo. src(m)
4.
(Edge
= v
contention)
max E JmIB tEL meM(O
Of course, particular, This
the
the
actual
communication
can be done
With
detailed purposes
we have
found
the
four
information 4.
for the
Cholesky
of order
in the
Ng,
In this model
final
than
wires,
better
a constant doing
dense,
and
problem
We prefer
to
p(M).
distributed
Cholesky
fraction
Cholesky
is a sine
however,
time-independent)
paths
substantial
a final dense
dense
can be obtained.
(i.e.
the
statically.
to processors,
cumbersome.
bound,
In
to be scheduled.
bounds
integrated
bounds.
in advance,
of tasks
on the
we consider
need
is known
assignment
only
any of the
machine)
to be unnecessarily
section
is spent
on this
in the
edge-contention
problem,
factorization
efficiency
and
depend
of the
wires
of use of the
approach
which case
Cholesky.
Since,
schedule
realistic
be greater
the set of messages
of algorithms
above,
and,
Dense
sparse
more
bounds M
ization.
of the
may
(the
or, when
of analysis this
time
resources
dynamically
knowledge
For the use
communication
of the
factorization
qua non
factorwork
in a
of a matrix
for a scalable
sparse
algorithm. 4.1.
Mapping
Assume
that
cyclically:
the
max
eration N2/2
columns
column
We first N2¢
columns.
N
(-_, TP),
cal path
and
column from
the
O(N)
The
and
the
critical
path.
of the tasks.
This is the path an atomic
unit
to O(N 2) operations.
fan-out
of order
=- j mod
is due to the longest
multiply-adds. operations
matrix
map(j)
the scheduling first
a right-looking,
symmetric
in processor
parallelism
no matter
N3/3.
multiplies
By making
of a dense
j is stored
examine
count,
Let us consider
Execution second
time term
cdiv(1),
to processors
cmod(2, O(N)
be at least from
DAG,
1), cdiv(2),
we have
at most
must
comes
in the computation
of computation,
Therefore,
N are mapped
Cholesky.
P.
The
path
distributed
the op-
which
cmod(3,
lengthened
processors
the
has 2), .... criti-
can be used
efficiently. Next, pose
that
consider
P is a perfect
is not necessary Consider by processor
communication
for our a mapping
map(j)
square
costs and
conclusions, of the
(a fan-out
that but
computation method).
on two-dimensional the
machine
it simplifies in which After
grid
is a v/P
or toroidal
x v/P
(This
Sup-
assumption
things.) the
performing
operation the
cmad(j,
operation
map(k) must sendcolumnk to all processors{map(j) IJ > k}. 7
grid.
machines.
k) is performed cdiv(k),
processor
[
i
2D
Grid
Toms
(2/3)v/P
(1/2)v/P
3D
pll3
(3/4)p1/3
TABLE Average
Two tially
possibilities
by processor
destinations; root
by the
to compute
flux
per
Let us assume
two randomly
for a 2D torus change reduce
and
the
based
2P/j8
3/_.
and With
the
tree
bound
travel
is (1/4)N2/3
With almost per
whole
processor
to half
that,
arrive factor
only (1/2)N
words.
will use
a more
on average there
the are
bounds
If separate
middle
the
traversed
roots
have
average
in 2D this and
is (2/3)v/-P; constants
assigning
grid,
idle
distance
the
about
of the
data
we can at best
So we will stick
links;
messages
the
total
to the
estimate
are
sent,
machine
the
total
bandwidth
flux
is
is roughly
seconds. distance" all the
computation
processors.
information
changes.
Every
Most
matrix
flux is (1/2)N2P
and
words of the
arrivals
bound
leave
matrix
any
processor.
>> P,
as arriving
is (1/2)N2/ij
If N
factor
seconds.
of the
machine
through
its vertical
With
spanning
is at least estimate
of them
tree
one.
This
that
the cut in the
(1/4)N2v/P_
multicast,
is not
in fact
in (1/2)V_
cut.
seconds
the
obsdervation
of the
element
the
will
flux per link
processors
columns.
The
If N ._ P the
midline.
Thus with
the
shape leads
Since
these
bounds
for 2D grids
of the
see
bandwidth bound
most
edges, flux
tree
to a weak
a bound. since
A realistic the
plays
drops
sends
must
is (1/4)N2v/P
words.
messages
(1/4)N
in Table
2.
The
Insstead,
assumption
tree uses and
a role.
bound.
separate
multicast. We summarize
cube
length
destination.
"average
total
O(N2/p)
tree intersects v/P
so the
whose
we may approximate the flux across the line by assuming that every With individual messages, it crosses (1/2)P times, for a total flux
realistic
are
factor.
graph,
2& seconds.
at all processors, column crosses. of crossings
constant
several
we clearly
is just
if we are clever
in the
P reaching
2 words
a bisection
number
the
length
so the
(1/4)N
of (1/4)NSP
1). Even
is (1/6)N2vrPI3
multicast,
path
(Otherwise
become
to the
processor
For a mesh
columns
of ILl = 2P
bound
machine.
sequen-
seconds.
is/30,
Consider
and
P.
path
and
processors.
average
distance
roots
(Table large
2D grids.
a total
P links,
multicast,
the
on
of total
over
square
for source
flux-per-link
spanning
therefore
tree
are
will use a tree
In 3D the
of the
the
than
separately
its own
destination
message
in the
by a modest
attention
There
average
for tori
taking
to know
N is greater
be done
tree
the
we need
the
the early,
positions
each
include
processors
(3/4)
distances
us fix our
(1/3)N2P
and
we place
average
on random Let
lower
chosen
to 1 for grids
that
that
it is (1/2)v/P.
to processors,
sends
assume
may
a spanning
nodes link,
sends
messages
through
whose
We first
between
be sent and
distances.
These
separate
map(k)
messages.
processors.)
with
may
interprocessor
themselves.
map(k),
or they
is processor In order
and
present
1
half
we
is that
of all edges
The
resulting
2 seconds
with
Type
of Bound
Lower
bound
Communication Scheme
N2
Arrivals
"Trio
Flux
per link
-_3
Flux
per link
-_V_3
width
-_v_3 Costs
for
messages
tree multicast
separate
TABLE Communication
multicast
separate
-_3
Bisection width
Bisection
tree
messages
2
Column-Mapped
Fell
Cholesky.
F.fl'tcieacy - No Bmadc4m, Pffi1024, Column Mapped 100,
i
.
.
3
4
5
i
,
|
6
7
8
9C
7C 6_
40 3O 20 I0
2
9
10
nip
Fro. 3. Iso-e._iciency
lines .for dense Choles_
with column cyclic mapping; separate
messages.
Eflic_mcy - Broad_
Coim_.n
t 4O 30
I
2
3
4
5
6
7
$
9
|0
n/P
FIc.
4.
Iso-e_iciency
lines
for
From the critical path, we have that the completion
dense
Cholesky
with
column
cyclic
mapping,
P-"
1,024;
tree
mldtica_t.
average work per processor, and the bisection width bounds, time is roughly maxrN--_xzP, _2 , N4--_) with tree multicast and
maxcN--_3_,,_2 , N_4-_ ) with separate 1,024) are shown in Figures 3 and 4.
messages.
Contours
of efficiency
(in the case P --
We can immediately conclude that without spanning tree multicast, this is a nonscalable distributed algorithm. We suffer a loss of efficiency as P is increased, with speedup limited to O(v_). Even with spanning tree multicast, we may not take P > -_ and still achieve high efficiency. For example, with _ = 10_ and P = 1,000, we require N > 12,000 (72,000 matrix elements per processor) in order to achieve 50% efficiency. This is excessive for full problems and will prove to be excessive in the sparse case, too. 4.2.
Mapping
blocks.
Dongarra,
Van de Geijn, and Walker
have already
shown that
on the Intel Touchstone Delta machine (P = 528), mapping blocks is better than mapping columns. In such a mapping, we view the machine as an P_ x Pc grid and we map elements Aij and Lq to processor (inapt(i), inapt(j)). We assume a cyclic mappings here: inapt(i) ---i rood P, and similarly for mapc. In a fight-looking method, two portions of column k are needed to update the block A,_o,_ot0: L,o_°,k and Lcozo.k (rows and col8 are integer vectors here). Again, we may send the data in the form of individual messages from the P, processors holding the data to those processors that need it, or we may use multicast. The analysis of the preceding section may now be done for this mapping. Now the compute
time must be at least N2_b max (_:p,, _); 10
the longest
path in the task graph has
[[ Type
Lower
of Bound
Arrivals
4
Edge
contention
Edge
contention
bound
Comment
_+_ tree multicast
separate
messages
TABLE 3 Communication
N2/2p,.
multiplies
linear
connections
about
the paths
edge.
This
in Table
and
bound 3.
multiply-adds.
of the p(m)
taken
by messages, the flux
Pr and
Pc both
mapping
and
with
efficient
Note
that
P = O(N
2) so that
storage
Contours 5.
of efficiency
Distributed
tions
are
last
section
for the
about
the
problem,
Instead, (on
described
above. and
The
sparse structure
Laplacian
The
best
way But
summarized
like
0(P-1/2).
when
this scalable
_ >> ft.
algorithm
[21].) in Figures
problem.
that.
loaded
The
to extend
5 and
6.
interesting the
an analytical
ques-
results
of the
approach,
even
complicated.
was
done
fan-in,
was
top-level
2. the
left subtree
3. the
right
to processors
separator
the
was
and
statistics
distributed,
Matlab,
collected.
The
column-oriented
version
vector
assigned
map
are
The
4.0,
was
elimination
which
has
generated tree
experiment
sparse sparse
cyclically to the
recursively to column
all that
to the left
to the
then
Cholesky
matrix
oper-
factored,
in order
computed.
Finally,
mapping whole
as follows:
machine;
half-machine;
right
k is computed
is needed
and
was
by the subtree-to-submesh recursively
was mapped
processor
Ng x Ng grid
L.
mapped
was mapped
subtree
of the
for an
of the factor
1. the
integrated
model
full case.
the
(In fact,
in 1985
are even
are
information
heavily
drops
is scalable
is O(1).
the
to be dauntingly
used
Results time
algorithm
be to do just
simulation
software
"assigned"
L and
the
use of the most width.
trees
this
[11].
the
number
not
bisection
spanning
With
P, = Pc = 32 are shown
and
would
workstation)
the were
matrix
and case
proved
The
storage
First, columns
a Sun
and
the
Stewart
the
columns.
communication the
and
Cholesky
an experimental
simulates
to obtain
sparse sparse
and
per processor
to O'Leary
approach,
and
compute
the
multicast,
for P = 1,024
sparse
to the
model
ations
is due
rows
per link
Full Cholesky.
multicast
we may
O(v/ff),
this
Cholesky
the
in machine
With
for distributed
for Toms-Mapped
For
processors
dominates
With
Costs
half-machine; and
stored
by a simulation
that
in map(k). collects
the
The time-
statistics
• a vector
of operation
• a vector
of counts
• the
total
• the
flux of data
counts of arriving
flux of data
per processor; words
per
processor;
the
horizontal
in word-links;
(in words)
crossing
chine. 11
and
vertical
midlines
of the
ma-
100
Efficimcy- No TreeMulticast, 21:) Wrsp Map. I)=-1024 ......
*
90 80 70 6O
40 30
IC
1
2
3
4
$
6
7
8
9
10
n/P
FIG.
5.
Iso-e_ciency
lines
for
dense
Cholesky
with
P,D cyclic
mapping;
separate
messages.
Efficieacy - T_e Multica,t. 2D Wrap Map. P=1024
7o 60
t,o 4O 30 20 10
1
2
3
4
5
6
7
8
9
10
n/P
FI¢3.6.
lso-e_iciency
lines
for
dense
Cholesl_g 12
with
_D
cyclic
mapping;
tree
melticast.
Ops,Flux,Bisection Width,Arr_alsfar8 X 8 Processor Grid
10'
Opi per Pro:. 105
Bisection Width Flux per Link. Max Arrivals
10'
103
103 GridSize,Ng
FIG. 7. Four lower bounds; Pr = Pc = 8.
Flux. Bisection Width, Arrivals for Ng = 31 lOs
....
w
1¢
Flu; per Li_k
t
mm, Proces_
l0 t Grid Size, Pr ffiPc
FIG. 8. Four lower bounds; N ! = 31. 13
•
•
•
.
.
.
l#
Scll_l C_m
OverheKl _
_
B_m_
2.8
2.6
2.4
2.2
2
1.8
!.6
1.4
140
rs =(112)P
Fie;.
9.
Scaled
communication
and
load
balance
w/fh
N t = (1/2)P.
Figure 7 shows the computational load on the processors (Ops per Proc.), the bisection width bound, the maximum
number
of words arrivingat any processor and the average flux
of words per machine link as a function of the grid size Ng with P, - Pc = 8; there are three data points on each curve, for grids of size Ng - 15, 31, and 63. The slope of the Ops per Processor curve is greater than that of the communication
curves, as expected, and when
N o >> P efficiencywillbe good. Figure 8 shows the behavior of these four metrics as P increases and N o is fixed at 31. Now,
the operations per processor curve drops as 1/P, the communication
and efficiencyis very poor when
P is not much
curves do not,
smaller than N o.
The resultsfor the dense case lead us to suspect that efficiencywillbe roughly constant if the ratio Ng/P
is fixed. Figures 9 and 10 show two measures of efficiencyover a range
of values of Ng and P, with the ratio fixed at one half and at two. These curves are nearly fiat,which confirms the main resultof this work: one must scale the number as the square of the number
of processors in order to have efficiencybounded
P is increased. Thus, the method
of gridpoints above zero as
is not scalableby our earlierdefinition.
Recently, Thinking Machines Corporation has introduced a highly parallelmachine with a _fat-tree"interconnect scheme. A fat tree is a binary tree of nodes. Leaves are processors and internalnodes are switches. The link bandwidth
increasesgeometrically with increasing
distance from the leaves. We
simulated column-mapped
bles at each tree level.Columns
sparse Cholesky for a fat tree with bandwidth were mapped
in a subtree-to-subtreemanner: 14
that dou-
Scaled Communic_dm 14
Overhead and _
i
i
i
6
6
Bd_mo_ •
12
I0
M._op.1^,.op.
%
_
_
"6
_oo _;o _o
130
Ng=2*P
FIO.
10.
Scaled
communication
and
load
balance
with
Ng =
2P.
Avenge (Computation / Communicadm) 7 _5 NS=Hp 6 5.5
l
5
4 4.5 "---"--"-_g
3.5
= NI_
3 2.5
Ns - [_'.;
2_
6'_
614
6'.6
618 _
7
7_2
7'.4
716
718
Fm Tree HeiSt
FIG.
11.
Scaled
communication
and
load 15
balance
for
fat
trees,
with
N e oc P.
1. the top-level separator was mapped cyclically to the whole machine; 2. the left subtree was mapped recursively to the left half-machine; 3. the right subtree was mapped recursively to the right half-machine; The same set of statistics were collected; they are shown in Figure 11. Clearly, our conclusions hold for fat trees as well as meshes. Perhaps this is surprising, since average interprocessor distance is now O(log P) and the bisection bandwidth of the machine is O(P) instead of O(x/_). This is additional evidence that column-mapped methods are not scalable for highly parallel 6. Further
machines. work.
This work should
be extended
in several
ways.
• Experimental performance data should be taken from actual distributed dense and sparse Cholesky and compared with our predictions. • Variants that map the sparse matrix data in some form of two-dimensional cyclic map, as has been suggested by Gilbert and the author, Kratzer, and by Venugopal and Naik, should also be scrutinized experimentally. • The whole Cholesky factorization can be viewed as a DAG whose nodes are arithmetic operations and edges are values. (An n-input SUM operator should be used so as not to predefine the order of updates to an element.) Let us call this the computation DAG. The ultimate problem is to assign all the nodes to processors in such a way that the completion time is minimized. The computation DAG is quite large. Methods that work with an uncompressed representation of this DAG suffer from excessive storage costs. (This idea is quite like the very old one of generating straight-line code for sparse Cholesky in which the size of the program is proportional to the number of flops, and hence is larger than the matrix and its factor.) Of course, Cholesky DAGs have underlying regularity that allows for compressed representations. One such representation is the structure of L. Others, smaller still, have been derived from the supernodal structure of L and are usually only as large as a constant multiple of the size of A. All approaches to the problem to date have employed an assignment of computation to processors that is derived from the structure of L rather than from the computation DAG. None has succeeded. It is not known, however, if this failure is due to a poor choice of assignment, or alternatively if arty assignment based only on the structure of L must in some way fail, or indeed whether there is arty assignment for sparse
Cholesky
computation
investigation. • In these proceedings, which the assignment
DAGs
that will succeed.
These
issues
Ashcraft proposes a new class of column-oriented of work to processors differs from the assignment
require some methods in used in the
algorithms we have investigated. His approach may make for a substantial reduction in the fulx per link and bisection width requirements of the method, and so it should be investigated further. We note, however, that it will not reduce the length of the critical path, since it is based on the same task graph as all column-oriented methods. • It appears that the scalable implementation of iterative methods is much easier than it is for sparse
Cholesky.
Indeed,
even naive distributed 16
implementation
of attractive
iterative
methods
is quite
pings
of gridpoints
Total
flux is kept
subgrids and
processor.
Fichtner
[24], and
grids.
at some
When all that
gridpoints been
Recent
locality)
good
example
been
provided
impediment
are even
more
subspace
suitable
Bjcrstad
and
decomposition found
this
can
for irregular
the
can
number have
of also
be viewed
as
to take advantage
of
environment. methods
that
reside
be annoying;
that
(which
designed
difference
that
even
preconditioners
methods
[6], who
of finite
that
may
at worst,
parallel
methods
domain
Skogen
solution
products
compact
Annaratone,
it clear
in the distributed-memory
of paxallel
to the efficient
dot is,
fully
gridpoints
cost,
map-
products.
by mapping
[25] makes
are used, tolerable
simple
of matrix-vector count
preprocessing
Useful,
grid,
[12], Pommerell,
Wang
decomposition
Krylov
of the power by
and
them
a regular
of the grid connect
of Hammond
methods
domain
the class of preconditioned
operation
Simon,
to make
Finally,
calculation
but supportable
subspace
we require
fast
work
Pothen,
with
of the edges
grow like P log P, not P_.
developed.
spatial
allows
fraction
noticeable
Krylov
For example,
so that most
same
be done, but
to processors to a small
to processors,
on the
efficient.
P
=
equations
A
has recently was
no
with
16,384
Ng equal
to
direct
solvers
only 640. We conclude be made machines.
by admitting
competitive
that
it is not yet clear
at all for highly
(P > 256)
whether and
sparse
massively
(P > 4096)
can
parallel
REFERENCES [1] E. ANDERSON, A. BENZONI, J. DONGARRA, S. MOULTON, S. OSTROUCHOV, B. TOURANCHEAU AND R. VAN DE GEfJN, LAPACKfor distributed memory architectures: progress report, In Parallel
Processing for Scientific Computing, SIAM, 1992. [2] C.
ASHCItAFT, S. C. EISENSTAT, AND J. W. H. LIU, A fan-in algorithm _or distributed sparse nwmericalfacforization, SIAM J. Scient. Star. Comput. 11 (1990), pp. 593-599. [3] C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, AND A. H. SHERMAN, A comparison of three column-based distributed sparse factorization schemes, Research Report YALEU/DCS/RR-810, Comp. Sci. Dept., Yale Univ., 1990. [4] C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, B.W. PEYTON, AND A. H. SHERMAN, A compute.
[5] C.
ahead fan-in scheme for parallel sparse matriz _actorlzation, In D. Pelletier, editor, Proceedings, Supercomputing Symposium '90, pp. 351-361. Ecole Polytechnique de Montreal, 1990. ASHCRA_r, The fan-both family/ of column-based distributed Cholesl_ factorization algorithms,
These proceedings. [6] P.
BJ_ItSTAD
AND M.
D.
SKOOEN,
for massively parallel computeil. Decomposition. SIAM, 1992.
Domain
decomposition
Proc¢_lings
algorithms
of Schwarz
of the Fifth International
iype,
Symposium
designed
on Domain
[7] J. DONOARRA, R VAN DE GEIJN, AND D. WALKER, A look at scalable dense linear algebra libraries, Proceedings, Scalable High Performance Computer Conference, Williamsburg, VA, 1992. [8] A. GEOROE, J. W. H. LIu, AND E. No, Communication results for parallel sparse Cholesl_ factor. ization on a hvpercube, Parallel Comput. 1O (1989), pp. 287-298. [9] A. GEORGE, M. T. HEATH, J. W. H. LIu, AND E. No, Solution of sparse positive definite s_lstems on a hypercube, J. Comput. Appl. Math. 2? (1989), pp. 129--156. [10] J. R. GILBERT AND R. SCHREIBER, Highly parallel sparse Choles_ factorization,
SIAM J. Scient. Stat. Comput., to appear. [11] J. R. GILBERT, C. MOLER, AND R. SCHREIBER, Sparse matrices mentation, SIAM J. Matrix Anal. Appl. 13 (1992), pp. 333-356. 17
in MATLAB:
design and imple-
[12] S. W. HAMMOND, Mapping Unstructured thesis, Dept. of Comp. Sci., Rensselaer
Grid Computations to Massively Polytechnic Institute, 1992.
PamUel
Computers,
PhD
[13] S. W. HAMMOND AND R. SCHREIBER, Mapping unstructured grid problems to the Connection Machine, In Piyush Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 11-30. MIT Press, 1992. [14] M. T. HEATH, E. NG, AND B. W. PEYTON, Parallel algorithms for sparse linear systems, SIAM Review 33 (1991), pp. 420-460 ......... [15] S. G. KRATZER, Massively parallel sparse matriz computations, In P. Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 178-186. MIT Press, 1992. A more complete version will appear in J. Supercomputing. [16] C. E. LEISERSON, Fat-trees: universal networks for hardware-efficient supercomputing, IEEE Trans. Comput. C-34 (1985), pp. 892-901. [17] GUANGYE LI AND THOMAS F. COLEMAN, A parallel triangular solver for a distributed memory muitiprocessor, SIAM J. Scient. Stat. Comput. 9 (1988), pp. 485-502. [18] M. Mu AND J_ R: RICE, Performance of PDE sparse solvers on hypercubes, In P. Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientific Computation on Multiprocessors, pp. 345-370. MIT Press, 1992. [19] M. Mu AND J. R. RICE, A grid based subtree-subcube assignment strategy for solving PDEs on hypercubes, SIAM J. Scient. Stst. Comput., 13 (1992), pp. 826-839. [20] A. T. OOIELSKI AND W. AIELLO, Sparse matriz algebra on parallel processor arrays, These proceedings. [21] D' P. O'LEARY AND G. W. STEWART, Data-flow algorithms for parallel matriz computations, Comm. ACM, 28 (1985), pp. 840-853. [22] L.S. OSTROUCHOV, M.T. HEATH, AND C.H. PuOMINE, Modelling speedup in parallel sparse matriz factorization, Tech Report ORNL/TM-11786, Mathematical Sciences Section, Oak Ridge National Lab., December, 1990. [23] D. PATTERSON, Massively parallel computer architecture: observations and ideas on a new theoretical model, Comp. Sci. Dept., Univ. of California at Berkeley, 1992. [24]Ci POMMERELL, M. ANNARATONE, AND W. FICHTNER, A set of new mapping and coloring heuristics for distributed-memory parallel processors, SIAM J. Scient. Stat. Comput. 13 (1992), pp. 194-226. POTHEN, H. D. SIMON, AND L. WANG, Spectral nested dissection, Report CS-92-01, Comp. Sci. Dept., Penn State Univ. Submitted to J. Parallel and Distrib. Comput. [26] E. I:_THBERG AND A. GUPTA, The performance impact of data reuse in parallel dense Cholesky factorization, Stanford Comp. Sci. Dept. Report STAN-CS-92-1401. AND A. GUPTA, An efficient block.oriented approach to parallel sparse Cholesky fac[27] E. R£)THBERG torization, Stanford Comp. Sci. Dept. Tech. Report, 1992. [28] Y. SAAD AND M.H. SCHULTZ, Data communication in parallel architectures, Parallel Comput. 11 (1989), pp. 131-150. [29] S. VENUGOPAL AND V. K. NAIK, Effects of partitioning and scheduling sparse matriz factorization on communication and load balance, Proceedings, Supercomputing 91, pp. 866-875.IEEE Computer Society Press, 1991. [30] E. ZMIJEWSKI, Limiting communication in parallel sparse Cholesky factorization, Tech. Report TRCS89-18, Dept. of Comp. Sci., Univ. of California, Santa Barbara, CA, 1989.
[25]A.
18