Jul 22, 1991 - The second scheme uses an extension of this inequality to a special case of dependent random variables due to Hoeffding 8 . Incidentally, very.
Sparse Matrix Computations on Parallel Processor Arrays Andrew T. Ogielski and William Aiello Bell Communications Research Morristown, NJ 07960 July 22, 1991 Revised May 1, 1992 Abstract
We investigate the balancing of distributed compressed storage of large sparse matrices on a massively parallel computer. For fast computation of matrix{vector and matrix{matrix products on a rectangular processor array with ecient communications along its rows and columns we require that the nonzero elements of each matrix row or column be distributed among the processors located within the same array row or column, respectively. We construct randomized packing algorithms with such properties, and we prove that with high probability they produce well{balanced storage for suciently large matrices with bounded number of nonzeros in each row and column, but no other restrictions on structure. Then we design basic matrix{vector multiplication routines with fully parallel interprocessor communications and intraprocessor gather and scatter operations. Their eciency is demonstrated on the 16,384{processor MasPar computer.
Keywords: distributed data structures, linear algebra, load balancing, parallel algorithms, randomized algorithms, sparse matrices.
1
Introduction
Ecient computation in the data{parallel mode is achieved with distributed data structures that balance the processors' computation load and promote maximum parallelism of data communications. For parallel dense matrix linear algebra routines [6, 3, 9, 7, 4, 12, 2] a balanced distribution of computations to processors can be achieved together with very 1
regular interprocessor communication patterns. In contrast, in the design of algorithms for unstructured sparse matrices, the data compression scheme may con ict with ecient communication. For instance, if a large matrix is evenly partitioned into blocks in correspondence with the processor array, the communications are similar to the dense case, but the computation time will be determined by the block with the largest number of nonzeros which may be unacceptably large. Alternatively, if the nonzeros are densely packed into the processor array without regard for the row and column structure, the bene t of even processor load may be negated by overwhelming communication costs. Universal distributed data structures that guarantee both fast computation and ecient compression for all sparse matrices have not been found. Therefore, we associate classes of sparse matrices with appropriate data structures and computational kernels. For unstructured sparse matrices it is desirable to have few classes, de ned by simple and easily veri ed properties. Here we will only request that no row or column of a sparse matrix is too dense, with no other restrictions on matrix structure. We assume that the matrix remains unchanged during computation, and that it is accessed suciently often to justify preprocessing for well-balanced storage; this is the case in the family of Lanczos algorithms for the symmetric eigenvalue problem, conjugate gradient methods, and in many other numerical algorithms [7]. Suppose we have an m by n two-dimensional processor array with ecient communications along its rows and columns. Many parallel computers can be con gured in this way. The processing element in the ith row and the j th column will be called PE (i; j ). Let A be a sparse M N matrix, M m and N n. We will say that an assignment of matrix elements to the processor array preserves the integrity of the matrix if for every row (column) all its nonzero elements are placed into processors lying in a single row (column) of the array. In such an assignment each processor stores a submatrix, and the data communications required by linear operations can be carried out in parallel. The problem of nding an assignment which preserves the integrity of the matrix and minimizes the largest processor load is NP -complete. This can easily be shown by a reduction from bin packing [5]. Nonetheless, we show that if one does not require a deterministic, optimal solution, then a fast random assignment which preserves the matrix integrity and comes very close to optimal can be achieved with high probability. We analyze two schemes for random assignments which preserve the integrity. In the simplest random assignment scheme each matrix row index is randomly and independently assigned a row of the processor array, and each matrix column index is randomly and independently assigned a column of the array. The dimensions of submatrices stored in dierent processors in general will be dierent. That loss of uniformity is corrected in the second, more restricted scheme. Suppose that before the loading a random permutation of matrix rows and a random permutation of 2
matrix columns are performed, and then the permuted matrix is partitioned into blocks of size dM=me dN=ne (the rightmost and lowest blocks may be smaller). This results in an m n matrix of blocks which are assigned in the natural way to the m n processor array. The second scheme is very convenient for algorithm design, but a priori it is not obvious that it can achieve as good load balancing as the rst method. Suppose that the matrix A has T nonzero elements and at most R (C ) nonzeros in each row (column). Let jPi;j j denote the number of nonzeros assigned to processor PE (i; j ); it is easily seen that for any assignment scheme we must have maxi;j jPi;j j dT=mne. We prove that for suciently large T and suciently small R and C either assignment scheme produces a well{balanced load with high probability. This is stated by showing that in both schemes Prfmaxi;j jPi;j j (1 + ) dT=mneg is bounded from above by expf,O(h())g, where h(x) = (1 + x, ) ln(1 + x) , 1. The function h(x) is strictly monotonically increasing, and h(x) x=2 for x ! 0, h(x) ln(x) for x ! 1. As far as we know this is the rst result on parallel sparse matrix computations with provably good storage eciency for unstructured sparse matrices. The exact statement of the theorem and its proof is given in Section 2. The proof for the rst scheme makes repeated use of Bennett's inequality for large deviations from the expected value of a sum of independent random variables [1]. The second scheme uses an extension of this inequality to a special case of dependent random variables due to Hoeding [8]. Incidentally, very large randomly sparse matrices (where each element independently is nonzero with a xed probability) with high probability give balanced load in the second scheme even without the row and column permutations (apply Bennett's inequality directly to block submatrices). Once submatrices are assigned to processor elements, the nonzeros are stored in a compressed format. Although the structure (i.e. location of nonzeros) of submatrices stored in dierent PE s may vary, this does not impede the parallelism of data movement and execution of matrix primitives if scatter and gather techniques are used: Data can be transmitted in parallel in regular patterns to buers in PE s and then scattered to proper memory locations, or vice versa. For this we assume that the parallel computer supports indirect addressing, that is, the PE s can store pointers to their own memories. With indirect addressing the scatter and gather operations can be executed in parallel even on data{parallel machines, since each PE can access a dierent memory location in a single instruction. In Section 3 we assume a balanced distribution of nonzeros and show one possible implementation of the basic matrix{vector kernels for computation of Ax or yT A and their extensions to blocks of vectors assuming only the data{parallel computer model. We also present examples both of the eciency of packing and of the performance of kernels on a particular data{parallel computer, the 16,384 processor MasPar MP-1216. The NP{completness of the optimum integrity-preserving matrix assignment problem does not preclude the possibility that a deterministic, polynomial time algorithm may be 2
1
3
able to produce an assignment that provably comes within a constant factor of the optimum. Finding such an algorithm is an interesting question for future research. We also expect that randomized storage of data in distributed data structures preserving the favorable communication patterns will be useful in other applications.
2
Balanced Loading
To distinguish the matrix row and column indices from the processor array indices we always use capital letters I; J; : : : for the matrix, and lower case i; j; : : : for the processors. For brevity in this section we use Pi;j to denote the processor PE (i; j ). Also, we will denote the set of integers from a to b inclusive by [a; b]. We consider assignments of nonzero elements of an M by N sparse matrix A to processors in an m by n array (where M m and N n) which preserve the integrity of matrix rows and columns. Any such assignment by de nition can be described by two mappings: The row mapping : [0; M , 1] ! [0; m , 1], and the column mapping : [0; N , 1] ! [0; n , 1]. A matrix element AI;J is assigned to P I ; J , therefore the processor Pi;j stores the submatrix of A given by the rows in , (i) and the columns in , (j ). Let jPi;j j denote the number of nonzeros in this submatrix. We analyze two fast probabilistic algorithms for the construction of row and column mappings that attempt to keep all the jPi;j j as close to the optimum value dT=mne as possible. To describe the rst, let FM;m be the uniform distribution on all functions from [0; M , 1] to [0; m , 1]. We draw according to FM;m and according to FN;n. An explicit method to construct such random mappings is to assign a uniform random value from [0; m , 1] to (I ) independently for each I 2 [0; M , 1], and to assign a uniform random value from [0; n , 1] to (J ) independently for all J 2 [0; N , 1]. The second type of random mapping is as follows. Let GM;m be the uniform distribution on all functions from [0; M , 1] to [0; m , 1] such that j, (i)j = dM=me for i 2 [0; (M mod m) , 1] and j, (i)j = bM=mc for i 2 [M mod m; m , 1]. The row mapping is drawn according to GM;m and the column mapping is drawn according to GN;n. This leads to the assignment scheme mentioned in the Introduction since one way to generate a mapping according to GM;m is to take a random permutation of [0; M , 1] and break the permuted sequence into M mod m consecutive subsequences of length dM=me and m , (M mod m) subsequences of length bM=mc. Alternatively, we may use (I ) = (I ) mod m. This restriction on the size of each submatrix actually forces a slight ineciency as we will see. For ease of notation in handling this let m0 and n0 be M=dM=me and N=dN=ne, respectively, when we are dealing with the distributions GM;m and GN;n and let them be simply m and n when dealing with the distributions FM;m and FN;n. We can now state our main theorem. As de ned before, T is the total number of nonzeros ( )
( )
1
1
1
1
4
of A, R is the largest number of nonzeros in a row and C is the largest number of nonzeros in a column. Theorem. If the row and column assignments are chosen either according to FM;m and FN;n, respectively, or according to GM;m and GN;n, respectively, then for any > 0 the T Pr max fjPi;j jg (1 + ) m0 n0 i;j is at most the minimum of R T T mn exp , m0 R h() + Mn exp , n0 h() + n exp , n0 C h() and mn exp , nT0 C h() + Nm exp , mC0 h() + m exp , mT0 R h() ; 2
where h(x) = (1 + x, ) ln(1 + x) , 1. Proof:We rst consider the case when the pair of index mappings (; ) is drawn according to FM;m FN;n. The proof involves three applications of a large deviation theorem due to Bennett [1]. When (; ) is drawn according to GM;m GN;n the proof follows immediately once we argue that a certain extension of Bennett's inequality due to Hoeding [8] applies. For the statement of Bennett's inequality let ; ; : : : ; M , be independentPrandom variables, bounded from above, and with nite rst and second moments. Let S = MI , I , and let Var(S ) be its variance. 1
0
1
1
=0
1
Bennett's Inequality. For all d 0 such that I , E(I ) d for all I , and for all 0, Pr fS E(S ) + g exp , h d d Var(S ) : We need an application of Bennett's inequality to the following. Let W = fW ; : : : ; WM , g be a set of M objects each having a well de ned nonnegative \weight" jWI j. De ne the weight of a subset X W as the sum of the weights of the elements of X . Suppose we put each WI 2 W independently and uniformly into one of m bins, fB ; B ; : : : ; Bm, g. Formally, each bin Bi , i 2 [0; m , 1], is the set of objects WI such that (I ) = i where is drawn according to FM;m. The weight of each bin, jBij is the sum of the weights of the elements in the bin. This can be written as jBiW j = MI , ((I ) = i)jWI j, where ((I ) = i) is the indicator variable which is one when (I ) = i and zero otherwise. Since we are placing each object independently and uniformly into the bins, for any xed bin i the events that W ; W ; : : : ; WM , are placed into bin i are independent Bernoulli events with probability of success 1=m. Said formally, since is drawn according to FM;m, for any xed bin (
!)
0
0
P
=0
0
1
1
5
1
1
1
1
i 2 [0; m , 1], the 0{1 random variables ((0) = i); ((1) = i); : : : ; ((M , 1) = i) are independent Bernoulli variables with probability of success 1=m. Hence, the weight of each bin, jBiW j, has the same distribution as the weighted sum of i.i.d. Bernoulli variables (although jB W j; : : : ; jBmW, j are themselves dependent random variables). If we know a w > 0 such that 0 jWI j w for all I 2 [0; M , 1] then the expected value, E(jBiW j) = (1=m) PMI , jWI j = jW j=m, and variance, 0
1
=0
Var(jBiW j) =
MX ,1 I =0
1
Var(((I ) = i)jWI j) = (1=m)(1 , 1=m)
MX ,1 I =0
jWI j ; 2
are bounded. Furthermore, ((I ) = i)jWI j , E (((I ) = i)jWI j) (1 , 1=m)jWI j (1 , 1=m)w for all I 2 [0; M , 1]. Hence, we may apply Bennett's inequality with d = (1 , 1=m)w and the union bound to get a bound on the weight of the largest bin, W jg jW j + Pr i2max fj B i ; m, m
m exp , (1 , 1 =m)w h wm I jWI j : m exp , w h wm I jWI j Unfortunately, in the course of the remainder of the proof, we may not know enough about W to compute MI , jWI j or MI , jWI j exactly. Nonetheless, we will know an upper bound W > 0 on the sum of the weights, MI , jWI j W . This implies an upper bound of wW on M , jW j . To see this note that M , jW j w M , jW j since 0 jW j w. This, I I I I I I I in turn, is at most wW by the bound on the sum of the weights. Now we can get a simple upper bound on the probability that one of the m bins has large weight: W jg (1 + ) W W jg jW j + W Pr max fj B Pr i2max fj B i i i2 ; m, ; m, m m m W h () : () m exp , mw (
[0
)
(
!)
P
1]
P
P
=0
P
=0
1
1
P
=0
1
1
=0
1
[0
P
2
)
(
2
2
P
=0 P
2
2
!)
(
=0
1
(
)
[0
1]
1]
The last inequality follows due to the fact that h is monotone. We return now to the problem of assigning submatrices of A to processors in the array. Let the weight of an element AI;J , denoted jAI;J j, be one if AI;J is a nonzero and zero otherwise. Let AI; be the I th row of A and let jAI;j be the number of ones in the I th row. Let R be an upper bound on the weights of the rows of A. Similarly, let A;J be the J th column of A and jA;J j the number of ones in the J th column. Let C be an upper bound on the weights of the columns of A. For the purposes of the analysis we make the column assignment, , rst and then the row assignment, . Let VI;j , I 2 [0; M , 1], j 2 [0; n , 1], be the submatrices formed by the 6
column assignment. That is, VI;j = fAI;J j J 2 , (j )g. This de nes the matrix V . Let V;j be the j th column of V . It is the submatrix of A consisting of all the columns, A;J such that (J ) = j . That is, the column assignment assigns the objects W = fA;J j J 2 [0; N , 1]g uniformly and randomly to the bins fV;j j j 2 [0; n , 1]g. We can apply () to bound the weight of the largest bin Pby recalling that A;J has weight at most C , since it is simply a column of A, and jW j = J jA;J j is simply the total number of nonzeroes of A which is at most T . Hence, ( ) T T PrF j2max fjV;j jg n (1 + ) n exp , nC h ( ) : ; n, N;n A similar analysis can also be applied on a row by row basis. For a xed row I 2 [0; M ,1], assigns the I th row of A, W = fAI;J j J 2 [0; N , 1]g, uniformly and independently to the I th row of V , fVI;j j j 2 [0; n , 1]g. For each I 2 [0; M , 1] we can use () to bound the weight of the largest bin, maxj fjVI;j jg, by recalling that each jAI;J j is at most one and jW j is simply jAI;j which is at most R. Hence, ) ( R R PrF j2max fjVI;j jg n (1 + ) n exp n h ( ) : ; n, N;n Applying this to each row I and using the union bound we get 1
[0
[0
8 >
:
3
1]
3
2
1]
2
9 > =
R (1 + ) max fj V jg I;j I2 ; M , n [0 [0
2
1] 1]
j 2 ; n,
> ;
3
2
Mn exp Rn h ( ) :
2
2
We complete the assignment of submatricies of A to processors by making the row assignments. Rows of V are assigned to rows of P . That is, Pi; is the submatrix composed of rows of V , VI;, such that (I ) = i where is drawn by FM;m. Said dierently, for a xed column j 2 [0; n , 1], the objects fVI;j j I 2 [0; M , 1]g are assigned uniformly and independently to the bins fPi;j j i 2 [0; m , 1]g. We can apply () so long as we have upper bounds on the weights jVI;j j and the sum of the weights PMI , jVI;j j. Assume for now that after the column assignment (but before the row assignment) we PM , have jVI;j j v for all I 2 [0; M , 1] and j 2 [0; n , 1], and that I jVI;j j = jV;j j V for all j 2 [0; n , 1]. Call the former event Ev , the latter event EV , and their conjuction Ev;V . For each column j 2 [0; n , 1] we can apply () to the random row assignment: ( ) V V h ( ) : E PrF max fj P jg (1 + ) m exp , i;j v;V m vm M;m i2 ; m, Applying this to every column we get =0
1
=0
[0
1]
1
8 >
: i2 ; m,
PrF
[0
1]
1
1
j 2[0; n,1]
7
1
9 > =
1
V h ( ) : Ev;V mn exp , vm > ;
1
1
To achieve a similarly small probability for maxi;j fjPi;j jg being large without the conditioning event Ev;V we need only show that the probability that Ev;V is not true is very small. That is, n o n o Pr fFg = Pr fF j Ev;V g Pr fEv;V g + Pr F j Ev;V Pr Ev;V n o Pr fF j Ev;V g + Pr Ev;V where F is the event that maxi;j fjPi;j jg mV (1 + ). If we nchoose Vo = (1 + )T=n and v = (1 + )R=n then we already have an upper bound for Pr Ev [ EV . Putting everything together we get T Pr max fjPi;j jg mn (1 + )(1 + ) i;j T 1 + R T mn exp , mR 1 + h( ) + Mn exp , n h( ) + n exp , nC h( ) : The choice = = gives a more tractable (but not neccessarily optimal) bound. Of course, we can do the whole analysis by conditioning on the results of the the row assignment rst. This gives us the same bound as the one above with the roles of M and N , m and n, and R and C interchanged. The actual bound is the minimum of these two. The proof for the case where the row and column assignments are made according to GM;m and GN;n, respectively, is nearly identical except that we need an extension of Bennett's inequality, due to Hoeding [8], to the following case. As before, let W be a set of M objects W = (W ; : : : ; WM , ) with weights jWI j. Let Bb W be the random variable for the set of objects gotten by randomly selecting r objects from W without replacement and let B W be the set of objects gotten by randomly selecting r objects from W with replacement. Note that MX , E(jBb W j) = E(jB W j) = (1=M ) WK = jW j=M 1
3
2
3
3
2
1
0
2
1
1
2
1
2
3
3
3
1
1
Also observe that Var(jB W j) = (r=M )
K =0 M ,1 (jW j , jW j=M )2 K K =0
but that Var(jBb W j) = ((M , r)=(M , 1)) Var(jB W j): P
Bennett-Hoeding Inequality. For any d 0 such that jWI j, d for all I 2 [0; M ,1], and any 0
d Pr exp , d h : Var(jB W j) We apply this bound to the situation where we place exactly dM=me objects into bins Bi W , i 2 [0; (M mod m) , 1] and exactly bM=mc objects into bins Bi W , i 2 [M mod m; m , n
jBb W j E(jBb W j) +
(
o
b
!)
b
8
1]. Formally, Bbi W = fWI j I 2 , (i)g where has distribution GM;m. By symmetry we know that each Bbi W , i 2 [0; (M mod m) , 1] has the same distribution as Bb W above with r = dM=me and each Bbi W , i 2 [M mod m; m , 1], has the same distribution as Bb W above with r = bM=mc. Using algebra similar to that used for deriving () from Bennett's inequality, it is straightforward to derive the following bound from the Bennett-Hoeding inequality: For any W jW j, any w > 0 such that for all I 2 [0; M , 1], WI w, any 0 1
W jg W dM=me (1 + ) PrG max fj B i M M;m i2[0; m,1] (
b
)
e h() : m exp , Ww dM=m M (
)
This is the same bound as () when one substitutes the m0 and n0 as previously de ned. Hence, the remainder of the proof follows exactly as before. 2
3
Data{Parallel Sparse Matrix{Vector Multiplication
So far, we have analyzed randomized algorithms for the balanced assignment of nonzeros of a sparse matrix to a rectangular processor array. Such assignments preserve the alignment of matrix rows and columns for the design of ecient parallel sparse matrix routines. In this section we consider the design of the basic sparse matrix{vector multiplication kernels for parallel processor arrays under the restrictive conditions of the data{parallel (SIMD) machine model. Program{parallel (MIMD) machines are more powerful, and naturally include the SIMD model. The minimum characteristics of a data{parallel computer model required here are: 1. There are p processing elements (PEs) interconnected by a communication network. Every PE has its own, identically organized local memory. It is assumed that the network is con gured as a (virtual) two{dimensional rectangular grid, with ecient communication among the PE s along any row or column of the grid. 2. Each PE independently can be in the active or inactive state, which depends on the local data and may change from instruction to instruction. 3. There is a separate processor (controller) executing the program and broadcasting instructions which are synchronously evaluated in all active PE s. It is assumed that indirect addressing feature is available: the PE s can store local pointers to their local memories, thus each PE can access a dierent memory location in a single instruction. 9
We will concentrate on the second (i.e. random permutation) assignment scheme. While both schemes produce balanced load under the same assumptions, and, with balanced load, lead to similar matrix algorithms, only the second scheme guarantees the upper limit on the dimension of the submatrices allocated to each processor. This simpli es memory managment and algorithm design for SIMD computers. Therefore, suppose that for a large sparse matrix A the random row and column permutations result in an acceptably balanced distribution of the nonzeros of A to the PEs. If the row permutation is represented by an M M matrix P , and the column permutation is represented by an N N matrix Q, the assignment of nonzeros to PEs considered in Section 2 can be written as (PAQT )I;J ! PE (I mod m; J mod n). In most matrix problems one may do the computations with PAQT instead of A and undo the permutations at the end. This is straightforward, therefore from now on we will ignore the permutations for sake of simplicity, and we will set P = IM and Q = IN . Each PE stores the nonzeros of a sparse submatrix of A in a compressed data structure. The choice of compression scheme may be dictated by the PE architecture, available local memory, or other considerations [13, 14]. Here we use a simple symmetric scheme which allows for fast computation of both right and left multiplications, y = Ax and xT = yT A. The processors' memory is therefore organized as follows: The nonzeros allocated to PE (i; j ) are stored in three aligned arrays a[k]; r[k]; c[k], k = 0; 1; : : : ; jPij j , 1, where a is the matrix element, and r; c are its row and column indices, respectively. It is not ecient to store dense vectors as matrices with one row or column. For communication eciency and good load balance it is preferable to distribute the components of a vector x = (x ; x ; : : : ; xL, ) to processors according to a multi{layer lexicographic scheme. In each PE de ne a local array u[ ] of length dL=mne. For row{wise lexicographic storage renumber the processors with a single index, so that PE (i; j ) gets the index k = i n + j , and place the component xJ in array element u[bJ=mnc] in the PE with index k = J mod mn, for J = 0; 1; : : : ; L , 1. For column-wise lexicographic placement renumber the processors so that PE (i; j ) gets the index k = j m + i, and proceed as before. We will discuss only the routine for computation of y = Ax in some detail. Obvious modi cations are required for the computation of yT A, and for the extension to blocks of dense vectors. The vector x is distributed among processors in a row-wise lexicographic order and stored in the local arrays u[ ], while vector y is distributed in the column-wise lexicographic order, and stored in the local arrays v[ ]. The multiplication routine requires an auxiliary local accumulator array acc[ ] of length jPij j, and an auxiliary local buer array buf [ ] in each PE (i; j ), and proceeds in several phases. For transparency the pseudocode below is written for the case when the PE array dimensions divide the matrix dimensions, i.e. M = rm and N = sn, and each PE has sucient memory for the buer array of length max(r; s). We note that indirect addressing is critical 0
1
1
10
for data{parallel execution of the scatter and gather steps.
Matrix{vector multiplication y = Ax 1. Distribute vector components for k = 0; : : : ; s , 1 in parallel in each array column j = 0; : : : ; n , 1 temp = k mod m, PE (temp; j ) sends u[bk=mc] to all PEs in column j , every PE copies received value in buf [k]. 2. Scatter: every PE (i; j ) in parallel for k = 0; : : : ; jPij j , 1 acc[k] buf [(c[k] , j )=n]. 3. Multiply: every PE (i; j ) in parallel for k = 0; : : : ; jPij j , 1 acc[k] acc[k] a[k]. 4. Gather: every PE (i; j ) in parallel for k = 0; : : : ; r , 1 buf [k] = 0, for k = 0; : : : ; jPij j , 1 temp = (r[k] , i)=m, buf [temp] buf [temp] + acc[k]. 5. Row sums for k = 0; : : : ; r , 1 in parallel in each array row i = 0; : : : ; m , 1 compute the sum of buf [k] along row i, copy the sum to v[bk=nc] in PE (i; k mod n). When the routine completes execution, the local array element v[l] in PE (i; j ) stores the vector component ylmn jm i, as determined by column-wise lexicographic order. The number of parallel operations is as follows: dN=ne vector copy steps, maxi;j jPij j each of the scatter, multiply, and gather steps, dM=me row sum evaluations. In practice, it may be more ecient to employ systolic techniques for the rst and last stages, rather than use broadcast along array columns and segmented scan{adds, respectively. The distribute/scatter and gather/sum steps may be iterated if there is not enough memory for a long buer array. Multiple iterations can be eciently managed with pointers when the arrays a; r and c are sorted in the order of increasing row index, r[0] r[1] : : : r[jPij j, 1], and in the order of increasing column index via an auxiliary local pointer array p[i] such that c[p[0]] c[p[1]] : : : c[p[jPij j , 1]. +
+
11
3.1 A Practical Implementation
In order to demonstrate the practicality of data structures and algorithms proposed in this report we have implemented the matrix multiplication routines and load balancing on a commercially available computer, the MasPar MP-1216. This is a data-parallel machine with 16,384 RISC processors. Each processor has 64 kbytes of local memory, and operates on 4{bit wide data elds, with oating point instructions implemented in microcode. There are two separate communication networks: a two{dimensional toroidal mesh, and a global router. Only the mesh network is used in matrix calculations, and on the MP-1216 the PEs are connected as a 128 128 array. The programs have been written in MPL [11], which is a data{parallel extension of the ANSI standard C language [10]. Several parallel algorithms for dense matrix multiplication have been implemented and analyzed on this computer in Ref. [2]. We have found that the performance of the load balancing algorithm is much better in practice than guaranteed by our theorem. The reason for this is that in the proof some dependencies have been bounded by overcounting, and some bounds have been relaxed to obtain a compact nal formula. An extended empirical study of our assignment scheme for a variety of structures and sizes is beyond the scope of this report, nonetheless, as an illustration we do present data for some large matrices, both unstructured and highly structured. Let P = dT=mne be the perfectly balanced load per PE. For the matrices described below we have estimated the distribution function Prfmaxi;j jPij j=P < g, which succintly illustrates the performance of the randomized allocation algorithm. The probability distribution is the GM;m GN;n of Section 2, corresponding to independent randomly drawn row and column permutations. Empirical distribution functions were obtained from a few hundred independent assignments for each matrix. For the rst example we take a 25; 629 56; 530 matrix representing the frequencies of words occuring in more than two articles from the Academic American Encyclopedia. This is a sparse, unstructured matrix with T = 2; 843; 956 nonzeros, C = 13; 904 and R = 2; 168. However, despite the presence of some quite dense rows and columns the random permutation assignment works reasonably well: An estimate of Prfmaxi;j jPij j=P < g for the 128 128 MasPar array is shown in Figure 1. We see that a maximum load of at most 2P can be achieved in one attempt with probability about .22. This probability increases sharply when the requirement on the maximum load is relaxed. Qualitatively similar results have been obtained for other large sparse matrices representing word frequencies in dierent document databases. For a completely dierent example we consider a square 200; 000 200; 000 banded matrix with a half bandwidth of 100. Although for such special matrices one would rather design a dierent distributed data structure, it is instructive to see the power of randomization: A direct tiling mapping of this matrix onto the 128 128 PE array would produce maxi;j jPij j = 0
0
0
0
12
151; 350 compared to P = 1221. However, the random permutation assignment does very well (see Figure 1): with probability of success of over .99 we obtain an assignment that deviates from the perfectly balanced load by less than 15 %. In order to asses the performance of our implementation of the sparse matrix{vector multiplication routine, and to estimate the fraction of time spent on interprocessor communication and non-numerical operations, we compare the performance of the routine to the machine's peak oating point computing speed. All performance gures are for the double precision (64 bits) oating point format for matrix and vector components. According to custom, we also characterize the routine's performance in terms of the number of oating point operations per second ( ops), including all non-numerical operations in the time measurement. For a standard of the machine peak rate we consider two array operations c[i] a[i]+ b[i] and c[i] a[i] b[i] executing in parallel in the PEs without interprocessor communication. The timing of these instructions for long arrays on a 16,384 processor MasPar gives the average peak rate of 250 M ops (register operations are about twice faster, but not proper for comparisons with routines involving array references). For the sparse matrix{vector multiplication routine the op rate is determined by the number of nonzeros, T , and in accordance with [7] is de ned as 2 T= , where is the routine execution time. The highest performance has been achieved with the perfectly balanced PE load (i.e. maxi;j jPij j = T=16; 384) for large dense matrices cast in the sparse data structure, giving 116 M ops, that is, about 45% of the peak machine rate determined above. For sparse matrices the ratio maxi;j jPij j=P quanti es the performance loss resulting from load imbalance. Our fastest implementation of the sparse matrix{vector multiplication on the MasPar utilizes only the nearest-neighbor communications in the vector distribution and row sum collection stages. We have analyzed its performance, obtaining a formula for the execution time , = m dN=mne + max jPij j + n dM=mne i;j 0
0
1
2
3
The rst term accounts for the vector distribution, the second for the scatter, multiply and gather steps, and the last for the row sum collection. The expected operation times i , measured on a 128 128 machine, are = 17 s, = 290 s, and = 84 s (microseconds). The formula for predicts quite accurately the actual execution time for arbitrary matrices and imperfectly balanced loads, with a small uncertainty due to the dependence of the course of the scatter and gather operations on the data. For instance, for the rst word frequency matrix described above we obtained 48 M ops, and for the banded matrix we obtained 70 M ops. On this machine for a xed T the op rate decreases with increasing matrix dimensions, M and N , thus for extremely sparse matrices with very large dimensions another algorithm could oer better performance. 1
13
2
3
Acknowledgment We thank Tom Leighton and Michael Berry for useful discussions, and Sue Dumais for examples of word{frequency matrices.
References [1] G. Bennett. Probability inequalities for the sum of independent random variables. J. Amer. Statist. Assoc., 57:33{45, 1962. [2] P. Bjrstad, F. Manne, T. Srevik, and M. Vajtersic. Ecient matrix multiplication on SIMD computers. University of Bergen Technical Report, 1991. [3] E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM J. Comput., 10:657{675, 1981. [4] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32:54{135, 1990. [5] M. R. Garey and D. S. Johnson. Computers and Intractability, A Guide to the Theory of NP{Completness. W. H. Freeman and Co., New York, NY, 1979. [6] W. M. Gentleman. Some complexity results for matrix computations on parallel processors. J. ACM, 25:112{115, 1978. [7] Gene H. Golub and Charles F. Van Loan. Matrix Computations, 2nd ed. The Johns Hopkins Univ. Press, Baltimore, 1989. [8] W. Hoeding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13{30, 1963. [9] H. J. Jagadish and T. Kailath. A family of new ecient arrays for matrix multiplication. IEEE Trans. Comp., 38:149{155, 1989. [10] B. W. Kernighan and D. M. Ritchie. The C Programming Language, 2nd Edition. Prentice Hall,Inc., Englewood Clis, NJ, 1988. [11] MasPar MP-1 Standard Programming Manuals. MasPar Computer Corporation, Sunnyvale, CA, 1991. [12] J. M. Ortega, R. G. Voigt, and C. H. Romine. A bibliography on parallel and vector numerical algorithms. In Parallel Algorithms for Matrix Computations, pages 125{197, Philadelphia, 1990. SIAM. 14
[13] S. Pissanetzky. Sparse Matrix Technology. Academic Press, London, 1984. [14] Y. Saad. Krylov subspace methods on supercomputers. SIAM J. Sci. Stat. Comput., 10:1200{1232, 1989.
15