A Distribution Independent Algorithm for the Reduction to ... - CiteSeerX

1 downloads 0 Views 144KB Size Report
5] Beresford N. Parlett, The symmetric eigenvalue problem, Prentice Hall, 1980. 6] Bing Bing Zhou and Richard Peirce Brent, \A par- allel ordering algorithm for e ...
A Distribution Independent Algorithm for the Reduction to Tridiagonal Form Using One-Sided Rotations  Markus Hegland Research School for Information Sciences and Engineering Australian National University

Abstract A scalable algorithm for the reduction to tridiagonal form of symmetric matrices is developed. It uses onesided rotations instead of similarity transforms. This allows a data distribution independent implementation with low communication volume. Timings on the Fujitsu AP 1000 and VPP 500 show good performance.

We suggest to apply the same one-sided idea to the reduction methods. The algorithm will form part of the subroutine library for distributed memory Fujitsu VPP 500. Often the application of subroutines from libraries allow the user little freedom in the choice of the distribution of the data to the local memories of the processors. The one-sided algorithms allow a large range of distributions and perform equally well on all of them.

1 Introduction

2 Reduction algorithms

For the parallel solution of symmetric eigenvalue problems both Jacobi's method [5, 6] and reduction methods [5] which rst compute a tridiagonal matrix which is orthogonally similar to the original matrix are successfully used. Whereas the rst class requires a large amount of oating point operations but is readily implemented in parallel, the second class of methods require large amounts of communication. Block methods can increase eciency to some extent, see [2, 3, 1]. The basic step of both classes of algorithms are orthogonal similarity transforms: Ai 7! Qi Ai QTi : It was seen [6] that the Jacobi method based on onesided transformations Bi 7! Bi QTi : allows better vectorization and requires less communication. The intermediate matrices Bi are de ned as factors of the Ai: Ai = BiT Bi :  appeared in \Proceedingsof the IEEE First International Conference on Algorithms And Architectures for Parallel Processing", Brisbane, Australia, 19-21 April 1995, vol. 1, pp. 286 { 289

2.1 Two-sided reduction

The basic reduction algorithm for symmetric matrices is [4] in matrix notation:

Algorithm 2.1 (Reduction to tridiagonal form) T1 := A[1; 1] a1 := A[2 : n; 1] A1 := A[2 : n; 2 : n] for i := 1; : : : ; n ? 1

Find an orthogonal Qi such that Qi ai = kaikfi

A~i := Qi Ai QTi   ikei Ti+1 := kaTkieT Ak~a[1; i 1] i i ~ ai+1 := Ai [2 : n ? i; 1] Ai+1 := A~i [2 : n ? i; 2 : n ? i]

Here A[s : t; u : v] denotes the submatrix of A consisting of rows s to t of columns u to v etc. Furthermore, ei = (0; : : : ; 0; 1)T 2 Ri, fi = (1; 0; : : : ; 0)T 2 Rn?i, and k  k denotes the Euclidean norm. The computationally expensive part of step i of the reduction algorithm is the computation of A~i :=

QTi Ai Qi which is done in three steps: yi := Ai ui vi := yi ? 1=2(yiT ui )ui A~i := Ai ? uiviT ? vi uTi : Thus step i needs 6(n ? i)2 oating pointPoperations ?2 6(n ? and so this algorithm requires a total of in=1 2 2 3 2 i) + O(n ) = 2n + O(n ) oating point operations. This can be reduced but only at the cost of non-unit stride data access. One third of the time is spent in matrix vector products and the rest in rank 2 updates. The overall mean vector length is 2n=3. Performance of the algorithm can be improved by using block algorithms [2].

2.2 One-sided reduction

In nite element problems and in statistical applications A is usually given as A = B T B for some m by n matrix B. Such a factorization can also be obtained for every symmetric positive de nite matrix by Cholesky factorization. The two-sided algorithm assembles A rst whereas the one-sided algorithm assembles the matrix A as it is needed for the reduction and applies the orthogonal transformations to B instead of A. As there is no need to store the matrix A, the one-sided algorithm has also de nite advantages for sparse problems. The one-sided algorithm is in matrix notation:

Algorithm 2.2 (One-sided reduction)

N1 := B[1 : n; 1] B1 := B[1 : n; 2 : n] for i := 1; : : : ; n ? 1 ai := BiT Ni [1 : n; i] Find Qi such that Qi ai = kaikfi B~i := Bi QTi   Ni+1 := Ni B~i [1 : n; 1] Bi+1 := B~i [1 : n; 2 : n ? i] The same algorithm can be used to obtain the bidiagonal form needed for the singular value decomposition of a general matrix B 2 Rm;n by doing a QR factorization of N. For the orthogonal transformations, Housholder matrices [4], i.e., Qi = I ? i ui ui are used. An essential step in the computations is the evaluation of B~i = Bi Qi which proceeds in two steps: yi := Bi ui B~i := Bi ? yi uTi :

The oating point operation count of this variant is 3n3 + O(n2) which is only slightly higher as what is needed for the two-sided case. Note, however, that the 2n2 operations to assemble A are included in this count. The overall mean vector length is n.

3 A Parallel Implementation 3.1 Processor Model

and

Programming

The one-sided reduction is implemented on the Fujitsu VPP 500 vector-parallel computer with up to 222 processors connected by a crossbar switch with 1.6 G op/s peak performance per node. Alternatively, a Fujitsu AP1000 is used with 128 Sparc nodes connected by a two-dimensional torus network. One of the main features is the distributed memory of both computers. The algorithms mainly use data from local processor memory and avoid communication. The SPMD programming model is used with computation and communication alternating.

3.2 Communication Requirements of the Reduction Algorithms For column-wise distributed matrices the two-sided algorithm requires communication at step i for

 the computation of Qi (reduction and broadcast)  the matrix vector product Ai ui;  the rank two update (broadcast ui and yi ). If the assembly step is included another broadcast is required. The one-sided reduction needs communication for

 the matrix vector product BiT ni (broadcast ni )  the computation of Qi  the matrix vector product Bi ui and rank one update

if the matrices are distributed column-wise. Thus, the one-sided algorithm requires only one broadcast (which can even be partially avoided by doing some redundant computations). The two-sided algorithm does three broadcasts per step.

3.3 Distribution invariance In general, the amount of the communication is dictated by the algorithm and the way the data is distributed over the processors. In order to get good load balancing the matrix B has to be distributed such that the local memory of every processor contains the same amount of columns (1). This distribution can be blocked, cyclic or any other way as long as this requirement is satis ed. Then let B 0 be the matrix which contains the columns of B chosen interleavedly from the local memories. Thus B 0 is cyclically distributed and, furthermore, B 0 = PB for a permutation P. As only the properties of A = B T B = B 0T B 0 are of interest nally, the matrix B 0 is used for the computations instead of B. The attractive feature of B 0 is its cyclic distribution and note that no redistribution or rearrangement of B, only a reinterpretation was needed.

Algorithm 3.1 (Parallel one-sided reduction) N1(1) := B (1) [1 : n; 1]; N1(k) := [ ]; k = 2; : : : ; p B1(1) := B (1) [1 : n; 2 : n]; B1(1) := B (k) ; k = 2; : : : ; p broadcast n1 := N1(1) for i := 1; : : : ; n ? 1 stage 1: computation without communication T

a(ik) := Bi(k) ni Find Q(ik) such that Q(ik)a(ik) = ka(ik)kf T B (ik) := Bi(k) Q(ik)

stage 2: computation and communication mixed (p) a^Ti := (ka(1) i k; : : : ; kai k) Find Q^ i such that Q^ i a^i = kai kf h i (p) B^i := B (1) Q^ Ti [:; 1]; : :: ; B [:; 1] i i h

i

B~i(k) := B^i(k) ; B (ik)[:; 2 : q0 ] (q0 is the number of columns of B (ik)) broadcast ni+1 := B~i [:; 1]

then, as in the original algorithm:   Ni+1 := Ni B~i [1 : n; 1] Bi+1 := B~i [1 : n; 2 : n ? i]

end for

3.4 The Parallel or Distributed OneSided Reduction In the parallel implementation of the one-sided reduction the computations in step i are done in two stages: The rst stage does the reduction on one processor without any communication. The second stage does the reduction \across processors" which involves some communication. Let in the following f denote the rst standard basis vector of Rq for the appropriate q0 . Also, the \empty" array [] is used with norm zero, and X (k) denotes the portion of matrix X on processor k. 0

Stage 1 is exactly the same as for the case of one processor and Householder transforms are used. The crucial part is stage 2 as it involves communication. The implementation of this part does the computations such that the broadcast of ni+1 is unnecessary. This comes at the cost of some redundant computations. Computational steps and communication steps appear interleavedly. Stage 2 is now discussed in detail for the case of 4 processors. For the last few steps less processors are active, here however, a typical step is displayed where all processors are involved (i.e., i > n ? 3). Furthermore, to keep notation simple, it is assumed that column i + 1 is in the local memory of processor 1.

Algorithm 3.2 (Stage 2 for 4 processors) data exchange between processors 1 $ 2 and 3 $ 4 j1 = 2 j2 = 1 j3 = 4 j4 = 3 (k) := (jk ) ; w(k) := n(jk )

computational step using Givens rotations nd c(k); s(k) such that 





 

q

c(k) s(k) (k) = (k)2 + (k) 2 1 ; 0 ?s(k) c(k) (k) if k = 2; 4 then b(k) = ?s(k) n(k) + c(k)w(k) end if

n(k) := c(k) n(k) + s(k) w(k) q

(k) := (k)2 + (k) 2

data exchange between processors 1 $ 3 and 2 $ 4

j1 = 3 j2 = 4 j3 = 1 j4 = 2 (k) := (jk ) ; w(k) := n(jk )

computational step using Givens rotations nd c(k); s(k) such that 





 

q

c(k) s(k) (k) = (k)2 + (k) 2 1 ; 0 ?s(k) c(k) (k) if k = 3 then b(k) = ?s(k) n(k) + c(k)w(k) end if if k = 1 then

b(k) = c(k)n(k) + s(k) w(k)

end if

n(k) := c(k) n(k) + s(k) w(k) n(k)

(k)

(k)

This stage computes := B [:; 1] and := ( k ) kai k. Rather than computing ni+1 on one processor and broadcasting, some computations are duplicated, and ni+1 is computed simultaneously on all 4 processors. The communication pattern is that of a collect operation.

3.5 Performance

The time needed for communication in step i is log2 (p)( + (n + 1)) and so the total communication costs are

comm = log2 (p)( + (n + 1))(n ? 2)p: As the ni+1 are computed on all the processors there is also a parallel overhead of the order O(n2 ) log2 (p)p

but this is negligible compared to the communication costs. Isoeciency is obtained if n = C log2 (p)p: This is approximately con rmed on the AP1000 were the following M op rates were obtained. In order to be able to compare performance across algorithms we de ne performance as a scaled inverse time as rn = 2n3=tn: p n r (M op/s) 128 3200 105 64 1344 59 32 576 32 16 256 15.4 8 128 7.3 4 64 3.8 2 32 1.7 1 16 1.0 Table 1: performance r on the AP1000 with p processors and problem size n So far the algorithm has been tested on a 1 and 2 processor VPP 500 where performance is 857 M op/s (n=512) and 1.9 G op/s (n=1024) respectively. Thus on the AP1000 about 1/6 of peak performance is obtained where on the VPP 500 we get over 1/2 of peak performance. Performance on the AP 1000 can be increased by taking into account the memory hierarchy. In comparison, the twosided method combined with block algorithms was used in [3, 1] where for n = 3000 between 10 and 20 percent of the peak performance was obtained on a 128 processor Intel Delta.

4 Conclusions A new algorithm was suggested for the reduction to tridiagonal form of matrices given as A = B T B. This algorithm has longer vector lengths, needs less operations and reduces communication compared with traditional methods. In addition it is scalable. Timings of implementations on both the Fujitsu VPP 500 and the AP 1000 showed good performance and con rmed the scalability analysis. Finally, the algorithm could prove useful for sparse problems arising from the nite element method.

Acknowledgements The algorithm has been developed in the Area 4 joint ANU/Fujitsu mathematical software project and will

be included in Fujitsu's scienti c subroutine library SSL II/VPP. I would like to thank M. Osborne, ANU for suggesting to me to study reduction to tridiagonal form and encouragement to pursue the line of one-sided reduction. Furthermore I thank B.B. Zhou and M. Kahn, ANU for helpful discussions and their introductions to the one-sided Jacobi algorithms. The software is tested and improved by M. Kahn. Finally I would like to thank J.J. Dongarra, Univ. Tennessee, for kindly sending me his manuscript [1].

References [1] Jaeyoung Choi, Jack J. Dongarra, and David W. Walker, \The design of a parallel, dense linear algebra software library: Reduction to Hessenberg, tridiagonal and bidiagonal form", submitted to SIAM J. Sci. Comp., 1994. [2] Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen, \Block reduction of matrices to condensed forms for eigenvalue computations", Tech. report, Argonne National Laboratory, 1987. [3] Jack J. Dongarra and Robert A. van de Geijn, \Reduction to condensed form for the eigenvalue problem on distributed memory architectures", Tech. report, Dept. Comp. Sci., Univ. Tennessee, 1991. [4] Gene H. Golub and Charles F. Van Loan, Matrix computations, 2nd ed., The Johns Hopkins University Press, 1989. [5] Beresford N. Parlett, The symmetric eigenvalue problem, Prentice Hall, 1980. [6] Bing Bing Zhou and Richard Peirce Brent, \A parallel ordering algorithm for ecient one-sided Jacobi SVD computations", Proc. Sixth IASTED-

ISMM International Conference on Parallel and Distributed Computing and Systems, pp. 369-372,

1994.

Suggest Documents