The modified Gram-Schmidt (MGS) algorithm solves the problem of decom- posing a ... repeated orthogonalization process inside an integration process. But there ... it is difficult to derive a efficient parallel implementation for the column-wise ..... tion, this process is only done in one process column k = i/b mod PN , which.
Comparison of Different Parallel Modified Gram-Schmidt Algorithms Gudula R¨ unger and Michael Schwind Department of Computer Science, Technical University Chemnitz 09107 Chemnitz, Germany {ruenger,schwi}@informatik.tu-chemnitz.de
Abstract. The modified Gram-Schmidt algorithm (MGS) is used in many fields of computational science as a basic part for problems which relate to Numerical Linear Algebra. In this paper we describe different parallel implementations (blocked and unblocked) of the MGS-algorithm and show how computation and calculation overlap can increase the performance up to 38 percent on the two different Clusters platforms which where used for performance evaluation.
1
Introduction
The modified Gram-Schmidt (MGS) algorithm solves the problem of decomposing a matrix A ∈ Rm×n (here we consider the case n ≤ m) into matrices Q ∈ Rm×n and R ∈ Rn×n , so that A = QR. The matrix Q computed by the algorithm is an orthogonal matrix composed of orthogonal vectors Q = {q1 , . . . , qn }, the matrix R is upper triangular. The algorithms presented in this article were mainly developed for the fullrank QR-problem (m=n), which arises in derivation of Lyaponov -vectors and -exponents in computational physics [1], where the MGS-algorithm is part of a repeated orthogonalization process inside an integration process. But there are many other areas of application for the MGS-algorithm, for example the solution of linear least squares problems [2] or as part for the iterative solution of linear systems. Besides the MGS-algorithm there exist several other algorithms for QR-decomposition, for example the QR-algorithm with Householder reflectors or the classical Gram-Schmidt algorithm (CGS). The Householder algorithm has a higher accuracy in the orthogonal vectors qi , so that the norm I − QT Q is in the range of machine-accuracy (I is the identity matrix), but it needs more floating-point-operations 83 m3 then MGS 2m3 (m = n). The CGS- and the MGSalgorithm are mathematically equivalent but behave differently in the orthogonality of the computed matrix Q. In the MGS-algorithm the norm I − QT Q can be predicted by an upper bound [2] but there exists no such bound for the CGS-algorithm. The algorithm used in this article is called the row-wise MGS-algorithm because it constructs the R-matrix of the QR-decomposition row by row. There J.C. Cunha and P.D. Medeiros (Eds.): Euro-Par 2005, LNCS 3648, pp. 826–836, 2005. c Springer-Verlag Berlin Heidelberg 2005
Comparison of Different Parallel Modified Gram-Schmidt Algorithms
827
exists another version of the MGS-algorithm called the column-wise version, but it is difficult to derive a efficient parallel implementation for the column-wise algorithm. The parallel algorithms are realized in the SPMD programming style with MPI as communication library for message passing. The parallel realization of the algorithms uses the block-cyclic-mapping over a two dimensional processor grid of dimension (PM ×PN ), with PM processor rows and PN processor columns. We have chosen this layout because experiments have shown that this leads to the best performance and speedup values for the underlying cluster platforms. For using todays computers with deep memory hierarchies most efficiently, we have developed two variants of Level-3 implementations of the algorithm based on the sequential algorithm described in [2, 3]. The distinction between Level-2 and Level-3 algorithms is based on the operations which the algorithms use. The Level-2 algorithms use matrix-vector operations and the Level-3 algorithms use in addition matrix-matrix operations [4]. Early work on parallel Gram-Schmidt orthogonalization is described in [5–7]. Actual work on the topic of the parallel modified Gram-Schmidt algorithm often concentrates on the case of having a matrix to orthogonalize with m >> n. In this situation often row-wise and block or cyclic column-wise distributions are used. Our investigations have shown that for the case of having square matrices a block-cyclic distribution over a two dimensional processor grid gives the best performance for the Level-3 algorithms. In [8, 9] different parallel block GramSchmidt algorithms for a row-wise distribution have been presented, where the vectors have been grouped to a non constant block-size to have similar accuracy as MGS. Our algorithm in this work uses the iterated classical Gram-Schmidt algorithm [10] to increase accuracy. In [11] different partitioning schemes for MGS including row-wise, block and cyclic column-wise partitionings were analyzed and they use ring-communication (pipelined algorithm) which we adapt for our parallel Level-3 algorithm (Algorithm 4) for the double-cyclic distribution.
2 2.1
Sequential-Algorithms Level-2 MGS
The MGS algorithm orthogonalizes a matrix A through a series of transformations of the column-vectors ai , i = 1, . . . , n with previously computed orthogonal vectors qj (1 ≤ j < i) in the following way: T qij = qij−1 − (qij−1 , qj−1 ) qj−1 = (I − qj−1 qj−1 ) qij−1 , rij
with
qi =
(1)
Pj−1
qii / qii
and
qi0
= ai
rii
The superscript j indicates that a vector has been transformed with the vectors q1 , . . . , qj . When a vector qi0 = ai is transformed by i − 1 orthogonal vectors qj ,
828
Gudula R¨ unger and Michael Schwind
1 ≤ j < i it is normalized and stored as the new vector qi . The norm qii is the element rii and the inner-product (·, ·) in Formula (1) is the element rij of R. The orthogonalization can be seen as a series of multiplication with transformation matrices Pj (see Formula (1)). Algorithm 1 shows the pseudo-code of the rowwise-algorithm which uses a matlab like notation to represent sub-matrices. In a first step, the matrix A is copied into Q so that the algorithm applies all the transformations to Q. The vector qi is normalized in the i-th step of the main loop and then used to transform the vectors qi+1 , . . . , qn with matrix-vector operations. For the transformation a vector r of size n − i + 1 is calculated. This vector is then used for a rank-1-update of the column-vectors qi+1 , . . . , qn . The symbol “− =” in Algorithm 1 describes that the matrix on the left is replaced by itself minus the matrix on the right. The norm qi and the vector r build the elements i to n of the i-th row of matrix R and are stored in line 4 and 10. Algorithm 1 requires 2mn2 floating-point-operations. Algorithm 2: BMGS
Algorithm 1: MGS Procedure MGS(A,Q,R) begin 1 Q=A 2 for i = 1 to n do 3 q = Q[:, i] 4 R[i, i] = q 5 q = q/q 6 Q[:, i] = q 7 if i < n then 8 r = Q[:, i + 1 : n]T · q 9 Q[:, i + 1 : n]− = q · r T 10 R[i, i + 1 : n] = r T
Procedure BMGS(A,Q,R) begin 1 Q=A 2 for i = 1 to n Step b do ¯ = Q[:, i : i + b − 1] 3 Q ¯ R[i : i + b − 1, i : i + b − 1]) 4 ICGS (Q, ¯ 5 Q[:, i : i + b − 1] = Q 6 if i < n − b then ¯=Q ¯ T · Q[:, i + b : n] 7 R ¯ 8 R[i : i + b − 1, i + b : n] = R ¯·R ¯ 9 Q[:, i + b : n]− = Q
end
2.2
end
Level-3 MGS
A Level-3 formulation of the MGS algorithm can be derived by replacing many transformations of Q through Pj (see Formula (1)) with transformations by matrices Lk which are defined as follows: ¯T , ¯kQ Lk = I − Q k
1 ≤ k < (n/b)
(2)
¯ k = (qi1 , . . . , qi2 ), i1 = ¯ k consists of b orthogonal vectors Q The block-matrix Q (k − 1) · b + 1, i2 = kb, which can be calculated with a Level-2 algorithm. Without loss of generality n can be assumed to be a multiple of b. To get numerical properties comparable to the Level-2 MGS-algorithm it is ¯ ¯T Q important to use a Level-2-algorithm which minimizes the norm Q k k − I. [2] suggests to use the iterated classical Gram-Schmidt algorithm (ICGS) for ¯ k a second the Level-2 transformation, which orthogonalizes the columns of Q
Comparison of Different Parallel Modified Gram-Schmidt Algorithms
829
Algorithm 3: applyTransform ¯ l ,s,e) Procedure applyTransform(Ql ,Rl ,Q 1 if MyRank owns parts of columns of Q between s and e then 2 -generate a local block of inner products with a local matrix-multiplication with ¯l ¯ l transposed and parts of Ql in range from s to e and store it in R the parts of Q ¯ l in process column and broadcast the result in process column 3 -sum R ¯ l and R ¯l 4 -local Rank-k-update of Ql with Q l l ¯ 5 -local store parts from R into R
Algorithm 4: PBMGS Procedure PBMGS(Al ,Ql ,Rl ) begin 1 - copy local parts of Al into Ql 2 for i = 1 to n − b Step b do 3 k = (i − 1)/b mod PN 4 if MyRank is in process column k then 5 -Level-2 factorization of Q in range of vectors i to i + b − 1 store the result ¯ l and the generated elements of R in Q 6 if i < n − b then ¯ l within process row; root of broadcast are processes of 7 -broadcast Q process column k ¯ l ,i+b,n) 8 -applyTransform(Ql ,Rl ,Q end
Algorithm 5: PBMGS2 Procedure PBMGS2(Al ,Ql ,Rl ) begin 1 - copy local parts of Al into Ql 2 for i = 1 to n − b Step b do k = (i − 1)/b mod PN 3 if MyRank is in process column k then 4 if i > b then ¯ lold ,i,i + b − 1) 5 applyTransform(Ql ,Rl ,Q 6 -Level-2 factorisation of Q in range of vectors i to i + b − 1 store the result ¯ lnew in Q ¯ lnew asynchronously in process row; root of broadcast 7 -start Broadcast of Q are processes of process column k 8 if i > b AND i < n − b then ¯ lold ,i + b,n) 9 -applyTransform(Ql ,Rl ,Q ¯ lnew 10 -end Broadcast of Q l l ¯ ¯ 11 Qold = Qnew end
time with the classical Gram-Schmidt algorithm (CGS) depending on an easyto-compute re-orthogonalization criterion [10].
830
Gudula R¨ unger and Michael Schwind
A pseudo-code-implementation of a Level-3-version of the MGS algorithm is ¯ from presented in Algorithm 2. In line 3 the ICGS-algorithm is called to form Q ¯ columns i to i + b − 1 of Q. While calculating Q the upper diagonal elements of an b × b sub-matrix of R are created and are stored in R[i : i + b − 1, i : i + b − 1]. ¯ is copied back into Q after the orthogonalization in line The block-matrix Q 5. After the ICGS-phase, the resulting matrix Q[:, i + b : n] is transformed by calculating a block of inner-products through a matrix-matrix-multiplication in ¯ has dimension b × (n − i − b + 1) line 7 of Algorithm 2. The resulting matrix R and stores the elements of the sub-matrix R[i : i + b − 1, i + b : n] of R in line 8. ¯ and R ¯ In line 9 a rank-k-update of the columns i + b to n of Q is made with Q to complete the block-transformation. The number of floating-point-operations of the Level-3-MGS-algorithm depends on the number of orthogonalization steps in the ICGS-step. If only one orthogonalization is required the number of floating-point-operation is nearly the same as for MGS; if every vector needs a second orthogonalization step the number of floating-point-operations is about 2mn(n + b). Figure 2 D shows the ratio between the number of re-orthogonalizations and the number of columns for different 300 × 300 random-matrices with different condition numbers. The matrices have been generated with the routine DLATMS of the Lapack [12] testing library. A value of 1 means that every vector has to be orthogonalized twice in the ICGS step, whereas a value of 0 means that every vector has to be orthogonalized only once. There is a strong decency of the ratio from the block-size and condition number.
3
Parallel-Algorithms
We describe only the parallel implementation of two Level-3 algorithms, but show in the next section the performance of the Level-2 algorithm too. The parallel Level-2-algorithm can be derived when setting the block-length of the Level-3 algorithm to one. For the parallel Level-3 algorithm we have chosen to use the column-block length of the underlying block-cyclic-mapping as block length of the Level-3-algorithm. This reduces the number of messages to send in the Level-2-phase and is typical for many other parallel software. A pseudo-code of the parallel Level-3 algorithm is described in Algorithm 4; it is a “straight forward” parallelization of Algorithm 2. A superscript l denotes the local parts of a distributed matrix or vector. In the following we describe the main steps from Algorithm 4: 1. ICGS-step: This step performs the Level-2-Transformations in line 4. Since the block-length of the algorithm is equal to the block length of the distribution, this process is only done in one process column k = i/b mod PN , which changes in every step cyclically. For the ICGS-step communication with combined reduction/broadcast operations are required for building vector norms and inner-products of column-vectors through matrix-vector-products.
Comparison of Different Parallel Modified Gram-Schmidt Algorithms
831
¯ generated in the previous step is needed to form 2. Broadcast: The matrix Q ¯ is located only the product in line 7 of the sequential algorithm. Since Q ¯ must in one column, every process in the process column k which holds Q ¯ distribute its local parts of Q to all other processes in his process row. 3. Transformation: The transformation of the columns of Q in range i + b to ¯ and Q ¯ in lines 6 to 9 of the sequential Algorithm 2 is described n through R in Algorithm 3. The parameters of the algorithm for the transformation of the matrix Q (Algo¯ Q, ¯ and a range (global indices) of columnrithm 3) are the local parts of Q,R, vectors of Q to which the transformation will be applied. Line 1 of Algorithm 3 checks whether a processor column runs out of data. The following steps of Algorithm 3 are as follows: 1. Matrix-Multiplication: The matrix multiplication of line 7 (Alg. 2) can be parallelized by the decomposition of the inner-products which build the matrix-product globally. It consists of two steps: (a) Local Matrix-Multiplication: Build local inner-products through a ¯ (transposed) with local matrix-multiplication of the local parts from Q parts of Q. (b) Reduction/Broadcast: A summation of the local-inner products by a global communication operation in the process columns yields the global inner products. Since the result is needed in the next step a broadcast of the results is done within the process columns. ¯ and R ¯ are used to make 2. Local Rank-k-Update: The operand matrices Q a global Level-3-update of the matrix Q by local updates. 3. Building R: When the parameter for the distribution (grid-dimension, block length) is the same for the matrices Q and R then this step needs no redis¯ which holds parts of R. Some attention must be tribution of the matrix R ¯ into local parts of R. Since the rowtaken, when copying local parts from R block-length may be different from b, there is the possibility that rows from ¯ must be stored on different process rows. R 3.1
Communication Patterns
Efficient communication is essential for reaching high efficiency on today’s hardware architectures. A property of the parallel algorithms presented above is that communication takes place only in the process rows or in the process columns, which can be potentially independent. In [11] the runtime of the algorithms is analyzed for the strict column-block-cyclic and row-block-wise case. For the column-cyclic distribution they use a ring-broadcast for communication and for the row-wise case a tree-implementation for reduce/broadcast. Since our algorithms use the block-cyclic distribution, we use the combination of both communication-patters. For the broadcast within process rows we use a ringbroadcast and for the reduce/broadcast within the process columns we use a tree-implementation.
832
3.2
Gudula R¨ unger and Michael Schwind
Communication/Calculation Overlap
In the parallel algorithm PBMGS there is a strict order between computation and communication. But it is possible to overlap these two. [13] presents a scheme for block matrix factorizations and presents results for QR-/LU- and Cholesky-factorizations. This scheme was specified for pure column-block-cyclic distributions but can be extended to block-cyclic-distributions too. The idea in [13], when transferred to Level-3-MGS, is to not transform the column-vectors ¯ on all process columns fully in one step. The process column which of Q with Q is responsible for the next Level-2-transform updates only a small stripe with ¯ and then does the Level-2-transform on that stripe. The result length b with Q ¯ new is send via a broadcast asynchronously, while the other of this transform Q ¯ For Algorithm 5 we adopt the process columns do the update with the old Q. idea of [13] to the Level-3 modified Gram-Schmidt orthogonalization but implement it over a two dimensional processor grid using block-cyclic distribution. ¯ new . The matrix Q ¯ old is used for the ¯ old and Q Algorithm 5 uses block-matrices Q ¯ new is calcuLevel-3-transformations in the i-th step of the main-loop, while Q lated and distributed at that step. The asynchronous broadcast is implemented through a sequential broadcast using standard MPI Isend-/MPI Irecv-routines for asynchronous communication.
4
Performance Evaluation
The performance evaluation has been done on two different machines, a BeowulfCluster (CLiC) with 528 Pentium III processors running at 800 MHz connected with Fast Ethernet and a SMP-Cluster of 16 dual XEON nodes running at 2GHz connected with SCI in a 4 × 4 torus. The CLiC-Cluster uses LAM Version 6.5.2 and the XEON-Cluster uses ScaMPI for message passing. Both clusters use libgoto [14] for local operations. The performance in GFlop/s for 16 processors for different sizes of the inputmatrix is shown in Figure 1. For this measurements only one processor per node is used on the XEON-Cluster. A label with It=1 means that the curve was measured with 1 orthogonalization in the ICGS-step per vector and It=2 means there have been two orthogonalizations per vector. The runtime for a typical run of the algorithms lies between these 2 curves, depending on the input-matrix. In all figures it can be seen that the flop-rate of the Level-3 algorithm increases with higher matrix-sizes because the ratio of communication-time to computation-time shifts towards computation. The maximum flop rate measured with 16 processors on the XEON-Cluster is 38.4 GFlop/s and on the CLiC 4.73 GFlop/s for the PBMGS2 algorithm (see Figure 1 A2,B2). This is about 60% on the XEON- and about 37% on the CLiC-Cluster of the total peak performance, which is 64 GFlop/s on the XEON and 12.8 GFlop/s on the CLiC for 16 processors. The performance of the algorithms will be higher if we measure the performance for higher matrix-dimensions, but the highest matrix-dimensions (6000 × 6000) in Figure 1 is the maximum which is needed for the application of the algorithms described in [1].
Comparison of Different Parallel Modified Gram-Schmidt Algorithms
833
For both Level-3 algorithms the optimal processor grid configuration (diagrams A2,B2) is the square configuration for 16 processors. As it can be seen the row-wise or the column-block-cyclic distribution are not optimal for the Level-3 algorithms. In the square configuration the algorithm with the overlap of communication and calculation (Alg. 5) reaches an up to 38% better performance on the XEON-Cluster (m = n = 1500, It = 2) and an up to 30% better performance on the CLiC-Cluster (m = n = 1500,It = 2) than the Algorithm 4, in Figure 1 A2,B2. It can be observed that algorithm PBMGS2 can increase the percentage of peak floating-point performance up to 10% on the XEON and up to 4.7% on the CLiC-Cluster compared to algorithm PBMGS. While the performance of PBMGS2 with the optimum grid-configuration is better than algorithm PBMGS (Alg. 4), there is a performance loss in the implementation of algorithm PBMGS2 when using the column-block-cyclic distribution, since the sequential broadcast in the process row which should be overlapped with the calculation in the algorithm PBMGS2 takes too long in this configuration to have an advantage in overlapping communication with the calculation. The curves with two iterations per vector (It=2) in the ICGS-step show a slight performance decrease for both Level-3 implementations (PBMGS, PBMGS2). This is expected since there are more floating-point-operations and more communication. For the row-wise distribution the decreased performance can be explained by the extra communication introduced for the additional innerproducts. For the column-block-cyclic distribution (grid: 1 × 16) idle-waiting occurs since all process columns wait for the result of the ICGS-step which is calculated in a single process column. For the square process grid both effects, the higher communication time and the higher idle-waiting time, are the reason for the difference in performance for the curves with one orthogonalization (It = 1) and with two orthogonalizations per vector (It = 2) in the ICGS-step on both machines. Figure 2 C1 shows the performance of the algorithms for different processor numbers of the clusters in percent of the peak floating-point performance normalized to the processor number. The performance has been determined for different block-sizes and grid-configurations and the highest performance is shown. It is shown that the CLiC-Cluster reaches a smaller utilization of the processors as the XEON-Custer for both Level-3 algorithms when more than one processor is used. The reason for this is the worse ratio of performance of local operations to network bandwidth on the CLiC. Figure 2 C2 shows the parallel speedup for the measurements of Figure 2 C1. The highest speedup for the algorithms PBMGS and PBMGS2 on the XEONCluster is 19.8 and 22.6 which is a parallel efficiency of about 61.9% and 70.6%, respectively. On the CLiC-Cluster with 64 processors the highest speedup is 19.6 for PBMGS and 21 for PBMGS2. The parallel efficiency on the CLiC is much lower and is 30.7% (PBMGS) and 32.8% (PBMGS2) for 64 processor. On the CLiC-Cluster with a processor number higher than 64 processor the performance of algorithms PBMGS and PBMGS2 is equal. Local calculations seem to be too short to overlap with the dominating communication times of the sequential column-broadcast.
834
Gudula R¨ unger and Michael Schwind
The performance of the Level-2 algorithm is not higher than 10% of the peak floating-point performance on both Clusters (see Figure 1 and 2 C1). The reason for this is the poor cache utilization of the Level-2 operations which results in a stall of the processors because of memory operations. By increasing the number of processors the memory bandwidth increases and the Level-2 algorithm has a high speedup on the CLiC as shown in Figure 2 C2. On XEON-Cluster the PMGS algorithm has a speedup drop for 24 and 32 processors because in this range two processors per node are used. Since matrix-vector operations on the XEON cannot utilize the cache efficiently and the two processor per node share the memory bandwidth the speedup drop appears.
4.5
(A1) CLIC row-wise
(A2) CLIC optimum
(A3) CLIC column-block-cyclic
(B2) XEON optimum
(B3) XEON column-block-cyclic
4 3.5 3 2.5 2
PMGS PBMGS It=1 PBMGS It=2 PBMGS2 It=1 PBMGS2 It=2
1.5 1 0.5 (B1) XEON row-wise 35 30 25 20 15 10 5 1
2
3
4
matrix size (x1000)
5
6
1
2
3
4
matrix size (x1000)
5
6
1
2
3
4
5
6
matrix size (x1000)
Fig. 1. Performance in GFlop/s for 16 processors for different matrix-sizes on the CLiC (top) and on the XEON (bottom) for the row-wise- (grid: 16×1), optimal- and columnblock-cyclic-distribution (grid: 1×16 ) (from left to right). The optimal grid is the 4×4 grid for PBMGS and PBMGS2 and the 1 × 16 grid for PMGS
5
Conclusion
We have developed a Level-3 modified Gram-Schmidt algorithm (PBMGS2) which has the advantage of overlapping communication and calculation. The performance of PBMGS2 has been compared with a Level-2 (PMGS) and a Level-3 implementation (PBMGS) which where derived from a “straight forward” parallelization. A comparison of the two Level-3 algorithms on the two different
Comparison of Different Parallel Modified Gram-Schmidt Algorithms (C1) Performance 6000x6000
60 50
Speedup
CLIC PMGS CLIC PBMGS CLIC PBMGS2 XEON PMGS XEON PBMGS XEON PBMGS2
70
percent
(C2) Speedup 6000x6000
(D) ratio re−orthogonalizations/columns
70
40 30
CLIC PMGS 60 CLIC PBMGS CLIC PBMGS2 XEON PMGS 50 XEON PBMGS XEON PBMGS2 40 ideal Speedup
re−orthogonalizations/columns
80
30 20
20 10
10
0
0 0
10
20 30 40 50 number of processors
60
70
0
10
20 30 40 50 number of processors
835
60
70
1 1 10 100 0.8 1e3 1e4 0.6 1e5 1e6 1e7 0.4 0.2 0 0
20 40 60 80 100 120 140 160 block length b
Fig. 2. Performance in percent of the total peak floating-point for the specific processor number (left) and Speedup (middle) for different number of processors for an m = n = 6000 matrix. right: Ratio between orthogonalizations in the ICGS-step and number of columns of the input-matrix for the Level-3-algorithms for different block lengths and different condition numbers of a m=n=300 random input-matrix
Clusters shows that the algorithm with the overlapping of communication and calculation can increase the performance up to 38% compared to the parallel Level-3 implementation without communication calculation overlap. We also show that the often described row-wise- and column-block-cyclic distributions for parallel modified Gram-Schmidt are not optimal for the Level-3 algorithms (PBMGS, PBMGS2) for the full rank QR-problem on Cluster-platforms and we have realized a more efficient two dimensional block-cyclic distribution for the Level-3 algorithm.
References 1. G. Radons, G. R¨ unger, M. Schwind, and H. Yang. Parallel Algorithms for the Determination of Lyapunov Characteristics of Large Nonlinear Dynamical Systems. In Extended Abstracts: PARA’04 Workshop on State-of-the-Art in Scientific Computing, Copenhagen, Denmark, 2004. CDROM. 2. ˚ A. Bj¨ orck. Numerics of Gram-Schmidt Orthogonalization. Linear Algebra Appl., 197–198:297–316, 1994. 3. W. Jalby and B. Philippe. Stability Analysis and Improvment of the Block Gram-schmidt Algorithm. SIAM Journal on Scientific and Statistical Computing, 12:1058–1073, 1991. 4. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990. 5. E. L. Zapata, J. A. Lamas, F. F. Rivera, and O. G. Plata. Modified Gram-Schmidt QR factorization on Hypercube SIMD Computers. Journal of Parallel and Distributed Computing, 12:60–69, 1991. 6. H. De Meyer, C. Niyokindi, and G. Vanden Berghe. The implementation of parallel gram-schmidt orthogonalisation algorithms on a ring of transputers. Computers and Mathematics with Applications, 25:65–72, 1993.
836
Gudula R¨ unger and Michael Schwind
7. D. P. O’Leary and P. Whitman. Parallel QR factorization by householder and modified Gram-Schmidt algorithms. Parallel Computing, 16:99–112, 1990. 8. D. Vanderstraeten. A Generalized Gram-Schmidt Procedure for Parallel Applications. http://citeseer.ist.psu.edu/vanderstraeten97generalized.html. 9. D. Vanderstraeten. An accurate parallel block Gram-Schmidt algorithm without reorthogonalization. Numer. Linear Algebra Appl., 7(4):219–236, 2000. 10. W. Hoffmann. Iterative Algorithms for Gram-Schmidt Orthogonalization. Computing, 41:334–348, 1989. 11. S. Oliveira, L. Borges, M. Holzrichter, and T. Soma. Analysis of different partitioning schemes for parallel Gram-Schmidt algorithms. Parallel Algorithms and Applications, 14(4):293–320, April 2000. 12. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. 13. K.Dackland, E. Elmroth, and B. K˚ agstr¨ om. A Ring-Oriented Approach for Block Matrix Factorizations on Shared and Distributed Memory Architectures. In PPSC, pages 330–338, 1993. 14. K. Goto and R. van de Geijn. On reducing tlb misses in matrix multiplication. Technical Report TR-2002-55, 2002.