Jan 14, 1997 - \M Vektor" should be fast to compute in parallel. Often used ... (3) the normal equations ^AT ^A ^Mk = ^AT ^ek iteratively using the precondi-.
Sparse Approximate Inverses for Preconditioning of Linear Equations Thomas Huckle TU Munchen, Institut fur Informatik January 14, 1997 1. Sparse Approximate Inverses and Linear Equations We consider the problem of solving a system of linear equations Ax = b in a parallel environment. Here, the n n-matrix A is large, sparse, unstructured, nonsymmetric, and ill-conditioned. The solution method should be robust, easy to parallelize, and applicable as a black box solver. Direct solution methods like the Gaussian Elimination are not very eective in a parallel environment. This is caused by the sequential nature of the computation and solution of a triangular factorization A = LR with lower and upper triangular matrices L and R. Therefore, often iterative solution methods like GMRES, BiCGSTAB, or QMR (see [2]) are preferable. For many important iterative methods the convergence depends heavily on the location of the eigenvalues of A. Therefore, the original system Ax = b is replaced by an equivalent system MAx = Mb or the system AMz = b, x = Mz . Here, the matrix M is called a preconditioner and has to satisfy three conditions: AM (or MA) should have a `clustered' spectrum, M should be eciently computable in parallel, \M Vektor" should be fast to compute in parallel. Often used preconditioners are block-Jacobi-peconditioners, polynom-preconditioners, or incomplete LU -decompositions of A [2]. But these preconditioners either lead to unsatisfactorily convergence or are hard to parallelize. A very promising approach is the choice of sparse approximate inverses for preconditioning, M A?1 and M sparse [10,4,3,7,6,8]. Then, in the basic iterative scheme only matrix-vector multiplications with M appear and it is not necessary to solve a linear system in M like in the incomplete LU -approach. 1
Obviously, A?1 is a full matrix in general, and hence not for every sparse matrix A there will exist a good sparse approximate inverse matrix M . But the following scheme for computing a sparse approximate inverse provides at least information about the quality of the approximation [7]. We can compute such a matrix M by solving a minimization problem of the form minkAM ? I k for a given sparsity pattern for M . By choosing the Frobenius norm we arrive at an analytical problem that is very easy to solve. Furthermore, in view of min kAM ? I k2F =
n X k=1
min kAMk ? ek k2
this minimization problem can be solved columnwise for Mk and is therefore embarrassingly parallel. First, we consider M with a prescribed sparsity pattern, e.g. M = 0, M a diagonal matrix, or M with the same sparsity pattern as A or AT . We get the columnwise minimization problems kAMk ? ek k2 , k = 1; 2; :::; n, with a prescribed sparsity pattern for the column vector Mk . Let us denote by Jk the small index set of allowed nonzero entries in Mk , and the \reduced" vector of the nonzero entries by M^ k := Mk (Jk ) . The corresponding submatrix of A is A(:; Jk ) , and most of the rows of A(:; Jk ) will be zero in view of the sparsity of A. Let us denote the row indices of nonzero rows of A(:; Jk ) by Ik , and the corresponding submatrix by A^ = A(Ik ; Jk ), and the corresponding reduced vector by e^k = ek (Ik ). Hence, for the k-th column of M we have to solve the small least squares problem min kA^M^ k ? e^k k : Mainly, there are three dierent approaches for solving this LS-problem. We can compute a QR-decomposition of A^ based on (1) Householder matrices or (2) the Gram Schmidt process, or we can solve (3) the normal equations A^T A^M^ k = A^T e^k iteratively using the preconditioned conjugate gradient algorithm. For the general case it is not possible to prescribe a promising sparsity pattern without causing Jk and Ik to be very large. This would result in large LS-problems and a very expensive algorithm. Therefore, for a given index set Jk with optimal solution Mk (Jk ) we need a dynamic procedure to nd new promising indices that should be added to Jk . Then, we update Ik and solve the enlarged LS-problem until the residual rk = AMk ? ek is small enough or until Jk gets too large. In general the start sparsity pattern should be Jk = ;. 2
Only for matrices with nonzero diagonal entries we can set Jk = fkg in the beginning. We use a hierarchy of three dierent criteria for nding new promising indices to be added to Jk . As a global a priori criterion for M we only allow indices that appear in (AT A)m AT for a given m, e.g. m = 0; 1; 2; 3, or 4. As a heuristic justi cation for this restriction let us consider the equation
A?1 = (AT A)?m?1 (AT A)m AT ; where the diagonal entries of (AT A)?m?1 are nonzero (moreover the maximum element of this matrix ia a diagonal entry), and therefore the sparsity pattern of (AT A)m AT is contained in the sparsity pattern of A?1 (Here we neglect possible cancellation). Similarly, by considering the Neumann series for A?1 , the sparsity pattern of (I + A)m also seems to be a good a priori choice for the sparsity structure of A?1 . As a compromise the structure of (I + A + AT )m can be used. Such global criteria are very helpful for distributing the data to the corresponding processors in a parallel environment. For the maximum allowed index set Jmax we get a row index set Imax and a submatrix A(Imax ; Jmax ), which represents the part of A that is necessary for the corresponding processor to compute Mk . If one processor has to compute Mk for several k 2 K , then this processor only needs the submatrix of A that is given by the column indices [k2K Jmax and the row indices [k2K Imax . Now let us assume that we have already computed an optimal solution Mk with residual rk of the LS-problem relative to an index set Jk . As a second local a priori criterion we consider only indices j with (rkT Aej )2 > 0. We will see later that this condition guarantees that the new index set Jk [ fj g leads to a smaller residual rk . The nal selection of new indices out of the remaining index set, after applying the a priori criteria, is ruled by (a) a 1-dimensional minimization minj kA(Mk + j ej ) ? ek k, or (b) the full minimization problem minJk [fjg kAM~ k ? ek k . In the case (a) we consider kA(Mk + j ej ) ? ek k2 =: j : min kr + j Aej k2 = min 2R k j
j
For every j the solution is given by
T T k Aej and 2 = krk k2 ? rk Aej 2 : j = ? krAe j 2 kAej k2 j k22
Hence, indices with (rkT Aej )2 = 0 lead to no improvement in the 1-D minimization. We arrange the new possible indices j relative to the size of their corresponding residuals j . 3
In the case (b) we want to determine the optimal residual value j that we get by minimizing AM~ k ? ek over the full index set Jk [ fj g with known index set Jk and a new index j 2= Jk . Surprisingly it is not too expensive to derive j using this higher dimensional minimization [6]. The additional costs for every j are mainly one additional product with the orthogonal matrix related to the old index set Jk . It holds [6,8]
j2 = krk k2 ?
(rkT Aej )2
kAej k ? kY T A^e^j k 2
2
;
with A^ = Q R0 = Y R and Q = ( Y Z ). Again we order the possible new indices after the size of j . Similarly to (a), indices with (rkT Aej )2 = 0 lead to no improvement in the residual and can be neglected. Now we have a sorted list of possible new indices. Starting with the smallest j (resp. j ) we can add one or more new indices to Jk and solve the enlarged LS-Problem. Numerical examples show that it saves operations if more new indices are added per step. Additionally, a lot of evaluations for j or j can be saved if we allow only a limited number of new index candidates j . To this aim we order the index candidates j after their appearance as column indices in rows of A that correspond to the maximum entries in the residual rk . For example, if the m-th entry of rk has the maximum absolute value then we rst consider the column indices in the m-th row of A. These indices are necessary to achieve a reduction of the maximum entry of rk . In the same way new indices are added by considering the second large entry of rk and so on, as long as a prescribed number of candidates is reached.
2. Computational Aspects In the dynamic procedure for nding new pro table indices, the access on the matrix A, resp. A^, is columnwise for computing the inner products rkT Aej . Furthermore, for every row of A we need the list of occuring column indices. This is equivalent to nding all indices j with rkT Aej 6= 0. Hence, A should be stored in the Compressed Column Storage, and additionally we should have the vector of column indices of the Compressed Row Storage [2].
Iterative Solution As an iterative solution method (3), the conjugate gradient algorithm is applied on the normal equations A^T A^M^ k = A^T e^k . No further evaluation is necessary, if both matrices are not multiplied. As preconditioner we can de ne diag(A^T A^). The matrixvector-multiplications in the cg-method can be done 4
^ and z = A^T y, where x, y, and z are in sparse mode in two steps with y = Ax small dense vectors. The advantage of this approach is, that we need no additional memory, no updates, and we need no old information if we want to solve the LS-problem for an enlarged index set Jk . Furthermore, sometimes we want to nd factorized sparse approximate inverses that minimize for example
kAM :::Ml ? I kF : 1
Hence, in every step for given M1 ; :::; Ml?1 we have to compute the new approximate sparse inverse to AM1 :::Ml?1 . If we solve the resulting least-squares problems iteratively we can avoid the explicit product AM1 :::Ml?1 , that will be much denser than the original A. The disadvantage of this iterative method is that we can not use old results connected with the index set Jk for solving the enlarged LS problem. This leads to a more expensive method than the QR-based solvers, especially if we add only one new index in every step. Hence, this iterative solution method is attractive only, if more than one new index per step is added, because then, there are fewer LS-problems to solve.
Householder Orthogonalization Now let us consider the use of an implicit QR-decomposition based on Householder matrices. First, we assume that at each step we add only one new index to Jk . Then we begin with index set Jk1 = fj1 g and Ik1 , and we have to compute the QR-decomposition of A^1 = A(Ik1 ; j1 ). In the Householder approach, we use one elemantaryHouseholder matrix H1 = I ? 2q1q1T , that transforms the matrix A^ via H1 A^1 = R01 to upper triangular form. In the second step we add one pro table new index j2 to Jk1 , which leads to the new matrix A^ B^ ^ A2 = 01 B^1 ; 2 where B^1 is the part of the new column that corresponds to indices in Ik1 , while B^2 is related to new indices that are only induced by the new column j2 . Now, we have to update the QR-decomposition. Therefore, we have to compute the QR-decomposition of
0R @0
1
0
1
H1 B^1 A = R1 B~1 : 0 B~2 B^2
5
We compute R the new Householder vector q2 related to the matrix B~2 with H2 B~2 = 02 . This leads to the equation
1
0
0 H2
H
1
0
0 R B~ 1 ^ @ A I A = 00 R0 : 0
1
2
1
2
We can write this equation in a more convenient form by adding zeros in the vectors q1 and q2 , to extend these vectors to the row length of A^2 . Then, we get ~ 0 : H~ 2 H~ 1 A^2 = R02 with ( q~1 q~2 ) = 0q1 q2 i1 If we continue to add new indices to Jk , and to extend the vectors qi and q~i to the corresponding row length, then we get the matrix 0 A^1 1 .. C BB 0 B2 . C B C . ^ Am = B () BB 0 0 B3 . . ... CCC . . . . @ .. . . . . . . A
0 0 0 Bm
and Householder vectors of the form
R~ ^ ~ ~ Hm H1 Am = m 0
with
0q BB 0i ( q~ q~ q~m ) = B BB 0.i @ ..
0
1
1
2
q2
1 2
0i2
0im
0im
.. .
... ... ...
0 .. . 0
qm?1
0im
0 .. . 0 0
qm
1 CC CC : CA
This matrix is a lower triangular matrix where additionally the last entries in every column are zero. Hence, we have to store only the short nonzero kernels ql and the lengths of the corresponding vectors. Then, every multiplication with Hl can be reduced to nonsuper uous arithmetic operations. In this way we can use the sparsity of A in the Householder matrices H~ l . To solve the LS-problem, we have to compute QT e^k = H~ m H~ 1 e^k : 6
We can update this vector for every new index jm+1 which takes only one product with the new Householder matrix H~ m+1 . If we add more than one new index per step, we can derive the same sparsity structure in A^m and the Householder matrices H~ l if we partition the index set Ik in such a form that Ik = Ik1 [ Ik2 [ Ikm where Ikl is the set of new row indices induced by the new column index jl . This approach is numerically stable and can be used in connection with the index startegy (a). To use the criterion (b) for choosing new indices, an explicit expression for the rst columns Y of the orthogonal matrix Q is needed. This leads to additional costs, and Gram-Schmidt methods that automatically generate an explicit representation of Y are better suited for this case.
Gram-Schmidt Orthogonalization Here, in the original form, we orthogonalize the new column Aej against all the previous orthogonalized columns. Again, let us assume that the index set Ik is of the form Ik = Ik1 [ [ Ikm with Ikl corresponding to a column index jl . The QR-decomposition is of the form
A^ = Q R0 = ( Y Z ) R0 = Y R :
The orthogonalization of the new column is evaluated by rj = Y T (A^e^j ) ; qj = A^e^j ? Y rj ; rjj = kqj k ; qj = qj =rjj : Then, the new matrix R is given by
rj ; 0 rjj and the matrix Y is built up by the vectors qj , and is of the form () with the same sparsity structure as A^.
Rj?
1
Unfortunately, the Gram-Schmidt process is numerically unstable, and in many examples it will be necessary to use a stable generalization like the Modi ed Gram-Schmidt algorithm or iteratively re ned methods like the algorithm introduced by Daniel, Gragg, Kaufmann, and Stewart [5,6]. Note, that with these Gram-Schmidt-like approaches Y will have the special sparsity pattern () if we order the index set Ik relative to the new columns of A^. For the LS-problem, we have to solve a linear equation with R and the right hand side Y T e^j , the j -th column of Q that is given explicitly. For the multiplication with Y T and Y one can use the sparsity structure of Y , if one vector is stored that contains the number of nonzero entries in every column of Y , resp. A^. In the same way, for evaluating j we need one matrix-vector product Y T (A^e^j ), and can take advantage of the structure of Y and the sparsity of A^e^j . Hence, for criterion (b) this approach is very favourable, but to be numerically stable we need robust generalizations of the Gram-Schmidt orthogonalization. 7
3. The symmetric case Case A: The matrix A is positive de nite Let us rst consider a symmetric positive de nite matrix A = LTA LA with Cholesky factor LA. For preconditioning we want to nd a sparse approximate Cholesky factor L of the inverse M = LLT A?1 . In [10], the Frobenius norm minimization of kLAL ? I kF is used. Here, we want to minimize the following functional rst introduced by Kaporin [9,1] T
(L AL) min (1=n) trace det(LT AL)(1=n) relative to a given sparsity pattern of L. If we restrict L to be of diagonal form then the solution to this problem is given by L = diag(A)(?1=2) and therefore diag(LT AL) = I . Hence, this last equation will be also true if we minimize relative to some other sparsity pattern of L that covers the diagonal structure. For general triangular L with prescribed sparsity pattern we denote the columns of L bei Lk , the index set of allowed entries of Lk by Jk , and J~k := Jk ? fkg. Then Pn LT AL k=1 k k = (L11 L22 :::Lnn )(2=n) Pn L2 A + 2L L (J~ )T A(J~ ; k) + L (J~ )T A(J~ ; J~ )L (J~ ) kk k k k k k k k k k = k=1 kk kk (L11 L22 :::Lnn )(2=n) is to be minimized. This leads to the conditions Lk (J~k ) = ?Lkk A(J~k ; J~k )?1 A(J~k ; k) and
1 : T ~ Akk ? A(Jk ; k) A(J~k ; J~k )?1 A(J~k ; k) Again, we can compute the solution L columnwise in parallel by solving for each Lk a small positive de nite linear system related to the submatrix A(J~k ; J~k ), where J~k is the set of allowed nonzero entries without the diagonal entry. We can derive a dynamic formulation of this method by starting with the diagonal sparsity pattern for L, Jk = fkg. For an already computed solution Lk we want to determine new pro table indices that are added to Jk . To this aim we consider the 1D-minimization problem
L2kk =
min j
(1=n) trace(LT + j ek eTj )A(L + j ej eTk )
det (LT + j ek eTj )A(L + j ej eTk ) 8
=n)
(1
:
Note, that we have j > k (L is lower triangular), and hence the determinant is constant with respect to j . The solution is given by
AT L j = ? Aj k jj and the reduction of the above functional depends on the value of
2
A(j; Jk )Lk (Jk ) j = ? : Ajj Hence, we can determine new inidices for Lk independently of the other columns, the possible set of new indices is reduced to indices j 2= Jk , j > k with ATj Lk 6= 0, and we can compare the possible indices j by considering the values j . The linear equations in A(J~k ; J~k ) can be solved either in sparse mode or by updating a Cholesky factorization for each enlarged index set.
Case B: The matrix A is inde nite For symmetric inde nite A we want to nd a symmetric sparse preconditioner by solving min kAM ? I kF for symmetric M with prescribed sparsity pattern. If we order the unknowns in a vector, we get 1 0 M1(J1 ) 1 0 e1(I1 ) 1 0 A(I1 ; J1 ) A(I2 ; J2 ) CCBB M2(J2 ) CC ? BB e2(I2 ) CC k = B min k B ... A @ ... A @ ... A @ en (In ) Mn (Jn ) A(In ; Jn ) = min kBxM ? ek : But in view of the symmetry the entries in the vector xM are no longer independent; e.g. the entry M12 = M21 occurs in the vector M1 (J1 ) and in M2(J2 ), or 2 2 J1 and 1 2 J2 . We collect the independent variables in a shorter vector x~M = ( M~ 1(J~1 ) M~ 2 (J~2 ) M~ n (J~n ) )T ; where J~k has only entries j k. The relation between the original vector of unknowns xM and the reduced vector x~M is described by a rectangular matrix P of the form xM = P x~M . Here the columns Pk are all zero except one entry which is 1 if x~M (i) represents a diagonal element of M or two entries which are 1 if x~M (i) represents a symmetric pair of odiagonal elements of M . Here, xM (i) denotes the i-th component of the vector xM . This leads to a least squares problem with normal matrix P T B T BP x~M = T P B T e. In many examples the matrix P T B T BP is large but with bounded 9
condition number. Therefore, we can use the conjugate gradient method for solving the least squares problem. Then, the solution vector x~M de nes the preconditioning matrix M . In comparison with the nonsymmetric case instead of solving n small and independent least squares problems directly we have to solve only one large sparse least squares problem of size O(n) iteratively. This is easily parallelizable and eective as long as the matrix P T B T BP is wellconditioned. In a natural way we can also de ne a block Jacobi preconditioner for this matrix by considering the diagonal blocks corresponding to J~k , k = 1; :::; n. Like in the unsymmetric case, we can determine new pro table indices by considering the one-dimensional minimization problem (a). But here, every new odiagonal entry q for the index set Jk also leads to an enlargement of the index set Jq by k. With a larger index set we can again compute the matrices B and P and solve the least squares problem by the cg-algorithm.
References [1] Axelsson,O.: Iterative Solution Methods, Cambridge University Press, 1994. [2] Barrett,R., Berry,M., Chan,T., Demmel,J., Donato,J., Dongarra,J., Eijkhout,V., Pozo,R., Romine,C., van der Vorst,H.: Templates for the solution of linear systems: building blocks for iterative methods, SIAM, Philadelphia, 1994. [3] Chow,E., Saad,Y.: Approximate Inverse Preconditioners for general sparse matrices, Research Report UMSI 94/101, University of Minnesota Supercomputing Institute, Minneapolis, Minnesota, 1994. [4] Cosgrove,J.D.F., Diaz,J.C., and Griewank,A.: Approximate inverse preconditioning for sparse linear systems, Intl. J. Comp. Math. 44, pp. 91-110, 1992. [5] Daniel,J.W., Gragg,W.B., Kaufman,L.C., and Stewart,G.W.: Reorthogonalization and stable algorithms for updating the Gram-Schmidt QRfactorization, Mathematics of Computation, 30, pp. 772-795, 1976. [6] Gould,N.I.M., Scott,J.A.: On approximate-inverse preconditioners, Technical Report RAL 95-026, Rutherford Appleton Laboratory, Chilton, England, 1995. [7] Grote,M., Huckle,T.: Parallel preconditioning with sparse approximate inverses, SIAM J. Sci. Comput., (to appear). 10
[8] Huckle,T.: Ecient Computation of Sparse Approximate Inverses, TU Munchen, Institut f. Informatik, TUM-19608; SFB-Report 342/04/96 A, 1996. [9] Kaporin,I.E.: New Convergence Results and Preconditioning Strategies for the Conjugate Gradient Method, in Numerical Linear Algebra Appl. 1(2), pp. 179-210, 1994. [10] Kolotilina,L.Yu., Yeremin,A.Yu.: Factorized sparse approximate inverse preconditionings I. Theory, SIAM J. Mat. Anal. 14, pp. 45-58, 1993.
11