COMPUTING THE SPARSE INVERSE SUBSET: AN INVERSE MULTIFRONTAL APPROACH YOGIN E. CAMPBELL AND TIMOTHY A. DAVISy
Technical Report TR-95-021, Computer and Information Sciences Department, University of Florida, Gainesville, FL, 32611 USA. October, 1995.
Key words. Sparse matrices, symmetric multifrontal method, supernode, Takahashi equations, inverse multifrontal method, inverse frontal matrix, inverse contribution matrix, inverse assembly tree, Zsparse. AMS (MOS) subject classi cations. 05C50, 65F50, 65F05. Abbreviated title: Multifrontal Sparse Inverse Subset Abstract.
We present the symmetric inverse multifrontal method for computing the sparse inverse subset of symmetric matrices. The symmetric inverse multifrontal approach uses an equation presented by Takahashi, Fagan, and Chin to compute the numerical values of the entries of the inverse, and an inverted form of the symmetric multifrontal method of Du and Reid to guide the computation. We take advantage of related structures that allow the use of dense matrix kernels (levels 2 & 3 BLAS) in the computation of this subset. We discuss the theoretical basis for this new algorithm and give numerical results for a serial implementation and demonstrate its performance on a Cray-C98.
1. Introduction. We address the problem of computing the sparse inverse subset (Zsparse) of a symmetric matrix, A, using the LDU factorization of A, and an equation relating the LDU factors to Z = A?1 presented by Takahashi, Fagan, and Chin in [19]. The sparse inverse subset is de ned as the set of entries in the upper triangular part of Z in locations given by the nonzero entries in the factorized matrix: that is, Zsparse = fzij j (U)ij 6= 0g Z. The entries in Zsparse are useful in many practical applications such as approximating the condition number of symmetric positive-de nite matrices, and estimating variances of the tted parameters in the least-square data- tting problem. Prior results on computing Zsparse and other related inverse subsets, based on the two equations of Takahashi, Fagan, and Chin, are found in the articles by Erisman and Tinney [13], and Betancourt and Alvarado [2]. Erisman and Tinney in [13] proved that both Zsparse and the subset of entries on the diagonal of Z can be evaluated without computing any inverse entry from outside of Zsparse. In [2] Betancourt and Alvarado give a parallel algorithm to compute Zsparse and the full inverse. None of these two articles considered the use of dense matrix operations such as matrix-vector or matrix-matrix multiplications in the computations. Our numerical results show that there is a signi cant improvement in the performance of the Zsparse algorithm when the level 2 (matrix-vector) and level 3 (matrix-matrix) BLAS [8] operations are used in the implementation. In this paper we develop a new method, the symmetric inverse multifrontal method [3], to compute Zsparse. We use one of the equations from [19] (henceforth referred to as the Takahashi equation(s)) to compute the numerical values of the inverse elements, and an inverted form of the symmetric multifrontal method of Du and Reid [12] to email:
[email protected] .edu. Computer and Information Science and Engineering Department, University of Florida, Gainesville, Florida, USA. (904) 392-1481, email:
[email protected] .edu. Technical reports and matrices are available via the World Wide Web at http://www.cis.u .edu/~davis, or by anonymous ftp at ftp.cis.u .edu:cis/tech-reports. 1 y
2 guide the computation and take advantage of related structures that allow the use of dense matrix kernels in the innermost loops. We show that the results in [13] and [2] can be easily derived using this formulation. In the multifrontal method the frontal matrix and assembly tree are the two key constructs. We introduce two similar constructs, the inverse frontal matrix and the inverse assembly tree, and discuss their relationship to the frontal matrix and assembly tree, respectively. We show how the computation of Zsparse can be formulated in terms of inverting inverse frontal matrices using the inverse assembly tree to specify the data dependencies. An outline of the paper follows. In Section 2, we discuss relevant aspects of one of the Takahashi equations and give a small example showing its use. We brie y review the symmetric multifrontal factorization method in Section 3. For a more detailed treatment of multifrontal factorization we refer the reader to [12, 17]. The inverse multifrontal algorithm is developed in Sections 4 and 5. In Section 4, the fundamentals of the inverse multifrontal approach are presented, culminating in an algorithm to compute Zsparse based on using matrix-vector operations. This algorithm is extended in Section 5 to include matrix-matrix operations based on inverting supernodes. Performance results from an implementation of a block-structured form of the Zsparse algorithm on the Cray-C98 is discussed in Section 6. Conclusions and avenues for future work are given in Section 7. 2. The Takahashi equation. Takahashi, Fagan, and Chin in [19] presented two equations for computing the inverse of a general matrix A, using its LDU factorization. The equations are (2.1)
Z = D?1 L?1 + (I ? U)Z
and (2.2)
Z = U ?1D?1 + Z(I ? L);
where Z = A?1 , and A = LDU (L, U, and D are unit lower triangular, unit upper triangular, and diagonal matrices, respectively). When A is (numerically) symmetric U = LT and Z is symmetric. In this paper we consider symmetric matrices and, therefore, only use Equation (2.1). (For notational convenience we continue to use U instead of LT .) We refer to Equations (2.1) and (2.2) as the Takahashi equations. The matrices involved in Equation (2.1) are illustrated in Fig. 2.1. Shaded areas in this gure represent nonzero elements. The following are some useful observations concerning the Takahashi equation and the matrices involved in it. The product D?1 L?1 is a lower triangular matrix with (D?1 L?1 )ii = (D?1 )ii . This is used to avoid computing L?1 when evaluating elements on the diagonal and upper triangular part of Z. The matrix (I ? U) is strictly upper triangular since U is unit upper triangular. Computationally the most useful feature of Equation (2.1) is that we can use it to compute Z (more precisely the upper triangular part of Z) without having to rst nd L?1 . Using the two previous observations, and restricting the inverse elements to the diagonal and upper triangular part of Z, Equation (2.1) can be restated as follows: (2.3)
zij = d?ij1 ?
n X k>i
uik zkj
for i j
3 Z
D =
=
-1
-1
I
L
0
1
*
0
0
+
.
.
0 .
1
+
1
.
.
0
U 0 .
1
-
1
.
0
.
Z .
* 1
*
0
. Illustration of the Takahashi equation, Equation (2.1)
Fig. 2.1
where the notation yij = (Y )ij is used. The elements of Z can be computed in reverse Crout order [9]. That is, evaluate in order the elements in rows n, n ? 1, : : : 1. (In each row we only need to evaluate entries in the upper triangular part of Z.) When computing the entries of a given row the order in which the diagonal entry is computed (relative to the other entries in the row) is important because it depends on a subset of the entries in the row. The other entries in the row are independent of each other and can therefore be computed in any order. Erisman and Tinney show in [13] that the sparse subset can be computed in terms of U, and other inverse elements from within the sparse subset only. This important result allows one to safely ignore all other inverse elements when computing the sparse subset. 2.1. An Example. We use a small example to illustrate the use of Equation (2.3) in computing inverse elements. Consider the symmetric matrix A and its corresponding lled matrix (L+U) shown in Equations (2.4) and (2.5), respectively. The 's represent the original entries of A, while the 's are entries due to ll-in. (2.4)
0 1 B CC A=B @ A
(2.5)
0 1 B CC L+U = B @ A
Using Equation (2.3), and a reverse Crout computational order, The set of Equations (2.6) gives the sequence used to compute the elements of the sparse subset. 9 z44 = d?441 > > z34 = ?u34z44 > > ? 1 z33 = d33 ? u34z43 > = z = ? u z 23 23 33 (2.6) z22 = d?221 ? u23z32 > > z14 = ?u13z34 ? u14z44 > > z13 = ?u13z33 ? u14z43 > ? 1 z11 = d11 ? u13z31 ? u14z41 ;
4 (Note by symmetry zij = zji .) Observe that this is only a partial ordering. For example, the entries z11 , z13 , and z14 can be computed in any order with respect to z22 and z23. We can also use Equation (2.3) to evaluate elements outside the sparse subset, for example: z24 = ?u23z34 ? u24z44 and z12 = ?u13z32 ? u1;4z42 . 3. The symmetric multifrontal approach. The direct method of solving the linear system of equations Ax = b (where A symmetric) using the LU or Cholesky factorization of A, is usually a two step algorithm. In the rst step, the coecient matrix A is factorized into LU (or LLT if Cholesky factorization is used), where L and U are lower and upper triangular matrices, respectively. The second step involves solving the two triangular systems Ly = b for y (forward substitution), and Ux = y for x (backward substitution). The symmetric multifrontal method of Du and Reid [11, 12] (see also Liu [17]) is an ecient method to compute the LU factorization of A, especially on machines with memory hierarchy. This factorization method is based on the use of dense matrix kernels in the innermost loops. Dense submatrices called frontal matrices are formed as the multifrontal algorithm progresses. One or more steps of Gaussian elimination is done on each of these frontal matrices. In general, the symmetric multifrontal algorithm consists of a symbolic analysis phase and a numerical factorization phase. In the analysis phase a ll-reducing pivot ordering algorithm (such as the approximate minimum degree algorithm [1] or the minimum degree algorithm [14]) is used to establish the pivot order and data structures. In addition, the precedence relationships among the frontal matrices that are used in the numerical phase are established and given by the assembly or elimination tree [9, 16]. In this phase only the pattern of A is used. The numerical work to actually compute the LU factors is done in the numerical factorization phase. The assembly tree is used to guide the computation in this phase. We use the example matrix A shown in Equation (3.1) to highlight the basic multifrontal constructs. The p's and 's represent original entries of A, while the 's are entries due to ll-in. 2p 3 1 66 p2 77 66 77 p3 66 p4 77 6 77 A=6 p5 (3.1) 66 77 p6 66 p7 77 4 5 p8 p9
(3.2)
2p 66 1 p2 66 p3 66 p4
L + U = 66 p5 66 p6 66
p7
4 p8
3
77
77 77 77 77 77
5
p9
5 Off-diagonal pivot row
Off-diagonal pivot column
Pivot
Pivot Row
Contribution Matrix
Pivot Column
. General structure of a frontal matrix
Fig. 3.1
We assume that the matrix has already been permuted so that the pivots lie on the diagonal: p1 : : : p9 . Its ll structure is also determined and shown in Equation (3.2). In the simplest case (no supernodes) nine frontal matrices, F1 : : : F9, corresponding to the nine pivots, p1 : : : p9 , are used in the LU factorization of A. The row/column index pattern of frontal matrix Fi is de ned by the row/column index pattern of pivot row/column i of the lled matrix. Let Ui be the column pattern of Fi. Then, Ui = fj j uij 6= 0; j ig: We nd it convenient to partition the column index set Ui into a pivotal subset Ui , and non-pivotal subset Ui , where Ui = fig (3.3) Ui = fj j uij 6= 0; j > ig: The row pattern Li is de ned and partitioned similarly. For symmetric matrices Li = Ui . For our example matrix U1 = f2; 8; 9g and U4 = f7; 8; 9g. The partitioning of the index set U induces a natural partitioning of the frontal matrix F into four parts: the pivot element, the o-diagonal pivot row, the o-diagonal pivot column, and the contribution matrix C = f(F)ij j i; j 2 Ui g: Figure 3.1 illustrates this general partitioning scheme. The assembly tree can be constructed using the parent-child relationship given by [16] parent(i) = minf j j lji 6= 0 ; j > ig = minf j jj 2 Lig: Figure 3.2 shows the assembly tree for the lled matrix in Equation (3.2). The node numbers correspond to the labels of frontal matrices and the arrows specify the dependency relationships among frontal matrices. For example, frontal matrix F8 must be factored after F2 and F7 because F2 and F7 contain update terms to the pivot row and pivot column of F8. 0
00
0
00
00
00
00
6 9
8
7
2
4
1
3
6
5
. Assembly tree for the matrix in Equation (3.2)
Fig. 3.2
4. The symmetric inverse multifrontal method. We introduce the basic concepts and constructs of the symmetric inverse multifrontal method by considering the computation of the sparse inverse subset, Zsparse. We show that the computation of Zsparse can be formulated in terms of constructs similar to those used in the multifrontal method. These include the inverse frontal matrix and inverse assembly tree. 4.1. The inverse frontal matrix. Let F be the set of frontal matrices used in the multifrontal LDU factorization of a (numerically) symmetric matrix A. (In this case U = LT , but we continue to U instead of LT for notational convenience.) For every frontal matrix Fi 2 F , we de ne a corresponding inverse frontal matrix F i , with row and column patterns identical to that of Fi . That is, F i has row index pattern Li and column index pattern Ui . We use the same notation, Li and Ui , to denote the index patterns of both Fi and F i . Although Fi and F i have the same structure we de ne an element of F i to be an element of the inverse matrix Z, i.e., (F i)kj = zkj 2 Z. The set of elements de ned by F i is thus a subset of Z. Let F represent the set of corresponding inverse frontal matrices. We partition an inverse frontal matrix F i into an inverse pivot row, an inverse pivot column, and an inverse contribution matrix in much the same way that its corresponding frontal matrix Fi is partitioned. The general partitioned-structure of the frontal and inverse frontal matrices is illustrated in Figure 4.1. Let Zi be the set of elements in the inverse pivot row of (F i ). Then, Zi = fzij jj 2 Ui ; j ig: It is easy to show by the equivalence in structure of Fi and F i that the sparse inverse subset (Zsparse) is given by (4.1)
Zsparse =
[
i2F
Zi = fzij juij 6= 0; j ig:
(Note that since Zsparse is symmetric we only consider entries in its upper triangular part.) Equation (4.1) states that the set of entries in Zsparse is the union of the set of entries in the inverse pivot rows of all inverse frontal matrices in F . Consequently, to
7 Pivot Row
Inverse Pivot Row Inverse Contribution Matrix
Contribution Matrix
Pivot Column
Inverse Pivot Column
F
F
. Equivalence of structures F and F
Fig. 4.1
9
8
7
2
1
4
3
6
5
. Corresponding inverse assembly tree
Fig. 4.2
compute Zsparse we in eect need to compute the entries in the inverse pivot rows of every inverse frontal matrix. We refer to the process of computing the entries in the inverse pivot row of an inverse frontal matrix as \inverting" or \inversion" of the inverse frontal matrix. 4.2. The inverse assembly tree. As mentioned in our review of the symmetric multifrontal method (Section 3), an assembly tree is used to guide the factorization of process. For any given matrix A, we de ne the corresponding inverse assembly tree to have identical structure as the assembly tree used in the factorization of A, except that the direction of the parent-child dependency arrows are reversed and node labels refer to inverse frontal matrices. As we shall show, the inverse assembly tree is used to guide the inversion process in a way analogous to how the assembly tree is used to guide the factorization process. Figures 3.2 and 4.2 show the assembly and inverse assembly trees, respectively, for the lled matrix in Equation (3.2). Note that the ow of data, represented by the dependency arrows, are from parent to children in the inverse assembly tree (Figure 4.2) as opposed to the usual children to parent
ow in the assembly tree (Figure 3.2). As discussed later, this \dependency reversal" captures the top-down traversal of the inverse assembly tree used in the computation of Zsparse compared to the bottom-up assembly tree traversal used in the factorization algorithm. There is also an inverse assembly process from parent to child, instead of the child to parent assembly process in the multifrontal factorization method. The
8 Inverse assembly tree
Assembly tree 9
9
8
8
2
2
7
1
4
6
3
5
7
1
4
6
3
5 2 8 9
2 8 9 2 # # # 8 # # # 9 # # # 1 2 8 9 1 2 8 9
# # # # # # # # #
+ + + + + + 9 + + +
2 8
F2
F2
1 2 8 9
#
2
F1
1
Multifrontal Constructs
1 2 8 9
2
+ + + + + + + + +
F1
1
Inverse Multifrontal Constructs
. The frontal and inverse frontal constructs
Fig. 4.3
inverse assembly tree can also be used in a parallel algorithm to compute Zsparse. We report results on such an algorithm in [5]. In Figure 4.3 we give a side-by-side illustration of the multifrontal and inverse multifrontal constructs for the matrix in Equation (3.2). Note that, whereas, assembly in the multifrontal method takes place from the contribution matrix of a child node (the # positions in F1 ) to the parent node (the # positions in F2), inverse assembly is from the parent node (the + positions in F 2 ) to the inverse contribution matrix of the child node (the + positions in F 1 ). 4.3. The theoretical development. Consider using Equation (2.3) to compute entries of Zsparse. Let (4.2)
Tij = f?uik juik 6= 0; k > ig
Since Z is usually full, even though A may be sparse, the set Tij speci es the set of nonzero uikzkj terms in the summation. These are the values of the U factor that actually \contribute" to the value of zij . The following is the key result that relates Tij to the index pattern of frontal matrix Fi. Claim:
9
Tij = f?uik j k 2 Ui g
(4.3)
00
Proof: By de nition (see Equation (3.3)) fk j uik 6= 0; k > ig = Ui . Substituting 00
this result into Equation (4.2) gives the result. 2 In eect, Equation (4.3) implies that in computing an inverse element, the k indices in the summation in Equation (2.3) are constrained to the non-pivotal index set of frontal matrix Fi . We refer to Tij as the direct u-dependency set of zij . The direct z-dependency set of zij , Zij , is de ned similarly: Zij = fzkj juik 6= 0; k > ig. As for the direct u-dependency set this can be rewritten in terms of the index set of the inverse frontal matrix F i ,
Zij = fzkj j k 2 Ui g.
(4.4)
00
Equation (4.4) expresses the set of inverse values directly used to compute zij in terms of the non-pivotal index set of inverse frontal matrix matrix F i . (Here we made use of the equivalence in index patterns of Fi and F i .) Equations (4.3) and (4.4) state two of the central results of the inverse multifrontal method. Using the Tij and Zij sets as row and column vectors, respectively, we can rewrite Equation (2.3) in the inner-product form zij = d?ij1 + Tij Zij i j.
(4.5)
We now discuss ve important results that follow immediately from the application of Equations (4.3?4.5) and the de nition of the inverse frontal matrix. First, note that the direct u-dependency set Tij is independent of index j since the set de ned by Equation (4.3) is independent of j. (This is true even for zij 2= Zsparse [4].) Second, the direct z-dependency set of an inverse entry zij 2 Zsparse, i 6= j, is identical to column j of the inverse contribution block of F i . This can be easily proved. Proof: Recall that by de nition, the set of entries in column j of the inverse contribution block of F i is given by (C i )j = fzkj jk 2 Ui g; 00
where j 2 Ui . The fact that (C i)j = Zij follows trivially from the de nition of Zij given by Equation (4.4). 2 (Note that for i = j, Zij is not in the inverse contribution block.) Figure 4.4 illustrates the relationship among the frontal, inverse frontal and direct dependency vectors. The third result concerns the relationship between the direct z-dependency set of the diagonal entry and the other entries in the inverse pivot row. We claim that the direct z-dependency set of zii is equal to the set of o-diagonal inverse pivot row entries of F i . Proof: Recall that the set of o-diagonal inverse pivot row entries of F i is given by Zi = fzik jk 2 Ui g. The direct z-dependency set of zii is obtained from Equation (4.4) and is given by Zii = fzki j k 2 Ui g. Substituting zki = zik (by symmetry of Z) we get the result Zii = fzik j k 2 Ui g = Zi . 2 This implies that zii depends directly on all the other entries in the inverse pivot row of F i. From an 00
00
00
00
00
00
10 z ij dot prod.
Tij
Zij z ij
T ij
Zij
_
Fi
Fi
. Relating F i , Fi , Tij , and Zij
Fig. 4.4
implementation perspective this of course means that zii should be the last element computed when F i is inverted. The fourth result makes the connection between the direct z-dependency set of the entire o-diagonal inverse pivot row and the inverse contribution matrix. Let Zi be the direct z-dependency set of all entries in the o-diagonal pivot row of F i , i.e. Zi . Then the following is true: Zi = C i. That is, the direct z-dependency set of the set of o-diagonal entries of an inverse pivot row is equal to the set of entries in the inverse contribution block. Proof: From Equation (4.4), 00
Zi =
[
j 2Ui
Zij =
00
[
j 2Ui
00
fzkj j k 2 Ui g = fzkj j k; j 2 Ui g 00
00
The last set expression is by de nition equal to the C i . 2 From here on we use Zi for the inverse contribution matrix of F i instead of the C i notation used earlier. The fth and nal result concerns the direct z-dependency set of all entries in Zsparse. Erisman and Tinney in [13] showed that the only inverse entries that an entry in Zsparse depended on also belong to Zsparse. In our terminology, this is equivalent to saying that the direct z-dependency set of Zsparse is equal to Zsparse. We prove this result next. Let direct(Zsparse) be the direct z-dependency set of Zsparse; we claim that direct(Zsparse) = Zsparse. Proof: By Equation (4.1), direct(Zsparse) = direct(
[
i2F
Zi ) =
[
i2F
direct(Zi )
where Zi is the elements in the inverse pivot row of Fi and direct(Zi ) is the direct zdependency set for the entries in Zi . Using the third and fourth results just discussed, direct(Zi ) = fzjk j j 2 Ui ; k 2 Uig (4.6) = fzji j j 2 Ui g [ fzjk j j 2 Ui ; j 2 Ui g 00
00
00
00
Note that the set fzjijj 2 Ui g = fzij jj 2 Ui g = Zi ; clearly, these entries all belong to Zsparse. The set fzjk jj 2 Ui ; j 2 Ui g = Zi , the inverse contribution block of F i . 00
00
00
00
00
11 Every element in the inverse contribution block of F i is also an element of Zsparse for the following simple reason. Since the structure of F i is equivalent to the structure of its corresponding frontal matrix Fi, for every zij in the inverse contribution block of F i there is a nonzero update to the entry, uij , in the factor matrix U (or lij in the factor matrix L if i > j): this obviously means that uij (or lij ) is nonzero. By de nition zij 2 Zsparse if uij is nonzero; therefore every entry of Zi must also be an entry of Zsparse. This is true for every inverse frontal matrix, therefore: direct(Zsparse) = direct(
[
i2F
Zi ) =
[
i2F
direct(Zi ) = Zsparse:2
It is also easy to show that the direct z-dependency set of sparse subset consisting of entries from the diagonal of Z is equal to Zsparse. Proof: Recall that direct(zii ) = zii [ Zi = Zi . (We have included zii in its own z-dependency set to make the proof that follows simpler.) Therefore, 00
[
i2F
direct(zii ) =
[
i2F
Zi = Zsparse:2
4.4. The matrix-vector form of the Takahashi equation. The fourth result discussed in the previous subsection shows that the computation of the odiagonal entries of inverse pivot row i directly involves only the entries in the inverse contribution matrix of F i . We also know by the second result that these dependencies are \structured," in the sense that zij depends only on column j of the inverse contribution matrix of F i. This allows us to restate Equation (4.5) in matrix-vector form for the special case where we are only concerned with computing the entries in the inverse pivot row of F i 2 F : Zi = Ti Zi (4.7) zii = d?ii 1 + Ti Zii = d?ii 1 + Ti Zi where we have dropped the unnecessary j subscript on the direct t-dependency vector. 4.5. Computational cost. Observe that jTij = jZij j = jUi j, where jxj is the number of elements in the set/vector/matrix x. By Equation (4.5) it is easy to see that the direct cost in terms of multiply-add pairs of oating point operations to compute zij is jTij = jZij j = jUi j. This direct cost only represent the cost to compute zij once all the inverse entries on which it depends have previously been computed. Clearly, the overall oating point operation cost in computing zij must also include the oating point work done in computing the entries on which zij depends, the indirect cost. Since inverse pivot row i has jUi j + 1 entries, the direct cost to compute it is (jUi j)(jUi j + 1). To compute Zsparse we need to compute the entries of all inverse pivot rows (see Equation (4.1)). Using the fact that Zsparse can be computed without computing any element from outside this set, we get the following result for the total number of multiply-add pairs of operations required to compute Zsparse, 00
00
00
00
00
00
00
n X i=1
(jUi j)(jUi j + 1); 00
00
which is the same as the number of operations required for the factorization (recall that U = LT ). If for example, LDU has a tridiagonal pattern jUi j = 1 (i = 1; : : :; n ? 1), 00
12
jUn j = 0, and the cost of computing Zsparse for this system is 2(n ? 1) multiply-add 00
pairs. In general, if LDU is a band matrix of bandwidth 2m + 1, then
jUi j = m; 1 i n ? m; 00
and
jUi j = n ? i; n ? m + 1 i n: 00
The number of multiply-add pair operations to compute Zsparse is then, nX ?m i=1
(m)(m + 1) +
n X
i=n?m+1
(n ? i)(n ? i + 1) = m(m + 1)(3n ? 2m ? 1)=3
When m is much less than n (the typical case) this computation is of order (m2 n). 4.6. The inverse multifrontal Zsparse algorithm. There are still two important issues to be addressed: (a) how to eciently locate (assemble) the elements of the direct z-dependency sets (except, of course, Zii = Zi ), and (b) the order in which the inverse elements should be computed, taking into account the z-dependencies inherent in the Takahashi equation. Note that the locations of the direct u-dependency sets are completely speci ed by Equation (4.3). 4.6.1. Assembling the direct z-dependency set. Let node p be the parent of node i in the inverse assembly tree. The following assertions are true: (a) j 2 Ui implies j 2 Up , (b) for every zij in the o-diagonal pivot row of F i , Zij is a subset of column j of F p , and (c) the inverse contribution block of F i is a subset of the entries in the parent node F p . Assertion (a) follows immediately from the general properties of the index sets, one of which is that Ui Up ([16] and [12]). 2 The proof of assertion (b) is as follows. By Equation (4.4) the direct z-dependency of zij is 00
00
00
Zij = fzkj jk 2 Ui g: S Using the fact that Ui Up we can write Up = Ui Q (for some, possibly empty, 00
00
00
index set Q). Then,
fzkj jk 2 Up g = fzkj jk 2 Ui g [ fzkj jk 2 Qg 00
From which we get, Zij = fzkj j k 2 Ui g fzkj j k 2 Up g. Since from (a) we have j 2 Up we get the result that Zij is a subset of column j of (F p ). 2 Assertion (c) is an immediate consequence of assertions (a) and (b): Zi = Sj2Ui Zij is a subset of the entries in F p since all i; j 2 Up by assertion (a) and in these cases Zij is a subset of column j of F p by (b). 2 Assertion (c) in essence assures us that the inverse contribution block of F i can always be assembled from the parent node of i in the inverse assembly tree. 00
00
13
while (computing Zsparse) do
inverse ordering: select the next inverse frontal matrix, F i , based on top-down traversal of the inverse assembly tree inverse assembly: assemble inverse contribution matrix Zi , from its parent node inversion: Compute inverse pivot row i using Equation (4.7)
end while
. Algorithm to compute Zsparse
Fig. 4.5
This is analogous to the situation in the symmetric multifrontal method where the contribution matrix of a child node can always be assembled into its parent node. Being able to assemble the inverse contribution matrix from the parent node allows for an implementation that can take advantage of dense kernels. With the appropriate data structure, the inverse contribution block can be rst assembled from its parent and the matrix-vector product found in Equation (4.7) done using level 2 BLAS. Furthermore, applying the inverse assembly process recursively we get the important result that the inverse assembly tree (with dependency arrows from parent to children) captures the dependency relationships among the inverse frontal matrices. The computation of Zsparse then becomes a simple matter of following the dependency arrows of the inverse assembly tree, assembling the inverse contribution matrix and doing the actual inversion using Equation (4.7) at every node. Note that in contrast to the multifrontal factorization algorithm, the assembly in the inverse multifrontal algorithm is a copy operation requiring no oating-point operations. This algorithm is given in Figure 4.5. 5. The supernodal inverse multifrontal method. The discussion of the inverse multifrontal method in the previous section focused on simple inverse frontal matrices containing one inverse pivot row and column. However, multifrontal codes take advantage of related structures in the index pattern to form supernodes consisting of several pivot rows and columns per frontal matrix. The use of these supernodes permits the more ecient use of the memory hierarchy (cache and vector registers, for example) found in high performance computer architectures. We show in this section how to extend the simple algorithm for computing Zsparse to take advantage of the supernodes formed during the LDU factorization. 5.1. The supernodal inverse constructs. Our de nition of a supernode comes from a theorem given by Liu, Ng, and Peyton in [18]: Theorem 5.1 ([18]). The column set S = fi,i+1, : : :;i+mg is a supernode of the matrix L if and only if S is a maximal set of contiguous columns such that s+j ? 1 is a child of s+j in the elimination (assembly tree in our case) for j=1,2, : : :, m, and (L;i ) = (L;i+m ) + m where, (v) gives the number of nonzeros in the vector v.
We label a supernode with the minimum column index in S. In the symmetric multifrontal method the column set S, de ned by Theorem (5.1), speci es the pivot columns, and an equivalent set speci es the pivot rows. The simple inverse multifrontal approach is extended in a natural way to include supernodes. The
14 Pivot Block
Inverse Inverse Off-Diagonal Pivot Block Pivot Row Block
Off-Diagonal Pivot Row Block
Zi
Zi
Inverse Off-Diagonal Pivot Column Block
Off-Diagonal Pivot Column Block
Ti Contribution Matrix
Fi
Inverse Contribution Matrix
Zi
Fi . Block structure of supernodal F and F
Fig. 5.1
supernodes and supernodal assembly tree of the LU factorization have their corresponding inverse supernodes and inverse supernodal assembly tree de ned analogously as for the simple nodes discussed in Section 4. We continue to use the same notation for the supernodal frontal and inverse frontal matrices as was introduced in Section 4, with the understanding that we mean the supernodal structures. The supernodes are partitioned into the same four parts as the simple nodes, except that the single pivot is now a block matrix, and both the o-diagonal pivot row and pivot column are now block matrices. Figure 5.1 illustrates the general structures and partitioning of the supernodal frontal/inverse frontal matrices. It should be noted that now Ui = fi; i + 1; : : :; i + mg = S: 5.2. The supernodal algorithm for Zsparse. The matrix-vector form of the Takahashi equation, Equation (4.7), can be easily extended to take advantage of the supernodal structures. The following Equations (5.1) give the extended form to compute entries in the inverse row of F i . Zi = Ti Zi (Zi )r = (Zi )r + (Ti )[r;r+1:m] (Zi )[r+1:i+m;j1 :jc ] i r < i + m Zi = Di?1 + Ti (Z )Ti (Zi )[r;r+1:i+m] = (Zi )[r;r+1:i+m] + (Ti )[r;r+1:i+m] (Zi )[r+1:i+m;r+1:i+m] i r < i + m (Zi )r;r = (Zi )r;r + ((Ti )[r;r+1:i+m] )(Zi )T[r;r+1:i+m] i r < i + m (5.1) We used the colon notation found in [15] where X[p:q;r:s] represents the matrix with row indices ranging from p to q and column indices from r to s. If p = q then X[p;r:s] is a row vector with column indices r through s. Similarly X[p:q;r] is a column vector with row indices p through q. For U = fi; i+1; : : :; i+mg and U = fj1; j2; : : :; jc g the matrices Zi , Zi , Zi , Di?1 , and Ti are de ned as follows (see Figure 5.1 also): Zi = (F i )[i:i+m;i:i+m] ; 0
00
00
00
00
0
00
00
0
0
0
0
00
00
0
0
0
00
00
00
0
Zi = (F i )[i:i+m;j1 :jc ] ; 00
15 j
i
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
. The block-partitioned F
Fig. 5.2
Zi = (F i )[j1 :j ;j1:j ]; c
c
Di?1 = (Di?1 )[i:i+m;i:i+m] ; and
Ti = (Fi )[i:i+m;j1 :j ] = ?U[i:i+m;j1 :j ]; 00
c
c
respectively. There are three potential problems with using Equation (5.1) in an implementation. First, a modest amount of unnecessary oating point operations may be done, due to the third subequation in (5.1). Here, partial computation of entries in the lower triangular part of Z is done. The idea in using this subequation is to be able to take advantage of the level 3 BLAS operation. We can rewrite the subequation in terms of the less ecient level 2 BLAS operations. We need to balance the desire for high mega op performance (using the level 3 BLAS) versus the increase in CPU time as the inverse pivot blocks get bigger and more unnecessary operations are done. The second potential problem is that for the larger sized problems where the supernodes can become relatively large, the block sizes can become larger than the \optimum" level 3 BLAS operating size. This optimum block size will, of course, depend on hardware characteristics such as the cache size and vector register size. Clearly, it would be advantageous to restrict the block sizes so that the computation ts in cache and/or makes eective use of vector registers. The third concern has to do with memory usage. In order to take advantage of the level 3 BLAS we need to use more memory than is really necessary. This can be serious for the larger sized problems if the block size is set to large. 0
16 1:
while (computing Zsparse) do
2: 3:
inverse ordering: get next supernode, i, from a top-down traversal of the inverse supernodal assembly tree inverse assembly: assemble the \staircase" upper triangular part of Zi , the sparse inverse contribution matrix of F i , from its parent node inversion: compute the inverse pivot row using Equation (5.2)
end while
. The block-partitioned algorithm to compute Zsparse
Fig. 5.3
To control the negative impact of these three problems we use the two-dimensional partitioning scheme shown in Figure 5.2 for each of the inverse frontal matrices. The upper \staircase" diagonal form of the matrix still involves some unnecessary computation and unnecessary storage, but these can both be controlled by regulating the size of the blocking parameter. Assuming that the blocking size, b, is the same in both dimensions, we get the block-partitioned form of Equations (5.1):
P
P
Z = D?1 + jk=i+1 T Z + mk=j +1 T (Z )T i j (5.2) (Z )r = (Z )r + (T )[r;r+1:b] (Z )[r+1:b;1;b] 1 r < b (Z )rr = (Z )rr + (T )[r;r+1:b] (Z )T[r;r+1:b] 1 r < b ij
ij
ik
kj
ij
ij
ii
ij
ii
ii
ii
ii
ik
jk
where, the bold-type indices are local block indices, m = m1 + m2 with m1 and m2 de ned by Equation (5.3), 1 i m1 , 1 j m, and T = ?U . ij
ij
m1 = djU j=be m2 = djU j=be In the example shown in Figure 5.2, m1 = 5, and m2 = 3. The block-structured form of the Zsparse algorithm based on Equation (5.2) is shown in Figure 5.3. 6. Numerical experiments. We implemented the block-partitioned Zsparse algorithm discussed in the last section and a \scalar" Zsparse algorithm (discussed below), on the Cray Research, Inc. Cray-C98 platform. The Cray-C98 is an 8processor, 512Mw shared-memory machine. Our runs were restricted to a single processor. Each processor has vector capability and peak theoretical performance of one billion oating point operations per second (1 G op). Version cf77 6.0.4.1 of the Cray fortran compiler with automatic vectorization was used. All arithmetic operations were done in single precision (64 bit). The Cray Research Inc. optimized BLAS were used to do the matrix-vector and matrix-matrix computations. 6.1. Data structures. The available memory (not including the memory to hold the input factors, tree information and integer work arrays) is partitioned as shown in Figure 6.1. This is essentially a two-stack layout. The z-value memory area is used as the permanent storage area for the computed inverse entries. This forms the left-stack and grows towards the right as indicated by the right-arrow. The rightstack, labeled the parent area in Figure 6.1, is used to store the upper triangular parts of parent nodes with children yet to be inverted. Data are moved from this area to the inverse contribution block in the active area during inverse assembly. The right-stack (5.3)
0
00
17 z-value area
active area
Free Memory
parent area
. Partitioning of memory
Fig. 6.1
grows and shrinks as inverted parent nodes are allocated or deallocated, respectively. A point to note here is that a stack structure is the \natural" data structure to use for the parent memory area when the inverse assembly tree is traversed in a depth- rst manner. Natural here meaning that when F i is to be inverted its parent lies at the top of the stack and, therefore, no overhead is incurred in locating the parent node. If a breadth- rst traversal is used however, then a queue would be a better data structure. We used a depth- rst traversal scheme. The memory area in Figure 6.1 labeled the active area is used as the actual center of computation. When F i is to be inverted, memory of size b2[m1 (m1 + 1)=2 + m2 (m2 + 1)=2 + m1 m2 ] (see Section 5 for de nitions of m1 and m2 ) is allocated to accommodate its staircase upper diagonal form. The location of the active area is xed with respect to the top of the left-stack. A preprocessing phase is used to determine the maximum size to which the stacks and active area can grow. After an inverse frontal matrix, such as F i , is inverted, the input factors corresponding to the pivot row of Fi can be overwritten with the inverse entries in the inverse row of F i . Signi cant savings in the amount of stack memory used can result if this overwriting is allowed. We therefore provide the user with a parameter to control this feature. 6.2. The scalar Zsparse algorithm. To serve as a basis for comparison we implemented a \scalar" version of the Zsparse algorithm, shown in Figure 6.2. The implementation of this algorithm allowed the use of level 1 BLAS operations (saxpy and dot product, with vectorized gather/scatter operations), but no level 2 or 3 BLAS operations. The algorithm proceeds by computing elements in rows n through 1, step 1. Within each row, say i, the only o-diagonal elements in Zsparse are those with indices from the index pattern of row i of the factor matrix U, i.e. Ui . This is done in step 2 for the o-diagonal entries and in step 6a for the diagonal entry (zii ). Note that zii is evaluated after all the o-diagonal entries in row i have been evaluated. Step 3a involves a saxpy operation, while lines 3b and 6a involve dot products. 6.3. Numerical results. Statistics for the test matrices used in our numerical experiments are shown in Table 6.1. Most of these matrices are part of the HarwellBoeing sparse matrix collection [10]. The sizes of the matrices range from small (the 100-by-100 Nos4 matrix) to fairly large (74652-by-74752 for the Finan512 matrix). The matrices were rst factored using a early modi ed form, with strict diagonal pivoting, of UMFPACK Version 2.0 [7, 6]. The supernodal tree information was also generated by UMFPACK. In general, any supernodal or multifrontal Cholesky factorization algorithm can be used. Table 6.2 gives the execution times for the scalar and block-partitioned Zsparse implementations. For the block-partitioned algorithm we give CPU times for block sizes 8, 32, 64, 128, and 256. In the column labeled \peak m ops of block *" the
18 1: 2: 3: 3a: 3b: 4: 5: 6: 6a:
for i = n to 1 for k 2 Ui for j 2 (Ui and j > k) zij = zij ? uik zkj zik = zik ? uij zkj end for zik = zik ? uik zkk end for?1 00
00
zii = dii for k 2 Ui zii = zii ? uik zik 00
end for end for
. The scalar Zsparse algorithm
Fig. 6.2
Table 6.1
Information on test matrices
matrix Nos4 Plat1919 Bcspwr10 Bcsstk28 Bcsstk25 Finan512
jLj (103 ) 100 structural eng. 0.8 1919 oceanography 83.4 5300 electric power 40.0 4410 structural eng. 468.8 15439 structural eng. 2425.6 74752 economics 9564.8 n discipline
best mega op performance is given (the block size for which this occurs is shown in parentheses). The worst CPU performance for all the matrices occurs for a block size of 8; for block sizes smaller than 8 we got even larger CPU times for all the matrices. This no doubt is due to the poor performance of the BLAS on the smaller-sized blocks and the increasing overhead involved in creating the larger number of blocks. But even for a block size of 8 the inverse multifrontal block-partitioned algorithm has a clear performance advantage over the scalar algorithm for all but the smallest sized matrix (Nos4) and the very sparse Bcspwr10 matrix. As the block size increases the CPU times for the block algorithm typically decreases until some optimum block size is reached, after which the CPU times begin to rise. One may have expected that a block size of length equal to the length of the vector registers (128) would give the best CPU performance. However we see from Table 6.2 that a block size of 64 gives the best CPU performance (times in bold). We attribute this to the increasing number of unnecessary operations that are performed on the diagonal blocks. As the block size increases the ratio of unnecessary to necessary operations also increases, resulting in a relative increase in the CPU times. This explanation is further supported by the fact that the highest mega op performance occurs for a block size of 128 (for the larger matrices at least) implying that the BLAS are more ecient but more operations are being done.
19 Table 6.2
Results for the scalar and multifrontal Zsparse runs
matrix Nos4 Plat1919 Bcspwr10 Bcsstk28 Bcsstk25 Finan512
CPU time (sec) peak m ops of scalar block block * 8 32 64 128 256 0.003 0.007 0.007 0.007 0.007 0.007 1 (8) 1.022 0.268 0.161 0.155 0.155 0.155 33 (64) 0.203 0.405 0.387 0.387 0.387 0.387 2 (32) 13.01 1.70 0.64 0.59 0.62 0.66 127 (128) 170.9 19.52 5.64 5.01 5.14 5.29 187 (128) 2469.1 325.5 52.5 42.25 44.05 47.29 320 (128)
7. Conclusions. We introduced the symmetric inverse multifrontal approach to computing the sparse inverse subset. This method uses the concept of inverting an inverse frontal matrix using dense matrix kernels. Similarities between the inversion of an inverse frontal matrix and the familiar concept of factorizing a frontal matrix used in a multifrontal formulation of Gaussian elimination were highlighted. It should be pointed out that it is not essential that the LDU factorization be done by the multifrontal method for the inverse multifrontal method to be used. All we really need is the pattern of nonzeros in the factorized matrix; even the supernodal structures can be reconstructed from the lled-matrix pattern information [18]. The essential theoretical results, stated in Equations (4.3) and (4.4), and the ensuing observations, give a precise description of the locations of the direct dependency sets when inverting an inverse frontal matrix. This allows the use of level 2 and level 3 BLAS in the implementation. The high performance that we achieve in computing Zsparse based on the blockpartitioned inverse multifrontal algorithm is evident from the CPU times and m ops rates found in Table 6.2. The use of the higher level BLAS primitives is certainly the dominant contributing factor in the speedup over the scalar algorithm (which only takes advantage of gather/scatter vector operations). The presentation in this paper focused on the special case of computing Zsparse where the matrix A and, therefore, its inverse Z, are numerically symmetric. The theoretical development and implementation are very similar for computing Zsparse when A and Z are unsymmetric in value but have symmetric nonzero patterns. We discuss a parallel implementation of our algorithm in [5], and an extension to arbitrary subsets of Z in [4]. An open problem remains in adapting the inverse multifrontal method to the case where the coecient matrix has been reduced to block-triangular form before the factorization is done. 8. Acknowledgements. Support for this project was provided by the National Science Foundation (DMS-9223088 and DMS-9504974), and by Cray Research, Inc., through the allocation of supercomputing resources. REFERENCES
20 [1] P. R. Amestoy, T. A. Davis, and I. S. Du. An approximateminimumdegree ordering algorithm. SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical Report TR94-039. [2] R. Betancourt and F. L. Alvarado. Parallel inversion of sparse matrices. IEEE Transactions on Power Systems, PWRS-1(1):74{81, 1986. [3] Y. E. Campbell. Multifrontal algorithms for sparse inverse subsets and incomplete LU factorization. PhD thesis, Computer and InformationScienceand EngineeringDepartment, Univ. of Florida, Gainesville, FL, November 1995. Also CISE Technical Report TR-95-025. [4] Y. E. Campbell and T. A. Davis. On computing an arbitrary subset of entries of the inverse of a matrix. Technical Report TR-95-022, Computer and Information Science and Engineering Department, Univ. of Florida, 1995. [5] Y. E. Campbell and T. A. Davis. A parallel implementation of the block-partioned inverse multifrontal Zsparse algorithm. Technical Report TR-95-023, Computer and Information Science and Engineering Department, Univ. of Florida, 1995. [6] T. A. Davis and I. S. Du. A combined unifrontal/multifrontal method for unsymmetric sparse matrices. Technical Report TR-95-020, Computer and InformationScience and Engineering Department, Univ. of Florida, 1995. [7] T. A. Davis and I. S. Du. An unsymmetric-pattern multifrontal method for sparse LU factorization. SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical Report TR-94-038. [8] J. Dongarra, J. Du Croz, S. Hammarling, and I. Du. A set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1{17, 1990. [9] I. S. Du, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford Science Publications, New York, NY, 1991. [10] I. S. Du, R. G. Grimes, and J. G. Lewis. Users' guide for the Harwell-Boeing sparse matrix collection (release 1). Technical Report RAL-92-086, Rutherford Appleton Laboratory, Didcot, Oxon, England, Dec. 1992. [11] I. S. Du and J. K. Reid. The multifrontal solution of unsymmetric sets of linear equations. SIAM J. of Sci. and Statis. Computing, 5(3):633{641, 1984. [12] I.S. Du and J.K. Reid. The multifrontal solution of inde nite sparse symmetric linear equations. ACM Trans. Math. Software, 9:302{325, 1983. [13] A. M. Erisman and W. F.Tinney. On computing certain elements of the inverse of a sparse matrix. Communications of the ACM, 18:177{179, March 1975. [14] A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive-De nite Systems. Prentice-Hall, Englewood Clis, NJ, 1981. [15] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD and London, UK, second edition, 1990. [16] J. W. H. Liu. The role of elimination trees in sparse factorization. Siam J. Matrix Appl., 11(1):134{172, 1990. [17] J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. Siam Review, 34(1):82{109, March 1992. [18] J. W. H. Liu, E. G. Ng, and B. W. Peyton. On nding supernodes for sparse matrix computations. Siam J. Matrix Anal. Appl., 14(1):242{252, January 1993. [19] K. Takahashi, J. Fagan, and M. Chin. Formation of a sparse bus impedance matrix and its application to short circuit study. 8th PICA Conference Proc., Minneapolis, Minn, pages 177{179, June, 4-6 1973.
Note: all University of Florida technical reports in this list of references are available in postscript form via anonymous ftp to ftp.cis.ufl.edu in the directory cis/tech-reports, or via the World Wide Web at http://www.cis.ufl.edu/~davis.