THE JOURNAL OF CHEMICAL PHYSICS 128, 104105 共2008兲
Recursive inverse factorization Emanuel H. Rubensson,1,a兲 Nicolas Bock,2,b兲 Erik Holmström,3 and Anders M. N. Niklasson2,4,c兲 1
Department of Theoretical Chemistry, School of Biotechnology, Royal Institute of Technology, SE-10691 Stockholm, Sweden 2 Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA 3 Instituto de Física, Universidad Austral de Chile, Casilla 576, Valdivia, Chile 4 Applied Materials Physics, Department of Materials Science and Engineering, Royal Institute of Technology, SE-10044 Stockholm, Sweden
共Received 31 December 2007; accepted 31 January 2008; published online 12 March 2008兲 A recursive algorithm for the inverse factorization S−1 = ZZ* of Hermitian positive definite matrices S is proposed. The inverse factorization is based on iterative refinement 关A.M.N. Niklasson, Phys. Rev. B 70, 193102 共2004兲兴 combined with a recursive decomposition of S. As the computational kernel is matrix-matrix multiplication, the algorithm can be parallelized and the computational effort increases linearly with system size for systems with sufficiently sparse matrices. Recent advances in network theory are used to find appropriate recursive decompositions. We show that optimization of the so-called network modularity results in an improved partitioning compared to other approaches. In particular, when the recursive inverse factorization is applied to overlap matrices of irregularly structured three-dimensional molecules. © 2008 American Institute of Physics. 关DOI: 10.1063/1.2884921兴 I. INTRODUCTION
F → Z*FZ ⬅ Fˆ .
In electronic structure calculations based on, for example, Hartree–Fock or Kohn–Sham theory, the solution of the generalized symmetric matrix eigenvalue equation Fx = Sx
共1兲
is a key problem. Here, F is the potential matrix and S is the positive definite basis set overlap matrix. The electron density is uniquely defined by the occupied eigenspace of the matrix pair 共F , S兲. Usually, the electron density is represented by a set of orthogonal vectors that span this subspace or by the density matrix D which is the projection matrix for orthogonal projections onto this subspace. The problem of finding the electron density for a given potential can be solved by computing the eigenvectors of 共F , S兲 that constitute a basis for the occupied eigenspace or by direct construction of the density matrix using eigenspace methods such as density matrix purification1–9 or density matrix minimization.9–15 The advantage with the eigenspace methods is that the computational effort increases linearly with system size for sufficiently sparse systems. In general, this type of methods requires the generalized eigenvalue problem in Eq. 共1兲 to be transformed to standard form Fˆy = y,
Zy = x
共2兲
by a congruence transformation16 S → Z*SZ = I, a兲
Electronic mail:
[email protected]. Electronic mail:
[email protected]. c兲 Electronic mail:
[email protected]. b兲
0021-9606/2008/128共10兲/104105/10/$23.00
共3兲
共4兲
The purpose of this paper is to present an algorithm for the construction of the inverse factor Z that has a computational cost that increases linearly with system size for sufficiently sparse systems and that can be parallelized. In the following, we will attack the somewhat more general problem when the matrix S is Hermitian positive definite. The inverse factor Z in Eq. 共3兲 is not unique. Common examples include the inverse Cholesky factor17 and the inverse square root.18 The inverse Cholesky factor can be efficiently obtained by the AINV algorithm.19 Variants of the AINV algorithm include stabilized schemes that avoid possible breakdowns,20 a blocked variant that increases efficiency by the use of optimized linear algebra routines for submatrix-submatrix operations,21 and a recursive variant suitable for hierarchic matrix data structures.22 While the AINV algorithms are efficient for small to medium sized systems, they exhibit an unsatisfactory increase of computational cost when the system size is increased. The computational cost can be reduced if elements with magnitude smaller than some drop tolerance are removed from the inverse factor during the computation. This approach has been successful in producing efficient preconditioners for the solution of linear systems.19–21 Nevertheless, with dropping of matrix elements it is difficult to control the error in the computed solution, which is of particular importance for the congruence transformation. In addition, the AINV algorithms are difficult to parallelize because of interdependence between the columns of the inverse factor occurring in the algorithmic procedure. If an approximate inverse factor is known, the so-called iterative refinement method proposed by Niklasson23 can be
128, 104105-1
© 2008 American Institute of Physics
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-2
J. Chem. Phys. 128, 104105 共2008兲
Rubensson et al.
used to improve the accuracy of the approximate inverse factor. The iterative refinement scheme systematically improves the approximate inverse factor by annihilation of the factorization error to any desired order. The method is described in detail in the following section. The challenge remaining is how to obtain with reasonable computational cost an approximate inverse factor such that the convergence of the iterative refinement is guaranteed. In this paper, we address this challenge by combining the iterative refinement method of Ref. 23 with a recursive decomposition of the overlap matrix. The result is a recursive inverse factorization algorithm for Hermitian positive definite matrices that is parallelizable and that scales linearly with system size for sufficiently sparse systems. We provide proof that the algorithm always converges. The computational kernel of the recursive inverse factorization method is matrix-matrix multiplication. There are three main reasons why algorithms based on matrix-matrix multiplications are favorable for large-scale electronic structure calculations. Sparsity. For large systems the matrices F, D, S, and Z often contain many negligible matrix elements. Sparsity can relatively easy be utilized in matrix-matrix multiplication implementations. The sparsity depends a lot on the dimensionality of the system: for a given number of atoms, the matrices will, for example, be more sparse if the atoms are arranged in a one-dimensional chain than in a threedimensional cluster. Parallelism. Matrix-matrix multiplication can be parallelized. Parallelization is a more difficult task for complex operations such as diagonalization and inverse Cholesky decomposition. A single computational kernel. Matrix-matrix multiplications are used in several parts of the calculation, e.g., in density matrix purification and iterative refinement. If this kernel is optimized, it makes several parts of the calculation run faster in one shot.
The iterative refinement method23 can be used to iteratively improve an approximate inverse factor of a Hermitian positive definite matrix. In this section, we briefly review the method in a more compact and generalized form and show that iterative refinement can be seen as a recursive generalization of Löwdin’s symmetric orthonormalization. We will, in the following, use the notation 储A储2 for the Euclidean norm of the matrix A which is defined as the largest singular value of A. Let Z0 be an approximate inverse factor of S such that 共5兲
The iterative refinement method constructs a sequence of matrices Zi, i = 1 , 2 , . . ., such that the factorization error 储␦i储2 → 0 as i → ⬁, where the factorization error matrix
␦i = Zi*SZi − I. The sequence of Zi can be written as
Zi+1 = Zi 兺 bk␦ki ,
共7兲
i = 0,1, . . . ,
k=0
where b0 = 1,
bk = −
2k − 1 bk−1, 2k
共6兲
共8兲
k = 1,2, . . . ,m.
The b coefficients
冋
¯b = 1 − 1 3 − 5 35 − 63 231 − 429 ¯ 2 8 16 128 256 1024 2048
册 共9兲
are determined from the condition that the factorization error after one iteration should vanish up to order m: 储␦i+1储2 = O共储␦i储m+1 2 兲. If we let Z0 = I and m = ⬁ in Eq. 共7兲, we obtain ⬁
Z = 兺 bk共S − I兲k ,
共10兲
k=0
which is the Taylor expansion of S−1/2 around I. This expansion was used by Löwdin for symmetric orthonormalization of S.18 Thus, the iterative refinement method can be seen as a recursive generalization of the original Löwdin orthonormalization. Instead of using a single truncated Taylor expansion of the inverse square root of the overlap matrix, we recursively employ truncated expansions of the factorization error matrix to obtain successively improved inverse factors. The main advantages of using iterative improvement are that accumulation errors in the computation of matrix polynomials are annihilated in each iteration and that the order to which the factorization error vanishes grows rapidly with the number of employed matrix-matrix multiplications. Unless the starting guess Z0 commutes with S, the iterative procedure in Eq. 共7兲 does not necessarily converge to the inverse square root but to some inverse factor. There are two ways to construct ␦i. The first one is given by
␦i = ␦0 + Zi*S共Zi − Z0兲 + 共Zi − Z0兲*SZ0 .
II. ITERATIVE REFINEMENT
储Z0*SZ0 − I储2 ⬍ 1.
m
共11兲
This gives the local refinement method of Ref. 23. If the refinement of Z0 is local, the matrix-matrix multiplications can be carried out with an O共1兲 effort using sparse matrix algebra. In the second approach, ␦i is constructed explicitly by Eq. 共6兲 which gives a method equivalent to the ordinary iterative refinement of Ref. 23. In that report, the polynomial expansion is written as an expansion in Xi = Zi*SZi: m
Zi+1 = Zi 兺 akXki ,
i = 0,1, . . . .
共12兲
k=0
The a and b coefficients are related by m
ak = 兺 共− 1兲共k−j兲mod j=k
2
冉冊
j b j, k
k = 0,1, . . . ,m.
共13兲
Note that the a coefficients change when m varies whereas the b coefficients do not. The matrix polynomial in Eq. 共7兲 can be evaluated efficiently by the method proposed by
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-3
J. Chem. Phys. 128, 104105 共2008兲
Recursive inverse factorization
Paterson and Stockmeyer in Ref. 24. Using this method, optimal polynomial degrees are 1, 2, 4, 6, 9, 12, 16, 20, 25, . . . ,
共14兲
in the sense that these are the largest degrees that can be obtained with 0,1,…, matrix-matrix multiplications, respectively. For some polynomials, however, even more efficient n evaluation methods exist. For example, the polynomial x共2 兲 can be evaluated with n multiplications by recursive application of x2. For this reason, a small polynomial degree used in each refinement iteration gives the effect of rapidly increasing total polynomial degree as the iterative procedure progresses. On the other hand, each iteration requires three to five additional matrix-matrix multiplications, independent of the polynomial degree, needed to evaluate the multiplication by Zi in Eq. 共7兲 and the evaluation of ␦i either by Eq. 共6兲 or Eq. 共11兲. Hence, a constant overhead is added to each polynomial evaluation, making small polynomial degrees unfavorable. Our preliminary investigations show that the largest orders to which the factorization error vanishes for a certain number of multiplications are obtained for polynomial degrees listed in Eq. 共14兲 around m = 9. We note that the matrix polynomial evaluation method by Paterson and Stockmeyer previously has been used in electronic structure calculations to evaluate Chebyshev polynomials used to approximate the Heaviside step function.25
III. BINARY RECURSIVE ALGORITHM
In this section, we combine iterative refinement with a recursive decomposition of the overlap matrix. First, we propose a general recursive algorithm. Then, we propose a socalled binary principal submatrix decomposition to be used together with the recursive algorithm and show that the recursive algorithm always converges when such decomposition is used. This is the key result of this paper.
FIG. 1. Binary principal submatrix tree for the partitioning of S. Each S共i兲 is 共i−1兲 block diagonal with S共i−1兲 11 , . . . , S2i2i on the diagonal.
Let
␦共i兲 ⬅ Z共i兲*S共i−1兲Z共i兲 − I = Z共i兲*⌬共i兲Z共i兲,
i = l,l − 1, . . . ,1. 共16兲
The first condition is then satisfied if 储␦共i兲储2 ⬍ 1,
共17兲
i = l,l − 1, . . . ,1.
If Eq. 共17兲 is fulfilled, the inverse factor Z of S can be constructed by subsequent iterative refinement of Z共i兲, i = l , l − 1 , . . . , 1. That is, Z共l兲 → Z共l−1兲 → ¯ → Z共1兲 → Z共0兲 ⬅ Z,
共18兲
where each arrow indicates an iterative refinement. The convergence and the efficiency of this recursive procedure depend on the decompositions of S共i−1兲 = S共i兲 + ⌬共i兲, i = 1 , 2 , . . . , l. In the following section, we propose a recursive binary principal submatrix decomposition of S which meets both the first and the second conditions presented above.
B. Binary principal submatrix decomposition
One way of choosing the decompositions S共i−1兲 = S共i兲 + ⌬ , i = 1 , 2 , . . . , l, is to make a recursive binary principal submatrix decomposition of S. Let S be partitioned as 共i兲
A. Recursive algorithm
Consider the following recursive decomposition of S in l levels: S⬅S
共0兲
=S
共1兲
共1兲
+⌬ ,
S共1兲 = S共2兲 + ⌬共2兲 , 共15兲 ] S共l−1兲 = S共l兲 + ⌬共l兲 , and let Z共i兲 denote an inverse factor of S共i兲, i = 0 , 1 , . . . , l. We want this decomposition to meet two conditions for all i in order to be useful: 共1兲 共2兲
S = S共0兲 = S共1兲 + ⌬共1兲 =
Z共i+1兲 should be sufficiently close to Z共i兲 so that the iterative refinement method can be used to obtain Z共i兲 if Z共i+1兲 is known. It should be significantly cheaper to compute the inverse factor Z共i+1兲 compared to the inverse factor Z共i兲.
冉
共0兲 S11
0
0
共0兲 S22
冊冉 +
0
共0兲 S12
共0兲 S21
0
冊
.
共19兲
共0兲 Without loss of generality we have assumed here that S11 共0兲 and S22 are leading and trailing principal submatrices of S共0兲.26 Other binary principal submatrix decompositions can be used after a permutation of the rows and columns of S. Continuing like this, we get the following formula for S共i兲 and ⌬共i兲, i = 2 , 3 , . . . , l:
共i兲
S =
冢
共i−1兲 S11
0
0
0
0
0
0
共i−1兲 S 2i2i
冣
,
⌬共i兲 = S共i−1兲 − S共i兲 ,
共20兲
共i−1兲 are leading and trailing principal where S共i−1兲 j−1,j−1 and S j,j 共i−2兲 submatrices of S j/2,j/2, j = 2 , 4 , 6 , . . . , 2i. This gives a binary principal submatrix tree, as illustrated in Fig. 1. Since S共i兲 is
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-4
J. Chem. Phys. 128, 104105 共2008兲
Rubensson et al.
FIG. 2. Recursive computation of Z using the binary principal submatrix 共i−1兲 共i兲 is tree illustrated in Fig. 1. Here, Z共i兲 jj is an inverse factor of S jj . Each Z 共i兲 , . . . , Z on the diagonal. block diagonal with Z共i兲 11 2i2i
block diagonal with 2i blocks on the diagonal, the computation of Z共i兲 is trivially parallel up to 2i processors. Figure 2 illustrates the recursive computation of Z. Here, Z共i兲 jj is computed by iterative refinement of
共
兲.
共i+1兲
Z2j−1,2j−1 0
0 Z共i+1兲 2j,2j
There are many possible decomposition schemes. The great advantage with a binary principal submatrix decomposition is that convergence is guaranteed. The following theorem assures that the recursive algorithm always converges when the binary principal submatrix decomposition is used. Theorem 1. Let S be a Hermitian block tridiagonal positive definite matrix partitioned as
S = S1 + ⌬1 =
冢
0
A1
冣冢
0
B1
B*
+
0
An
1
0
0
Bn−1
* Bn−1
0
冣
.
共21兲
Let ZA* AiZAi = I be inverse factorizations of Ai, i = 1 , 2 , . . . , n i and let
Z1 =
冢
Z A1
0
0
Z An
冣
冉
A1 0
position of this network is illustrated in Fig. 4. For this illustrative example 储␦共i兲储2 ⬇ 0.1 for all levels i = 1 , 2 , 3 , 4. IV. BINARY SUBDIVISION SCHEMES
The factorization error 储␦共i兲储2 and hence the number of iterative refinement iterations needed for convergence depend on how the two principal submatrices are chosen in the binary decomposition. A common way to partition matrices in electronic structure calculations is to use a spatial subdivision of the molecule.22,27 The spatial subdivision is based on the coordinates of the basis function centers. The molecule is then divided into two parts across its largest dimension. This procedure is then recursively repeated until the submatrices have the desired size. This way of partitioning the matrix is not general since it assumes that the matrix has some correspondence to a molecular geometry. A. Network modularity optimization
共22兲
.
Then, 储Z1*SZ1 − I储2 ⬍ 1. See Appendix A for a proof. In particular, Theorem 1 is valid for any Hermitian positive definite matrix S partitioned as S = S共1兲 + ⌬共1兲 =
FIG. 3. Illustrative example network.
0 A2
冊冉 +
0
B1
B1*
0
冊
.
共23兲
Hence, if inverse factors ZA1 and ZA2 of A1 and A2 are known, the iterative refinement method can be used to construct an
共Z 0兲
inverse factor Z of S using Z共1兲 = 0 Z1A as starting guess. 2 This theorem is the foundation of the recursive algorithm with binary principal submatrix decomposition, assuring that it will always converge. A
C. Illustrative example
Consider the network in Fig. 3 and let S be the corresponding connection matrix with ones on the diagonal and 0.1 for all other connections. This particular choice makes S positive definite. An example of a recursive binary decom-
The overlap matrix can be seen as a connection matrix for a network where the magnitude of each matrix element, corresponding to the overlap between a pair of basis functions, determines the strength of the connection between two network nodes. A binary principal submatrix decomposition of the matrix corresponds to a division of the network into two communities. A common way to quantify the quality of a certain community division of a network is to introduce a quality function that gives a “value-reflecting” number, or “quality-of-split.”28 In order to achieve a good split, the problem is transformed into the optimization of the quality function of the network. The modularity, as introduced by Girvan and Newman,29 is a popular such quality-of-split function. Although there may be some drawbacks with this approach,30 the community divisions that are obtained appear to give valuable information.31,32 Examples of methods for optimizing the modularity are extremal optimization by Monte Carlo,32,33 simulated annealing,34 or the spectral algorithm.35 A well-known fast method is the greedy algorithm36 that scales O共N log2 N兲 with the number of nodes N in the network. In the greedy algorithm, the communities are merged step-by-step until all nodes are in one single community. Each step is optimized in terms of the modularity when
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-5
Recursive inverse factorization
J. Chem. Phys. 128, 104105 共2008兲
FIG. 4. Recursive binary decomposition of the illustrative example network. The figure indicates which network connections of S that are kept for each matrix of Eq. 共15兲. Compare with the tree structure in Fig. 1. Note that S共i−1兲 = S共i兲 + ⌬共i兲,i = 1 , 2 , 3 , 4 is fulfilled.
merging C communities into C − 1 communities. The greedy algorithm is known not to find the largest possible modularity value however.33,37 In this article, we use a method that improves the accuracy of the greedy algorithm. The method utilizes repetitive, suboptimal, merge steps in order to optimize the modularity. It has been shown to be computationally as efficient as the greedy algorithm, yet able to find better modularity values which are comparable to the extremal optimization method.38 Note that this algorithm tries to maximize the modularity for all possible number of communities C = N − 1 , . . . , 2, and not only for the C value that gives the global maximum of the modularity. In the context of this paper, we are only interested in binary decompositions, i.e., C = 2. Possibly more efficient methods for this special case exist. B. Examples
We illustrate in Figs. 5–7 three different ways to do the binary decomposition. Our test system is the overlap matrix for a two-dimensional zigzag graphene nanoribbon using basis set STO-3G. The Cartesian coordinates were obtained by replicating a perfect honeycomb pattern with C–C length of 1.42 Å and H terminations with C–H length equal to 1.01 Å.39–41 The overlap matrices throughout this article were constructed with the ERGO quantum chemistry program.42 The three subdivision schemes we compare are spatial subdivision, the network modularity optimization, and random subdivision. In all cases we do not subdivide the atom blocks, i.e., all basis functions centered on the same atom are restricted to be in the same community. The main reason for this restriction is that it gives more illustrative figures. This restriction also reduces the computational effort needed to do the subdivisions, particularly for the modularity optimization which is applied here to the matrix containing the Frobenius norms of all atom-atom blocks. Nevertheless, more optimal subdivisions could possibly be found without this restriction. For the highly regular system considered in this section, the spatial subdivision scheme 共Fig. 5兲 finds binary subdivisions that give relatively small initial factorization errors 储␦共i兲储, i = 1 , 2 , 3. The network modularity optimization 共Fig. 6兲 finds nonintuitive divisions that are as good or better than the spatial subdivision. We expect the network modularity optimization to be particularly useful for more irregularly
FIG. 5. Spatial subdivision: in this figure, spatial subdivision is used to construct the binary principal submatrix tree of the overlap matrix. Panel 共a兲 shows molecule connections that are kept for each matrix in Eq. 共15兲 and corresponding factorization errors 储␦储2 for each level. The number of iterations k needed to converge the iterative refinement of Z共i+1兲 → Z共i兲 to an accuracy of 储␦k共i+1兲储2 ⬍ 10−10, i = 2 , 1 , 0, using a polynomial order of m = 1 in the iterative refinement algorithm is indicated as well. Panel 共b兲 depicts the nonzero structure of the reordered atom block overlap matrix.
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-6
Rubensson et al.
FIG. 6. Network modularity optimization: in this figure, optimization of the network modularity is used to construct the binary principal submatrix tree of the overlap matrix. Panel 共a兲 shows molecule connections that are kept for each matrix in Eq. 共15兲 and corresponding factorization errors 储␦储2 for each level. The number of iterations k needed to converge the iterative refinement of Z共i+1兲 → Z共i兲 to an accuracy of 储␦k共i+1兲储2 ⬍ 10−10, i = 2 , 1 , 0, using a polynomial order of m = 1 in the iterative refinement algorithm is indicated as well. Panel 共b兲 depicts the nonzero structure of the reordered atom block overlap matrix.
J. Chem. Phys. 128, 104105 共2008兲
FIG. 7. Random permutation: In this figure, a random construction of the binary principal submatrix tree of the overlap matrix is used. Panel 共a兲 shows molecule connections that are kept for each matrix in Eq. 共15兲 and corresponding factorization errors 储␦储2 for each level. The number of iterations k needed to converge the iterative refinement of Z共i+1兲 → Z共i兲 to an accuracy of 储␦k共i+1兲储2 ⬍ 10−10, i = 2 , 1 , 0, using a polynomial order of m = 1 in the iterative refinement algorithm is indicated as well. Panel 共b兲 depicts the nonzero structure of the reordered atom block overlap matrix.
V. ADDITIONAL EXAMPLES
structured molecules which will be investigated in the next section. Also, the random permutation 共Fig. 7兲 produces subdivisions that are useful for this particular system. Nevertheless, it is still advisable to use a more sophisticated scheme since smaller initial factorization errors can be obtained with small computational effort. We note also that the spatial subdivision and the network modularity optimization schemes reduce the bandwidth of the matrix. This suggests that bandwidth reduction methods such as the Cuthill–McKee method43 could be useful for doing the permutation. The Cuthill–McKee method however does not utilize the magnitude of the matrix elements. The methods investigated here utilize the magnitude of matrix elements either implicitly or explicitly and are therefore expected to give smaller initial factorization errors.
We have seen that spatial subdivision is almost as good a choice as the more sophisticated network modularity based approach for the regularly structured graphene nanoribbon. We will now investigate how the different subdivision approaches compare for more irregularly structured threedimensional systems. In our molecule test set we have included four molecules from the Protein Data Bank to represent molecules of relevance in current biochemical research.44–48 These molecules have three-dimensional geometries. To test the effect of increasing system size we have also included the graphene nanoribbon with 10⫻ 3 benzene units that was also used in the previous section and a ribbon ten times larger with 100⫻ 3 benzene units. The molecules in the test set are listed in Table I.
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-7
J. Chem. Phys. 128, 104105 共2008兲
Recursive inverse factorization
TABLE I. Molecule test set. The graphene coordinates were obtained by replicating perfect honeycomb patterns as explained in Sec. IV B. The other geometries were obtained from the Protein Data Bank 共Ref. 44兲. Abbreviation
Molecule
G10 G100 1EDP 1A1U 2CRD 1CIS
Graphene, 10⫻ 3 units Graphene, 100⫻ 3 units Endothelina P53 fragmentb Charybdotoxinc Chymotrypsin inhibitor 2d
a
Reference 45. Reference 46. c Reference 47. d Reference 48. b
The values of the factorization errors, 储␦共i兲储2, determine the speed of convergence. The importance of these values decreases as i increases. This is because the matrix S共i兲 at a certain level i consists of 2i disconnected submatrices which makes the computational cost significantly smaller for larger levels i. Therefore, we will focus on 储␦共1兲储2 as the key value for fast convergence. In Fig. 8, we compare the 储␦共1兲储2 values for the different subdivision schemes. This figure shows that for irregularly structured three-dimensional molecules, it is beneficial to use the partitioning algorithm based on the network modularity. We note also that increasing the system size has no effect on the 储␦共1兲储2 value. In Fig. 9, we compare the number of iterations needed to converge the iterative refinement at the highest level for three different basis sets. Here, the basis sets 3-21G and 6-31G give condition numbers of S around 105, while the condition numbers for STO-3G lie around 102. This explains the increased number of iterations needed to converge the iterative refinement when the larger basis sets are used.
FIG. 9. 共Color online兲 Number of iterations needed to converge iterative refinement at the highest level using three different basis sets. Iterative refinement with seventh order of convergence 共m = 6兲 and the network modularity subdivision scheme is used.
iteration, i.e., that 储␦共i+1兲储2 ⬍ 储␦共i兲储2. Otherwise, the factorization error may stagnate or end up being larger than one leading to divergence. The main source of error in matrix operations is the removal of small matrix elements done to reduce the computational effort. In the following, we will let ␦共i+1兲 denote the factorization error matrix that would have been the result of one refinement step if exact arithmetics were used in matrix operations. The factorization error matrix that is obtained when truncation of small matrix elements is applied will be denoted by ˜␦共i+1兲. Assume now that 储˜␦共i+1兲 − ␦共i+1兲储2 艋 .
共24兲
Then, according to Weyl’s theorem on eigenvalue movement caused by a Hermitian perturbation 共see, for example, Corollary 4.10 in Ref. 16兲, 储˜␦共i+1兲储2 艋 储␦共i+1兲储2 + .
VI. STABILITY
The stability of the iterative refinement is related to the accuracy that is used in matrix operations. The required accuracy in matrix operations should be determined from the condition that the factorization error shall decrease in each
共25兲
Therefore, if we can express ␦共i+1兲 in terms of ␦共i兲, we can ensure that 储˜␦共i+1兲储2 ⬍ 储␦共i兲储2 by choosing a proper value. To fulfill Eq. 共24兲, truncation of small elements can be done in line with Ref. 49. In exact arithmetics, the decrease of the factorization error from one iteration to the next can be exactly calculated from the b coefficients. The factorization error matrix after a step is given by
␦共i+1兲 = 兺 ck␦共i兲 , k
共26兲
k
where k−m−2
k−m−1
ck =
FIG. 8. 共Color online兲 The factorization error 储␦共1兲储2 as a function of molecule and subdivision scheme. The molecule set is given in Table I.
兺 j=0
− 2b jbk−j +
兺 j=0
− 2b jbk−j−1 .
共27兲
See also Table II where the c coefficients are given for m = 1 , 2 , . . . , 6. Since ck = 0 for k = 0 , 1 , . . . , m and since 兺k兩ck兩 = 1 ∀ m, it follows that
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-8
J. Chem. Phys. 128, 104105 共2008兲
Rubensson et al.
TABLE II. Coefficients for the factorization error expressed as in Eq. 共26兲 for different polynomial orders m. Note that 兺k兩ck兩 = 1 ∀ m. m
c0
c1
c2
c3
c4
c5
c6
c7
1
0
0
−
3 4
1 4
0
¯
2
0
0
0
5 8
−
15 64
9 64
3
0
0
0
0
−
35 64
7 32
4
0
0
0
0
0
5
0
0
0
0
6
0
0
0
0
c8
c9
0
¯
−
35 256
25 256
0
¯
63 128
−
105 512
135 1024
−
1575 16384
1225 16384
0
0
−
231 512
99 512
−
2079 16384
385 4096
0
0
0
429 1024
−
3003 16384
1001 8192
储␦共i+1兲储2 艋 储␦共i兲储m+1 2 .
共28兲
Hence, to avoid stagnation, we can choose
⬍ 储␦共i兲储2 − 储␦共i兲储m+1 2 .
共29兲
We note that the polynomials in Eq. 共26兲 have previously been presented for m = 1 , 2 , 3 , 4 with ␦共i兲 substituted for ⑀ = ␦共i兲 + 1.50 A subject for further research is to optimally choose truncation thresholds for the removal of small matrix elements so that the computational effort is minimized while maintaining stability. We note also that for very large condition numbers it may not be possible to achieve accurate enough matrix operations to avoid stagnation or divergence. In such a case however one might need to reconsider the choice of basis set.
VII. ILL-CONDITIONED SYSTEMS, A STRUGGLE IN VAIN?
Many basis sets consisting of Gaussian functions based 2 on the form ␣e−x give rise to ill-conditioned overlap matrices. Especially when the number of basis functions per atom is large, distances between atoms are short, or if the  values are small. Ill-conditioned systems generally give larger initial factorization errors and an increased number of iterations needed for convergence, as can be seen in Fig. 9. We have tried three different approaches to remedy. The first idea is to scale the ⌬共i兲 matrices with a factor ␥ 苸 关0 , 1兴 such that the 储␦共i兲储2 value is guaranteed to be smaller than or equal to some predefined value. The inverse factor of S共i兲 + ␥⌬共i兲 is computed several times with ␥ being successively increased until ␥ = 1 and the inverse factor of S共i−1兲 is obtained. The second idea is to solve a well-conditioned system where the same basis set is used but with larger exponents and then use the solution to this problem as a starting guess for the illconditioned problem. The third idea is to use a nonlinear conjugate gradient scheme 共the Polak–Ribière method51兲 to
c10
c11
0
¯
−
4851 65536
3969 65536
−
3003 32768
9555 131072
−
c12
c13
0
¯
63063 1048576
53361 1048576
c14
¯
0
¯
produce an improved starting guess for the iterative refinement scheme. We applied the conjugate gradient scheme to the following functional and gradient
再
⍀共Z兲 = 储Z*SZ − I储F = Tr关共Z*SZ − I兲*共Z*SZ − I兲兴, ⵜZ*⍀共Z兲 = 2SZ共I − Z*SZ兲.
冎
共30兲
In Fig. 10, we compare the conjugate gradient method with iterative refinement. The iterative refinement is clearly superior for both the well-conditioned and the more illconditioned cases. Our preliminary investigations of the three approaches above do not indicate any significant improvement with respect to computational efficiency compared to using iterative refinement alone. However, the methods above may turn out to be useful to improve the stability of the method for very ill-conditioned systems. For example,
FIG. 10. 共Color online兲 Convergence of the nonlinear conjugate gradient algorithm 共dashed lines兲 compared to iterative refinement 共solid lines兲. The test system is the graphene nanoribbon with 10⫻ 3 benzene units depicted in Fig. 5. We used two different basis sets, STO-3G 共crosses兲 and 6-31G 共circles兲 and iterative refinement with second order of convergence 共m = 1兲. The starting guesses for both methods were obtained by a binary principal submatrix decomposition using the network modularity optimization.
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-9
J. Chem. Phys. 128, 104105 共2008兲
Recursive inverse factorization
using the scaling of the ⌬共i兲 matrices one can keep the factorization error far from 1 eliminating in this way the risk that eigenvalues of ␦ leave the 关−1 , 1兴 interval.
VIII. SUMMARY
We have presented a recursive inverse factorization method for the computation of inverse factors of Hermitian positive definite matrices. A binary principal submatrix tree is used for the recursive decomposition of the matrix. It is proved that the algorithm always converges with such a decomposition. Since the algorithm is based on matrix-matrix multiplications, the computational effort increases linearly with system size for systems with sufficiently sparse matrices and the implementation of the algorithm can be parallelized. The use of blocked sparse data structures has significantly improved the performance of programs for large-scale electronic structure calculations.22,27,52 These data structures, however, depend on subdivision schemes that are able to squeeze out zero elements from the blocked representation. In the recursive inverse factorization, the use of an efficient subdivision scheme also reduces the initial factorization error. While it is relatively straightforward to bring about nearly optimal subdivisions for regular systems, it can be considerably more difficult for irregular three-dimensional systems. In this work, recent advances in network theory have been used to find binary subdivisions in the recursive decomposition of the overlap matrix. We have shown that optimization of the so-called network modularity results in an improved partitioning compared to other approaches, especially in the case of irregularly structured threedimensional systems. We would like to stress that the recursive inverse factorization algorithm was developed with parallel computer architectures and very large systems in mind. Sparsity in matrices and/or an optimized parallel implementation of matrixmatrix operations are imperative for the efficiency of the algorithm. For small to medium sized calculations running on a single processor, existing inverse Cholesky algorithms19–22 are likely to be more efficient. The work presented in this article was motivated by the need to transform the generalized eigenvalue problem occurring in large-scale Hartree–Fock and Kohn–Sham calculations to standard form by a congruence transformation, but our method may have applications also in other research areas. The construction of preconditioners for the solution of linear systems comes to mind.
ACKNOWLEDGMENTS
The authors are grateful for the enlightening atmosphere provided by the International Ten Bar Café. We also gratefully acknowledge the support of the U.S. Department of Energy through the LANL LDRD/ER program for this work. E.H.R. is thankful to Elias Rudberg for valuable comments and support from Pieter Swart and the Los Alamos mathematical modeling and analysis student program.
APPENDIX A: PROOF OF THEOREM 1
Here we will prove Theorem 1. Definition 1. Let X=
冉 冊 A B C D
共A1兲
and let A be invertible. Let 共X/A兲 = D − CA−1B denote the Schur complement of A in X. Theorem 2. A Hermitian matrix X is positive definite if at least one principal submatrix of X is positive definite together with its Schur complement. Also, if X is positive definite, every principal submatrix of X is positive definite together with its Schur complement. See Ref. 26 for a proof. We will now use this theorem to prove Theorem 1. Proof of Theorem 1. Let U1 = S,
Ui = 共Ui−1/Xi−1兲,
i = 2, . . . ,n
共A2兲
where Xi is the conforming leading principal submatrix of Ui. Now, since Ui is the Schur complement of Xi−1 in Ui−1, Ui is positive definite if Ui−1 is positive definite. U1 is positive definite. Hence, Ui and Xi are positive definite for all i. A general formula for Xi is X 1 = A 1,
* X−1 B , Xi = Ai − Bi−1 i−1 i−1
i = 2, . . . ,n.
共A3兲
We note that Z1*SZ1 − I = Z1*⌬1Z1 and let V1 = I + Z1*⌬1Z1,
Vi = 共Vi−1/Y i−1兲,
i = 2, . . . ,n,
共A4兲
where Y i is the conforming leading principal submatrix of Vi. A general formula for Y i is * Z Y −1 Z* B Z , Y i = I − ZA* Bi−1 Ai−1 i−1 A i−1 Ai
Y 1 = I,
i
i−1
共A5兲 i = 2, . . . ,n. However, then, Y i = ZA* XiZAi,
i = 1, . . . ,n
i
共A6兲
and since ZAi has full rank and Xi is positive definite, Y i is positive definite for all i. Vi is positive definite if Vi+1 = 共Vi / Y i兲 and Y i are positive definite but since Vn = Y n, Vi is positive definite for all i. In particular, V1 = I + Z1*⌬1Z1 is positive definite. Analogously, it can be shown that I − Z1*⌬1Z1 is positive definite. Hence, 储Z1*⌬1Z1储2 ⬍ 1. 䊐
APPENDIX B: ALTERNATIVE INITIAL APPROXIMATE INVERSE FACTOR
An alternative initial approximate inverse factor for the iterative refinement procedure is Z0 =
冑
2 I max + min
共B1兲
where max and min are the largest and smallest eigenvalues of S, respectively. It follows that
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp
104105-10
J. Chem. Phys. 128, 104105 共2008兲
Rubensson et al.
储Z0*SZ0 − I储2 =
冐
冐
共B2兲 We note that with this choice of starting guess, the initial factorization error will be close to 1 for ill-conditioned systems. Iterative refinement using this starting guess is equivalent to the scaling of the overlap matrix suggested by Jansik et al.50 In the same paper, the authors propose “intermediate scaling” as an aid to further improve convergence of the iterative refinement. Intermediate scaling could be combined also with the recursive algorithm described in the present article. We claim, however, in contrary to Ref. 50, that intermediate scaling works best for odd polynomial orders m since the eigenvalue spectrum of ␦ for these cases is folded over itself which makes it possible to do more extensive scaling. This can serve as an explanation for the results presented in Ref. 50 where the intermediate scaling in fact has better effect for m = 1 than m = 2 共in the notation of Ref. 50: m = 2 than m = 3兲. R. McWeeny, Proc. R. Soc. London, Ser. A 235, 496 共1956兲. A. H. R. Palser and D. E. Manolopoulos, Phys. Rev. B 58, 12704 共1998兲. 3 A. M. N. Niklasson, Phys. Rev. B 66, 155115 共2002兲. 4 A. M. N. Niklasson, C. J. Tymczak, and M. Challacombe, J. Chem. Phys. 118, 8611 共2003兲. 5 A. Holas, Chem. Phys. Lett. 340, 552 共2001兲. 6 D. A. Mazziotti, Phys. Rev. E 68, 066701 共2003兲. 7 E. H. Rubensson and H. J. A. Jensen, Chem. Phys. Lett. 432, 591 共2006兲. 8 H. J. Xiang, W. Z. Liang, J. Yang, J. G. Hou, and Q. Zhu, J. Chem. Phys. 123, 124105 共2005兲. 9 D. K. Jordan and D. A. Mazziotti, J. Chem. Phys. 122, 084114 共2005兲. 10 X.-P. Li, R. W. Nunes, and D. Vanderbilt, Phys. Rev. B 47, 10891 共1993兲. 11 J. M. Millam and G. E. Scuseria, J. Chem. Phys. 106, 5569 共1997兲. 12 M. Challacombe, J. Chem. Phys. 110, 2332 共1997兲. 13 H. Larsen, J. Olsen, P. Jørgensen, and T. Helgaker, J. Chem. Phys. 115, 9685 共2001兲. 14 Y. Shao, C. Saravanan, M. Head-Gordon, and C. A. White, J. Chem. Phys. 118, 6144 共2001兲. 15 K. R. Bates, A. D. Daniels, and G. E. Scuseria, J. Chem. Phys. 109, 3308 共1998兲. 16 G. W. Stewart and J. Sun, Matrix Perturbation Theory 共Academic, Boston, 1990兲. 17 G. H. Golub and C. F. V. Loan, Matrix Computations, 2nd ed. 共The Johns Hopkins University Press, Baltimore, 1989兲. 18 P.-O. Löwdin, Adv. Phys. 5, 1 共1956兲. 19 M. Benzi, C. D. Meyer, and M. Tuma, SIAM J. Sci. Comput. 共USA兲 17, 1135 共1996兲. 1 2
M. Benzi, J. K. Cullum, and M. Tuma, SIAM J. Sci. Comput. 共USA兲 22, 1318 共2000兲. 21 M. Benzi, R. Kouhia, and M. Tuma, Comput. Methods Appl. Mech. Eng. 190, 6533 共2001兲. 22 E. H. Rubensson, E. Rudberg, and P. Sałek, J. Comput. Chem. 28, 2531 共2007兲. 23 A. M. N. Niklasson, Phys. Rev. B 70, 193102 共2004兲. 24 M. S. Paterson and L. Stockmeyer, SIAM J. Comput. 2, 60 共1973兲. 25 W. Liang, C. Saravanan, Y. Shao, R. Baer, A. T. Bell, and M. HeadGordon, J. Chem. Phys. 119, 4117 共2003兲. 26 E. V. Haynsworth, Linear Algebr. Appl. 1, 73 共1968兲. 27 C. Saravanan, Y. Shao, R. Baer, P. N. Ross, and M. Head-Gordon, J. Comput. Chem. 24, 618 共2003兲. 28 J. H. Ward, J. Am. Stat. Assoc. 58, 236 共1963兲. 29 M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 99, 7821 共2002兲. 30 S. Fortunato and M. Barthélemy, Proc. Natl. Acad. Sci. U.S.A. 104, 36 共2007兲. 31 D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. U.S.A. 101, 5241 共2004兲. 32 C. P. Massen and J. P. K. Doye, Phys. Rev. E 71, 046101 共2005兲. 33 J. Duch and A. Arenas, Phys. Rev. E 72, 027104 共2005兲. 34 C. O. Dorso, A. Medus, and G. Acuna, Physica A 358, 593 共2005兲. 35 M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 103, 8577 共2006兲. 36 A. Clauset, M. E. J. Newman, and C. Moore, Phys. Rev. E 70, 066111 共2004兲. 37 L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, J. Stat. Mech.: Theory Exp. 2005, P09008. 38 N. Bock, E. Holmström, and J. Brännlund, arXiv:0711.1603. 39 Y. Miyamoto, K. Nakada, and M. Fujita, Phys. Rev. B 59, 9858 共1999兲. 40 Y.-W. Son, M. Cohen, and S. Louie, Nature 共London兲 444, 347 共2006兲. 41 E. Rudberg, P. Sałek, and Y. Luo, Nano Lett. 7, 2211 共2007兲. 42 E. Rudberg, E. H. Rubensson, and P. Sałek, ERGO, version 1.5, a quantum chemistry program for large scale self-consistent field calculations 共2007兲. 43 E. Cuthill and J. McKee, Proceedings of the 24th National Conference ACM, 1969 共unpublished兲, pp. 157–172. 44 The Protein Data Bank 共http://www.pdb.org兲. 45 N. H. Andersen, C. P. Chen, T. M. Marschner, S. R. Krystek Jr., and D. A. Bassolino, Biochemistry 31, 1280 共1992兲; PDB ID: 1EDP. 46 M. McCoy, E. S. Stavridi, J. L. Waterman, A. M. Wieczorek, S. J. Opella, and T. D. Halazonetis, EMBO J. 16, 6230 共1997兲; PDB ID: 1A1U. 47 F. Bontems, B. Gilquin, C. Roumestand, A. Ménez, and F. Toma, Biochemistry 31, 7756 共1992兲; PDB ID: 2CRD. 48 P. Osmark, P. Sørensen, and F. M. Poulsen, Biochemistry 32, 11007 共1993兲; PDB ID: 1CIS. 49 E. H. Rubensson and P. Sałek, J. Comput. Chem. 26, 1628 共2005兲. 50 B. Jansík, S. Høst, P. Jørgensen, and J. Olsen, J. Chem. Phys. 126, 124104 共2007兲. 51 J. Nocedal and S. J. Wright, Numerical Optimization 共Springer-Verlag, New York, 1999兲. 52 M. Challacombe, Comput. Phys. Commun. 128, 93 共2000兲. 20
max − min 2S −I = ⬍ 1. max + min max + min 2
Downloaded 25 Mar 2008 to 128.165.45.12. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyright.jsp