Fast matrix scaling, ne-coarse grid partitioning, and

Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.

Fast matrix scaling, ne-coarse grid partitioning, and highly parallel preconditioning I. Kaporin, Computing Center RAS, [email protected]

Abstract

matrices, as well as tendency of the latter matrices to be of block diagonal form with very small blocks (after appropriate reordering). For the practical purposes (e.g. in multilevel preconditionings), one can x the diagonal dominance degree and then construct the corresponding leading block satisfying this requirement. We present and analyse a fast procedure for the scaling of symmetric positive de nite matrices to the doubly stochastic form, and discuss its application to grid reorderings as well to the construction of ecient multilevel preconditionings.

A reordering of symmetric positive de nite (SPD) matrix A related to row-norm balancing symmetric scaling is considered. A ne/coarse grid splitting procedure is then introduced which constructs a 2 by 2 block splitting of the matrix with the leading block possessing the prescribed degree of diagonal dominance. The hierarchical permutation obtained by recursive application of the above described procedures to the corresponding approximate reduced matrices (i.e. Schur complements) is then used for the construction of improved version of the Inverse Incomplete Cholesky (IIC) preconditioned Conjugate Gradient method for the solution of the related linear equation system Ax = b. Numerical results are given for several hard-to-solve SPD matrices from the University of Florida collection which con rm the improvement of the IIC preconditioning quality due to application of the hierarchical reordering.

1 Fast symmetric scaling The scaling algorithms are very important for the processing of real-world matrices, where the magnitudes of elements may vary essentially. Proper scaling strategies for symmetric matrices and methods for their construction can be found in papers [4, 1, 2]. The algorithm is applicable to any symmetric matrix and has the following form.

Introduction

Input: symmetric matrix A; iteration number bound iscl 0; Below we discuss how to use the scaling of a (sparse) matrix to the doubly stochastic form for the purpose (1) Initialization: of extracting a " ne grid" from the set of the associdi = 1; i = 1; : : :; n; ated graph vertices. The obtained procedures are used (2) Iterations: for the construction of ecient parallelizable preconfor k = 1; : : :; iscl : ditionings for the conjugate gradient solution of linear for i = 1;P: : :; n : algebraic systems si := j jaij jdj ; i = 1; : : :; n; Ax = b: (1) end for for i = 1;p: : :; n : Formally, the extraction of ne grid corresponds to a di := di=si 2 2 splitting of the symmetric positive de nite maend for trix (e.g. discrete Laplacian) which provides for a wellend for conditioned leading block A11 of a possibly large size. It is observed numerically that the scaled matrix re- (3) Exit: the (approximately) scaled matrix is AS = DAD; where D = Diag(d): ordered to the non-increasing main diagonal demonstrates such remarkable properties as a smooth decrease of diagonal dominance in the succesive leading The iterations may also be terminated by the criterion 1 This work was nancially supported by target program P-14 maxi (disi ) 1 + ; 0 < 1; (2) of Presidium of R.A.S. and by the grant NSh-4096.2010.1 mini(di si )

1 Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.


Note that typically the iterations converge quickly (the left hand side of the latter condition normally behave as 1 + O(2 k ), where k is the iteration number) if the original matrix is positive de nite. It can be shown that this algorithm is equivalent to the Row Simultaneous Scaling (RSS) method [2] applied to the matrix A.

2 Multilevel ordering via 2 2-splitting

At the next stage, a \strengthening" of the leading entries of the main diagonal is performed by the sym(m) = P (m) A(m) (P (m) )T which metric permutation ASR S simply sorts the diagonal entries of AS(m) in the nonincreasing order. Finally, similar to [10], the diagonally dominant leading block is formed in the course of scanning through the main diagonal using the direct check of the condition X (m) ) j (1 )j(A(m) ) j; j(A11 0 < < 1: (5) 11 ii ij

The key component of the considered matrixdependent multilevel ordering is a recursive construction of a 2 2-block splitting with a diagonally dominant leading block. Let the current reduced matrix be initialized as A(1) = A, and its dimension be n(1) = n. At the mth level, the scaled and reordered matrix is constructed:

sequentially for i = 1; 2; 3; : : :, where is a prescribed diagonal dominance parameter. If the condition (5) is satis ed, then the current row index will be included into the leading block; otherwise, it will go to the trail(m) . ing block A22 (m) satis es the same Clearly, the resulting matrix ASR scaling properties (3)-(4).

j 6=i

(m) P D(m) A(m) D(m) (P (m) )T ASR (m) "

#

(m) A11 (m) A21

(m) A12 = (m) ; A22 where n(m) = n1(m) + n(m+1)

2.2 Construction of the next reduced matrix

(m) = (A(m) )T ; A21 12

In order to construct the next level, a reduced matrix is formed. For the ease of computations, one replaces (m) by its strengthened diagonal. The dithe matrix A11 agonal correction is made by adding the absolute value (m) ) j to the of each nonzero o-diagonal element j(A11 ij (m) ith and jth diagonal positions. Hence, the matrix ASR ( m ) is transformed to A~SR , which is also positive de nite and can be factorized as # " (m) A(m) D ( m ) 11 12 A~SR = (m) (m) A21 A22

determines the block 2 2-splitting, diagonal matrix D(m) determines row and column norm-balancing symmetric scaling, P (m) (m) is a diagonal domiis a permutation matrix, and A11 nant matrix of the order n1(m) .

2.1 Scaling and reordering of the reduced matrix

Here we use the iterative evaluation of the scaling ma (m) D11 0 I1 U12(m) ; trix as described above in Section 1 and the two-stage = I(m1 ) 0 L21 I2 0 I2 0 A(m+1) construction of permutation matrix similar to that of [2, 10]. where the reduced matrix is de ned as According to (2), the scaled matrix (m+1) = A(m) A(m) D(m) 1 A(m) ; (6) A 12 11 22 21 AS(m) = D(m) A(m) D(m) and is also SPD. Hence, the recursion can be continued satis es the rowsum equalizing condition by applying all the above described steps to A(m+1) . (m) nX

j =1

j(AS(m) )ij j = !j ;

3 The use of multilevel reordering in IIC

(3)

(max ! )=(min ! ) 1 + ; j j j j

0 < 1; (4) As the result of the procedure formulated in Section 2, a symmetric permutation where the small parameter determines the precision of scaling. AR = PAP T (7) 2 Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.


Here Si is the mi mi principal submatrix of AR determined by the i-th subset of column indices, that is, (Si )p;q = (AR )ji (p);ji (q) ; 1 p mi ; 1 q mi : Note that by the theoretical results of [6, 8], the Koptimality of the IIC preconditioning corresponds to the minimization of an appropriate upper bound on the PCG iterations.

of the original matrix A is constructed (which is accompanied by its block splitting by the levels). This permutation is then used in the construction of the so-called Incomplete Inverse Cholesky (IIC) preconditioning, which is de ned by the decomposition GAR GT = In + EG; where G is a nonsingular lower triangular matrix (e.g., with prescribed sparsity pattern) and EG is the IIC error matrix, which is \small" is certain appropriate sense (e.g., EG can be equal to small-norm matrix plus low-rank matrix). Some preliminary results for matrices related to nite-dierence operators on square grids can be found in [8], where such \hierarchical" permutations were constructed analytically by \geometrical" reasons. The corresponding preconditionings (based on the use of more sparse rows for the initial levels in the ordering) have shown very competitive quality in numerical tests. However, using the above presented multilevel reordering, one can apply the IIC in a black-box mode. Thus, one can construct a general-purpose SPD solver which possesses a very high potential for parallelization since the IIC-CG iterations involve only elementary vector operations and matrix-vector multiplications.

3.2 Multilevel IIC structure

Obviously, the quality of IIC preconditioning may essentially depend on the reordering (7) of the matrix A. It appears that the \natural" orderings are far from optimum when used with the IIC techniques. Thus we propose, rst, to use the multilevel orderings described above and, second, to use more dense rows of G for the nodes which belongs to higher levels. The potential advantage of the multilevel reorderings can be seen already for the matrices of the order n = 3. Let us x the sparsity structure of G the same as the one for the lower triangle of the factorized matrix (A or AR , respectively). If A is tridiagonal with 6= 0 and AR is its reordering to the arrowhead form, that is 2 2 3 3 1 0 1 0 A = 4 1 5; AR = 4 0 1 5 ; 0 1 1 then the IIC error term is always nonzero in the rst case, and is exactly zero in the second case. However, it was observed in numerical tests that any arti cial enforcement of a dependence between density of a row and the corresponding level number (e.g. of the type used in [8]) is not ecient. Instead of that, the following IIC() construction is proposed (cf. [7]), which allows to reduce considerably the complexity of preconditioner application while keeps the preconditioning quality (in terms of reduction of the K-condition number (8)) almost the same. Let the column indices of the ith row of G are the same as those of the lower triangle of the matrix AR = PAP T . (If AR is very sparse, then the sparsity structure of the squared matrix AR can be used, etc.) First, the corresponding IIC factor G is evaluated. Second, one uses a small parameter 0 < < 1 in order to truncate the initial sparsity structure according to the criterion j(G)ij j < (G)ii; (9) i.e., the updated sparsity structure is obtained by excluding the nonzero positions (i; j) satisfying (9). Finally, the obtained sparsity structure is used to re-

3.1 The construction of IIC preconditioner

The preconditioner H AR1 can be written in the factorized form H = GT G; where G is a nonsingular lower triangular matrix with positive diagonal and prescribed sparsity pattern. In [6, 8] the following construction of the point Incomplete Inverse Cholesky factorization was proposed. Let the positions of structurally nonzero elements of the i-th row of G be (G)i;ji (1); : : :; (G)i;ji(mi ) ; 1 i n; where 1 ji (1) < : : : < ji(mi ) = i is the ith list of the corresponding column indices. Then the minimum of special K-condition number K(HAR ), where K(M) = n 1 trace M n = det M; (8) is attained with 1 (G)i;ji (p) = q(Si 1)p;mi : (Si )mi ;mi 3



calculate the IIC factor G. Such preconditioning will be further referred to as IIC(). It appears that the truncated sparsity structure of G has clear tendency to contain more dense rows as the level number grows. Very typical situation is shown in Fig. 1, where the number of nonzeroes in the row versus the row number is plotted for the case of the IIC(0.01) factorization applied to matrix \bcsstk17".

least one of the following three criteria: (a) the matrix has been already used in earlier publications on the topic; (b) the matrix determines a hard-to-solve linear system; (c) the system can be solved by our IIC-CG method in a moderate time (less than a couple of minutes) and within the available workspace (limited to 2 Gbytes). In Table 1, some basic statistics on the test matrices are given: \n" is the order of the coecient matrix A; \nz(A)" is the total number of its nonzero elements; \kAS k" is the spectral norm of the diagonally scaled coecient matrix AS = DA 1=2 ADA 1=2; \cond(AS )" is the spectral condition number of the diagonally scaled matrix, i.e. cond(AS ) = max (AS )=min (AS ). The latter two quantities serve as an indicator if problem (1) with A is really a hard-to-solve one.

4 Numerical testing of multilevel IIC-CG solver Numerical testing of the method has been done using a subset of 24 symmetric positive de nite matrices of the size n > 10000 taken from the University of Florida sparse matrix collection [11]. Note that many of these test problems cannot be eciently solved by the existing standard preconditioned iterative methods. Below we describe the test matrix set used for numerical experiments, determine solution statistics used to measure the solver performance, and present the results obtained with the Multilevel IIC-CG solver.

Remark 4.1 The data presented in the last two

columns of Table 1 were obtained using the J-CG method with ne residual norm stopping criterion krk k 10 12kr0k.

Table 1. A subset of SPD matrices from UFL collection matrix n nz(A) kAS k cond(AS ) bodyy6 19366 134208 2.00 0.18E+05 bcsstk18 11948 149090 4.63 0.56E+05 bcsstk25 15439 252241 4.53 0.49E+08 cvxbqp1 50000 349968 3.00 0.13E+07 bcsstk17 10974 428650 6.70 0.16E+07 gridgena 48962 512084 2.09 0.14E+06 apache1 80800 542184 2.00 0.11E+07 g2 circuit 150102 726674 2.00 0.23E+06 olafu 16146 1015156 15.21 0.43E+09 17361 1021159 4.55 0.29E+09 gyro k bcsstk36 23052 1143140 8.02 0.12E+10 msc23052 23052 1142686 8.04 0.13E+10 msc10848 10848 1229776 9.11 0.14E+07 raefsky4 19779 1316789 5.55 0.22E+12 cfd1 70656 1825580 6.80 0.34E+06 oilpan 73752 2148558 5.92 0.11E+09 vanbody 47072 2329056 9.54 0.37E+09 ct20stif 52329 2600295 6.68 0.44E+09 nasasrb 54870 2677324 5.92 0.28E+08 cfd2 123440 3085406 7.69 0.15E+07 s3dkt3m2 90449 3686223 5.10 0.31E+11 thread 29736 4444880 19.79 0.90E+10 s3dkq4m2 90449 4427725 4.14 0.17E+11 apache2 715176 4817870 2.00 0.13E+07

In Table 2, the iteration results obtained with the standard J-CG solver are given, which characterize the complexity of solving these problems by the simplest (and readily parallelizable) iterative method. In Table 3, we present the iteration results obtained with IC2()-CG, one of the best general-purpose SPD iterative solvers [9]. Note, however, that the code implementing the corresponding incomplete factorization preconditioning, cannot be eciently parallelized. In both cases, the original matrix A was reordered to reduce its bandwidth. Such preprocessing increases the mega ops rate for the matrix-vector multiplication and (normally) improves the performance of the IC2 preconditioning. Hereafter, the null initial guess (x0 = 0) and the stopping criterion

kritk < 10 8kr0k

(10)

were used in the preconditioned CG iterations. In many cases, the test problem was presented only by the coecient matrix without any test right-hand 4.1 Test problem set side. In order to process the data uniformly, in the First we describe some general properties for a subset of remaining cases we have ignored the latter vector even 24 coecient matrices A (see Table 1) chosen from the if it was supplied. Therefore, each test problem (1) University of Florida collection [11] (further referred to was solved with the ridht-hand side b = Ax , where the test solution vector was de ned as as \UFL subset\). The University of Florida collection contains more x(i) = i=n; i = 1; : : :; n: than 1600 samples, so we made our choice using at 4 Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.


4.2 Computational environment

Table 3. Iteration results for IC2(0.01)-CG method matrix dens. #iters. # ops time bodyy6 0.766 5 0.28E+07 0.03 bcsstk18 1.297 56 0.31E+08 0.31 bcsstk25 1.465 402 0.31E+09 2.00 cvxbqp1 3.117 287 0.61E+09 5.33 bcsstk17 1.128 88 0.11E+09 0.75 gridgena 1.485 212 0.36E+09 2.11 apache1 1.588 156 0.34E+09 2.45 g2 circuit 2.796 75 0.34E+09 3.00 olafu 0.723 547 0.11E+10 5.27 1.218 554 0.15E+10 8.83 gyro k bcsstk36 1.152 1090 0.29E+10 15.59 msc23052 1.142 1069 0.29E+10 14.97 msc10848 1.078 172 0.64E+09 5.44 raefsky4 1.092 783 0.24E+10 13.23 cfd1 1.807 193 0.14E+10 13.02 oilpan 1.497 673 0.42E+10 24.31 vanbody 1.035 322 0.18E+10 11.05 ct20stif 1.117 139 0.11E+10 9.25 nasasrb 0.954 582 0.34E+10 17.66 cfd2 1.952 181 0.23E+10 19.91 s3dkt3m2 1.035 1698 0.14E+11 73.50 thread 0.618 5340 0.40E+11 162.56 s3dkq4m2 0.979 1456 0.14E+11 69.02 apache2 2.010 263 0.56E+10 37.05

For the calculations, a desktop computer AMD Athlon 64x2 Dual Core Processor 4200+ (2.21 GHz, 2 Gbytes main memory) was used under operating system MS Windows XP v.2002 Service Pack 2 with Intel Fortran Compiler 6.0 for Windows. The compilation options are speci ed as i /Ox /G6 /Qxi /Qip *.for which command line was used to link the program. The average performance of the Fortran code implementing the IIC-CG solver, was typically near 120 M ops, where one op corresponds to the scalar multiply-add pair z = a x + y.

4.3 Solution results and discussion Now we present the results obtained with the IIC-CG iterative method which possesses both fast covergence and good parallelizability. For the solution, the PCG method was used with the preconditioning de ned by the multilevel IIC factorization for the matrix

The multilevel ordering parameters were set as follows: = In + EP;G; the scaling tolerance parameter = 0:001 in (4); where EG;P is the error term. The IIC factorization the scaling iteration number limit iscl = 10; truncation parameter was xed as = 0:01 and the the diagonal dominance parameter = 0:1 in (5); initial sparsity structure for G was chosen the same as that of the lower triangle of PAP T (see Sect.3.2). G(PAP T )GT

Table 4. IIC(0.01)-CG method: original ordering matrix dens. #iters. # ops time bodyy6 0.682 361 0.12E+09 0.76 bcsstk18 0.693 345 0.11E+09 0.81 bcsstk25 0.640 3693 0.19E+10 11.35 cvxbqp1 0.986 2499 0.25E+10 20.81 bcsstk17 0.780 592 0.52E+09 4.11 gridgena 0.554 2341 0.25E+10 14.67 apache1 0.743 980 0.14E+10 8.84 721 0.17E+10 11.87 g2 circuit 0.986 olafu 0.617 1733 0.31E+10 21.17 0.725 2819 0.54E+10 35.46 gyro k bcsstk36 0.742 1790 0.39E+10 25.79 msc23052 0.736 1652 0.36E+10 26.03 msc10848 0.746 755 0.23E+10 37.48 raefsky4 0.796 2702 0.69E+10 43.79 cfd1 0.892 886 0.35E+10 23.50 oilpan 0.721 1296 0.54E+10 36.20 vanbody 0.673 6252 0.26E+11 158.56 ct20stif 0.724 2863 0.14E+11 89.23 nasasrb 0.718 2464 0.12E+11 76.20 cfd2 0.902 1505 0.10E+11 64.01 s3dkt3m2 0.599 4652 0.30E+11 174.26 thread 0.448 4111 0.29E+11 229.43 s3dkq4m2 0.604 4391 0.34E+11 195.62 apache2 0.871 1886 0.25E+11 156.15

Table 2. Iteration results for J-CG method matrix #iters. # ops time bodyy6 553 0.16E+09 0.81 bcsstk18 841 0.21E+09 0.98 bcsstk25 7968 0.30E+10 12.69 cvxbqp1 2621 0.20E+10 11.72 bcsstk17 2596 0.13E+10 4.39 gridgena 2986 0.27E+10 15.28 apache1 1171 0.14E+10 8.28 1059 0.20E+10 13.75 g2 circuit olafu 24220 0.28E+11 118.53 6210 0.72E+10 22.08 gyro k bcsstk36 15315 0.20E+11 65.01 msc23052 13543 0.18E+11 57.72 msc10848 3333 0.44E+10 12.27 raefsky4 290 0.43E+09 1.39 cfd1 1621 0.39E+10 16.00 oilpan 17681 0.48E+11 233.91 vanbody 5742 0.16E+11 52.61 ct20stif 1563 0.47E+10 16.48 nasasrb 8557 0.27E+11 89.31 cfd2 3171 0.13E+11 61.01 s3dkt3m2 46567 0.21E+12 725.91 thread 162625 0.76E+12 2119.34 s3dkq4m2 34519 0.18E+12 793.30 apache2 3672 0.39E+11 243.42



dens. the preconditioner density (GGT A in the

Table 5. IIC(0.01)-CG method: multilevel ordering matrix dens. #iters. # ops time bodyy6 0.683 345 0.12E+09 0.82 bcsstk18 0.556 358 0.11E+09 0.90 bcsstk25 0.490 2940 0.14E+10 10.20 cvxbqp1 0.988 2339 0.23E+10 29.93 bcsstk17 0.733 553 0.49E+09 4.57 gridgena 0.554 2064 0.22E+10 15.20 apache1 0.711 682 0.95E+09 6.81 g2 circuit 0.985 730 0.17E+10 14.46 olafu 0.608 1562 0.29E+10 27.23 0.749 1910 0.39E+10 37.00 gyro k bcsstk36 0.616 1165 0.24E+10 24.71 msc23052 0.615 1287 0.27E+10 26.73 msc10848 0.749 482 0.21E+10 41.39 raefsky4 0.777 1538 0.40E+10 40.11 cfd1 0.896 773 0.31E+10 45.03 oilpan 0.762 868 0.38E+10 47.67 vanbody 0.609 3699 0.15E+11 178.38 ct20stif 0.654 2763 0.13E+11 163.60 nasasrb 0.618 2018 0.96E+10 109.57 cfd2 0.911 1199 0.81E+10 139.32 s3dkt3m2 0.710 3864 0.27E+11 239.45 thread 0.448 2156 0.16E+11 229.01 s3dkq4m2 0.563 3310 0.25E+11 244.21 apache2 0.838 1272 0.17E+11 126.70

IC2 case) de ned as nz(G)=nz(DA + UA ); #iters. the iteration number corresponding to the stopping criterion (10); # ops. the total number of scalar multiply-add z = a x + y operations; time the total solution time in seconds. In Tables 4 and 5, the iteration results are given for the IIC(0.01)-CG method with the original ordering (denoted IICO ) and the multilevel ordering of A (denoted IICM ), respectively. It is clearly seen that: (a) The IIC-CG methods are much more ecient than the J-CG method with respect to the total number of operations. Moreover, since the IIC-CG iteration number is also greatly reduced as compared to J-CG, the former may be more ecient in parallel implementation due to the proportinally smaller amount of the (global) scalar product operations. (b) The IIC-CG methods can be competitive even with the IC2-CG method with respect to the total numTable 6. Comparison of CG total solution time for ARMS, AMG, and multilevel IIC(0.01) preconditionings ber of operations, which is especially clear for matrimatrix ARMS AMG IICM ces \bcsstk36", \msc23052", \oilpan", and, especially, bodyy6 74.7 0.8 \thread". At the same time, IC2-CG method \as is" bcsstk18 2.2 11.2 cannot be eciently parallelized. bcsstk25 10.0 10.2 cvxbqp1 643.8 287.9 (c) Somewhat larger solution time in Table 5 (as bcsstk17 28.3 527.3 compared to Table 4) can be explained by an imperfect gridgena 14.5 15.2 implementation of IIC-CG with the multilevel reorder174.7 37.0 gyro k bcsstk36 2323.1 2835.1 ing. In the code, the coecient matrix is used in the msc23052 2425.8 2320.9 reordered form which results in considerable reduction msc10848 10381.7 41.4 of mega ops rate for the matrix-vector multiplication cfd1 523.8 4370.8 operations (due to a much larger rate of cache misses). oilpan 983.7 1317.5 vanbody 480.4 2859.8 One may expect an essential acceleration of computanasasrb failed 3596.0 tions if the IIC-CG code will be rewritten to reduce the bandwidth of A (with corresponding reordering of G). Table 7. Comparison of CG iteration numnber for ARMS, AMG, and multilevel IIC(0.01) preconditionings As is clearly seen from Tables 4 and 5, the new multimatrix ARMS AMG IICM level construction of IIC preconditioning makes it posbodyy6 8385 345 sible to further reduce the space occupied by the prebcsstk18 497 358 conditioner. At the same time, this new version conbcsstk25 168 2940 siderably improves the preconditioning quality so that cvxbqp1 10694 2339 bcsstk17 705 4290 the new method has a smaller number of iterations and gridgena 274 2064 a lower arithmetic complexity in almost all cases congyro k 946 1910 sidered here (except of the IC2 preconditioning which bcsstk36 5648 4486 is not suitable for parallel implementation). msc23052 5754 3357 msc10848 19829 482 In Figs. 2{4, the most important data from the above cfd1 1198 1426 oilpan 1694 3938 Tables are visualized to illustrate the relative perforvanbody 2526 3699 mance of dierent preconditionings. nasasrb failed 4711 Note also that all the matrices were processed by In Tables 3-5 we list the following performance mea- the same general-purpose IIC algorithms with default sures for the solver: tuning. 0.1

0.9

3.4

29.9 4.6

3.4

20.9

24.7

26.7

24.2

45.0

47.7

178.4

109.6

3

165

162

2018

553

81

398

1165

1287

433

773

868

1413

2018



Figure 1: Number of nonzeroes per row in the IIC factor of \bcsstk17" matrix in multilevel reordering

Figure 3: Number of iterations for 24 problems

Figure 2: Relative space used by preconditioner for 24 Figure 4: Number of arithmetic operations for 24 problems problems



4.4 Comparison with other multilevel preconditionings

[4] D. Ruiz, A Scaling Algorithm to Equilibrate Both Rows and Columns Norms in Matrices. Rutherford Appleton Laboratory Tech. Rep. RAL-TR2001-034, Oxon, UK, September 10, 2001 [5] P. R. Amestoy, I. S. Du, D. Ruiz, and B. Ucar, A Parallel Matrix Scaling Algorithm. In: High Performance Computing for Computational Science - VECPAR 2008 Lecture Notes in Computer Science, 2008, Volume 5336/2008, 301-313. [6] I. Kaporin, On preconditioned conjugate-gradient method for solving discrete analogs of dierential problems. Di. Equat. 1990; 26(7):897{906. [7] Kaporin, I.E. Explicitly preconditioned conjugate gradient method for the solution of nonsymmetric linear systems. Int. J. Computer Math., v.40, 1992, pp.169-187. [8] I. Kaporin, New convergence results and preconditioning strategies for the conjugate gradient method. Numer. Linear Algebra Appls. 1994; 1 (2): 179{210. [9] I.E.Kaporin, High quality preconditioning of a general symmetric positive matrix based on its U T U + U T R + RT U-decomposition, Numerical Linear Algebra Appl., 1998, v.5, 484-509. [10] I. E. Kaporin, Multilevel ILU preconditionings for general unsymmetric matrices. In: Numerical geometry, grid generation, and high performance computing (V.A.Garanzha, Yu.G.Evtushenko, B.K.Soni, and N.P.Weatherill, eds.), Procs. Int. Conf. NUMGRID/VORONOI-2008, Moscow, 1013 June 2008, pp.150{157. [11] Davis T. A. University of Florida sparse matrix collection, 2008 http://www.cise.u .edu/research/sparse/matrices [12] Maclahlan S., Saad Y. A greedy strategy for coarse-grid selection // SIAM J. Sci. Comput. 2007. v. 29. no. 5, 1825{1853. [13] Saad Y., Suchomel B. ARMS: An algebraic recursive multilevel solver for general sparse linear systems // Numer. Linear Algebra Appl. 2002. v. 9, 359{378. [14] Ruge J. W., Stuben K. Algebraic multigrid (AMG) // Multigrid Methods, (Ed. S. F. McCormick) Frontiers Appl. Math. Vol.3. Philadelphia: SIAM, 1987, 73{130.

We now compare the performance of the IICM (0.01)CG method with the data for the ARMS and AMG preconditionings given in [12] (Tables 9 and 10, respectively). In the latter paper, the tests were run on Dual Intel Xeon 3.0 GHz (2 GB RAM) computer, which allows for a direct comparison of timings. Note that such methods as ARMS (Algebraic Recursive Multilevel Solver, [12, 13]) or AMG (Algebraic MultiGrid, [12, 14]) are often referred to as the best ones for the solution of linear algebraic systems arising in the corresponding application areas. In Tables 6 and 7 we reproduce the total solution time and the CG iteration numbers, respectively. It is clearly seen that our version of IIC preconditioning may outperform these ARMS and AMG implementations up to 100 times and more. The iteration number count for the IIC preconditioning is also quite competitive. Note that the frequently mentioned covergence of multigrid methods in 10 iterations independent of the problem size can be observed only for a certain restricted class of matrices such as discrete Laplacians. Clearly, such problems as \cvxbqp1", \bcsstk36", \msc23052", \oilpan", etc. are not recognised by these algorithms as the easy-to-solve ones.

4.5 Conclusions

Delivering a reasonable compromise between space overhead, iteration number, and parallelizability, the Multilevel Incomplete Inverse Cholesky preconditioned CG method can be considered as promising candidate for an implementation as a higly parallel generalpurpose SPD linear solver. Moreover, with properly optimized matrix-vector multiplication kernels, this solver can be competitive even for sequential computations.

References [1] O. E. Livne and G. H. Golub, Scaling by Binormalization, Numer. Alg., 35, 97{120, 2004. [2] I. Kaporin, Scaling, Reordering, and Diagonal Pivoting in ILU Preconditionings, Russian Journal of Numerical Analysis and Mathematical Modelling 22, no.4, pp. 341{375, 2007.

[3] M. H. Schneider and S. A. Zenios, A comparative study of algorithms for matrix balancing. Operations Research, 38, 439-455, 1990. 8


Fast matrix scaling, ne-coarse grid partitioning, and

Fast matrix scaling, ne-coarse grid partitioning, and

Suggest Documents

Matrix Partitioning: Optimal bipartitioning and

Scaling a unitary matrix

Matrix Partitioning: Optimal bipartitioning and heuristic solutions

Enabling and Scaling Matrix Computations on ... - CiteSeerX

Fluctuation scaling and covariance matrix of ...

Matrix models without scaling limit

Partitioning for Parallel Matrix-Matrix Multiplication ... - Semantic Scholar

On Partitioning Dynamic Adaptive Grid Hierarchies - CiteSeerX

ON TWO-DIMENSIONAL SPARSE MATRIX PARTITIONING - graal

Dynamic Programming and Fast Matrix Multiplication

Fast Matrix Multiplication and Symbolic Computation - arXiv

Fast QMC matrix-vector multiplication

Grid Based Image Scaling Technique - Research Publications

Partitioning of the molecular density matrix over atoms and bonds

Effects of Partitioning and Scheduling Sparse Matrix ...

A Fast and High Quality Multilevel Scheme for Partitioning ... - CiteSeerX

A Fast and Effective Partitioning Algorithm for Document Clustering

Fast sparse matrixâvector multiplication by partitioning and reordering

THETO â A Fast and High-Quality Partitioning Driven ... - CiteSeerX

A Fast Approximation to Multidimensional Scaling - Cs.UCLA.Edu

Multidimensional partitioning and bi-partitioning: analysis and ...

Return probability and scaling exponents in the critical random matrix ...

Enabling and Scaling Matrix Computations on ... - Semantic Scholar

Enabling and Scaling Matrix Computations on Heterogeneous Multi ...

Fast matrix scaling, ne-coarse grid partitioning, and