A Parallel Preconditioned Conjugate Gradient Package ... - CiteSeerX

A Parallel Preconditioned Conjugate Gradient Package for Solving Sparse Linear Systems on a Cray Y-MP* Michael A. Heroux Phuong Vu Chao Yang Mathematical Software Research Group Cray Research, Inc. 655F Lone Oak Drive Eagan, MN 55121.

Abstract

In this paper we discuss current activities at Cray Research to develop generalpurpose, production-quality software for the ecient solution of sparse linear systems. In particular, we discuss our development of a package of iterative methods that includes Conjugate Gradient and related methods (GMRES, ORTHOMIN and others) along with several preconditioners (incomplete Cholesky and LU factorization and polynomial). Vector and parallel performance issues are discussed as well as package design. Also, benchmarks on a wide variety of real-life problems are presented to assess the robustness and performance of methods implemented in our software. For symmetric positive de nite problems, we also compare the performance of the preconditioned conjugate gradient code with our parallel implementation of the multifrontal method for sparse Cholesky factorization.

1 Introduction The ecient solution of a linear system, Ax = b, where A is a large, sparse matrix, is an important and costly step in a variety of engineering and scienti c applications. The preconditioned conjugate gradient method [10, 19, 8] and extensions of this method [25, 27, 23, 22] have been very successful at solving certain classes of these problems. If the matrix A is a banded or block matrix or has some other structure, then ecient implementations of these methods exploit the matrix structure to achieve very good performance [3, 4, 15, 14]. However, if there is no known structure to the matrix, then performance can be severely degraded if the proper data structures are not chosen. In an attempt to provide easy access to a variety of iterative methods which are eciently implemented, we have developed a package of preconditioned conjugate gradient methods which are written to provide good performance for general sparse matrices on shared memory, vector/parallel machines such as a Cray Y-MP. For the purposes of this paper, we call this package CrayPCG. The iterative methods implemented in CrayPCG have three basic types of operations which consume most of the computing time. They are: *

To appear in IMACS Journal of Applied Numerical Mathematics.

1

1. Sparse matrix times a vector. 2. Setup and solution of sparse triangular system. 3. Vector updates and dot products. Type 1 operations are of the form y = Ax or y = AT x where A is a square, sparse matrix. Type 2 operations involve computing triangular matrices L and U and solving systems of the form x = L?1 b or x = U ?1 b where L and U are lower and upper triangular matrices, respectively, both being sparse. Type 3 operations are of the form y = x + y and = (x; y ) where and are scalars and x and y are vectors. One should note several things about these three type of operations. First, type 3 operations are independent of the data structures used to store the sparse matrices A, L and U . Also, type 3 operations are dense linear algebra operations which perform very well on vector machines and which, for large problems, can be executed in parallel via \stripmining." For vector updates, stripmining involves partitioning the vectors x and y into subvectors x1; x2; : : : and y1 ; y2; : : : and computing yi = xi + yi ; i = 1; 2; : : : in parallel. How x and y are partitioned depends on the problem size, the number of processors available and whether or not the user is in a dedicated or multi-user environment. For large enough problems, since the yi subvectors can be updated completely independently, one should see nearly linear speedup for this type of operation. For dot products, stripmining involves the same type of partitioning, but now we must compute partial sums i = (xi ; yi ) which must be added together to obtain . On a vector computer this implies some overhead because typically the computation of each i requires the same amount of scalar arithmetic as a single-processor computation of . Thus, if we use eight processors we are doing at least eight times more scalar arithmetic than if we used a single processor. However, for moderate to large sized problems, this overhead is relatively small and signi cant speedup can be obtained. Unlike type 3 operations, performance and implementation of type 1 and type 2 operations depend greatly on the sparsity patterns of the matrices and the data structures used to store them. Because of this, these operations are the most dicult to optimize and by far the most expensive steps of the solution process. Thus, most of the eort in optimizing CrayPCG was put into developing good vector and parallel implementations of type 1 and type 2 operations. Our implementations of these operations are for general sparse matrices and make no assumptions about the structure. It is possible to improve performance for type 1 and type 2 operations if the matrices have some special property such as being banded or block. In these cases, using a data structure which exploits the special property can be very bene cial. Although we do not presently support any specialized data structures, we have designed our package with the capability of easily replacing our routines for type 1 and type 2 operations with routines written by the user. In this paper we describe the features of CrayPCG and discuss our implementation of the key steps of the iterative process. In particular, we discuss the vector and parallel performance of the iterative, matrix-vector product and preconditioning routines. Timing results from an 8-processor Cray Y-MP are given using problems from the Harwell-Boeing test set[5]. 2

2 Description of CrayPCG CrayPCG provides a core package of optimized preconditioned conjugate gradient (PCG) and PCG-related iterative methods along with a selection of preconditioners. A variety of methods and preconditioners is provided in an attempt to compensate for the lack of robustness of any single method and preconditioner. We provide six iterative methods in CrayPCG. Each method is a well-known, proven technique for certain classes of matrices. The algorithms we use are based in part on those used in SLAP, a general-purpose, iterative package [26]. Throughout this paper we will refer to these six methods by the corresponding three letter abbreviations listed below.

Available Iterative Methods PCG Preconditioned Conjugate Gradient Method. CGN Conjugate Gradient method applied to the equations AAT y = b, x = AT y (Craig's

Method). BCG Bi-conjugate Gradient Method. CGS (Bi)-conjugate Gradient Squared Method. OMN Orthomin/Generalized Conjugate Residual (GCR) Method. GMR Generalized Minimum Residual (GMRES) Method. Two basic types of preconditioning are available in CrayPCG. The rst type is explicit scaling of the linear system by a diagonal matrix. This type of preconditioning can be applied directly since it does not disturb the structure of the matrix. Also, it is inexpensive to apply and almost always improves the convergence rate. We will refer to this type of preconditioning simply as \scaling." Six types of scaling are available.

Available Scaling Options

1. Symmetric diagonal. 2. Symmetric row-sum. 3. Symmetric column-sum. 4. Left diagonal. 5. Left row-sum. 6. Right column-sum. The second type of preconditioning is the usual implicit preconditioning where the preconditioned linear system is never explicitly formed but the preconditioner is applied at each iteration. We will refer to this type of preconditioning simply as \preconditioning." Five types of preconditioning are available.

Available Preconditioners

1. Implicit diagonal scaling. 3

2. 3. 4. 5.

Incomplete Cholesky factorization. Incomplete LU factorization. Neumann polynomial preconditioning. Least-squares polynomial preconditioning.

3 Implementation of the Iterative Methods This section describes our implementation of the basic iterative schemes. We discuss how we use reverse communication to make the iterative routines independent of the data structures needed to perform the preconditioning and matrix-vector (MV) product. We also discuss how we exploit parallelism in these routines.

3.1 Use of Reverse Communication

In most iterative schemes the application of the preconditioner and the computation of the MV product are by far the most time consuming parts of the solution process. In comparison the vector update and dot product operations are of minor importance and consume only a small fraction of computation time. Also, these operations are independent of the data structures used by the preconditioner and matrix-vector product. Thus, it is advantageous to separate the vector update and dot product operations from the preconditioning and MV operations to allow greater exibility. By doing this, the preconditioning and MV routines could be replaced without modifying the iterative routine. Similarly, the same preconditioner or MV routine could be used for several dierent iterative routines without major changes. One approach to providing this kind of \open architecture" is called direct communication. With direct communication, the user is told the calling sequences of the preconditioning and MV routines called by the iterative routine. Then the user can load their own routines to do these operations as long they conform to the given conventions. In many cases this is eective. However, in some cases, requiring the user to conform to a given calling sequence can be too restrictive and will almost always require some kind of translation of arguments by the user in order to conform to the convention. Examples of packages that use this approach are NSPCG [16] and SLAP [26]. A thorough discussion of this approach can be found in [2]. Another approach, the one we take, is called reverse communication. This approach places the preconditioning and MV operations outside of the iterative routines. When the iterative routine needs to apply preconditioning or compute an MV operation, it sets a

ag with a value indicating the type of operation needed and returns control to the calling routine. The calling routine should then perform the requested operation and then re-call the iterative routine. By doing this, the iterative routine need have no knowledge of the matrix or data structures being used. There is the added expense of the extra returns and calls, but this cost is minimal for most problems. Reverse communication can also be used to keep track of the convergence of the iterative method. However, to avoid explicitly handling the error estimate, the user can allow the iterative routine to compute its own error estimate in the \natural" (cheapest) way. Each iterative method has one or more ways of computing an error estimate that involves at most one dot product for each iteration. 4

3.2 Parallel Execution

Standard conjugate gradient-type algorithms are typically sequential in nature in the sense that one step of the algorithm must be completed before another step can be executed. Thus, execution of several steps simultaneously is prohibited. Variations do exist for increasing the parallelism in iterative methods by rearranging the steps so that some can be performed simultaneously at the cost of more operations per iteration [24, 11, 15, 14]. However, we have not found this necessary. Instead we have found it more useful to exploit the parallelism in each step of the algorithm. This is true for two reasons. First, for general sparse matrices, the preconditioning and matrix-vector product step usually consume the major portion of the execution time. Thus, it is much more important to exploit parallelism within each of the steps. Secondly, for a small number of processors, parallel execution of the vector update and dot product operations via stripmining is quite eective. This approach is not appropriate when the problem dimension is small, but for moderate and large sized problems, the speed up is signi cant (see Tables 6{ 11).

4 Implementation of the matrix-vector product One of the most important operations for any iterative method is the computation of the matrix-vector (MV) product y = Ax where A and x are given. Since A is typically large and sparse, it is almost always stored in some format which eliminates the need to store most or all of the zero terms in the matrix. If A is well-structured, for example banded, then the structure can usually be exploited to achieve good performance rates on a vector/parallel machine. However, if there is no exploitable structure to the matrix, performance can be very bad. In this case care must be taken when choosing the data structure for MV operation.

4.1 Common Data structures

There are several common data structures for general sparse matrices. Among them are the compressed sparse column (CSC), compressed sparse row (CSR), and the ELLPACKITPACK (ELL)1 formats. In the CSC, format the matrix A is represented by three arrays AMAT, ROWIND and COLSTR. Let NCOL be the column dimension of A and NZA be the number of non-zero elements of in A. Then AMAT is an NZA-length scalar array containing the nonzero elements of A stored column-by-column. ROWIND is an NZA-length integer array such that each element contains the row index of the corresponding element of AMAT. COLSTR is an (NCOL+1)-length integer array such that the jth element of COLSTR points to the start of the jth column of A in AMAT for j = 1; : : :, NCOL. The last element of COLSTR is de ned to be COLSTR(NCOL + 1) = NZA + 1. For example, let 0 11 0 0 14 0 1 BB 0 22 23 24 25 CC A=B (1) BB 0 32 33 0 0 CCC @ 41 42 0 44 0 A 0 52 0 0 55 1

The abbreviations CSC, CSR and ELL are consistent with those used in SPARSKIT [20].

5

then A would be stored in CSC format as follows. AMAT = ( 11 41 22 32 42 52 23 33 14 24 44 25 55 ) ROWIND = ( 1 4 2 3 4 5 2 3 1 2 4 2 5 ) COLSTR = ( 1 3 7 9 12 14 ): If A is symmetric then only the lower triangle needs to be stored. The CSR format is analogous to CSC but is row oriented. Both formats, and variations on these formats, are used in a variety of packages. The ELL format [16] uses a scalar array COEF and integer array JCOEF both of dimension NROW-by-MAXNZ where NROW is the number of rows in A and MAXNZ is the maximum of the number of nonzero terms in each row of A. Each row of COEF contains the nonzero elements of the corresponding row of A with the diagonal element in the rst column. If the number of nonzero terms in a row of A is less than MAXNZ, then the corresponding row of COEF is padded with zeroes at the end. Each element of JCOEF contains the column index of the corresponding element of COEF. Thus, for the matrix A above, the ELL representation would be

1

0 11 BB 22 COEF = B BB 33 @ 44

14 0 0 23 24 25 C C 32 34 0 C C; A 41 42 0 C 55 52 0 0

01 BB 2 JCOEF = B BB 3 @4

1

4 1 1 3 4 5C C 2 4 3C C A 1 2 4C 5 2 5 5 The CSC and CSR formats are very natural and quite ecient in terms of storage. However, MV operations using these formats do not perform very well on vector machines because the vector lengths are determined by the number of nonzero terms in each column (CSC) or row (CSR). Thus, even for very large problems, vector lengths are typically very small and performance is poor. Typical performance rates for these MV kernels are 5{30 MFLOPS on a single processor of a Cray Y-MP. On one processor, the performance rates for the CSC format are typically about 10{ 20% better than rates for the CSR format. This is because the Y-MP architecture does not have hardware support for short vector reduction operations. Thus, the typically short dot-products needed with the CSR format do not perform well. However, on multiple processors, the opposite is true because the CSR format has a natural parallel structure which allows very good speed-up for the MV kernel even for small problems. This natural parallelism is a result of the fact that row operations can be computed independently. The same is not true for the CSC format. In either case, however, the performance is not good and some other format should be used. MV operations using the ELL format eliminate the problem of short vectors since vectors are of length NROW. Thus, performance rates of well over 100 MFLOPS can be achieved on one processor. Also, a natural row partition of A exists allowing good parallel speed-up for large problems. However, if the number of nonzero terms in each row of A is 6

not fairly uniform, then a large amount of zero- ll can occur and actual performance can be severely degraded. Other data structures have been shown to work well for general sparsity patterns. One of them is the stripe structure [13] which is similar to the ELL format. Another is the sparse diagonal format [6] which is particularly suited to symmetric matrices. A third is the ITPLUS format [7] which is a combination of the ELL format and the CSR format.

4.2 CrayPCG MV Data Structure

To eliminate both the problem of short vectors and zero- ll, we have used a variation of the jagged diagonal (JAG) format [1, 18] for the MV operations in CrayPCG. With this format the matrix A is replaced by the matrix A = PAP T where P is a permutation matrix chosen so that the rows of A are ordered by decreasing number of nonzero terms per row. The matrix A is then represented by a scalar array AMAT and the integer array COLIND, both of length NZA, and the integer array COLSTR with length MAXNZ + 1, where NZA and MAXNZ are as before. AMAT contains the nonzero elements of A stored column-wise. The rst column of AMAT contains the NROW diagonal elements of A. The next column of AMAT contains the rst non-diagonal, nonzero term in each row of A and this continues through the MAXNZ columns. The array COLSTR points to the start of each column in AMAT and COLSTR(MAXNZ + 1) = NZA + 1. Each element of COLIND contains the column index of the corresponding element of AMAT except for the rst NROW elements which contain the number of nonzero terms in each row. For the matrix A in Equation 1, an appropriate matrix P would be

00 BB 0 P =B BB 0 @1

1 0 0 0 0 0

Then A would be

0 0 1 0 0

0 22 BB 42 T A = PAP = B BB 32 @ 0

0 1 0 0 0

0 0 0 0 1

1 CC CC : CA

24 23 0 25 44 0 41 0 0 33 0 0 14 0 11 0 52 0 0 0 55 and AMAT, COLIND and COLSTR would be

1 CC CC : CA

AMAT = ( 22 44 33 11 55 24 42 32 14 52 23 41 25 ) COLIND = ( 4 3 2 2 2 2 1 1 2 1 3 4 5 ) COLSTR = ( 1 6 11 13 14 ): Using this data structure we are able to eliminate short vectors since the vector lengths increase as the problem size increases. We are also able to eliminate zero- ll. However, the cost of nding and applying P can be high, but, because it is done once, the cost is usually recovered by the improved performance of the MV operations. There is a version of the JAG format for symmetric matrices using only the lower triangle of the matrix. However, in the following discussion we limit the presentation to the case where all the nonzero terms are stored. 7

4.3 Parallel MV Operations

Let A be an nn matrix and assume for simplicity that P = I so that A = A. Then parallel processing of the MV operation is achieved by determining a partition of A consisting of q horizontal segments such that 0 1

BB AA12 A=B B@ ...

CC CC : A

(2)

Aq P where Ai is an ni n rectangular matrix, i = 1; : : :; q , and n = qi=1 ni . These segments of A can be applied simultaneously thus allowing parallel execution. Usually q p where p is the number of processors. Below we present the algorithms which are based upon this partition and then discuss how to determine an appropriate partition. The MV routines in CrayPCG are based upon the kernel routine which performs the following operations using a horizontal section of the matrix A where A is stored in jagged diagonal format. zi = yi + Aix; z = y + ATi xi: Here is a scalar, xi , yi and zi are the ni -length sections of n-length vectors corresponding to Ai and x, y and z are n-length vectors. Assuming that we have a partition of A, we can compute z = y + Ax in parallel using the following simple algorithm.

Parallel Matrix-vector Product Algorithm: do parallel i = 1; : : :; q zi = yi + Ai x enddo

With this algorithm, the scheduling of processors is automatic since each iteration will be passed to a processor as soon as a processor becomes available. This self-scheduling is necessary for good load balancing since work across segments is usually not uniform. Using this algorithm we achieve good parallel performance on large problems. To compute the transpose operation in parallel we use the same partition of A but now these segments are vertical and therefore each processor computes a partial sum which must be added to the nal result. Thus, we need p work vectors wi ; i = 1; : : :; p of length n where p is the maximum number of physical processors available. We also need an p-length integer vector f to keep track of available work vectors. f (i) is 1 if the ith work vector is available, otherwise f (i) is 0. Then the algorithm is as follows.

Parallel Matrix-transpose-vector Product Algorithm: z = y f (j ) = 1; j = 1; : : :, p do parallel i = 1; : : :; q

guard 1 j = location of the rst 1 in f . f (j ) = 0 end guard 1 8

wj = ATi xi

guard z = z + wj end guard guard 1 f (j ) = 1 end guard 1 enddo The guard/end guard statements in this algorithm indicate code regions which cannot be executed by more than one processor simultaneously. The two guarded regions labeled with 1 are treated as a single guarded region. Using this algorithm, good speed-up can be obtained but there is some extra cost introduced by the full vector updates of z and by the potential access con icts due to the guarded regions. One should note that this algorithm does allow self-scheduling which, as mentioned above, is important since the amount of work at each iteration can vary substantially. Also, it is possible to use z in place of wp and thereby reduce the number of work vectors by one. However, this causes unnecessary synchronization when adding up the partial sums and can cause signi cant degradation in performance. In order to compute the MV operation eciently, we need to determine a partition of the matrix which balances the load across the available processors but is not too expensive to compute. Due to the irregular structure it is not feasible, and sometimes not possible, to divide the work evenly among all the processors. Instead some kind of heuristics are needed. One approach would be to divide the rows up evenly between processors, but this tends to make the work in each segment decrease as we go down the matrix since the rows are sorted by decreasing number of nonzero elements. We have found that a better method is to divide the matrix so that each horizontal segment has approximately the same number of nonzero elements, that is, approximately NZA=p elements. However, to handle special cases and improve load balancing, we subject this division to the condition that the number of rows in each segment be between the two empirically determined integers nl and nu . By making the number of rows in a segment greater that nl , we attempt to keep good vector performance. By keeping a cap on the number of rows in a segment, we nd that the load balancing for large problems is much better, especially in a non-dedicated environment where it is not certain how many processors may be available at any given time. Table 1 gives timing information for a few sample matrices from the Harwell-Boeing test set. The format abbreviations are those de ned in Subsection 4.1 except for JAG* which is an assembly language version of JAG. All other routines were written in Fortran. A * indicates matrices for which the ELL representation could not t into core memory because of too much zero- ll. All results were obtained on a Cray Y-MP8/832 (8 processors, 32Mw memory, S/N 1001, 6.4 ns clock) in dedicated mode. In all cases, even if the matrix is symmetric, all non-zero elements are stored. Usually, if the matrix is symmetric, storage space can be saved by storing only the lower (or upper) triangle of the matrix. However, we do not recommend this for two reasons. One, the MV performance is much poorer for the symmetric storage and two, because only the lower (upper) triangle of the matrix is stored, there is an implied transpose operation in order to obtain the action of the upper (lower) triangle of the matrix. Thus, we need to apply the parallel matrix-transpose9

1138-BUS Format CSR CSC ELL JAG JAG*

1 CPU 3.4 3.8 24.4 69.6 91.4

Format CSR CSC ELL JAG JAG*

1 CPU 25.8 42.9 * 73.4 83.7


1 CPU 31.9 49.5 34.7 80.4 94.0


1 CPU 6.2 7.2 127.8 121.1 128.4


1 CPU 3.8 4.3 73.9 119.5 126.4

AUDI

BCSSTK30

ORSREG-1

SHERMAN3

No-Trans 8 CPU Speedup 17.5 5.1 12.9 3.4 48.1 2.0 138.7 2.0 272.6 3.0 No-Trans 8 CPU Speedup 192.2 7.4 269.0 6.3 * * 535.9 7.3 604.0 7.2 No-Trans 8 CPU Speedup 241.2 7.6 339.1 6.9 241.1 6.9 556.1 6.9 631.0 6.7 No-Trans 8 CPU Speedup 42.7 6.9 40.5 5.6 585.7 4.6 515.5 4.3 572.8 4.5 No-Trans 8 CPU Speedup 26.4 6.9 24.5 5.7 428.2 5.8 567.3 4.7 585.0 4.6

1 CPU 3.8 3.4 4.7 5.9 13.7 1 CPU 42.9 25.8 * 41.8 55.5 1 CPU 49.5 31.9 12.1 43.5 59.0 1 CPU 7.2 6.2 17.0 10.8 24.4 1 CPU 4.3 3.8 9.2 10.7 23.8

Transpose 8 CPU Speedup 12.9 3.4 17.5 5.1 15.3 3.3 17.6 3.0 40.9 3.0 Transpose 8 CPU Speedup 269.0 6.3 192.2 7.4 * * 138.2 3.3 148.7 2.7 Transpose 8 CPU Speedup 339.1 6.9 241.2 7.6 79.3 6.6 230.7 5.3 258.0 4.4 Transpose 8 CPU Speedup 40.5 5.6 42.7 6.9 68.5 4.0 53.3 4.9 78.3 3.2 Transpose 8 CPU Speedup 24.5 5.7 26.4 6.9 39.3 4.3 38.0 3.6 54.8 2.3

Table 1: MFLOPS for Matrix-vector routines

10

vector algorithm of Subsection 4.3 which requires p extra work vectors. Often this extra workspace is more than the extra space used by storing the non-zero terms of the upper (lower) triangle of the matrix.

5 Implementation of Preconditioners All of the iterative methods provided (except CGN) attempt to solve the system Ax = b by working with the preconditioned system (ML?1 AMR?1 )(MR x) = (ML?1 b) where ML?1 and MR?1 are the left and right preconditioning matrices, respectively. CGN replaces the system Ax = b with AAT y = b and then applies the preconditioned conjugate gradient method to this system. The solution to the original system is then x = AT y . The preconditioning matrices ML?1 and MR?1 are determined by the scaling and preconditioning options selected. Let ML = SL PL and MR = PR SR . Then the scaling option selected determines SL and SR and the preconditioning option determines PL and PR . The de nitions of SL , SR , PL and PR are given below.

5.1 Scaling Methods

All of the scaling methods explicitly apply diagonal matrices to the linear system. Let sgn be the signum function, n be the dimension of A and I be the identity. Then below we de ne the diagonal terms of these matrices in terms of the original matrix A. p Symmetric diagonal scaling: (sL)ii = (sR)ii = sgn(aii) jaiij.

q Symmetric row-sum scaling: (sL)ii = (sR)ii = Pnj=1 jaij j. q Symmetric column-sum scaling: (sL)jj = (sR)jj = Pni=1 jaij j. Left diagonal scaling: (sL)ii = aii, SR = I . Left row-sum scaling: (sL)ii = Pnj=1 jaij j, SR = I . Right column-sum scaling: SL = I , (sR)jj = Pni=1 jaij j. Let B denote the result from applying one of the above scalings. Then by applying the symmetric scalings, B is symmetric if A is symmetric. Also, if either diagonal scaling is applied, then the diagonal terms of B are all 1. If left row-sum scaling is applied then the sup-norm of B is 1. Similarly, if right column-sum scaling is applied, then the 1-norm of B will be 1.

5.2 Preconditioning Methods

All of the following preconditioners are available for either left or right preconditioning. Two-sided preconditioning is presently not available. Below we de ne the preconditioning matrices for left preconditioning. In all cases PR = I . The de nitions for right preconditioning are analogous. Implicit diagonal scaling: PL = diag(a11; a22; : : :; ann). 11

Incomplete Cholesky (IC) factorization: PL = L(k)LT (k). Incomplete LU (ILU) factorization: PL = L(k)U (k). ?1 Neumann Polynomial preconditioning: P = Pl (I ? A)i . L

i=0

Jacobi-Least-Squares Polynomial preconditioning: PL = (ql(A))?1 . Note that implicit diagonal scaling is functionally equivalent to the explicit diagonal scaling techniques described above except when the iterative method is CGN in which case, the explicit diagonal scaling scales the matrix A by the inverse of the diagonal of A and implicit diagonal scaling scales the matrix AAT by the inverse of the diagonal of AAT . The parameter k for IC and ILU preconditioning indicates the level of ll allowed in the incomplete factors which can be de ned recursively. Level ll 0 means that the only nonzero terms allowed in the factors are terms that correspond to nonzero terms in A. Thus, the nonzero pattern of L(0) for IC and L(0)U(0) for ILU is the same as that of A. Level ll k for k > 0 means that the only nonzero terms allowed in the factors are terms from the previous levels and terms created by terms from the previous levels (see [1]). If CGN is used then the incomplete factors are the same, i. e., L(k)U (k) approximates A,, but the de nition of the ILU preconditioner corresponds to AAT instead of A. The parameter l for the polynomial preconditioners indicates the degree of the polynomial. The polynomial ql is the polynomial in A of degree less than or equal to l that best approximates the inverse of A in a least-squares sense. The implementation is based on a paper by Saad [21]. For the polynomial preconditioners to be eective, some kind of explicit scaling should be applied. Also, these preconditioners are usually eective only for symmetric, positive de nite matrices. If CGN is the chosen iterative method, then the above polynomials are in AAT instead of A.

5.3 Implementation of the Polynomial Preconditioners

Implementation of the polynomial preconditioners is really a repeated application of the MV routine with the addition of a vector update that can also be executed in parallel. Thus the performance characteristics of the polynomial preconditioners are closely tied to those of the MV operation. The partitioning of the matrix to perform parallel processing is the same as that used by the MV operation. In the case of the least-squares polynomial preconditioner, we have a setup phase in which we must compute the coecients of the polynomial. For this preconditioner, we assume that the matrix A is symmetric positive de nite. Thus its spectrum lies in an interval I = [; ] on the positive real axis. To compute the polynomial coecients, we need to provide an estimate of I and compute the coecients based on this estimate. A simple estimate for the left endpoint of I is = 0. A reasonable estimate for the right endpoint, which is very cheap to compute, is provided by Gershgorin's Circle Theorem as shown in [21]. Using these estimates and a table of precomputed coecients which depend on the parameter , we compute the coecients cheaply and store them for later use.

5.4 Implementation of the IC/ILU Preconditioners

The vector performance of the IC and ILU preconditioners for general sparse matrices is typically not very good. This is because the basic operation is either an indexed vector 12

update or indexed dot product on a short vector. Some performance gains can be obtained on multiple processors since the factors are sparse and many columns and/or rows can be processed independently. However, the main reason for using incomplete factorizations is that they are quite robust, often solving problems which cannot be solved by scaling or polynomial preconditioners. This is especially true of ILU factorization for non-symmetric problems. We have implemented two versions of the IC and ILU preconditioners. One for single processor execution and the other for parallel execution. In either version the computation of the factors is a two step process. First we compute the symbolic factorization which is the non-zero structure of the factors. Given the symbolic factorization, the numerical factorization is computed similarly to the standard Cholesky or LU factorization except that the only values computed are those which correspond to the non-zero structure of the symbolic factorization. In the single processor version of IC preconditioning, we compute the factor L columnby-column and store it in compressed sparse column (CSC) format as described in section 2. Once L is computed, we obtain LT in CSC by simply making a copy of L and storing it in compressed sparse row (CSR) format. Storing LT is expensive in terms of storage but it is done to enhance performance of the backsolve. By having a copy of both L and LT we can perform the forward and back solves using indexed vector updates which oer the best performance on a single processor. The single processor version of ILU preconditioning is similar to IC. We compute the factorization using the column-oriented JKI version of LU factorization [17] . L and U are both stored in CSC format for good performance on a single processor. In the parallel version of IC preconditioning, we compute the symbolic factorization in the same way as we do in the single processor version and we still compute the numerical factorization in a column-wise fashion. Only now we use level scheduling to exploit the sparse nature of L and determine columns of L which are independent of each other and can be processed in parallel [1, 9]. For example, let the structure of L be as follows where x indicates a non-zero value o the diagonal and the values on the diagonal indicate the level to which each column belongs. 01 0 0 0 0 0 0 01 BB x 2 0 0 0 0 0 0 CC BB 0 x 3 0 0 0 0 0 CC BB 0 x 0 3 0 0 0 0 CC str(L) = B BB 0 0 x x 4 0 0 0 CCC BB x 0 0 0 0 2 0 0 CC B@ x x 0 0 0 0 3 0 CA 0 x x 0 0 0 0 4 For this L, the numerical factorization of column 1 is performed rst. After column 1 is done, columns 2 and 6 can be computed in parallel since they are dependent only on column 1. Then columns 3, 4 and 7 can be computed in parallel and nally columns 5 and 8. Columns that can be processed in parallel are said to belong to the same level. In practice, determining the level scheduling is inexpensive and speed up can be substantial, especially for level 0 ll. As the ll increases, however, the independence of the columns decreases leading to less parallelism. Tables 2 and 3 show timing results using level scheduling. To obtain parallel execution of the solve phase of IC preconditioning we again want to use level scheduling. However, now we need to know which rows of L and LT can be 13

Problem 1138-BUS 1138-BUS 1138-BUS AUDI AUDI AUDI BCSSTK30 BCSSTK30 BCSSTK30

k 0 1 2 0 1 2 0 1 2

nlevl tfac1 tfac8 speedup tsol1 tsol8 speedup 21 0.0148 0.0023 6.43 0.0020 0.0010 2.00 31 0.0159 0.0027 5.89 0.0021 0.0011 1.91 38 0.0168 0.0030 5.60 0.0021 0.0012 1.75 589 25.1596 3.3568 7.50 0.1442 0.0633 2.28 703 25.4232 3.4464 7.38 0.1467 0.0545 2.69 731 25.9542 3.4996 7.42 0.1520 0.0545 2.79 5513 7.7167 2.0143 3.83 0.0827 0.0827 1.00 6043 7.9569 2.1845 3.64 0.0857 0.0886 0.97 6243 8.1137 2.2616 3.59 0.0874 0.0903 0.97

Table 2: Level schedule statistics in seconds for IC(k) processed in parallel. Thus, we want to have L and LT in CSR format. But we already have this since LT in CSC format, which is what we computed above, can be interpreted as L in CSR format and similarly, L in CSC format can be interpreted as LT in CSR format. Since level scheduling for the solve phase is inexpensive, one would expect the speed up to be substantial. However, this is not the case since we have switched from the CSC format to the CSR format thereby decreasing vector performance to obtain parallel performance. Thus, good speed up over single processor performance cannot be expected. In fact, in most cases 8-processor performance is less than two times better than single processor performance. Parallel execution of ILU preconditioning is analogous to that of IC except that after the factorization step, L and U must be explicitly converted from CSC format to CSR format adding to the overhead costs. In fact, the added cost of converting to CSR format along with the cost of computing the level scheduling for the solve phase makes the overall setup time for ILU approximately the same for eight processors as for one. It is, however, still worth doing since some gain will be seen in the solve phase. (See Tables 8 and 11 in Section 6.) It is possible to use other data structures to store the triangular system for the solve phase. In particular, the jagged diagonal scheme has been used with varying degrees of success [1, 18]. In this approach, the o-diagonal elements in a level (A level is a set of rows which can be processed simultaneously.) are stored in JAG format. Then solving the triangular system can be thought of as a sequence of MV operations where the row dimension of each matrix is the number of row in the corresponding level (see [1]) . However, we have not found this approach useful for several reasons. First, converting from the CSR format to JAG format adds to the already high setup cost. This is particularly bad for IC/ILU preconditioning since, when these preconditioners work, typically few iterations (less than 100) are needed to reach convergence. Thus, it is dicult to recover any extra overhead introduced in the setup phase. The second reason is that, except for specially structured problems, the number of rows per level is too small to get substantial vector performance.

14

Problem ORSREG-1 ORSREG-1 ORSREG-1 SHERMAN1 SHERMAN1 SHERMAN1 SHERMAN3 SHERMAN3 SHERMAN3

k nlevl nlevu tfac1 tfac8 speedup tsol1 tsol8 speedup 0 45 45 0.0434 0.0066 6.58 0.0038 0.0021 1.81 1 73 73 0.0483 0.0078 6.19 0.0039 0.0023 1.70 2 109 109 0.0562 0.0098 5.73 0.0040 0.0027 1.48 0 28 28 0.0104 0.0018 5.78 0.0015 0.0009 1.67 1 54 54 0.0112 0.0023 4.87 0.0015 0.0011 1.36 2 96 96 0.0121 0.0040 3.03 0.0015 0.0015 1.00 0 50 50 0.1851 0.0461 4.02 0.0072 0.0038 1.89 0 82 82 0.1936 0.0267 7.25 0.0074 0.0040 1.85 0 139 139 0.2008 0.0304 6.61 0.0075 0.0045 1.67

Table 3: Level schedule statistics in seconds for ILU(k)

6 Computational Results In this section we present computational results using CrayPCG and, when appropriate, our multifrontal direct solver. The tests were run on a Cray Y-MP8/832 (8 processors, 32Mw memory, S/N 1001, 6.4 ns clock) in dedicated mode using sample problems from the Harwell-Boeing test set. Tables 4 and 5 describe the test problems we use giving the condition number of the original and diagonally scaled matrices. For those matrices which are symmetric positive de nite (SPD), we show results for both the PCG and multifrontal codes. However, we are somewhat limited in the test problems we can use to compare the iterative and direct methods fairly because most of the Harwell-Boeing problems are pattern-only and do not specify values. This is a problem since the rate of convergence of PCG is value-dependent. We have included results for some pattern-only matrices using arti cial values, namely AUDI, BCSSTK30, and CUBE12, only because they are some of the larger problems in the test set. Tables 7, 6 and 8 give results for the SPD matrices. We compare four dierent methods: PCG/Diag: Conjugate Gradient (CG) with explicit, symmetric diagonal scaling. PCG/IC(k): CG with explicit, symmetric diagonal scaling and Level-k ll Incomplete Cholesky Factorization. PCG/LSP(k): CG with explicit, symmetric diagonal scaling and k-degree Leastsquares Polynomial preconditioning. Multifrontal: Multifrontal method for sparse Cholesky factorization. Table 6 gives the solution time in seconds. Table 7 gives the number of iterations for the PCG runs, the absolute error of the computed solution against the exact solution, and the speedup obtained on eight processors compared to one. For a stopping criterion with the PCG runs, we required that the pseudo-residual error be less than 10?10 . A y indicates a test which did not converge in n iterations where n is the problem dimension. A z indicates a test for the IC factorization could not be computed because in the process of the factorization a diagonal element became too small causing a oating-point divide error. Table 8 breaks down the solution time for two of the sample runs as follows: 15

1138-BUS AUDI BCSSTK16 BCSSTK17 BCSSTK25 BCSSTK28 BCSSTK30 BCSSTM12 CUBE12 NOS3 JPWH-991 ORSIRR-1 ORSIRR-2 ORSREG-1 SHERMAN1 SHERMAN2 SHERMAN3 SHERMAN4 SHERMAN5

Symmetric Positive De nite Problems Admittance Matrix 1138 Bus Power System, D.J.Tylavsky, July 1985. C-3 Gesamtstruktur Stiness Matrix - Corp. of Engineers Dam Symmetric Stiness Matrix - Elevated Pressure Vessel Symmetric Stiness Matrix - 76 Story Skyscraper Solid Element Model (MSC NASTRAN) Stiness Matrix for O-shore Generator Platform (MSC NASTRAN) Symmetric Mass Matrix, Ore Car (Consistent Masses) Symmetric Pattern, 27 Point Operator on 12 by 12 by 12 Grid Symmetric Matrix, FE Approximation to Biharmonic Operator on Plate Non-symmetric Problems Unsymmetric Matrix from Philips LTD, J.P.Whelan,1978. Oil Reservoir Simulation Matrix for 21x21x 5 Irregular Grid Oil Reservoir Simulation Matrix for 21x21x 5 Irregular Grid Oil Reservoir Simulation Matrix for 21x21x 5 Full Grid Black Oil Simulator, Shale Barriers 10 by 10 by 10 Grid, One Unk Thermal Simulation, Steam Injection, 6 by 6 by 5 Grid, Five Unk Black Oil, IMPES Simulation, 35 by 11 by 13 Grid, One Unk Black Oil, IMPES Simulation, 16 by 23 by 3 Grid, One Unk Fully Implicit Black Oil Simulator 16 by 23 by 3 Grid, Three Unk Table 4: Description of Test Matrices

Problem Dimension # Non-zero (A) (D?1=2AD?1=2) 1138-BUS 1138 2596 1.22E+07 8.87E+05 AUDI 58126 1437027 7.76E+00 7.51E+00 BCSSTK16 4884 147631 7.04E+09 7.01E+02 BCSSTK17 10974 219812 1.95E+10 2.84E+06 BCSSTK25 15439 133840 3.48E+12 6.42E+07 BCSSTK28 4410 111717 2.61E+09 2.03E+07 BCSSTK30 28924 1036208 3.25E+01 1.01E+01 BCSSTM12 1473 10566 4.68E+05 2.30E+03 CUBE12 8640 495620 2.12E+01 9.23E+00 NOS3 960 8402 4.27E+04 4.14E+04 JPWH-991 991 6027 2.41E+02 2.05E+02 ORSIRR-1 1030 6858 9.94E+06 6.90E+03 ORSIRR-2 886 5970 6.77E+04 6.51E+03 ORSREG-1 2205 14133 7.37E+03 7.14E+03 SHERMAN1 1000 3750 4.64E+03 1.89E+03 SHERMAN2 1080 23094 6.25E+11 1.25E+07 SHERMAN3 5005 20033 6.90E+16 2.34E+04 SHERMAN4 1104 3786 2.32E+03 1.75E+03 SHERMAN5 3312 20793 4.21E+03 1.02E+03

Table 5: Statistics for Test Matrices 16

NCPU: Number of processors. Sym Fac: Symbolic factorization time. Num Fac: Numerical factorization time. Tot Fac: Total factorization time. Prec: Time spent in applying preconditioner. MV Setup: Time spent converting from CSC to JAG format and computing par-

tition for parallel MV. MV: Time spent computing matrix-vector product. Iter: Time spent in reverse communication iterative routine. Tot time: Total time used to solve problem. Tables 9, 10 and 11 give results for the non-symmetric matrices. All matrices were scaled explicitly with left diagonal scaling and then, where indicated, ILU(k) preconditioning was used. We did use polynomial preconditioning for all non-symmetric tests but it only worked for one problem so we do not report those results. GMR and OMN were used with 10 and 40 truncation vectors as indicated. Table 11 breaks down the solution time in the same way as Table 8. As before, for a stopping criterion, we required that the pseudo-residual error be less than 10?10. Also, a y indicates a test which did not converge in n iterations where n is the problem dimension.

7 Conclusions In this paper we have described an implementation of a set of preconditioned conjugate gradient methods. We have shown that, by using reverse communication, we can place the steps of the solution process which depend on the data structure outside of the iterative routine giving greater exibility in software development. We have discussed a variety of data structures for sparse matrices and shown that, for the matrix-vector product operation, signi cant vector and parallel performance can be obtained even for unstructured problems if the proper data structures are used. On the negative side, we have shown that good performance for parallel implementations of IC and ILU preconditioners is dicult to obtain for unstructured sparse matrices. As is shown in Table 7, the eight-processor timings are typically two times faster than one-processor timings. In part, this is because the data structures needed to compute the factors and determine the level-scheduling do not work well when solving the triangular systems. More suitable data structures are available but, when IC and ILU preconditioning works, the number of iterations is typically too few to recover the conversion cost. Thus, it must be concluded that parallel implementations of IC and ILU as presented here are of marginal value. Despite its poor parallel performance, ILU preconditioning is still very valuable since it provides robustness. However, for the set of problems studied here, IC preconditioning does not solve any problems that the other preconditioners cannot solve. In general this is not always true, but as the condition number estimates of Table 6 show, diagonal scaling alone can signi cantly reduce the condition number of problems which initially are 17

PCG/Diag PCG/IC(0) PCG/LSP(10) Multifrontal Problem 1 CPU 8 CPU 1 CPU 8 CPU 1 CPU 8 CPU 1 CPU 8 CPU 1138-BUS 0.186 0.083 0.384 0.195 0.188 0.076 0.070 0.071 AUDI 2.340 1.348 29.503 14.212 8.588 2.210 33.470 18.570 BCSSTK16 1.592 0.310 1.576 0.861 2.578 0.447 2.308 1.360 BCSSTK17 26.875 4.988 z z 46.300 8.046 3.850 2.788 BCSSTK25 62.479 11.611 z z 80.962 13.116 5.975 4.560 BCSSTK28 y y z z 49.637 7.540 1.229 0.837 BCSSTK30 1.557 0.609 10.197 4.282 5.430 1.326 14.830 8.838 BCSSTM12 0.213 0.098 0.135 0.106 0.253 0.072 0.152 0.140 CUBE12 0.848 0.274 2.966 2.019 2.984 0.657 15.520 4.874 NOS3 0.107 0.054 0.177 0.175 0.133 0.040 0.126 0.104

Table 6: Solution times in seconds for PCG and Multifrontal Methods PCG/Diag PCG/IC(0) PCG/LSP(10) Problem #It Error Spd #It Error Spd #It Error Spd 1138-BUS 1006 10?2 2.2 148 10?8 2.0 168 10?8 2.5 AUDI 17 10?9 1.7 8 10?9 2.1 9 10?10 3.9 BCSSTK16 228 10?9 5.1 46 10?9 1.8 35 10?9 5.8 ? 7 BCSSTK17 2921 10 5.4 z z z 488 10?7 5.8 BCSSTK25 7719 10?1 5.4 z z z 1729 10?3 6.2 BCSSTK28 y y y z z z 802 10?7 6.6 BCSSTK30 21 10?9 2.6 8 10?10 2.4 9 10?10 4.1 ? 7 ? 9 BCSSTM12 401 10 2.2 17 10 1.3 54 10?7 3.5 CUBE12 24 10?10 3.1 9 10?10 1.5 9 10?11 4.5 ? 9 ? 10 NOS3 240 10 2.0 52 10 1.0 33 10?9 3.3

Table 7: Number of Iterations, Absolute Error and Speedup

18

Multifrontal Error Spd 10?10 1.0 10?13 1.8 10?10 1.7 10?8 1.4 10?6 1.3 10?8 1.5 10?13 1.7 10?11 1.1 10?12 3.2 10?10 1.2

1138-BUS

Sym Num Fac Fac

Tot Fac

Method NCPU Prec PCG/Diag 1 0.012 PCG/IC(0) 1 0.001 0.017 0.022 0.324 PCG/IC(1) 1 0.007 0.018 0.030 0.242 PCG/IC(2) 1 0.008 0.019 0.031 0.215 PCG/LSP(4) 1 0.124 PCG/LSP(8) 1 0.143 PCG/LSP(10) 1 0.146 PCG/Diag 8 0.006 PCG/IC(0) 8 0.001 0.009 0.013 0.153 PCG/IC(1) 8 0.008 0.009 0.021 0.126 PCG/IC(2) 8 0.009 0.010 0.022 0.119 PCG/LSP(4) 8 0.041 PCG/LSP(8) 8 0.045 PCG/LSP(10) 8 0.044

MV Setup 0.010 0.010 0.010 0.010 0.010 0.010 0.010 0.009 0.009 0.009 0.009 0.009 0.009 0.009

BCSSTK16

MV 0.079 0.012 0.008 0.007 0.028 0.016 0.013 0.016 0.004 0.003 0.003 0.010 0.006 0.005

Iter 0.074 0.011 0.008 0.007 0.025 0.015 0.012 0.047 0.012 0.008 0.007 0.027 0.016 0.014

Tot Time 0.186 0.384 0.303 0.276 0.194 0.190 0.188 0.085 0.195 0.172 0.164 0.092 0.081 0.076

Sym Num Tot MV Tot Method NCPU Fac Fac Fac Prec Setup MV Iter Time PCG/Diag 1 0.010 0.062 1.431 0.062 1.592 PCG/IC(0) 1 0.005 0.468 0.555 0.628 0.062 0.293 0.013 1.576 PCG/IC(1) 1 0.768 0.500 1.355 0.592 0.062 0.267 0.011 2.313 PCG/IC(2) 1 0.780 0.530 1.403 0.567 0.062 0.249 0.011 2.317 PCG/LSP(4) 1 1.884 0.062 0.467 0.020 2.461 PCG/LSP(8) 1 2.205 0.062 0.273 0.012 2.580 PCG/LSP(10) 1 2.254 0.062 0.224 0.010 2.578 PCG/Diag 8 0.003 0.053 0.211 0.028 0.310 PCG/IC(0) 8 0.005 0.129 0.214 0.528 0.053 0.043 0.006 0.861 PCG/IC(1) 8 0.758 0.137 0.982 0.516 0.053 0.039 0.005 1.668 PCG/IC(2) 8 0.770 0.155 1.017 0.518 0.053 0.038 0.005 1.716 PCG/LSP(4) 8 0.275 0.053 0.068 0.009 0.460 PCG/LSP(8) 8 0.354 0.053 0.042 0.005 0.508 PCG/LSP(10) 8 0.332 0.053 0.032 0.004 0.469

Table 8: Timing breakdown for PCG

19

JPWH-991

Diag ILU(0) ILU(2) 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU 0.060 0.042 23 0.205 0.114 12 0.191 0.160 0.271 0.137 46 0.375 0.185 16 0.222 0.178 0.025 0.019 14 0.090 0.075 7 0.128 0.141 0.051 0.041 27 0.091 0.075 11 0.123 0.134 0.048 0.043 22 0.083 0.071 11 0.123 0.134 0.066 0.051 27 0.097 0.080 11 0.123 0.134 0.073 0.065 22 0.087 0.075 11 0.123 0.134

Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40)

#It 74 399 41 140 74 140 74

ORSIRR-1

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU y y y 56 0.507 0.232 18 0.235 0.178 y y y 266 2.249 0.912 53 0.548 0.320 943 0.369 0.194 40 0.207 0.137 14 0.161 0.157 y y y 77 0.209 0.142 23 0.152 0.150 y y y 63 0.192 0.138 19 0.144 0.143 y y y 77 0.231 0.154 23 0.159 0.157 y y y 63 0.213 0.153 19 0.148 0.147

ORSIRR-2

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU 855 0.579 0.286 56 0.435 0.199 18 0.218 0.154 y y y 247 1.798 0.755 52 0.464 0.279 690 0.240 0.129 36 0.163 0.113 13 0.134 0.134 y y y 76 0.178 0.125 23 0.131 0.131 y y y 62 0.162 0.120 19 0.124 0.124 y y y 76 0.194 0.137 23 0.135 0.135 y y y 62 0.178 0.138 19 0.126 0.126

ORSREG-1

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU 419 0.671 0.252 69 1.329 0.556 18 0.548 0.344 y y y 461 8.152 3.219 44 1.000 0.540 232 0.188 0.087 45 0.495 0.298 11 0.317 0.402 612 0.375 0.300 85 0.496 0.307 18 0.305 0.262 497 0.563 0.475 69 0.452 0.297 18 0.308 0.266 701 0.591 0.257 85 0.549 0.323 18 0.314 0.192 497 0.916 0.494 69 0.501 0.303 18 0.315 0.264

Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40) Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40) Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40)

Table 9: Solution Times of Iterative Methods for Non-symmetric Problems, Part I 20

SHERMAN1

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU 569 0.310 0.188 50 0.346 0.180 22 0.189 0.128 y y y 249 1.578 0.744 65 0.456 0.284 396 0.121 0.087 39 0.170 0.120 15 0.103 0.104 y y y 99 0.216 0.153 31 0.107 0.108 y y y 62 0.160 0.110 23 0.094 0.095 y y y 99 0.242 0.173 31 0.116 0.115 y y y 62 0.181 0.138 23 0.100 0.100

SHERMAN2

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU y y y 16 0.205 0.117 5 0.743 0.731

Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40) Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40)

SHERMAN3 Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40)

SHERMAN5 Method BCG CGN CGS GMR(10) GMR(40) OMN(10) OMN(40)

y y y y y y

y y y y y y

y y y y y y

Diag #It 1 CPU 8 CPU

y y y y y y y

y y y y y y y

y y y y y y y

y

y

y

9 18 15 18 16

0.097 0.099 0.092 0.103 0.097

0.083 0.085 0.080 0.087 0.081

#It 96 545 94 1005 322 1016 322

ILU(0) 1 CPU 2.029 15.678 1.872 11.196 3.459 10.384 4.026

8 CPU 1.366 6.961 0.955 4.927 2.113 5.213 1.911

y

3 5 5 5 5

y

0.708 0.705 0.705 0.705 0.705

y

0.713 0.709 0.711 0.711 0.710

ILU(2) #It 1 CPU 8 CPU 50 1.881 0.983 165 5.154 2.479 47 1.253 0.912 325 3.346 1.288 116 1.602 1.085 325 3.800 2.134 116 3.704 2.005

Diag ILU(0) ILU(2) #It 1 CPU 8 CPU #It 1 CPU 8 CPU #It 1 CPU 8 CPU 2334 4.226 1.751 36 0.808 0.428 20 0.854 0.616 y y y 126 2.390 1.141 46 1.320 0.907 y y y 29 0.485 0.278 18 0.699 0.580 y y y 97 0.727 0.546 41 0.740 0.626 y y y 37 0.406 0.277 20 0.616 0.523 y y y 97 0.805 0.522 41 0.777 2.617 y y y 37 0.444 0.252 20 0.627 0.522

Table 10: Solution Times of Iterative Methods for Non-symmetric Problems, Part II 21

SHERMAN3

Sym Num Tot MV Tot Method NCPU Fac Fac Fac Prec Setup MV Iter Time CGS/Diag 1 0.497 0.046 2.838 2.479 y5.933 CGS/ILU(0) 1 0.028 0.211 0.239 1.475 0.047 0.054 0.046 1.873 CGS/ILU(1) 1 0.077 0.217 0.294 0.948 0.046 0.034 0.030 1.363 CGS/ILU(2) 1 0.145 0.227 0.373 0.772 0.046 0.027 0.024 1.253 CGS/LSP(4) 1 4.976 0.046 1.030 0.899 6.984 CGS/LSP(8) 1 5.998 0.046 0.642 0.560 7.271 CGS/LSP(10) 1 5.954 0.046 0.514 0.447 6.984 CGS/Diag 8 0.171 0.048 0.772 0.970 y2.037 CGS/ILU(0) 8 0.046 0.080 0.126 0.771 0.049 0.016 0.017 0.989 CGS/ILU(1) 8 0.077 0.087 0.164 0.539 0.048 0.009 0.011 0.783 CGS/ILU(2) 8 0.146 0.120 0.266 0.570 0.048 0.007 0.009 0.912 CGS/LSP(4) 8 1.172 0.048 0.268 0.332 1.862 CGS/LSP(8) 8 1.443 0.048 0.167 0.195 1.878 CGS/LSP(10) 8 1.396 0.048 0.131 0.158 1.756 OMN/Diag 1 0.264 0.045 1.542 5.904 y7.801 OMN/ILU(0) 1 0.027 0.208 0.236 8.575 0.045 0.313 1.197 10.384 OMN/ILU(1) 1 0.076 0.214 0.291 5.485 0.045 0.200 0.765 6.802 OMN/ILU(2) 1 0.144 0.225 0.369 2.885 0.045 0.101 0.388 3.800 OMN/LSP(4) 1 7.448 0.045 1.540 5.908 y14.988 OMN/LSP(8) 1 14.580 0.046 1.566 5.978 y22.218 OMN/LSP(10) 1 18.060 0.046 1.566 5.977 y25.695 OMN/Diag 8 0.086 0.048 0.407 1.983 y2.569 OMN/ILU(0) 8 0.080 0.101 0.182 4.529 0.048 0.078 0.403 5.256 OMN/ILU(1) 8 0.077 0.086 0.163 3.231 0.048 0.054 0.255 3.767 OMN/ILU(2) 8 0.147 0.117 0.265 1.748 0.048 0.023 0.126 2.223 OMN/LSP(4) 8 1.532 0.048 0.356 1.865 y3.845 OMN/LSP(8) 8 2.884 0.048 0.344 1.870 y5.192 OMN/LSP(10) 8 3.637 0.048 0.369 1.886 y5.986

Table 11: Timing breakdown for CGS and OMN(10).

22

poorly conditioned. In these cases the robustness of IC preconditioning is not needed. For problems not studied here, we have seen that variants of IC preconditioning like the shifted IC preconditioner of [12] are quite robust and solve problems which cannot be solve by diagonal or polynomial preconditioning. One of these variants will be installed in future versions of CrayPCG. In comparing CrayPCG to the multifrontal solver, CrayPCG is competitive for moderately well to well conditioned problems. However, in the absence of better preconditioners, CrayPCG cannot compete with the multifrontal solver for ill-conditioned problems. With the shifted IC factorization mentioned above, we have a robust preconditioner which is advantageous for problems which are too large for the multifrontal solver, but its vector/parallel performance is poor. Thus, for ill-conditioned problems, more robust and ecient preconditioners are needed for CrayPCG to be competitive with a good direct solver for symmetric positive de nite problems. Overall, vector performance of the routines in CrayPCG is adequate with exception of IC and ILU preconditioning. However, except in the case of polynomial-preconditioned CG, the parallel performance of CrayPCG is somewhat disappointing. Clearly, for a large number of processors, dierent techniques must be considered, especially for nonsymmetric problems where the present polynomial preconditioners are ineective.

References [1] Edward Charles Anderson. Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear equations. Master's thesis, Center for Supercomputing Research and Development, University of Illinois, August 1988. [2] Steven F. Ashby and Mark K. Seager. A proposed standard for iterative linear solvers. Technical report, Lawrence Livermore National Laboratory, January 1990. [3] Cleve Ashcraft and Roger G. Grimes. On vectorizing incomplete factorization and SSOR preconditioners. SIAM J. Sci. Stat. Comput., 9:122{151, 1988. [4] O. Axelsson and V. Eijkhout. Vectorizable preconditioners for elliptic dierence equations in three space dimensions. Journal of Computational and Applied Mathematics, 27:299{321, 1989. [5] Iain S. Du, Roger G. Grimes, and John G. Lewis. Sparse matrix test problems. ACM Transactions on Mathematical Software, 15(1):1{14, March 1989. [6] Jocelyne Erhel. Sparse matrix multiplication on vector computers. International Journal of High Speed Computing, 2(2):101{116, 1990. [7] P. Fernandes and P. Girdinio. A new storage scheme for an ecient implementation of the sparse matrix-vector product. Parallel Computing, 12:327{333, 1989. [8] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, Maryland, second edition, 1989. [9] Steven W. Hammond and Robert Schreiber. Ecient ICCG on a shared memory multiprocessor. Technical report, Research Institute for Advanced Computer Science (RIACS), May 1989. 23

[10] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. J. Res. National Bureau of Standards, 49:409{436, 1952. [11] William Jalby, Ulrike Meier, and Ahmed Sameh. The behaviour of conjugate gradient based algorithms on a multi-vector processor with a memory hierarchy. Technical report, Center for Supercomputing Research and Development (CSRD), February 1987. [12] T. A. Manteuel. An incomplete factorization technique for positive de nite linear systems. Math. Comp., 34:473{497, 1980. [13] Rami Melhem. Parallel solution of linear systems with striped sparse matrices. Parallel Computing, 6:165{184, 1988. [14] Gerard Meurant. The conjugate gradient method on vector and parallel supercomputers. Technical report, CEA, Centre d'Etudes de Limeil-Valenton, July 1989. [15] Gerard Meurant. Iterative methods for multiprocessor vector computers. Computer Physics Reports, 11:51{80, November 1989. [16] Thomas C. Oppe, Wayne D. Joubert, and David R. Kincaid. NSPCG User's Guide. Center for Numerical Analysis, The University of Texas at Austin, December 1988. [17] James M. Ortega. Introduction to Parallel and Vector Solution of Linear Systems. Frontiers of Computer Science. Plenum Press, New York and London, 1988. [18] Gaia Valeria Paolini and Giuseppe Radicati Di Brozolo. Data structures to vectorize CG algorithms for general sparsity patterns. BIT, 29:703{718, 1989. [19] J. K. Reid, editor. Large Sparse Sets of Linear Equations, chapter On the Method of Conjugate Gradients for the Solution of Large Sparse Systems of Linear Equations. Academic Press, 1971. [20] Youcef Saad. SPARSKIT: a basic tool kit for sparse matrix computations. Preliminary Version. [21] Youcef Saad. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comput., 6(4):865{881, October 1985. [22] Youcef Saad. Krylov subspace methods on supercomputers. SIAM J. Sci. Stat. Comput., 10(6):1200{1232, November 1989. [23] Youcef Saad and Martin H. Schultz. Conjugate gradient-like algorithms for solving nonsymmetric linear systems. Math. Comp., 44(170):417{424, April 1985. [24] Youcef Saad and Martin H. Schultz. Parallel implementations of preconditioned conjugate gradient methods. Technical report, Yale University, October 1985. [25] Youcef Saad and Martin H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856{ 869, July 1986. [26] Mark K. Seager. A SLAP for the masses. Technical report, Lawrence Livermore National Laboratory, December 1988. 24

[27] Peter Sonneveld. CGS, a fast lanczos-type solver for nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 10(1):36{52, January 1989.

25

A Parallel Preconditioned Conjugate Gradient Package ... - CiteSeerX

A Parallel Preconditioned Conjugate Gradient Package ... - CiteSeerX

Suggest Documents

Parallel multilevel preconditioned conjugate-gradient approach to ...

A Parallel Preconditioned Conjugate Gradient Solver for ... - CiteSeerX

PRECONDITIONED CONJUGATE GRADIENT METHOD FOR

the multigrid preconditioned conjugate gradient method

Preconditioned Conjugate-Gradient Methods _ for ... - NTRS - NASA

Implementation of the Deflated Preconditioned Conjugate Gradient ...

Preconditioned conjugate gradient wave-front ... - OSA Publishing

Preconditioned Conjugate Gradient Method for ... - Purdue Math

Implementation of the Deflated Preconditioned Conjugate Gradient ...

Preconditioned Nonlinear Conjugate Gradient methods based on a

Time Complexity of a Parallel Conjugate Gradient

The Deflated Preconditioned Conjugate Gradient Method Applied to ...

A CONJUGATE-GRADIENT BASED ALGORITHM

An augmented subspace Conjugate Gradient - CiteSeerX

3. Conjugate gradient method

TWO IMPLEMENTATIONS OF THE PRECONDITIONED CONJUGATE ...

ordering methods for preconditioned conjugate ...

A Scaled Conjugate Gradient Algorithm for Fast ... - CiteSeerX

MCGS: A Modi ed Conjugate Gradient Squared Algorithm ... - CiteSeerX

A GLOBALLY CONVERGENT MODIFIED CONJUGATE-GRADIENT ...

A New Conjugate Gradient Algorithm Incorporating

Research Article A New Conjugate Gradient ...

A modified three-term PRP conjugate gradient

An imperfect conjugate gradient algorithm