Iterative methods for complex symmetric systems with ... - CiteSeerX

2 downloads 0 Views 491KB Size Report
Key words. complex symmetric matrices, multiple right-hand sides, block methods .... data of interest; see [10] for a discussion on the single-right hand side case.
Iterative methods for complex symmetric systems with multiple right-hand sides V. Simonciniy and E. Gallopoulos

December 1993 CSRD Report No. 1322 March 1994 revision

y Dipartimento di Fisica, Universit`a degli Studi di Bologna, Italy  Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 West Main Street Urbana, Illinois 61801

ITERATIVE METHODS FOR COMPLEX SYMMETRIC SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES V. SIMONCINI AND E. GALLOPOULOSy Abstract. The aim of this paper is to introduce and analyze block and nonblock iterative methods to solve

A[x(1) ; : : : ; x(s)] = [b(1) ; : : : ; b(s) ], where A is complex symmetric. The block methods are based on short term

recurrences combined with quasi-minimization of the residual. The inner iteration of the nonblock method applies one residual polynomial to several right-hand sides. Implementation of all methods on a shared-memory parallel computer is discussed and results from numerical experiments reported. Comparisons with a single right-hand side solver are included and conditions under which each approach demonstrates superior performance are analyzed. Key words. complex symmetric matrices, multiple right-hand sides, block methods, conjugate gradients, quasi-minimal residual, residual polynomial, sparse matrices, Lanczos algorithm, electromagnetics, parallel processing AMS subject classifications. 65F10, 65Y20

1. Introduction. The design of methods for the solution of complex symmetric systems with multiple right-hand sides, (1)

AX = B;

A 2 Cnn , AT = A, X; B 2 Cns , and s  n is a challenging problem with important applications in areas such as electromagnetics and chemistry [4, 22, 24, 34, 39, 43]. When n is very large, iterative methods become necessary. Two important issues need to be addressed. i. Matrix A cannot always be put in the form e T I , where T T H and  2 R. Therefore, it is not possible, in general, to build iterates based on short recurrences, and at the same time satisfy monotonicity properties for some measure of the error [8, 42]. ii. There are s systems and they all share the same coefficient matrix A. Is it possible to exploit this fact and design methods that are more effective than applying a good single right-hand side solver to each system independently? One approach for (i) is to extend over the complex field techniques used for real nonsymmetric systems [11]; examples are CG on the normal equations, biconjugate gradient (BiCG), truncated Orthomin; see [3, 32, 38, 39] for references from the electromagnetics literature. If we abandon the short recurrence requirement, one could also use the complex version of the generalized minimum residual method of Saad and Schultz (GMRES) [30]; see [4]. Joly and Meurant [18] review several iterative methods and report on experiments for general complex matrices. Related methods are presented in [16, 23]; the method of [20] uses polynomial acceleration on an equivalent set of coupled real systems. It is worth noting that CG-type methods have been proposed and used for the single right-hand side case of (1) in electromagnetics without the usual step of converting to normal equations; see for example [32, 33] and the analysis of some of these methods in [40]. It seems, however, that the most thorough study of the problem to date is due to Freund [10]. He proposes and analyzes algorithms that are based on a complex symmetric version of the Lanczos recursions and produce residuals which satisfy a quasi-minimization property. Key is the use of the indefinite bilinear form

( + )

=

 Dipartimento di Fisica, Universit`a degli Studi di Bologna, Italy. y Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801. 1

hx; yi = xT y (termed “natural inner product” by Craven [6]), instead of the standard inner product (x; y ) = xy (the superscript “*” denotes the conjugate transpose). Breakdowns due 6 0 such that xT x = 0, are handled using look-ahead. to quasi-null vectors, that is vectors x = It was shown in [10] that when A is diagonalizable and the starting residual can be written as linear combination of non quasi-null eigenvectors of A, then incurable breakdowns cannot occur in the complex symmetric nonblock Lanczos algorithm [10]. To simplify our discussion, we will be assuming from now on that the starting residuals have no projection in the directions of quasi-null eigenvectors. Therefore, in addressing (ii), which is the primary motivation for our work, we start from the methods described in [10] and propose iterative techniques that allow the sharing of information during the solution of (1). An ideal multiple right-hand side method should solve for all systems at a rate that is in general much better than and at worst comparable with solving each system independently. We consider two approaches. One is to apply block iteration to construct the solution by making use of information extracted from all the systems in each step. The nonblock approach is to choose one system at a time and use the extracted information to help the remaining systems proceed toward a solution. We discussed the real nonsymmetric problem in [36]. The methods we present here are different from those of [36] even though the objectives are the same. Block methods have been investigated in several contexts: [25, 26, 35] discuss the case of real linear systems; [14, 17, 31] the eigenvalue problem; and [2, 15] for block methods in control. Our paper is structured as follows. In Sections 2.1 and 2.2 the block quasi-minimum residual implementation of CG-like and Lanczos methods are presented; a modified version of the block Lanczos algorithm is described in Section 2.3. In Section 3, we propose a nonblock approach to solve problem (1), and discuss the theoretical justification in Section 3.2. Computational and memory requirements are presented in Section 4. Section 5 contains numerical experiments and actual performance results on a shared-memory parallel computer. For the rest of the paper, unless otherwise stated, A 2 Cnn will be assumed nonsingular. i A will indicate the ith singular value of A, starting from the largest. We denote by 0m and Im the zero and identity matrices of order m, by 0n;s the zero rectangular matrix, and by In;s e1 ; : : :; es the leading n  s section of the identity matrix, where ej denotes the j -th unit vector. Pk denotes the space of scalar-valued complex polynomials of maximum 1. Similarly Pk;s : degree k, and Pk the subset of Pk with elements satisfying p 0 k s  s f   0 1  : : : k  j 0 ; : : :; s 2 C g will denote the space of matrix-valued complex polynomials of maximum degree k and order s, and Pk;s the subset of Pk;s with Is . Unless otherwise specified, k  k will denote the Euclidean elements satisfying 0 norm for vectors and Frobenius norm for matrices. We denote n by s matrices using both their (s) (1 ) name and column vector notation, e.g., Vk and vk ; : : :; vk . We use semicolons to delimit adjacent rows of matrices, e.g., 11 ; 12; 21 ; 22 for a matrix of order 2. For a matrix with elements ti;j we denote by t:;j the j th column and by ti;: the ith row. We denote by h; iL the bilinear map defined by

( )

=[

( )

]

+

( )

(2)



()= 

+ ( ) =

[

[

hX; Y iL := X T LY;

for X; Y

=

]

]

2 Cns ; and symmetric L 2 Cnn :

= [ ]

We omit the subscript whenever L In . Matrices or vectors of random elements are picked from a uniform distribution in 0; 1 . Function RAND k; m returns a k by m matrix of real random values. Except where noted, operation counts correspond to complex floating-point arithmetic.

(

2

)

2. Block methods. We first discuss some design issues for Krylov-based block methods. In analogy with the single right-hand side solvers of [10], after k steps of our block methods, and assuming that no breakdown has occurred, two sets of matrices would have been generated, P0 ; : : :; Pk and Vk V0; : : :; Vk , Pi ; Vi 2 Cns . These matrices satisfy namely Pk

=[

]

=[

]

APk?1 = Vk Tk ;

(3) with the block tridiagonal matrix Tk

2 & 0 1 66 Tk = 4 0 1

(4)

1

..

.

..

.

3 77 5;

Tk 2 C(k+1)sks ; i ; &i; i 2 Css :

= +

The approximate solution of (1) is given by Xk+1 X0 Pk?1 y , for some matrix y 2 Ckss . Then the residual can be written as Rk+1 R0 ? Vk Tk y . For the remainder of this paper we will be assuming that the algorithms we present are not subject to incurable breakdowns for the data of interest; see [10] for a discussion on the single-right hand side case. In exact arithmetic and if breakdown has not occurred, the above approach yields the solution in at most dn=se steps [26]. In practice n=s is very large, and we would like to obtain an approximate solution earlier. We observe that our algorithms use the bilinear map (2) and since no minimization is performed, the residual error curves might be erratic. We th us introduce a block quasiminimization procedure and investigate its implementation and effect for complex symmetric versions of block Lanczos and CG-like methods. Let V0 : B ? AX0 with A complex symmetric. Defining Ek : Is ; 0s; : : :; 0s T 2 ks  R s , the residual can be written as Rk+1 Vk Ek+1 ? Tk y . In general matrix Vk is diag !0 ; : : :; !k is used to normalize the not unitary so a scaling diagonal matrix Wk columns of Vk V0; : : :; Vk , with !i diag kvi(1)k; : : :; kvi(s)k ; it follows that Rk+1 Vk Wk?1 Wk Ek+1 ? Tk y . Then y is chosen in order to solve

=

=

(

)( (

=[

))

= ( = ( = (

]

min

(5)

y2Ckss

) ) )

=[

]

=

kWk (Ek+1 ? Tk y)k:

Note that other choices, yielding different convergence histories, are possible for the scaling matrix Wk ; for example we can choose Wk to be block diagonal. Problem (5) is solved using Qk Rk with Qk unitary and Rk upper triangular. Matrix Tk is block factorization Wk Tk Hessenberg, thus the QR decomposition is done using Givens rotations. Clearly, these steps are of higher computational complexity than the corresponding ones for the usual QMR, as one has to deal with order-s matrices instead of scalars. For example, in generating Rk , the triangular matrix k needs to be annihilated at each iteration; see [36, 41] for a similar procedure applied to block GMRES. The approximate solution at step k becomes Xk+1 X0 Sk tk , with ? 1 T Sk Pk?1 Rk and tk Qk Ek+1 . Exploiting the banded structure of Wk Tk , Xk+1 can be updated at each iteration by Xk+1 Xk Sk+1 k , after the computation of the new S1; : : :; Sk+1 and tk 0T ; : : :kT T [10]. components Sk+1 and k of Sk+1 We are now in a position to design algorithms based on quasi-minimization of block CG-like and Lanczos-like procedures for complex symmetric matrices. These algorithms correspond to specific choices for Tk ; Vk , and Pk . We note that block Lanczos and CG algorithms for Hermitian systems were first described in [26].

=

=

=

=

=[

=

+

3

]

=[

]

+

2.1. Block QMRCG-like method. In this section we design a CG-like algorithm for complex symmetric systems, based on the bilinear map (2), for suitable choice of L, and we combine it with a quasi-minimization procedure in order to improve the behavior of the residual norm. Rk ? APk k , the quantities in (3) are Vk R0; : : :; Rk and Pk From Rk+1 P0 0 ; : : :; Pk k . Matrix Tk is block bidiagonal with k Is and k ?Is . By using the diag !0 ; : : :; !k we actually compute k !k , k ?!k+1 , with weight matrix Wk ( ( s ) 1) !k diag krk k; : : :; krk k . The new approximate solution is updated using the factors of the QR decomposition of Wk Tk . The details of the algorithm together with counts for leading floating-point operations of the iteration steps are shown below. The estimate of the total cost per step is presented in Section 4.

= ]

[

=

(

=

Algorithm 2.1:

(

)

)

BQMRCG

=

=[ ] = = =

=

(A; B; X0; L)

R0 = B ? AX0; S0 = 0n;s [ 0; P0] = ORTH(R0) !0 = diag(kr0(1)k; : : :; kr0(s)k); 0 = !0 0 = hR0; R0iL; G?1 = I2s for k = 0; 1; : : : k = hAPk ; PkiL; k = ?k 1 kT k Rk+1 = Rk ? APk k !k+1 = diag(krk(1+)1k; : : :; krk(s+)1k) &k = 0s ; k = !k ; k = ?!k+1 [&k ; k ]T = Gk?1 [&k ; k]T [Gk ; k; k+1 ; k] = GIVENS1 1(k ; k ; k) Sk+1 = (Pk k ? Sk &k )k? Xk+1 = Xk + Sk+1 k k+1 = hRk+1; Rk+1 iL P^k+1 = Rk+1 + Pk k?1k?1 k+1 [ k+1 ; Pk+1] = ORTH(P^k+1 )

2ns

2

+ ns +

2ns2 14=3s3 2ns 2s

2s2 n 2ns2

+ ns +

+ 3ns

2ns 2ns2 11=3s3 2ns2

end

Functions ORTH and GIVENS1 are computational primitives. The application of ORTH on a block of s column n-vectors Y generates Y and such that Y Y and Y T Y Is [5]. We observe that such orthogonalization does not ensure the column rank of Y to be equal to the column rank of Y . Therefore the process is not equivalent to Gram-Schmidt orthogonalization with respect to the usual inner product. Function GIVENS1 applies Givens rotations as described in Appendix A.2. An important feature of the algorithm above is that matrices k and k are diagonal, so that the block quasi-minimization part of the computation is decoupled among the right-hand sides. As presented, Algorithm 2.1 does not compute the true residual Rtrue k+1 at each step. This can be obtained by using the update

=^

^

=

^

true H ?1 H Rtrue k+1 = Rk g3;k g3;k + Rk+1 !k+1 g4;k k+1 ; where Gk = [g1;k ; g2;k; g3;k ; g4;k ], with gj;k 2 Css . An approximation for the norm of each p (k+1) (i) (i) (k) residual rk+1 is available from krk+1 k  s k + 2jti;i j with k = (ti;j ) [10]. 6 j and hAPk ; Pk iL =6 0s it follows that hAPk ; PkiL = From hAPk ; Pj iL = 0s for k = diag(0 ; : : :; k ), where diag(0 ; : : :; k ) is a block diagonal matrix, with i (0  i  k) not

(6)

4

necessarily diagonal. Finite precision arithmetic causes loss of orthogonality in Algorithm 2.1 and matrices k ; k become very ill-conditioned (see Section 5). The experiments were conducted using L I , that is, with the quasi-minimal residual version of block CG. One could also choose L A and thus obtain the quasi-minimal residual version of the block minimum residual method (adapted to complex symmetric matrices).

= =

2.2. Block QMR Lanczos method. In this section we present a block quasi-minimal residual method based on the Lanczos recurrence for complex symmetric matrices [10]. For this algorithm, the quantities in (3) are Pk Q0; : : :; Qk and Vk Pk . Matrix Tk is block tridiagonal with columns

=[

]

=

0 T 1 k C B @ k A ; k ; k ; k+1 2 Css ; k+1

leading to a block three-term recurrence for the Qi ’s. The weight matrix Wk is defined as in BQMRCG. The algorithm and leading floating-point operation counts are shown below. Function ORTH is as described in the context of BQMRCG. Function GIVENS carries out the incremental block QR decomposition of Tk and is described in Appendix A.2. Matrices Gk?1 ; Gk contain the rotation coefficients of two preceding steps. Algorithm 2.2: BQMRLAN(A; B; X0 ) Q^0 = B ? AX0; Q?1 = S?1 = S?2 = 0n;s G?2 = G?1 = I2s [ 0?1 ; Q0] = ORTH(Q^ 0) !0 = diag(kq0(1)k; : : :; kq0(s)k) 0 = !0 0; !?1 = 0s for k = 0; 1; : : : Q^ k+1 = AQk ? Qk?1 kT k = QTk Q^ k+1 Q^ k+1 = Q^ k+1 ? Qk k [ k?+11 ; Qk+1] = ORTH(Q^ k+1 ) !k+1 = diag(kqk(1+)1 k; : : :; kqk(s+)1k) k = 0s ; &k = !k?1 kT k = !k k ; k = !k+1 k+1 [k ; &k ]T = Gk?2 [k ; &k ]T [&k ; k ]T = Gk?1 [&k ; k]T [Gk ; k; k+1 ; k] = GIVENS(k ; k; k) Sk+1 = (Qk ? Sk &k ? Sk?1 k )k?1 Xk+1 = Xk + Sk+1 k end

4ns2 2ns2 2ns2

6s2 n

+ ns + ns + s3

2ns 3s2 4s3 8s3

+ 2ns + s3 2s2 n + ns

The true residual can be computed from (7)

(i)

p

true ?1 H ?1 H Rtrue k+1 = Rk k g3;k k+1 + Qk+1 !k+1 g4;k k+1 ;

+

(k+1)

and krk+1 k  s k 2kt:;i k [10]. We note however that unlike the case of BQMRCG the block quasi-minimization procedure in Algorithm 2.2 cannot be decoupled into its s components. 5

[

] ( + )

In exact arithmetic, BQMRLAN generates the set Q0 ; : : :; Qk of k 1 s vectors that are orthogonal with respect to h; i. A three-term recurrence leads to orthogonality between Qi and Qj i 6 j , whereas orthogonality within the columns of Qi is forced by ORTH; unfortunately finite precision arithmetic causes early loss of orthogonality.

( = )

2.3. Stability and error bounds. While it is known that Lanczos and CG-like algorithms are sensitive to roundoff, problems seem to be amplified in the block versions of these algorithms. As we observed, the quasi-minimization computations in BQMRCG are completely decoupled between right-hand sides. Hence any deterioration in accuracy is primarily due to the underlying CG procedure. In particular, ill-conditioning of k and k can cause stagnation (see Section 5). In order to ameliorate the situation we will be using a technique first proposed in [1] for block CG and Hermitian A which uses orthogonalization of the columns of Rk . In that implementation each block Pk is orthogonalized with respect to APk . In Sections 4 and 5 We will be using this modification referring to the resulting algorithm as MBCG. The problem of loss of orthogonality for block Lanczos was discussed in [21]. Errors are amplified both because of computations with full matrices of order s as well as because of three-term recurrences in Qk and Sk . Lewis showed that





k  k k kkIk+1 ? QTk Qk k + kQTk AQk ? k k + k?1 k k k k k?+11 k; where k = kQTk Qk+1 k. Expanding (8) we see that k can grow as the product of cond( j ) for j  k. Furthermore, if the norms of the differences to the right of (8) are not zero, there will be further deterioration. One cure for avoiding numerical loss of rank of Qk is (8)

to deflate linearly dependent columns of the matrix but continue updating the corresponding approximate solutions [26]. Deflation can be also useful for eliminating converged systems. In our experiments, however, failure of BQMRLAN was not caused by loss of rank within the block. To avoid error accumulation from the three-term recurrence we convert to a coupled twoterm one, as was proposed in [12, 13] for Lanczos QMR, and call the algorithm BQMRLAN2; this is described in Appendix A.1. This helps preserve orthogonality between different Qk ’s and mitigates the growth of k at the cost of some additional work. Up to scaling, the recurrence matrices are the same as those computed by CG, so that in exact arithmetic BQMRLAN2 and BQMRCG generate the same solution iterates [12] though in practice the two methods give different results; see Section 5. We next present residual error bounds for Algorithms 2.1 and 2.2 using tools that we developed to analyze block GMRES [37]. From (3) and setting Vk Pk Lk and Hk Tk Lk?1 we have the relation

=

=

AVk?1 = Vk Hk ;

(9) where Hk is block tridiagonal. Let i

2 Pi;s be such that i X Vi = i (A)  R0  Aj R0j(i); j =0

(i)

where j 2.2 as (10)

2 Css (0  i  k).

Then we can write the residuals from Algorithms 2.1 and

Rk = k (A)  R0 6

 ( )= ( ) =



with k  fk  k 2 Pk;s where k As discussed earlier in this section the length, hence kVk k s. Thus we have

kfk ()k = 2

and it follows that

= Wk (Ek+1 ?Hk y) and fk () = [0 (); : : :; k ()]. Wk s are chosen to make the columns of Vk of unit

k X j =0

kj ()k2 = s2 (k + 1)

p kk ()k  s k + 1kBk ()k;

(11)



where B k is the minimum residual matrix-valued polynomial that solves min

k 2P k;s

( ) =

(

kk (A)  R0k:

)



()

=

Let " A : f 2 C : k I ? A ?1 k  "?1 g and define k k k;F : sup2 k k  k. The following result can be proved: THEOREM 2.1 ([37]). Let Rk k A  R0 with R0; Rk 2 Rns , A 2 Rnn and k 2 Pk;s . Then



=( )



kRk k  2L"" kk k" kR0k:

(12)

When A is diagonalizable, by combining with (11) one can obtain a tighter bound [37]. From (11) and (12) it follows that

p kRk k  2L"" s k + 1kR0k min sup kk ()k; k 2Pk;s 2" p which differs by a factor of s k + 1 from the bound obtained using complete minimization; see [9] for a similar result for the nonblock case.

3. A nonblock polynomial algorithm. In contrast to block methods, which in their course generate and share information among all right-hand sides, we study the viability of a scheme which selects a system at a time, extracts information from it, and uses that information to partially solve any remaining systems. Similar principles have guided the creation of solvers for real symmetric [27, 28, 29, 39] and nonsymmetric systems [36]. 3.1. Algorithm description. We proceed as follows. Among the systems (1), a “seed” () system Ax() b() ;  2 f1; : : :; sg is selected. Then an approximation xk to the solution is computed by means of a base algorithm, which we choose to be the complex symmetric version of the minimum residual algorithm. We emphasize here that the complex symmetric version of the algorithm does not produce minimum residuals. ()  A r(),  2 P . Meanwhile After k iterations, the base algorithm produces rk k k k 0 ( i ) ( i ) the non-seed solutions xk , 1  i  s and residuals xk , 1  i  s are updated as well during the iteration. Once the seed system has converged, a new seed system is selected and the process is repeated. The algorithm and leading computational costs are shown below. The total computational cost and comparisons with other methods are presented in Section 4.

=

(

Algorithm 3:

MULTI.MR

= ( ) (

)

(A; B; X0; ")

7

)



E = f1; : : :; sg r0(i) = b(i) ? Ax(0i) ; p0(i) = r0(i) ; i 2 E [r0(); ] = maxi2E (kr0(i)k) for k = 0; 1; : : : kr  k while( k > ") kr k k = (rk(); Ap(k))=(Ap(k); Ap(k)) x(ki+) 1 = x(ki) + k p(ki); i 2 E rk(i+) 1 = rk(i) ? k Ap(ki) ; i 2 E k = ?(Ark(+)1 ; Ap(k))=(Ap(k); Ap(k)) p(ki+) 1 = rk(i+) 1 + k p(ki); i 2 E i Ik := fi : kkrrk i kk  ; i 2 E g E = E n Ik ( )

( ) 0

( ) +1 ( ) 0

4n 2ns 2ns 2ns

2n

+ 2n

2ns

end

end

 = maxi2E (krk(i+) 1k=kr0(i)k) () () if (krk+1 k=kr0 k)  " then stop p(ki+) 1 = rk(i+) 1 ; i 2 E

Set E consists of indices of systems that have not yet converged. The seed system is chosen to be the one with largest residual norm; see [36] for some heuristics. The residual polynomial is applied implicitly at each iteration, thus preventing the introduction of instabilities from any explicit generation of polynomial coefficients. The method inherits the properties of the underlying single right-hand side solver. Incidentally, there is flexibility in the choice of the polynomial-building algorithm; for instance, one can use the polynomial generated by quasiminimization applied on the complex symmetric version of CG or Lanczos algorithms. A computational advantage of MULTI.MR is that the number of inner products necessary to solve (1) can be significantly smaller than when applying the base algorithm on each right-hand side. In a parallel implementation, this also implies a reduction in the amount of synchronization and communication. 3.2. The behavior of the residual polynomial. We next use the properties of the residual polynomial generated by the base algorithm to motivate the design of MULTI.MR. We will be assuming throughout A to be diagonalizable with definite Hermitian part, and will be using notation and results from [10]. For this subsection in particular we will be discussing the behavior of the residual polynomial for a single right-hand side, which permits us to omit the superscript right-hand side identifier. To illustrate the theory, we will be using the order 100 complex symmetric matrix

A = T + i

(13)

[ ] [ ]  ( ] [ ] = ( )

Matrix T corresponds to the scaled 2-D Poisson operator on 0; 1  0; 1 , and matrix is diagonal consisting of random elements. Matrix A has eigenvalues in 0; 8  0; 1 . All experiments in this section were done using Sparse Matlab v. 4.0 on a Sun Sparcstation. The residual of the minimum residual algorithm can be written as rk k A r0 , where k 2 Pk . When A is Hermitian positive definite, k is a normalized kernel polynomial



8

satisfying

k () = arg( min kk );

(14)

kk := sup j()j:

with

2Pk

2

For A complex symmetric, k is not a kernel polynomial and the method does not minimize the residual at each step. In exact arithmetic, the recurrence of the base algorithm generates a Krylov subspace Km spanfr0 ; Ar0; : : :; Am?1 r0 g with m  m , where m is the lowest integer for which Km becomes an invariant subspace of A, so that A?1 b 2 x0 Km . Pm c v , with c 6 0, Since A is diagonalizable, the starting residual can be written as r0 j j =1 j j v1; : : :; vm is the matrix of m eigenvectors of A, one for each distinct and Vm eigenvalue, such that VmT Vm diag 1 ; : : :; m [10]. Let m : f1 ; : : :; m g with i 6 j . Since the subspace Km is spanned by fv1; : : :; vm g, we have rm m A r0 0, and m is the minimal polynomial for r0 . Consider the discrete functional analog of the bilinear form hx; y i, that is

=

=[

]

(

=

=



)

hf; gi := m1

m X

 i=1

()

+

=

=

=

=

( ) =

f (i)g(i)(i)

() = !(), with

with   some weight function. The k ’s are orthogonal with respect to  !  the Jacobi weight !  , that is, for k 6 j ,

()

( )=

m 1 X

hk ; j i = m

 i=1

=

k (i)j (i)2i = 0:

Thus, the residual polynomial satisfies some properties of kernel polynomials, in particular orthogonality with respect to hf; g i and a three-term recurrence. Our point is that in practice k demonstrates a behavior that can be exploited in order to build iterative methods for complex symmetric systems. To see this we first plot in Figure 1 the zeros of k for matrix (13), starting from a vector of real random elements, together with the zeros of the kernel minimal residual polynomial of GMRES. We see that the zero patterns are very similar in the vicinity of the eigenvalues of A with smallest and largest real parts. Then in Figure 2 a we plot jk  j as  takes values in the domain 0; 8  0; 1 that contains the spectrum of A and observe that jk  j is small on m even though k does not satisfy (14). We next show that in spite of the fact that condition (14) does not hold for the complex symmetric polynomial method, jk  j on m is bounded above by some factor times krk k. Furthermore, this factor depends only on the eigencomponents of the starting residual. We write the “economy form” of its singular value decomposition as Vm U +V , with U 2 Cnm , V 2 Cmm , U H U Im and V unitary. + is a diagonal matrix of order m and contains the m singular values of Vm . It follows that

()

()

()

()

~

(15)

~

( ] [ ]





~ ~=

~



Vm = MU; M = U~ +U~ H ; U = U~ V~ :

= ~ ~

U = [u1; : : :; um ] has orthonormal columns, (15) being a generalized polar decomposition. Hence

rk =

m X j =1

cj k (j )vj = U~ + U~ H

= U~ +U~ H Uz 9

m X j =1

cj k (j )uj

0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0

1

2

3

4

5

6

7

8

FIG. 1. Zeros of the GMRES kernel polynomial of degree k = 19 (’o’) and zeros of the residual polynomial for the complex symmetric version of the minimum residual algorithm of degree k = 19 (’+’) for A in (13) and random starting vector.

x 10

-2

5

50

4

40

3

30

2

20

1

10

0 0.6

0 0.6 8

0.5

j

k kk k

with z

2 0.2

0

FIG. 2. Plot of k () for  = 5 ( r5 = r0 < 10?14 ), r0 =

4

0.3

2 0.2

k

6

0.4

4

0.3

j

8

0.5

6

0.4

0

2P(0; 8]  [0; 1]; (a) k = 21 (kr21 k=kr0 k < 10?5), r0 = RAND(n; 1). 5 j 1 vi .

(b)

=

2 Cm the column vector with elements cj k (j ), (1  j  m). Hence krk k = kU~ +U~ H Uzk  min (Vm )kUzk = min(Vm )kzk:

We have thus proved the following result. THEOREM 3.1. Let A be diagonalizable with eigenvalues m f1 ; : : :; m g and also r0 Pmj=1 cj vj with none of the vectors vj 1  j  m quasi-null. For functions f; g

=

 = )

(

P  f ( )g( )!( ), with !( ) = jc j2, and kf k = (f; f ) . Then define (f; g )! := m1 m ! j j j j j ! j =1 k k  ? (V )?1 kr k: k !

min

m

1 2

k

(f; g)! introduced in Theorem 3.1 defines a norm in C. Furthermore m (M ) = min (Vm ), so the above theorem could be stated in terms of the m th singular The inner product

10

value of M . This will be useful in the next theorem. Roughly speaking, the degree of success of Algorithm 3 depends on the amount by which the polynomials k reduce the norm of (i) residuals rk other than the seed. This issue is discussed next.

3.3. Using a single residual polynomial on multiple right-hand sides. Let A be di(i) agonalizable and let m be the dimension of the smallest invariant subspace of A generated (i) by r0 . Define (i) : m(i) to be the part of spectrum of A whose eigenvectors are the

 =

(i)



components of r0 and let Vm(i) contain the corresponding eigenvectors. After k iterations of  (i) Pm(i) c(i)v , i 1; : : :; s, we have r(i)  A r(i), that is MULTI.MR on r0 k k j =1 j j 0

=

= ( )

=

m X(i) (i) ( i ) rk = cj k (j )vj : j =1

( )





From Theorem 3.1 it follows that jk j j can be small for j 2 (i) \ () . Thus, whenever (i) \ () 6 ;, that is whenever r(i) has projection on the spectral components of r(), the 0 0 (i) application of k on non-seed systems can reduce krk k. Figure 2 b shows jk  j for matrix A in (13) using b with eigencomponents corresponding to some of the largest and smallest eigenvalues of A. In accordance with Theorem 3.1, jk  j is small in the neighborhood of eigenvalues with eigenvectors participating in the initial residuals. Furthermore jk  j is also small in the vicinity of each i 2 () . (i) Thus, even for j 62 (i) \ (), jk j j can decrease with krk k, as long as the distance dist j ; () remains small. Using the above notation, we derive the following theorem. THEOREM 3.2. Let A be diagonalizable with (i) the set of eigenvalues of A corresponding (i) to the eigencomponents of r0 . For each l 2 (i), let djl : jl ? jl j minj 2() jl ? j j.



 =

()

()

()



(  )

() ( )









=

=

() () Define D(djl ) such that k (l) = k (jl ) + D(djl ) with rk = k (A)r0 . Let also Mi be Hermitian positive semidefinite satisfying Vm(i) = Mi Um(i) with Um(i) having orthonormal    columns. Then, for each i, 1  i  s there exists a constant Li; such that

2 krk(i)k  4Li; m(i)krk()k2 + 12(Mi)

X

3 jc(l i)j2jD(djl )j25 ; 1 2

l 2(i) n()

lim

l!j

jD(djl )j = 0:

( ) ( ) = 0 and the residual rk(i) can be written as X (i) X (i) X (i) cl D(djl )vl: cl k (jl )vl + cl k (l)vl = rk(i) =

Proof. From the definition of D djl , D djj

l 2(i) n()

l 2(i)

l2(i)

The columns of Um are orthonormal and kMi k2

= 1 (Mi), hence

2 X (i) 2 krk(i)k  1 (Mi) 4 jcl j jk (jl )j2 + l 2(i)

Moreover,

X l2(i)

X l 2(i) n()

3 jc(l i)j2jD(djl )j25 :

0 1 (i)j2 j c l A X jc()j2jk (j )j2 jc(l i)j2jk (jl )j2  @max l jl ( l;jl jcj)j2 l 2 i l ( )

11

1 2

TABLE 1 Leading complex floating-point operations for some basic linear algebra operations

operation additions multiplications

Y ; T Y T Y ?1 ?1  Y + Y Y  ns2 s3 ns2 4=3s3 s3 =2 ns 2 3 2 3 3 2 ns s ns 4=3s s =2 s ns

diag(kyi k)

ns ns

0 1 ( i ) 2 jcl j A m(i) X jc()j2j ( )j2:  @max  k j j l;jl jc()j2 jl j 2  ( )



where the summation is now performed over all eigenvalues in (). From Theorem 3.1,

X

j 2 

( )

Letting Li;

=

jc(j)j2jk (j )j2  (m (M ))?2 krk()k2 :

jc(i) j2 maxl;jl (l) 2 jcjl j

( )

!

1 (Mi ) m() (M )

!2

the result follows.  From the discussion after Theorem 3.1 it follows that Li; depends on the ratio of the largest singular value of the matrix Vm(i) of eigenvectors belonging to the non-seed residual  over the smallest singular value of the matrix of seed eigenvectors Vm() . Theorem 3.2 shows  that the performance of MULTI.MR is affected by both the dimension of (i) \ (), i 6=  and the clustering of the spectrum of A. For l not in the neighborhood of eigenvalues participating in the seed residual, D(djl ) can be large. In those cases, the application of the polynomial will lead to a deterioration of the residual. This problem can be handled by monitoring and stopping the application of the current polynomial on those residuals that become very large. This point will be discussed further in Section 5. 4. Cost analysis. The aim of this section is to evaluate the cost per iteration of all methods presented so far. Let Y 2 Cns ; ; ; ;  2 Css , with  triangular and  diagonal matrices. The basic costs for kernel matrix operations in terms of complex floating-point operations are listed in Table 1. Table 2 shows the leading computational costs per iteration per method, excluding the cost for calculating the residual norms and for matrix-vector products with matrix A. For comparison purposes we have also included the costs for the single right-hand side method TFQMR and the complex symmetric MBCG. Since the quasi-minimization phase of block methods requires operations with full matrices of order s, both BQMRLAN and BQMRLAN2 have large coefficients for terms ns2 and s3 . The coefficient for the ns term in BQMRCG is due to computations with diagonal matrices. Under favorable circumstances, block methods require fewer iterations as s increases, but at greater cost per iteration. In particular, the cost increase is polynomial in s; thus a large reduction in the number of iterations will be needed for them to be competitive. The density of matrix A provides an indication for the expected performance of block methods in comparison to other nonblock solvers. Let “nnz” denote the total number of nonzero elements in A. Then nnz=n is the average number of nonzero elements per row. The cost of forming A  X with X 2 Cns is approximately 2snnz complex operations1 . 1

To keep a uniform treatment we lump complex additions and multiplications together; however, the cost of the former is smaller than the cost of the latter. 12

TABLE 2 Memory requirements and major computational cost. All methods require matrix A per iteration.

method

matrices

ns

matrices

ss

3 2 BQMRLAN 4 BQMRLAN2 3 MULTI.MR 2 QMRs 4 y see Appendix A.2.

4 3 8 7

BQMRCG MBCG

matrices diag(s) 5

s matrix-vector products with

rotation coefs.y

vectors

s; s2

n

2

1 1

4 2

7

2

1

Computational Cost 25 3 2 12ns 9ns 3 s 14ns2 6s3 3ns 16ns2 14s3 7ns 3 14ns2 26 6ns 3 s 8ns 8n 18ns

+ + + +

+

+ + + +

From Table 2 we see that the leading cost of block methods is a multiple of ns2 , whereas the costs for the nonblock method and the single right-hand side solver are multiples of ns. Then if we denote by iB and iS the number of iterations for a block and nonblock method to converge and compare BQMRCG with MULTI.MR, we see that their costs will be similar when 12ns2 2snnz iB  8ns 2snnz iS : This means that for the block method to be successful, at first approximation we should have

(

+

)

iB 

(16)

(

+

+ +

)

8ns 2snnz iS 12ns2 2snnz

=

1?

6s ? 4 6s nnz

+

n

!

iS :

It follows that for nnz n small and s large, the block method achieves better total complexity only for very small iB ; on the other hand, a large density per row allows the block method a number of iterations which could be sufficient to attain convergence at lower total cost than the nonblock method. Table 2 shows the leading memory requirements, excluding storage for common matrices A; X and B . Depending on implementation, additional work vectors may be needed. We refer to [1, 10] for the meaning of the variables of MBCG and QMR respectively. The column "rotation coefs." shows the number of vectors used to store the Givens rotation coefficients; see Appendix A.2 for details. 5. Experiments. We present experiments to illustrate the performance of all methods described earlier, including MBCG. We also report the performance of QMR applied to each right-hand side, for the sake of comparison with a state-of-the-art solver for complex symmetric systems. The matrices in Examples 1; 2; 4 originate in electromagnetic scattering and computational chemistry. In both cases, the boundary conditions are selected to lead to a complex symmetric matrix. The matrix characteristics are shown in Table 3. As right-hand sides we used random values B1 RAND n; s and sections of identity matric B2 In;s for several values of s. Random values were computed using the Alliant intrinsic function drand. Observe that both B1 and B2 are real. In all cases the starting guess was chosen to be X0 0n;s .

=

( )

=

=

5.1. Implementation. All experiments were conducted on an Alliant FX/2800 shared memory multiprocessor running Concentrix 3.0, using at most 12 processors. Each processor has peak rate of 40 Mflops for real double precision arithmetic. Processors are based on the RISC architecture and offer extensive pipelining which is referred to as vectorization. The 13

TABLE 3 Characteristics of matrices used in the experiments

Example 1 2 3 4 5

matrix name RWG RXS YOUNG1C ELQ

diagonal matrix

matrix order 635 10,180 841 1,197 1,000

nnz: # nonzero elements 7,881 154,102 4,089 61,373 1,000

avg. density per row 12.4 15.1 4.8 51.2 1

origin em-scattering em-scattering Harwell-Boeing set magnetic resonance artificial

system command execute -cp was used to force the system to use a specific number (p) of processors. Computations were performed in double precision using complex arithmetic. Matrix A is stored in compressed row format. All algorithms require multiplications of dense n  s columns by A. This includes QMR (the single right-hand side solver of choice), which is applied concurrently to s right-hand sides, and hence uses the same matrix by matrix multiplication computational kernel as the remaining algorithms. All kernels listed in Table 1 involve operations with dense blocks of order s. Parallelization was achieved by distributing the s columns of the right-hand operand. Whenever possible we used the Alliant scientific library intrinsic functions matmul, dotproduct and transpose. The presence of multiple right-hand sides improves the data locality of all methods. In particular, computing A  X , where A 2 Cnn and X 2 Cns requires a total of s nnz ? n complex additions and snnz complex multiplications on a total of nnz ns elements. Counting one complex addition as two real additions and one complex multiplication as four real multiplications and two real additions, there is an average of

+

8nnz ? 2n) (s) = s(2nnz + sn

(

)

loc

(s) increases with s.

real floating-point operations per data element. It follows that loc convergence criterion used for all methods was (17)

The

kb(i) ? Ax(ki)k  10?7 : i=1;:::;s kb(i) ? Ax(i) k 0 max

Occasionally the number of iterations for a method is suffixed by an asterisk (“*”); this means that even though the method has converged, very ill-conditioned systems of order s had to be solved (condition number estimate larger than 10 13 ). All methods stopped iterating for those right-hand sides for which convergence was detected. For each method we obtained the number of iterations required and the total elapsed time, in seconds, as measured by the intrinsic function dtime; however, we report average times per right-hand side in order to highlight the performance of each method as the number of right-hand sides increases. As discussed after Theorem 3.2, a residual monitoring procedure is essential for MULTI.MR to succeed. Our implementation stops updating systems for which the magnitude of the residual norm becomes larger than 1 107 , until a new seed system is generated. For stability reasons, coefficients k in algorithm BQMRLAN were computed explicitly, instead of reporting the values from the orthogonalization of the previous iteration.

=

14

TABLE 4 Iterations to convergence for matrix RWG; n = 635; B1 =RAND(n;s), B2 = In;s .

s rhs

B1

BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR

B2

1

4

8

12

16

20

24

28

32

36

40

493 498 492 497 496 499 464 468 465 465 462 467

158 147 148 147 508 501 162 145 147 146 476 472

92 86 82 83 508 506 86 83 83 493 478

62 59 57 502 509 62 60 60 493 482

47 44 43 44 501 509 48 45 46 493 488

51* 35 35 35 501 512 40 38 38 494 488

30* 29 28 29 501 512 30 29 29 923 488

32* 24 24 24 501 512 25 24 24 938 488

22* 21 21 21 510 513 21 21 21 964 488

21* 19 18 19 510 513 19 18 19 964 488

18* 17 17 17 510 513 17 17 17 965 499

method

QMR BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR QMR

* ill-conditioned. 5.2. Test problems. The first two problems arise in the solution of the three-dimensional Maxwell equations using finite elements in a square plate shaped scatterer in the x ? y plane of side 1m, centered at the origin. Example 1. The matrix name is RWG and was created using 697 tetrahedral edgeelements, for a total of 1069 edges, and 434 edges on a perfect electric conductor (PEC) [22]. The number of right-hand sides ranged from 1 to 40. Results are reported in Table 4 and Figures 3(a) and 3(b). Most methods converge for both types of right-hand sides; however BQMRCG failed for B2 and s > 6. This is due to ill-conditioning of the order s matrices involved in the computation; see Section 2.3. This can also be seen from results for B1 and s  20. The results for MBCG imply that it is worthwhile to orthogonalize the residual matrices. Furthermore, they show that the loss of rank involved in BQMRCG is due to roundoff and would not have occurred in exact arithmetic; see Section 2.3. Round-off error also affects the convergence of BQMRLAN. We also note that the number of iterations for MULTI.MR increases sharply for s > 20. These were cases where the monitoring procedure detected deterioration of some residual norms and frequently had to stop the application of the seed polynomial on these residuals. From the figures we see that the performance of MULTI.MR stabilizes for large values of s. This is expected, since a fixed number of iterations is needed for the seed system to converge. Overall, MULTI.MR has the best performance and all methods perform better than QMR, implying that information sharing is very effective in this problem. Example 2. The next matrix, named RXS, is obtained using 7,916 tetrahedral edgeelements having 10,268 edges in total, out of which 88 are PEC edges [22]. Its diagonal elements have positive real part; exceptionally for this example we used this fact to precondition left and right A by a diagonal matrix with elements equal to the positive square roots of the diagonal elements of the real part of A. Table 5 shows the number of iterations for s  10. The numerical effectiveness of block methods is demonstrated by the reduction in the number of iterations as s increases. Unfortunately, Figure 4 shows that their overall performance suffers; they are also slower than QMR. Note that the high cost of MBCG is due to the double orthogonalization required at each iteration. The fastest method seems to 15

9

8

8 bqmrcg

7

seconds per right-hand side

seconds per right-hand side

9

6 5

qmr

4 3

bqmrlan

multi.mr

7 6 5

qmr

4 multi.mr

3 mbcg

mbcg

2

bqmrcg

bqmrlan

2 bqmrlan2

bqmrlan2 1

5

10

15 20 25 30 number of right-hand sides

35

1

40

5

10

15 20 25 30 number of right-hand sides

35

40

FIG. 3. Time per right-hand side, in seconds, using matrix RWG. (a) B = RAND(n; s), (b) Symbols: ’.-.’ BQMRCG, ’–’ MBCG, ’o’ BQMRLAN, ’- -’ BQMRLAN2, ’...’ MULTI.MR, ’..x..’ QMR.

B

=

In;s .

TABLE 5 Iterations to convergence for matrix RXS; n = 10180; B1 =RAND(n;s), B2 = In;s .

s rhs

B1

BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR

B2

1

2

4

6

8

10

1188 1085 1190 1049 1129 1052 1083 1085 1086 938 1035 940

1042 1039 1039 959 1187 1063 986 997 978 905 1075 1014

945 959 918 848 1180 1090 901 884 857 769 1067 1014

869 881 850 782 1149 1123 851 849 814 742 1114 1014

833 832 722 1149 1123 790 797 739 688 1114 1014

763 768 687 1146 1123 757 753 715 653 1115 1014

method

QMR BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR QMR

be MULTI.MR, indicating that it achieves the best trade-off between information sharing and computational work. The inferior performance of the block methods can be explained by means of Table 2 and the discussion leading to formula (16). In particular for a matrix of this size and density, the reduction in the number of iterations achieved by the block methods is not sufficient for them to equal the performance of MULTI.MR. This experiment supports our finding that for very large matrices, block methods cannot be effective unless the computation of A  X (the predominant kernel in nonblock methods) becomes very expensive compared to inner products and other computations with dense n by s blocks, as would be the case for example when A is dense [19, 21]. Example 3. The matrix in this example is YOUNG1C and was extracted from the class of complex symmetric matrices of the Harwell-Boeing collection [7]. All elements with nonzero imaginary part lie on the main diagonal. As before, MULTI.MR has the best performance. Table 6 shows that block methods BQMRCG and BQMRLAN are sensitive to the choice of right-hand sides, and fail in several 16

1400

1200

1000 seconds per right-hand side

seconds per right-hand side

1200

1000 bqmrlan 800 mbcg bqmrcg

600

bqmrlan2 400

bqmrlan

800

mbcg 600

bqmrcg bqmrlan2

400 qmr

qmr

0

’.-.’

200

multi.mr

200

multi.mr 2

4 6 8 number of right-hand sides

10

0

2

4 6 8 number of right-hand sides

10

FIG. 4. Time per right-hand side, in seconds, with matrix RXS. (a) B = RAND(n; s), (b) B = In;s . Symbols: -’ BQMRLAN2, ’...’ MULTI.MR, ’..x..’ QMR. TABLE 6 Iterations to convergence for matrix YOUNG1C; n = 814; B1 =RAND(n; s), B2 = In;s .

BQMRCG, ’–’ MBCG, ’o’ BQMRLAN, ’-

s rhs

B1

BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR

B2

1

4

8

12

16

20

24

28

32

36

40

458 464 464 458 451 461 396 400 402 397 388 397

178 168 168 456 492 169 170 166 400 406

98 99 93 456 496 106 98 99 417 429

70 69 69 456 496 89 84 85 456 429

55 55 53 456 498 89 75 80 440 429

45 44 44 456 498 74 63 66 452 429

37 37 37 469 498 70 60 59 449 429

33 32 32 472 498 65 55 55 451 429

28* 27 27 27 472 498 451 429

24* 24 24 470 498 451 429

22* 22 22 470 501 451 429

method

QMR BQMRCG MBCG BQMRLAN BQMRLAN2 MULTI.MR QMR

* ill-conditioned. cases. More reliable results are obtained from BQMRLAN2, MBCG; even so, for B2 and s  32 all block methods fail. On the other hand, whenever block methods converge, their performance improves with s and approaches that of MULTI.MR. We note that the failure of was apparently due to loss of orthogonality. Comparing with the performance of we conclude that once again, some form of information sharing is effective. Example 4. The matrix in this example is called ELQ and originates from the discretization of the stochastic Liouville equation associated with slow-motional magnetic resonance lineshape simulations. The discretization is performed using a direct product basis set, consisting of a complete set of orthogonal idempotent spin operators and a set of orthogonal functions [34]. In order to measure the parallelism of the codes, a different type of experiment was performed. We fixed the size of the problem, holding the number of right-hand sides to be equal to the maximum number of processors used, so s 12 and let the number of processors vary from 1 to 12. For each method, Figure 6(a) shows the time per right-hand side as p varies. Figure 6(b) shows the corresponding speedup T 1 =T p , where T p is the BQMRLAN

QMR

= () ()

17

()

10

10 seconds per right-hand side

12

seconds per right-hand side

12

bqmrcg

8

bqmrlan 6

qmr

4

10

bqmrlan bqmrlan2 4

qmr multi.mr

2

multi.mr 5

mbcg

6

mbcg

bqmrlan2 2

8

15 20 25 30 number of right-hand sides

35

40

5

10

15 20 25 30 number of right-hand sides

35

40

FIG. 5. Time per right-hand side, in seconds, with matrix YOUNG1C. (a) B = RAND(n; s). (b) Symbols: ’.-.’ BQMRCG, ’–’ MBCG, ’o’ BQMRLAN, ’- -’ BQMRLAN2, ’...’ MULTI.MR, ’..x..’ QMR.

B = In;s ,

10 12

9

seconds per right-hand side

multi.mr

8

10

7 multi.mr speed-up

8

6

6 qmr 5 bqmrlan

4 bqmrcg

bqmrlan

4

bqmrlan2

3

mbcg

bqmrlan2 bqmrcg mbcg

2 2

qmr 1

0

2

4

6 8 number of processors

10

0

12

2

4

6 8 number of processors

10

12

FIG. 6. Parallel performance of methods for matrix ELQ, B = RAND(n; 12). ’.-.’ BQMRCG, ’–’ -’,BQMRLAN2, ’...’ MULTI.MR, ’..x..’ QMR. (a) Time in seconds. (b) Speedup T (1)=T (p).

MBCG,

’o’

BQMRLAN, ’-

time per right-hand side when p processors are used. Overall, block methods demonstrate similar performance with a maximum speed-up of 4. Methods MULTI.MR and QMR seem to profit more from additional processors reaching speed-ups of 7:5 and 6:8 respectively. Table 3 shows that ELQ has the highest density per row from all matrices in our set. According to our previous findings, this favors block methods. Indeed, Figure 6(a) shows that that when p 1 the block methods are much faster than MULTI.MR. Example 5. The performance of block methods is strongly affected by the content of the starting residuals (B for our case of zero starting vectors) in eigenvectors of A [17, 26, 37]. In Theorem 3.2 we have shown that MULTI.MR too is sensitive to the eigenvector components found in B . This experiment is designed to illustrate these points. We use the diagonal diag RAND n; 1 i RAND n; 1 , with n 1000. Clearly all eigenvalues lie matrix D in the right-hand plane and a set of eigenvectors is e1 ; : : :; en . In order to achieve a better approximation of the invariant subspaces spanned by the right-hand sides, we increased the convergence tolerance in (17) to 10?12 . Four sets of right-hand sides are used. These are labeled Ba ; Bb ; Bc and Bd ; each set consists of s 4 right-hand sides. These sets were

=

=

(

( )+

( ))

[

=

18

=

]

TABLE 7 Number of iterations and time per right-hand side for diagonal matrix of Example 5.

type method

Ba

Bb

Bc

n = 1000, s = 4.

Bd

48(2.92) 75(4.53) 83(4.96) 48(3.24) 75(5.03) 83(5.57) BQMRLAN 48 (3.63) 75 (5.65) 83(6.23) BQMRLAN2 48(2.76) 74(4.24) 83(4.74) MRMULTI 90y(0.67) 124(0.96) 147(1.12) y all systems converge when seed system is solved BQMRCG MBCG

95(5.69) 95(6.36) 96 (7.20) 95(5.45) 169(1.25)

generated as follows:

Ba : b1 = Bb : b1 = b4 = Bc : b1 = Bd : b1 =

200 X

i=1

200 X

i=1

i e i ; b 2 = i e i ; b 2 =

150 X

i=101 200 X

i=1

200 X

i=1

iei +

350 X

200 X

i=1

300 X

i=100

i=301

i e i ; b 2 = i e i ; b 2 =

i e i ; b 3 =

i=101 400 X

i=201

i=1

iei ; b3 =

iei + 300 X

200 X

600 X

i=501

i e i ; b 4 =

50 X

i=1

iei +

250 X

200 X

i=1

i=201

i e i

i e i +

400 X

i=301

i e i ;

i e i

iei ; b3 = iei ; b3 =

600 X

i=401 600 X

i=401

i ei ; b4 = i ei ; b4 =

700 X

i=501 800 X

i=601

i e i i e i

The scalars i are real random values. Each right-hand side in Ba has nonzero projection to every basis vector in fe1 ; : : :; e200g. Therefore the smallest invariant subspace for each of the columns in Ba is the span of fe1 ; : : :; e200g. Sets Bb ; Bc represent cases where the right-hand sides (hence their smallest invariant subspaces) have fewer common components. Finally, the smallest invariant subspaces for each of the right-hand sides in Bd are disjoint. Table 7 shows the number of iterations and, in parentheses, time per right-hand side. With Ba , all block methods reach their best performance of dn=se iterations to converge. As expected MULTI.MR demonstrates excellent performance; all systems were well approximated using the polynomial from the first seed. Block methods converge in the largest number of iterations with set Bd ; however, the number of iterations remains smaller than the dimension (200) of the invariant subspace of each right-hand side. It is interesting to see MULTI.MR achieve good performance despite the structure of Bd . This was due to the fact that the residual polynomial is small in 0; 1  0; 1 , causing a decrease of the norm of non-seed residuals; see Theorem 3.2. It is worth noting that the superior performance of MULTI.MR versus the block methods is explained in light of the discussion in Section 4: indeed, diagonal matrices have the maximum sparsity allowed in a nonsingular matrix.

[ ] [ ]

5.3. Conclusions from the experiments. In several cases, our methods seem to be faster than if a single right-hand side solver were to be applied independently to each right-hand side. For some application problems, block methods were found to perform well. However, algorithms BQMRCG and BQMRLAN stagnated in several cases; it is then worthwhile to use the 19

modified algorithms MBCG and BQMRLAN2. For MBCG, the improved numerical properties were achieved at a substantial additional cost per iteration. Our analysis and experiments support the conclusion that block methods become appealing when applied to systems which are not very sparse, in particular to cases where the cost of sums and products of dense block matrices becomes comparable to that of multiplication with the coefficient matrix. The eigenvector composition of the starting vectors influences the performance of all methods. With a good monitoring strategy to avoid breakdown, MULTI.MR performs well even for righthand sides consisting of disjoint eigenvector components, provided distinct eigenvalues are clustered. Acknowledgments. D. Schneider provided us with much valuable information, references, and the matrix in Example 4. B. Philippe and R. Freund have made several useful suggestions. R. Mittra and K. Mahadevan provided us with the matrices in Examples 1 and 2. A. Chronopoulos drew our attention to an important reference. M-C. Brunet made comments on an earlier version of the manuscript. M. Levy provided editorial assistance. They all have our thanks. This research was supported by the National Science Foundation under grant NSF CCR-9120105 and by ARPA under a subcontract from the University of Minnesota grant DARPA/NIST-60NANB2D1272. Preliminary results from this paper were first presented at the 1993 Householder Symposium. The authors can be reached at [email protected] and [email protected]. A. Appendix. A.1. Algorithm BQMRLAN2. Algorithm BQMRLAN2 implements the block version of Algorithm 8.1 from [12]. Function ORTH Y orthogonalizes matrix Y with respect to hx; y i, and function GIVENS applies Givens rotations as described in Appendix A.2.

^

(^)

Algorithm : BQMRLAN2(A; B; X0) V^0 = B ? AX0; P0 = S0 = 0n;s; G?1 = I2s [0?1 ; V0] = ORTH(V^0) !0 = diag(kv0(1)k; : : :; kv0(s)k)  0 = ! 0 0 ;  0 = I s for k = 0; 1; : : : Pk+1 = Vk ? Pk k?1 Tk k+1 = PkT+1 APk+1 V^k+1 = APk+1 ? Vk k+1 [?k+1 1 ; Vk+1] = ORTH(V^k+1 ) !k+1 = diag(kvk(1+)1 k : : :; kvk(s+)1k) &k = 0s ; k = !k k+1 ; k = !k+1 k+1 [&k ; k ]T = Gk?1 [&k ; k]T [Gk ; k; k+1 ; k] = GIVENS(k ; k; k) Sk+1 = (Pk+1 ? Sk &k )k?1 Xk+1 = Xk + Sk+1 k end

2ns2

+ ns + 8=3s3 2ns2 2ns2

4ns2

2ns2

+ ns + s3

2ns 2s2 4s3

+ ns + s3 2ns2 + ns

Matrices &k ; k , and k are either full or triangular, so that the BQMR procedure is slightly more expensive than for BQMRCG-like algorithms.

20

A.2. Function GIVENS. The block form of the function and is repeated below. It carries out the operation

" #

 ?! 

" ~#  0s

; ~;  2 Css ; ~; 

GIVENS

was described in [41]

upper triangular:

At the same time, other iterates are updated, and the rotation coefficients stored.

function # (; ;  ) " #[G; ~; ~ ; "~] = GIVENS E =  T = 0 ; G = I2s s for i = 1 : s for j = i + s : ?1 : i + 1 [C ; S ] = ROT(Ej?1;i; Ej;i) y = C Ej?1;i:s + S Ej;i:s Ej;i:s = ?S Ej?1;i:s + C Ej;i:s Ej?1;i:s = y y = C Tj?1;1:s + S Tj;1:s Tj;1:s = ?S  Tj?1;1:s + C Tj;1:s Tj?1;1:s = y y = C Gj?1;1:2s + S Gj;1:2s Gj;1:2s = ?S  Gj?1;1:2s + C Gj;1:2s Gj?1;1:2s = y end

"end #

~ = T; ~ = E 1:s;1:s ~

( )

Function ROT x; y computes the scalar Givens rotation coefficients to annihilate y . A total of s2 Givens parameters are calculated. When both k and k are diagonal, the routine can be significantly simplified, and only s Givens parameters are computed.

function [G; ~; ~ ; ~] = GIVENS1(; ;  ) for i = 1 : s [C ; S ] = ROT(i;i; i;i) i;i = C i;i + S i;i ~i;i = ?S  i;i; i;i = C i;i Gi;i = C ; Gi+1;i+1 = C ; Gi+1;i = ?S ; Gi;i+1 = S end ~ = ; ~ =  REFERENCES

[1] M. Arioli, I. Duff, D. Ruiz, and M. Sadkane. Techniques for accelerating the block Cimmino method. In J. J. Dongarra, et al. editors, Proc. Fifth SIAM Conf. Parallel Processing for Scientific Computing, pages 98–104. SIAM, Philadelphia, 1992. [2] D. L. Boley and G. H. Golub. The Lanczos-Arnoldi algorithm and controllability. Systems & Control Letters, 4:317–324, 1984. [3] F. X. Canning. Physical and mathematical structure determine convergence rate of iterative techniques. IEEE Trans. Magn., 25(4):2825–2827, 1989. 21

[4] D. C. Chatfield, M. S. Reeves, D. G. Truhlar, C. Duneczky, and D. W. Schwenke. Complex generalized minimal residual algorithm for iterative solution of quantum mechanical reactive scattering equations. J. Chem. Phys., 97(11):8322–8333, 1992. [5] D. Choudhury and R. A. Horn. An analog of the Gram-Schmidt algorithm for complex bilinear forms and diagonalization of complex symmetric matrices. Technical Report 454, The Johns Hopkins University, Jan. 1986. [6] B. D. Craven. Complex symmetric matrices. J. Austral. Math. Soc., 10:351–354, 1969. [7] I. S. Duff, R. G. Grimes, and J. G. Lewis. User’s guide for the Harwell-Boeing sparse matrix collection (release I). Technical Report TR/PA/92/86, CERFACS, Toulouse Cedex, France, Oct. 1992. [8] V. Faber and T. Manteuffel. Necessary and sufficient conditions for the existence of a conjugate gradient method. SIAM J. Numer. Anal., 21(2):352–362, 1984. [9] R. W. Freund. Quasi-kernel polynomials and convergence results for quasi-minimal residual iterations. In D. Braess and L. L. Schumaker, editors, Numerical Methods in Approximation Theory, Vol. 9, pages 77–95. Birkh¨auser Verlag, Basel, 1992. [10] R. W. Freund. Conjugate gradient-type methods for linear systems with complex symmetric coefficient matrices. SIAM J. Sci. Stat. Comput., 13(1):425–448, 1992. [11] R. W. Freund, G. H. Golub, and N. M. Nachtigal. Iterative solution of linear systems. Acta Numerica, 1:57–100, 1992. [12] R. W. Freund and N. M. Nachtigal. An implementation of the QMR method based on coupled two-term recurrences. SIAM J. Sci. Comput., 15(2):313–337, 1994. [13] R. W. Freund and N. M. Nachtigal. Implementation details of the coupled QMR algorithm. Technical Report 92-19, RIACS, NASA Ames Research Center, Oct. 1992. [14] G. Golub and R. Underwood. The block Lanczos method for computing eigenvalues. In J. R. Rice, editor, Mathematical Software III, pages 364–377. Academic Press, New York, 1977. [15] M. Hochbruck and G. Starke. Preconditioned Krylov subspace methods for Lyapunov matrix equations. Technical report, Eidgen¨ossische Technische Hochschule, Z¨urich, Oct. 1992. IPS Research Report 92-17. [16] D. A. H. Jacobs. A generalization of the conjugate-gradient method to solve complex systems. IMA J. Numer. Anal., 6:447–452, 1986. [17] Z. Jia. Generalized block Lanczos methods for large unsymmetric eigenproblems. Submitted for publication, 1993. [18] P. Joly and G. Meurant. Complex conjugate gradient methods. Numer. Alg., 4:379–406, 1993. [19] S. Kharchenko, P. Kolesnikov, A. Nikishin, A. Yeremin, M. Heroux, and Q. Seikh. Iterative solution methods on the Cray YMP/C90. Part II: Dense linear systems. Presented at 1993 Simulation Conference: High Performance Computing Symposium, Washington D.C. [20] A. B. Kucherov. A fast iterative method for solving in real arithmetic a system of linear equations with a complex symmetrix matrix. Soviet Math. Dokl., 43(2):377–379, 1991. [21] J. G. Lewis. Algorithms for sparse matrix eigenvalue problems. Technical Report STAN-CS-77-595, Dept. Comput. Sci., Stanford Univ., Palo Alto, CA, 1977. [22] K. Mahadevan and R. Mittra. Private communication. Electromagnetic Communication Lab., University of Illinois at Urbana-Champaign, 1992. [23] G. Markham. Conjugate gradient methods for indefinite, asymmetric, and complex systems. IMA J. Numer. Anal., 10:155–170, 1990. [24] E. K. Miller. Assessing the impact of large-scale computing on the size and complexity of first-principles electromagnetic models. In H. L. Bertoni and L. B. Felsen, editors, Directions in Electromagnetic Wave Modeling, pages 185–196, New York, 1991. Plenum Press. [25] A. A. Nikishin and A. Y. Yeremin. Variable block CG algorithms for solving large sparse symmetric positive definite systems on parallel computers, I: General iterative scheme. Technical Report EE-RR 1/92, Elegant Mathematics, Inc., 1992, Rev. Jan. 1993. [26] D. P. O’Leary. The block conjugate gradient algorithm and related methods. Lin. Alg. Appl., 29:293–322, 1980. [27] M. Papadrakakis and S. Smerou. A new implementation of the Lanczos method in linear problems. Int’l. J. Numer. Meth. Engng., 29:141–159, 1990. [28] B. N. Parlett. A new look at the Lanczos algorithm for solving symmetric systems of linear equations. Lin. Alg. Appl., 29:323–346, 1980. [29] Y. Saad. On the Lanczos method for solving symmetric systems with several right hand sides. Math. Comp., 48:651–662, 1987. [30] Y. Saad and M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7(3):856–869, 1986. 22

[31] M. Sadkane. A block Arnoldi-Chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices. Numer. Math., 64:181–193, 1993. [32] T. K. Sarkar. On the application of the generalized biconjugate gradient method. J. Electromagnetic Waves and Applications, 1(3):223–242, 1987. [33] T. K. Sarkar, X. Yang, and E. Arvas. A limited survey of various conjugate gradient methods for solving complex matrix equations arising in electromagnetic wave interactions. Wave Motion, 10:527–546, 1988. [34] D. J. Schneider and J. H. Freed. Calculating slow-motional magnetic resonance spectra: A user’s guide. In J. J. Berliner and J. Reuben, editors, Biological Magnetic Resonance, volume 8, pages 1–76. Plenum Press, 1989. [35] H. D. Simon and A. Yeremin. A new approach to construction of efficient iterative schemes for massively parallel algorithms: Variable block CG and BiCG methods and variable block Arnoldi procedure. In R. F. Sincovec, et al. editors, Proc. Sixth SIAM Conf. Parallel Processing for Scientific Computing, pages 57–60, SIAM, Philadelphia, 1993. [36] V. Simoncini and E. Gallopoulos. An iterative method for nonsymmetric systems with multiple right-hand sides. Technical Report 1242, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, July 1992. [37] V. Simoncini and E. Gallopoulos. Convergence properties of block GMRES for solving systems with multiple right-hand sides. Technical Report 1316, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Oct. 1993. [38] C. F. Smith, A. F. Peterson, and R. Mittra. The biconjugate gradient method for electromagnetic scattering. IEEE Trans. Ant. Prop., 38(6):938–940, 1990. [39] C. F. Smith, A. F. Peterson, and R. Mittra. A conjugate gradient algorithm for the treatment of multiple incident electromagnetic fields. IEEE Trans. Ant. Prop., 37(11):1490–1493, 1989. [40] H. A. van der Vorst and J. B. M. Melissen. A Petrov-Galerkin type method for solving Ax = b where A is symmetric complex. IEEE Trans. Magn., 26(2):706–708, 1990. [41] B. Vital. Etude de quelques m´ethodes de r´esolution de probl`emes lin´eaires de grande taille sur multiprocesseur. PhD thesis, Universit´e de Rennes I, Rennes, Nov. 1990. [42] V. V. Voevodin. On methods of conjugate direction. USSR Comput. Maths. Math. Phys., 19:228–233, 1980. [43] W. Yang and W. H. Miller. Block Lanczos approach combined with matrix continued fraction for the S-matrix Kohn variational principle in quantum scattering. J. Chem. Phys., 6:3504–3508, Sep. 1989.

23

Suggest Documents