Jul 6, 2012 - Variable preconditioning is often required when solving ... To deduce the solution of linear systems with several right-hand side given at once, minimum ...... The authors would like to acknowledge GENCI (Grand Equipement National de Calcul. Intensif) ... [16] H. Elman, O. Ernst, D. O'Leary, and M. Stewart.
A deflated minimal block residual method for the solution of non-hermitian linear systems with multiple right-hand sides Henri Calandra, Serge Gratton, Rafael Lago, Xavier Vasseur and Luiz Mariano Carvalho Technical Report TR/PA/12/45
Publications of the Parallel Algorithms Team http://www.cerfacs.fr/algor/publications/
A deflated minimal block residual method for the solution of non-hermitian linear systems with multiple right-hand sides Henri Calandra∗
Serge Gratton†
Rafael Lago‡
Xavier Vasseur§
Luiz Mariano Carvalho¶ July 6th, 2012
Abstract In this paper we address the solution of linear systems of equations with multiple right-hand sides given at once. When the dimension of the problem is known to be large, preconditioned block Krylov subspace methods are usually considered as the method of choice. Nevertheless to be effective in terms of computational operations it is known that these methods must incorporate a strategy for detecting when a linear combination of the systems has approximately converged. This strategy usually leads to an explicit block size reduction often called deflation. While initial deflation or deflation at the beginning of a cycle are nowadays popular, block Krylov subspace methods incorporating deflation at each iteration are quite rare. The purpose of this paper is thus to extend the block flexible restarted GMRES method to variants that allow the use of deflation at each iteration when solving multiple right-hand side problems given at once. The main goal of deflation will be then to reduce the cost of each iteration of the block Krylov subspace method by judiciously choosing which information to use and which information to be postponed. For the purpose of analysis we introduce a new minimal block residual method that incorporates block size reduction at each iteration named Deflated Minimal Block Residual method. First we study its main mathematical properties. We notably show that the Frobenius norm of the block residual is always nonincreasing. Second we justify the choice of the deflation strategy based on a nonincreasing behaviour of the singular values of the scaled block true residual. Third we propose a variant of the deflated block residual method that includes truncation at each iteration. Finally we discuss the computational cost of the algorithm and the possibility of breakdown in exact arithmetic. Numerical experiments on two different problems ∗
TOTAL, Centre Scientifique et Technique Jean F´eger, avenue de Larribau F-64000 Pau, France INPT-IRIT, University of Toulouse and ENSEEIHT, 2 Rue Camichel, BP 7122, F-31071 Toulouse Cedex 7, France ‡ CERFACS, 42 Avenue Gaspard Coriolis, F-31057 Toulouse Cedex 1, France § CERFACS and HiePACS project joint INRIA-CERFACS Laboratory, 42 Avenue Gaspard Coriolis, F-31057 Toulouse Cedex 1, France ¶ CNPq fellowship, Brazil. Applied Math. Dep., IME-UERJ, R. S. F. Xavier, 524, 629D, 20559-900, Rio de Janeiro, RJ, Brazil. †
1
2 issued from wave propagation situations requiring the solution of multiple right-hand side problems are then discussed. On these test cases the new block flexible method including deflation at each iteration has proven to be more efficient in terms of both preconditioner applications and computational operations than recent block flexible Krylov subspace methods performing deflation at restart only.
Key words. Block Krylov space method; Block size reduction; Deflation at each iteration; Flexible preconditioning; Multiple right-hand sides
1
Introduction
We consider block Krylov space methods for the solution of linear systems of equations with p right-hand sides given at once of the form AX = B, where A ∈ Cn×n is supposed to be a nonsingular non-Hermitian matrix, B ∈ Cn×p is supposed to be full rank and X ∈ Cn×p . Although the number of right-hand sides p might be relatively large we suppose here that the dimension of the problem n is always much larger. Later we denote by X0 ∈ Cn×p the initial block iterate and by R0 = B − AX0 the initial block residual. As stated in [23, 24] a block Krylov space method for solving the p systems is an iterative method that generates approximations Xm ∈ Cn×p with m ∈ N such that Xm − X0 ∈ Km (A, R0 ) (A, R ) (in the unpreconditioned case) is defined as where the block Krylov space Km 0 (m−1 ) X k p×p A R0 γk , ∀ γk ∈ C , with k | 0 ≤ k ≤ m − 1 ⊂ Cn×p . Km (A, R0 ) = k=0
We refer the reader to [23] for a recent detailed overview on block Krylov subspace methods and note that most of the standard Krylov subspace methods have a block counterpart (see, e.g., block GMRES [45], block BiCGStab [22], block IDR(s) [14] and block QMR [20]). In this paper we mainly focus on restarted block Krylov subspace methods that satisfy a minimum norm property as introduced in [38, Section 6.12]. Block Krylov subspace methods are increasingly popular in many application area in computational science and engineering (e.g. electromagnetic scattering (monostatic radar cross section analysis) [9, 29, 40], lattice quantum chromodynamics [39], model reduction in circuit simulation [19], stochastic finite element with uncertainty restricted to the right-hand side [16], and sensitivity analysis of mechanical systems [5] to name a few). To be effective in terms of computational operations it is recognized that these methods must incorporate a strategy for detecting when a linear combination of the systems has approximately converged [23]. This explicit block size reduction is called deflation as discussed in [23]. First a simple strategy to remove useless information from a block Krylov subspace - called initial deflation - consists in detecting possible linear dependency in the block right-hand side B or in the initial block residual R0 ([23, Section 12] and [29, Section 3.7.2]). When a restarted block Krylov subspace method is used,
3 this block size reduction can be also performed at each initial computation of the block residual, i.e., at the beginning of each cycle [23, Section 14]. In addition Arnoldi deflation [23] may be also considered; it aims at detecting a near rank deficiency occurring in the block Arnoldi procedure to later reduce the current block size. These three strategies based on rank-revealing QR-factorizations [10] or singular value decompositions [21] have been notably proposed both in the hermitian [32, 37] and nonhermitian cases [1, 3, 12, 20, 31, 33] for block Lanczos methods. They have been shown to be effective with respect to standard block Krylov subspace methods. While initial deflation or deflation at the beginning of a cycle are nowadays popular, block Krylov subspace methods incorporating deflation at each iteration have been rarely studied. In [36] Robb´e and Sadkane have introduced the notion of inexact breakdown to study block size reduction techniques in block GMRES. Two criteria have been proposed either based on the numerical rank of the generated block Krylov basis (W-criterion) or on the numerical rank of the block residual (R-criterion). Numerical experiments on academic problems of small dimension with a reduced number of right-hand sides illustrated the advantages and drawbacks of each variant versus standard block GMRES. Further numerical experiments can be found in [28]. Another method relying on such a strategy is the Dynamic BGMRES (DBGMRES) [13], which is an extension of block Loose GMRES [4]. Nevertheless, to the best of our knowledge we are not aware of any paper showing the interest of block Krylov subspace methods with deflation at each iteration on a concrete real-life application. Moreover the combination of block Krylov subspace methods performing deflation at each iteration and variable preconditioning has been rarely addressed in the literature. Variable preconditioning is often required when solving large linear systems of equations. This is notably the case when inexact solutions of the preconditioning system using, e.g., nonlinear smoothers in multigrid [34] or approximate interior solvers in domain decomposition methods [43, Section 4.3] are considered. Thus the main purpose of the paper is to derive a class of flexible minimal block residual methods that incorporate block size reduction at each iteration. We will introduce a new generic method belonging to this class and compare it to recently proposed flexible block space methods using deflation at the beginning of the cycle only [11]. The effectiveness of the new method will be shown on two applications arising from the discretization of partial differential equations occurring in wave propagation problems. The paper is organized as follows. First we will introduce in Section 2 a general framework for block Krylov subspace methods aiming at reducing the cost of each iteration by judiciously choosing which information to use and which information to be postponed. A new method named Deflated Minimal Block Residual method will be notably analyzed in this setting. We will show that the Frobenius norm of the block residual is always nonincreasing and that the singular values of the scaled block residual are always nonincreasing. Based on this property we will specify a possible block size reduction strategy and discuss a variant based on truncation. We will then emphasize in Section 3 connections with existing algorithms that incorporate no deflation, deflation at restart only and deflation at each iteration. In Section 4 we analyze the convergence properties
4 of the Deflated Minimal Block Residual method and the possibility of a breakdown in case of exact arithmetic only. Then in Section 5 we demonstrate the effectiveness of the proposed algorithm on two applications. Finally we draw some conclusions in Section 6.
2
A deflated minimal block residual method
In this section we present a method which aims at minimizing the Frobenius norm of the block true residual while performing possible block size reduction at each iteration and allowing variable preconditioning. We first introduce the block orthogonalization procedure then describe in detail the main mathematical properties of the new method with a specific focus on the block size reduction strategy. We conclude this section by analyzing the computational cost and memory requirements of the new method.
2.1
Notation
Throughout this paper we denote k.k2 the Euclidean norm, k.kF the Frobenius norm, Ik ∈ Ck×k the identity matrix of dimension k and 0i×j ∈ Ci×j the zero rectangular matrix with i rows and j columns. The superscript H denotes the transpose conjugate operation. Given a vector d ∈ Ck with components di , D = diag(d1 , . . . , dk ) is the diagonal matrix D ∈ Ck×k such that Dii = di . If C ∈ Ck×l we denote the singular values of C by σ1 (C) ≥ · · · ≥ σmin(k,l) (C) ≥ 0 and nul(C) the deficiency of C respectively. Finally em ∈ Cn denotes the mth canonical vector of Cn . Regarding the algorithmic part (Algorithms 1-4), we adopt notation similar to those of MATLAB in the presentation. For instance, U (i, j) denotes the Uij entry of matrix U , U (1 : m, 1 : j) refers to the submatrix made of the first m rows and first j columns of U and U (:, j) corresponds to its jth column.
2.2
Deflated block Arnoldi
To deduce the solution of linear systems with several right-hand side given at once, minimum residual norm based methods such as BGMRES usually rely on the block Arnoldi orthonormalization procedure [38, Chapter 6] for building a basis of the block Krylov subspace. We denote by pj−1 the number of linear systems that is considered at the j-th iteration of a given cycle of a block Krylov subspace method (1 ≤ pj−1 ≤ p)1 . If K ∈ Cn×pj−1 denotes an orthonormal matrix containing all the pj−1 Krylov directions at iteration j − 1 (1 ≤ pj−1 ≤ p), the most expensive part of the algorithm at the j-th iteration - when n is large - lies in the pj−1 applications of the (possibly variable) preconditioner and the subsequent pj−1 matrix-vector products. Nevertheless, as discussed later in Section 2.4, subspaces of range(K) may no longer be needed for ensuring convergence along the iterative procedure. In such a situation it would be desirable to handle this information separately to reduce the computational 1
A strict inequality pj−1 < p may happen in case of a rank degeneracy as discussed later in Remark 1 and in Section 4 respectively.
5 costs associated with block orthonormalization, preconditioning and matrix-vector products. This explicit block size reduction is often called deflation in the literature; see, e.g., [23, 29]. Thus we introduce a modified version of the block Arnoldi algorithm called deflated block Arnoldi in which range(K) has been judiciously decomposed into: H Vj Pj−1 = Ipj−1 , (1) Vj Pj−1 range(K) = range(Vj ) ⊕ range(Pj−1 ), with where Vj ∈ Cn×kj , Pj−1 ∈ Cn×dj with kj +dj = pj−1 . In other words kj Krylov directions are effectively considered at iteration j, while dj directions are left aside (or deflated) at the same iteration. We note that literally we choose the ”best” subspace of range(K) of dimension kj (not only kj columns of K) defining Vj , leaving the remaining subspace in range(Pj−1 ) (i.e. the deflated subspace is spanned by range(Pj−1 ) at iteration j). We address possible criteria for such a decomposition (1) later in Section 2.4. Based on this decomposition the deflated orthonormalization procedure will apply variable preconditioning and matrix-vector products only over the chosen kj directions of Vj . Next we show the j-th iteration of deflated block Arnoldi method using a variable preconditioner in Algorithm 1. Algorithm 1 j-th iteration of flexible deflated block Arnoldi with block modified GramSchmidt: computation of Vˆj+1 , Zj , nj ∈ N, pj ∈ N and sj ∈ N with Vi ∈ Cn×ki such that ViH Vi = Iki (1 ≤ i ≤ j), pj−1 = kj +dj , Pj−1 ∈ Cn×dj and [V1 , . . . , Vj , Pj−1 ] orthonormal. Pj−1 1: Define sj−1 = l=1 kl (s0 = 0) −1 2: # Choose preconditioning operator Mj −1 3: Zj = Mj Vj 4: S = AZj 5: # Orthogonalization of S with respect to [V1 , . . . , Vj , Pj−1 ] 6: for i = 1, . . . , j do 7: Hi,j = ViH S 8: S = S − Vi Hi,j 9: end for H S 10: Hp = Pj−1 11: S = S − Pj−1 Hp 12: Define Hj ∈ C(sj−1 +pj−1 )×kj as HjT = [H1,j , . . . , Hj,j , Hp ]T 13: Compute the QR decomposition of S as S = QT with nj = nul(S), Q ∈ Cn×(kj −nj ) and T ∈ C(kj −nj )×kj ˆj+1 = Q, Hj+1,j = T 14: Set V 15: Define pj as pj = pj−1 − nj and sj = sj−1 + kj ˆj+1 ∈ 16: Define Zj ∈ Cn×sj as Zj = [Z1 , . . . , Zj ], Vj ∈ Cn×sj as Vj = [V1 , . . . , Vj ] and V Hj Cn×(sj +pj ) as Vˆj+1 = Vj Pj−1 Vˆj+1 such that AZj = Vˆj+1 . Hj+1,j As in standard block Arnoldi, Algorithm 1 proceeds by orthonormalizing AZj against all the previous preconditioned Krylov directions, but additionally, the orthonormalization against Pj−1 is performed (lines 10 and 11 of Algorithm 1). The block modified
6 Gram-Schmidt version is presented in Algorithm 1, but a version of block Arnoldi due to Ruhe [37] or block Householder orthonormalization [2, 42] could be used as well. Remark 1. We consider the possibility of a rank deficiency in S (line 13 of Algorithm 1) where a rank-revealing QR (RRQR) algorithm [8] would be used to determine both the deficiency nj and the decomposition SΠc = QT (with Πc designing a column permutation matrix). As discussed later, a deficiency of S characterizes a breakdown in the deflated block Arnoldi procedure. We will show in Section 4 that this behaviour is rare in practice because it means that pj partial convergences (see Definition 1) have happened. Thus it is more realistic to consider that the relations nj = 0 and pj = p do hold at iteration j. Consequently a standard QR decomposition based on modified Gram-Schmidt is then used instead. In Proposition 1 we analyze the flexible Arnoldi relation that is obtained when using the deflated block Arnoldi procedure shown in Algorithm 1. Proposition 1. With notation of Algorithm 1, given Vj Pj−1 ∈ Cn×(sj +dj ) orthonormal, assume that the following flexible block Arnoldi relation holds at the beginning of the j-th iteration of the flexible deflated block Arnoldi procedure (j > 1): AZj−1 = Vj Pj−1 Hj−1 (2) with Zj−1 ∈ Cn×sj−1 , Vj ∈ Cn×sj , Pj−1 ∈ Cn×dj , Vj Pj−1 orthonormal and Hj−1 ∈ C(sj−1 +pj−1 )×sj−1 . The j-th iteration of the flexible deflated block Arnoldi shown ˆ j ∈ C(sj +pj )×sj in Algorithm 1 produces matrices Zj ∈ Cn×sj , Vˆj+1 ∈ Cn×(sj +pj ) , H which satisfy: ˆj , AZj = Vˆj+1 H (3) ˆ j is given by: where Vˆj+1 is orthonormal and H Hj−1 ˆ Hj = 0(kj −nj )×sj−1
Hj Hj+1,j
.
Proof. The j-th iteration of the flexible deflated block Arnoldi procedure with block modified Gram-Schmidt leads to: Hj ˆ AZj = Vj+1 . Hj+1,j Thus due to relation (2) we deduce: Hj ˆ AZj−1 AZj = Vj Pj−1 Hj−1 Vj+1 , Hj+1,j Hj−1 Hj ˆ A Zj−1 Zj = Vj Pj−1 Vj+1 , 0(kj −nj )×sj−1 Hj+1,j ˆj . AZj = Vˆj+1 H
7 The orthonormality of Vˆj+1 comes from the fact that we have considered Vj Pj−1 as orthonormal and that Vˆj+1 has been orthonormalized using a block Arnoldi procedure with block modified Gram-Schmidt.
2.3
Algorithm of the deflated minimal block residual method
We next present the Deflated Minimal Block Residual method which aims at minimizing the Frobenius norm of the block true residual while performing possible block size reduction (deflation) at each iteration. The restarted variant of Deflated Minimal Block Residual method called DMBR(m) (given in Algorithm 2) uses the flexible block deflated Arnoldi procedure with modified block Gram-Schmidt presented in Algorithm 1.
2.3.1
Flexible Arnoldi relation
ˆ j respectively, which It is yet to be shown how to obtain Vj+1 and Hj from Vˆj+1 and H is intimately related to the decomposition of subspace we mentioned earlier in Section 2.2 (see relation (1)). With notation of Algorithm 2 the decomposition at the end of the j-th iteration is then obtained as:
Vj+1 Pj = Vˆj+1 Fj+1 ,
(4)
with Vj+1 ∈ Cn×sj+1 , Pj ∈ Cn×dj+1 and Fj+1 ∈ C(sj +pj )×(sj +pj ) . In principle we leave the choice of Fj+1 open, but we provide a criterion for defining such a matrix later in Section 2.4. For the moment it suffices to consider Fj+1 as an unitary matrix of order sj + pj . Using relation (4) and the definition of Hj (line 23 of Algorithm 2) the flexible Arnoldi relation (3) simply becomes:
H ˆ AZj = Vˆj+1 Fj+1 Fj+1 Hj = Vj+1 Pj Hj
(5)
which is precisely the flexible Arnoldi relation that is required at the beginning of the (j + 1)-th iteration of Algorithm 1 (see Proposition 1, relation (2)). Also, since Fj+1 is unitary, we guarantee that no information is being discarded, and that Vˆj+1 is still orthonormal.
8 Algorithm 2 DMBR(m) 1: Choose a convergence threshold tol, a deflation threshold εd , the size of the restart m and the maximum number of iterations cymax 2: Choose an initial guess X0 ∈ Cn×p 3: Compute the initial block residual R0 = B − AX0 4: Define the diagonal matrix D ∈ Cp×p as D = diag(b1 , . . . , bp ) with bl = ||B(:, l)||2 for l such that 1 ≤ l ≤ p 5: Set s0 = 0 6: for cycle = 1, . . . , cymax do ˆ 0 and determine 7: Compute the QR decomposition of R0 D−1 as R0 D−1 = Vˆ1 Λ −1 n×p p ×p 0 0 ˆ ˆ p0 = rank(R0 D ) with V1 ∈ C and Λ0 ∈ C 8: Determine deflation unitary matrix F1 ∈ Cp0 ×p0 and k1 , d1 such that k1 + d1 = p0 (see Algorithm 3 or Algorithm 4) 9: Set s1 = k1 10: Define V1 P0 = Vˆ1 F1 , with V1 ∈ Cn×s1 (P0 ∈ Cn×d1 ) as the first s1 (last d1 ) columns of Vˆ1 F1 ˆ 0 with Λ1 ∈ Cp0 ×p 11: Define V1 = V1 and Λ1 = F1H Λ 12: for j = 1, . . . , m do ˆ j : Apply Algorithm 1 to obtain Zj ∈ Cn×sj , 13: Completion of Vˆj+1 , Zj and H ˆ j ∈ C(sj +pj )×sj such that Vˆj+1 ∈ Cn×(sj +pj ) , and H h i ˆ ˆ ˆ ˆ AZj = Vj+1 Hj with Vj+1 = V1 , V2 , . . . , Vj , Pj−1 , Vj+1 as well as pj and nj 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
ˆ j ∈ C(sj +pj )×p as Λ ˆj = Set Λ
Λj
0(kj −nj )×p ˆj − H ˆ j Y ||F Solve the minimization problem Yj = argminY ∈Csj ×p ||Λ ˆj = Λ ˆj − H ˆ j Yj Compute R ˆ j (:, l)||2 ≤ tol, ∀ l | 1 ≤ l ≤ p then if ||R Compute Xj = X0 + Zj Yj D; stop; end if Determine deflation unitary matrix Fj+1 ∈ C(sj +pj )×(sj +pj ) and kj+1 , dj+1 such that kj+1 + dj+1 = pj (see Algorithm 3 or Algorithm 4) Set sj+1 = sj + kj+1 Define Vj+1 Pj = Vˆj+1 Fj+1 , with Vj+1 ∈ Cn×sj+1 (or Pj ∈ Cn×dj+1 ) as the first sj+1 (or last dj+1 ) columns of Vˆj+1 Fj+1 H Λ (sj +pj )×p and H ∈ ˆ j and Hj = F H H ˆ Define Λj+1 = Fj+1 j j+1 j , with Λj+1 ∈ C (s +p )×s j j j C end for Xm = X0 + Zm Ym D Rm = B − AXm Set R0 = Rm and X0 = Xm end for
9 2.3.2
Convergence properties
We denote by X0 ∈ Cn×p the current approximation of the solution, R0 ∈ Cn×p the corresponding true block residual, both obtained at the beginning of a given cycle and D ∈ Cp×p a diagonal scaling matrix (defined in line 4 of Algorithm 2). In the next lemma we derive two possible representations of the scaled block residual R0 D−1 . Lemma 1. In the deflated minimal block residual method (DMBR(m), Algorithm 2), (sj +pj )×p represents the scaled block residual R D −1 = (B − AX )D −1 in the Λ 0 0 j+1 ∈ C Vj+1 Pj basis: (6) R0 D−1 = Vj+1 Pj Λj+1 . Proof. At the end of the j-th iteration of a given cycle in DMBR(m) (1 ≤ j ≤ m) the following relations hold due to relation (4), definition of Λj+1 and the unitary property of Fj+1 : H ˆ Vj+1 Pj Λj+1 = Vˆj+1 Fj+1 Fj+1 Λj , ˆj, = Vˆj+1 Λ Λj ˆ = Vj Pj−1 Vj+1 , 0(kj −nj )×p = Vj Pj−1 Λj . ˆ 0 ) the proof Since by construction R0 D−1 = V1 P0 Λ1 (or equivalently R0 D−1 = Vˆ1 Λ (s +p )×p ˆ j j is then complete. Similarly we also deduce that Λj ∈ C represents the scaled −1 −1 ˆ ˆ ˆ block residual R0 D in the Vj+1 basis (i.e. R0 D = Vj+1 Λj ). We denote by Yj ∈ Csj ×p the solution of the reduced minimization problem Pr : ˆ j Y ||F , ˆj − H Pr : Yj = argmin ||Λ
(7)
Y ∈Csj ×p
ˆ j ∈ C(sj +pj )×p the block quasi-residual R ˆj = Λ ˆj − H ˆ j Yj . We analyze in Propoand by R sition 2 the norm minimization property occurring in DMBR(m). Proposition 2. In the deflated minimal block residual method (DMBR(m), Algorithm 2) solving the reduced minimization problem Pr (7) amounts to minimizing the Frobenius norm of the block true residual ||B −AX||F over the space X0 +range(Zj Y D) at iteration j (1 ≤ j ≤ m) of a given cycle, i.e., ˆj − H ˆ j Y ||F argmin ||Λ
=
Y ∈Csj ×p
argmin ||B − A(X0 + Zj Y D)||F ,
(8)
Y ∈Csj ×p
=
argmin ||R0 D−1 − AZj Y ||F . Y ∈Csj ×p
(9)
10 Proof. At the end of the j-th iteration of a given cycle of DMBR(m), due to Lemma 1 and to the flexible deflated Arnoldi relation (5), ||R0 D−1 − AZj Y ||F can be written as: ||R0 D−1 − AZj Y ||F = k Vj+1 Pj Λj+1 − Vj+1 Pj Hj Y kF . Since the Frobenius norm is unitarily invariant the last equality becomes: ||R0 D−1 − AZj Y ||F
H ˆ H ˆ = kFj+1 Λj − Fj+1 Hj Y kF
due to the definition of both Λj+1 and Hj (line 23 of Algorithm 2). This finally leads to: ||R0 D−1 − AZj Y ||F
ˆj − H ˆ j Y kF , = kΛ
(10)
since Fj+1 is supposed to be unitary. From Proposition 2 we deduce that the current approximate solution Xj and the corresponding scaled block residual Rj D−1 = (B − AXj )D−1 at the end of the j-th iteration can be written respectively as: Xj = X0 + Zj Yj D, ˆj. Rj D−1 = Vˆj+1 R
(11)
The detection of convergence is explained next in Corollary 1. Corollary 1. In the deflated minimal block residual method (DMBR(m), Algorithm 2) detecting the convergence on the block true residual is equivalent to detecting the convergence on the block quasi-residual in exact arithmetic: ||B(:, l) − AXj (:, l)||2 ˆ j (:, l)||2 ≤ tol, ∀ l | 1 ≤ l ≤ p. ≤ tol, ∀ l | 1 ≤ l ≤ p ⇔ ||R ||B(:, l)||2 Proof. This is a direct consequence of Proposition 2 (relation (11)) and elementary properties of the Frobenius norm. The detection of convergence is thus easy and cheap in terms of computational operations. Proposition 2 also implies the nonincreasing behaviour of the scaled block residual in the Frobenius norm in DMBR(m). The convergence of the scaled block residual norm is then monotone in the Frobenius norm as stated in Proposition 3. Proposition 3. Let Rj ∈ Cn×p be the block true residual at the end of the j-th iteration in the deflated minimal block residual method (DMBR(m), Algorithm 2) (1 ≤ j ≤ m), and suppose that Zj is of full column rank. Then the singular values of Rj D−1 satisfy the inequality: σi (Rj D−1 ) ≤ σi (Rj−1 D−1 ),
1 ≤ i ≤ p.
(12)
11 Proof. From Proposition 2 we deduce that the l-th column of the current scaled block residual Rj D−1 (:, l) at iteration j in a given cycle of DMBR(m) is obtained as the orthogonal projection of R0 D−1 (:, l) onto (range(AZj ))⊥ . Thus Rj D−1 = Πj R0 D−1 where Πj is the orthogonal projector onto (range(AZj ))⊥ . If Wj denotes an orthonormal basis of range(AZj ) we obtain: Rj D−1 = (In − Wj WjH )R0 D−1 . Using the idempotence property of the projector Πj = In −Wj WjH and the decomposition Wj = Wj−1 Wj (for j > 1) we deduce: H Rj D−1 = (In − Wj WjH )(In − Wj−1 Wj−1 − Wj WjH )R0 D−1 ,
Rj D−1 = (In − Wj WjH )Rj−1 D−1 . From [27, Theorem 3.3.16] we conclude that the singular values of the scaled block true residual are monotonically decreasing, i.e., ∀i|1≤i≤p
σi (Rj D−1 ) ≤ σi (Rj−1 D−1 ).
ˆ j we also deduce that the singular values of the block quasiSince Rj D−1 = Vˆj+1 R residual are monotonically decreasing, i.e.,: ∀i|1≤i≤p
ˆ j ) ≤ σi (R ˆ j−1 ). σi (R
To the best of our knowledge DMBR(m) is the first block flexible Krylov subspace method with deflation at each iteration ensuring a monotonically decreasing behaviour of the singular values of the block residual. This property will appear as particularly important when determining the block size reduction strategy as discussed next in Section 2.4. We also note that the Euclidean norm of the true block residual is monotonically decreasing along convergence. This property can even be extended to any unitarily invariant norm, see [11, Section 3.3] for a similar proof related to block flexible Krylov subspace methods with deflation performed at restart only.
2.4
Subspace decomposition based on a singular value decomposition
We next address the main question of occurring in the DMBR(m) subspace decomposition ˆ ˆ method, i.e., given pj , Vj+1 = Vj Pj−1 Vj+1 obtained after the j-th iteration of deflated block Arnoldi method with block modified Gram-Schmidt (Algorithm 1) in a given cycle we want to determine kj+1 , dj+1 and the unitary matrix Fj+1 ∈ C(sj +pj )×(sj +pj ) such that the following decomposition holds: Vj+1 Pj = Vˆj+1 Fj+1 . (13)
12 To limit the computational to the construction of Vj+1 we consider the cost related following splitting Vj+1 = Vj Vj+1 with Vj ∈ Cn×sj obtained at the previous iteration and Vj+1 ∈ Cn×kj+1 to be determined. Thus the decomposition (13) can be written as: Vj Vj+1 Pj = Vj Pj−1 Vˆj+1 Fj+1 , (14) with Pj ∈ Cn×dj+1 and kj+1 + dj+1 = pj . Given the block form for Fj+1 F11 F12 Fj+1 = , F21 F22 where F11 ∈ Csj ×sj , F12 ∈ Csj ×pj , F21 ∈ Cpj ×sj and F22 ∈ Cpj ×pj , the relation (14) becomes Vj Vj+1 Pj = Vj F11 + Pj−1 Vˆj+1 F21 Vj F12 + Pj−1 Vˆj+1 F22 . Since VjH Pj−1 Vˆj+1 = 0sj ×pj we deduce the following matrix structure: Isj 0sj ×pj , (15) Fj+1 = 0pj ×sj Fj where the unitary matrix Fj ∈ Cpj ×pj remains to be determined. Assuming pj = p we next present two simple criteria to deduce Fj , kj+1 and dj+1 in Sections 2.4.1 and 2.4.2 respectively. 2.4.1
Determination of Fj+1
ˆ j ∈ C(sj +pj )×p as If we decompose the block quasi-residual R ˆs R j ˆ , Rj = ˆ Rpj ˆ s ∈ Csj ×p and R ˆ p ∈ Cpj ×p the scaled block residual Rj D−1 can be written as: with R j j ˆ s + Pj−1 Vˆj+1 R ˆp . Rj D−1 = Vj R j j As information to postpone, the proposed strategy aims at determining a possible linear combination of the columns of Rj D−1 that are almost dependent. This precisely corresponds to the that we do not want to consider when determining set of directions Vj+1 in Vj+1 = Vj Vj+1 . This leads us to compute the near nullspace Znear ∈ Cp×l of Rj D−1 such that ||Rj D−1 Znear ||2 ≤ εd tol, or equivalently ˆ j Znear ||2 ≤ εd tol, ||Vˆj+1 R where εd is a real positive parameter less or equal than one. To do so we consider the ˆ j as R ˆ j = U ΣW H . We note singular value decomposition of the block quasi-residual R
13 ˆ j is rather inexpensive since R ˆ j does not that the thin singular value decomposition of R depend on the problem size n. In practice we determine a subset of the singular values ˆ j according to the following condition: of R ˆ j ) > εd tol σl (R
∀ l such that 1 ≤ l ≤ pd .
(16)
This leads to the following decomposition of the matrix Σ ∈ Cp×p : Σ+ 0pd ×(p−pd ) Σ= 0(p−pd )×pd Σ− with Σ+ ∈ Cpd ×pd defined as Σ+ = Σ(1 : pd , 1 : pd ) and Σ− ∈ C(p−pd )×(p−pd ) as Σ− = Σ(pd + 1 : p, pd + 1 : p). Due to the approximate deflation condition (16), we note that ||Σ+ ||2 > εd tol and ||Σ− ||2 ≤ εd tol. With kj+1 = pd and dj+1 = pj − kj+1 the block size reduction strategy then leads to the ˆ j at iteration j: following decomposition of the block quasi-residual R Σ+ 0 H ˆ W+ W− = U+ Σ+ W+H + U− Σ− W−H Rj = U+ U− (17) 0 Σ− with U+ ∈ C(sj +pj )×kj+1 , U− ∈ C(sj +pj )×dj+1 , Σ+ ∈ Ckj+1 ×kj+1 , Σ− ∈ Cdj+1 ×dj+1 , W+ ∈ Cp×kj+1 and W− ∈ Cp×dj+1 where pj = kj+1 + dj+1 . U+ , W+ and Σ+ denote the quantities effectively considered as important for the convergence at iteration j of a given cycle of Algorithm 2, while U− , W− and Σ− are deflated (postponed). Indeed since W = [W+ , W− ] is unitary, it is straightforward to see from (17) that ||Rj D−1 W− ||2 ≤ εd tol. Thus W− corresponds to the near nullspace of Rj D−1 . If deflation is active, only kj+1 Krylov directions will be considered in the deflated block Arnoldi procedure at the next iteration which may yield a significant reduction in terms of computational operations since dj+1 directions are postponed (i.e. kept and reintroduced later in next iterations if necessary). An important consequence of Proposition 3 is that the nonincreasing behaviour of the singular values of the block quasi-residual implies a nonincreasing behaviour of kj+1 along convergence due to the approximate deflation condition (16). We obtain the set of active directions as ˆ j W+ , Rj D−1 W+ = Vj Pj−1 Vˆj+1 R (18) and similarly the set of directions that are postponed as ˆ j W− . Rj D−1 W− = Vj Pj−1 Vˆj+1 R Since Vj+1 Pj = Pj−1 Vˆj+1 Fj ,
(19)
14 Rj D−1 W can be expressed as Rj D−1 W
ˆ s W + Pj−1 Vˆj+1 R ˆ p W, = Vj R j j H ˆ p W. ˆ s W + Vj+1 Pj F R = Vj R j j j
Thus the unitary matrix Fj ∈ Cpj ×pj must satisfy the equality: ˆ p W = Pj−1 Vˆj+1 R ˆ p W. Vj+1 Pj FjH R j j ˆp W We then perform the QR factorization of R j orthogonal factor. This also implies that range(Vj+1 ) = range( Pj−1 range(Pj ) = range( Pj−1
and choose Fj as the corresponding ˆ p W+ ), Vˆj+1 R j ˆ ˆ Vj+1 Rpj W− ),
that is, the kj+1 directions associated to (In − Vj VjH )Rj D−1 W+ (the kept ones) lie in Vj+1 while the dj+1 associated to (In − Vj VjH )Rj D−1 W− (the deflated ones) lie in Pj . Algorithm 3 details this strategy. Algorithm 3 Determination of kj+1 , dj+1 and of Fj+1 1: Choose a deflation threshold εd ˆ j as R ˆ j = U ΣW H with U ∈ C(sj +pj )×p , Σ ∈ Cp×p and 2: Compute the SVD of R W ∈ Cp×p ˆ j such that σl (R ˆ j ) > εd tol for all l such that 1 ≤ l ≤ pd 3: Select pd singular values of R Set kj+1 = pd and dj+1 = pj − kj+1 ˆ p ∈ Cpj ×p as R ˆp = R ˆ j (sj + 1 : sj + pj , 1 : p) Define R j j ˆ p W as R ˆ p W = Fj Tj , with Fj ∈ Cpj ×pj , 6: Compute the QR decomposition of R j j H Fj Fj = Ipj Isj 0sj ×pj (s +p )×(s +p ) j j j j 7: Define Fj+1 ∈ C as Fj+1 = 0pj ×sj Fj 4:
5:
Remark 2. When no deflation occurs (i.e. kj+1 = pj ) at iteration j, a variant of Algorithm 3 consists in choosing Fj+1 equal to Isj +pj . This simply aims at avoiding the operation Vˆj+1 Fj+1 which is expensive when n is large. We conclude this section by showing a proposition later used in Section 4. Proposition 4. At the end of the j-th iteration of the deflated minimal block residual method (DMBR(m), Algorithm 2) in a given cycle, the following relations hold: rank( Vj+1 Pj ) = rank( Pj−1 Vˆj+1 ) = pj , rank(Vˆj+1 ) = rank( Vj+1 Pj ) = sj + pj .
15 If additionally we suppose that each preconditioner Mi−1 (1 ≤ i ≤ j) has been chosen such that rank(Zj ) = rank(Vj ), then
ˆ j ) = rank(Hj ) = sj . rank(H
Proof. Each Vˆi is full rank and orthogonal to Pi−2 by construction (1 < i ≤ m). To show that each Pi is full rank, we use a simple induction noting that P0 is full rank by ˆ construction. If Pi−1 is full rank, then Pi−1 Vi+1 is full rank, and since Fi is unitary, Pi−1 Vˆi+1 Fi will be full rank, which completes the proof due to relation (19). Since P rank(Vj ) = jl=1 kl = sj we deduce that rank(Vˆj+1 ) = sj + pj . Supposing that each variable preconditioner Mi−1 (1 ≤ i ≤ j) was chosen such that ˆ j ) = rank(AZj ) = sj . Since Vˆj+1 is full rank(Zj ) = rank(Vj ), we find that rank(Vˆj+1 H ˆ rank, rank(Hj ) = sj which concludes the proof.
We note that although Vj is unconditionally full rank, we cannot guarantee that each Zi will be linear independent (although the regularity of each Mi ensures that each Zi is full rank) for 1 ≤ i ≤ j. A simple way to guarantee this property is to use a constant preconditioner M . In this case Zj = M −1 Vj and the regularity of M guarantees that Zj is of full column rank.
2.4.2
Variant based on truncation
A variant of DMBR(m) that simultaneously combines deflation and truncation at each iteration is proposed. Truncation here consists in deciding once for all the maximal number of directions to be effectively considered along convergence, a parameter called pf (1 ≤ pf ≤ p). Thus when using truncation the inequality kj+1 ≤ pf is imposed at iteration j. This mainly aims at reducing the computational cost of the given iteration since at most pf matrix-vector products and preconditioner applications will be performed. Of course the non-truncated variant of DMBR(m) is simply recovered if pf = p. Consequently truncation just implies a modified selection of kj+1 and dj+1 as described next in Algorithm 4.
16 Algorithm 4 Determination of kj+1 , dj+1 and of Fj+1 (variant based on truncation) 1: Choose a deflation threshold εd and pf (1 ≤ pf ≤ p) ˆ j as R ˆ j = U ΣW H with U ∈ C(sj +pj )×p , Σ ∈ Cp×p and 2: Compute the SVD of R p×p W ∈C ˆ j such that σl (R ˆ j ) > εd tol for all l such that 1 ≤ l ≤ pd 3: Select pd singular values of R Set kj+1 = min(pd , pf ) and dj+1 = pj − kj+1 ˆ p ∈ Cpj ×p as R ˆp = R ˆ j (sj + 1 : sj + pj , 1 : p) Define R j j ˆ p W as R ˆ p W = Fj Tj , with Fj ∈ Cpj ×pj , 6: Compute the QR decomposition of R j j FjH Fj = Ipj Isj 0sj ×pj (s +p )×(s +p ) j j j j 7: Define Fj+1 ∈ C as Fj+1 = 0pj ×sj Fj 4:
5:
With notation of Algorithm 4 the block size reduction strategy in the truncated case leads to the following decomposition of Σ at iteration j when pd > pf :
Σ+,pf
Σ = 0(pd −pf )×pf 0(p−pd )×pf
0pf ×(pd −pf ) Σ+,pd −pf 0(p−pd )×(pd −pf )
0pf ×(p−pd )
0(pd −pf )×(p−pd ) Σ−,p−pd
with Σ+,pf ∈ Cpf ×pf , Σ+,pd −pf ∈ C(pd −pf )×(pd −pf ) , Σ−,p−pd ∈ C(p−pd )×(p−pd ) where pj = kj+1 + dj+1 . We note that due to truncation the inequality ||Σ− ||2 ≤ εd tol does not hold when pd > pf . Combination of residuals that have not approximately converged are indeed deflated. Nevertheless no information is discarded; this is the major difference with BFGMREST(m) a flexible variant of BFGMRES(m) based on deflation and truncation performed at the restart only [11]. Thus due to truncation the deflated minimal block residual method may require more preconditioner applications to converge than its non truncated version. However this drawback has to be balanced with the reduced computational cost of the iterations corresponding to the situation pd > pf .
2.5
Computational cost and memory requirements
As shown in Algorithms 1-4 deflation at each iteration induces additional operations with respect to methods incorporating deflation at restart only. Hence the question of the total computational cost of the new method has to be addressed. For that purpose we summarize in Table 1 the costs occurring during a given cycle of DMBR(m) (considering Algorithms 1, 2, 3 or 4) excluding matrix-vector products and preconditioning operations which are problem dependent. We have included the costs proportional to both the size of the original problem n and the maximal number of right-hand sides p, assuming a QR factorization based on modified Gram-Schmidt and a Golub-Reinsch
17 SVD2 ; see, e.g, [21, Section 5.4.5] and [26, Appendix C] for further details on operation counts. The total cost of a given cycle is then found to grow as C1 np2 + C2 p3 + C3 np and we note that this cost is always nonincreasing along convergence due to block size reduction. Compared to methods including deflationat restartonly, additional operations are related to the computations of F1 , Λ1 , Fj+1 , Vj+1 Pj , Λj+1 and Hj , operations that behave as p3 and np2 respectively. The computation of Vj+1 Pj is in practice the most expensive one in a given iteration of DMBR(m). Concerning the truncated variant, the computational cost of a cycle will be reduced only when pd > pf since the upper bound on kj+1 will be then active. This situation occurs at the beginning of the convergence due to the nonincreasing behaviour of the singular values of the block quasi-residual.
Step Computation of R0 D−1 QR factorization of R0 D−1 Computation of F 1 Computation of V1 P0 Computation of Λ1 Block Arnoldi procedure1 Computation of Yj ˆj Computation of R Computation of F j+1 Computation of Vj+1 Pj Computation of Λj+1 Computation of Hj Computation of Xm
Computational cost np 2np2 + np 4p0 p2 + 8p3 + 2p30 2np20 2p20 p Cj 2(sj + pj )s2j + ps2j (sj + pj )p + 2(sj + pj )sj p 4(sj + pj )p2 + 8p3 + 2p3j 2np2j 2p2j p 2p2j p np + 2nsm p + sm p
Table 1: Computational cost of a cycle of DMBR(m) (Algorithm 2). This excludes the cost of matrix-vector operations and preconditioning operations. Concerning storage proportional to the problem size n, DMBR(m) requires Rm , X0 , Xm , Vm+1 and Zm respectively leading to a memory requirement of 2nsm + npm + 3np at the end of a given cycle. Since sm varies from cycle to cycle an upper bound of the memory requirement can be given as n(2m+1)p0 +3np when p0 linear systems have to be considered at the beginning of a given cycle. We note that the storage is monotonically decreasing along convergence, a feature than can be for instance exploited if dynamic memory allocation is used. 2 The Golub-Reinsch SVD decomposition R = U ΣV H with R ∈ Cm×n requires 4mn2 +8n3 operations when only Σ and V have to be computed. 1 1: the block Arnoldi method based on modified Gram-Schmidt requires Pm Algorithm Pj Pm 2 j=1 i=1 (4nki kj + nkj + 4ndj kj ) operations (lines 6 to 11) plus j=1 2nkj operations for the QR Pm Pj decomposition of S (line 13). Thus Cj = j=1 ( i=1 (4nki kj + nkj + 4ndj kj ) + 2nkj2 ).
18
3
Connections with existing methods
We briefly discuss possible connections between different GMRES-based block Krylov subspace methods including deflation and truncation at the beginning of the cycle or at each iteration. A common framework to introduce these methods is sketched as: V1 P0 = Vˆ1 T0 , Vj+1 Pj = Pj−1 Vˆj+1 Tj , with Pj−1 ∈ Cn×dj and Pj−1 Vˆj+1 ∈ Cn×pj where the transformation matrices T0 and Tj will be specified later for each strategy. T0 is related to the decomposition at the beginning of a cycle, whereas Tj is related to the j-th iteration of a given cycle. Table 2 summarizes the main properties of block GMRES-based methods. With notation of Section 2 we specify kj (the number of directions effectively considered), dj (the number of directions that are postponed), the transformation matrices T0 and Tj and the dimension of the minimization problem to be solved at the j-th iteration of each strategy. Method Reference kj dj pj−1 T0 Tj min ||.||F
BGMRES [45] p 0 p Ip Ip jp × p
BlMResDefl [23] p0 0 p0 Ip0 Ip0 jp0 × p0
BFGMREST [11] k1 = min(pd , pf ) 0 k1 F0 (1 : p0 , 1 : k1 ) Ik1 jk1 × k1
BGMRES-R [36] kj dj kj + dj Ip Fj sj × p
Table 2: Comparison between standard restarted block GMRES-based methods (BGMRES(m)), methods including deflation at restart only (BlMResDefl(m), BFGMREST(m)) or at each iteration (BGMRES-R(m)).
3.1
Connections with methods including deflation or truncation at restart
The method named BlMResDefl in [23] uses a rank-revealing QR factorization based on a threshold εqr to perform initial residual deflation and to determine the rank of the block true residual (p0 ) at the beginning of the cycle. Block size reduction will be then active only when p0 < p. BlMResDefl incorporates Arnoldi deflation [23] as well relying again on a RRQR. Nevertheless it is reported that in practice such a deflation never becomes active [23, §15]. Table 2 reveals that BFGMREST(m) which combines residual deflation and truncation at restart discards information when min(pd , pf ) < p leading to a non-square T0 transformation matrix. In such a case T0 is directly obtained as the p0 × k1 principal
19 block of R0 D−1 W once the QR factorization of the scaled block residual R0 D−1 is performed. Hence the minimization problem in BFGMREST(m) at the j-th iteration of a given cycle has a reduced size. We may thus discard part of the information that could be considered as useful for convergence. Moreover DMBR(m) selects the most important Krylov directions at each iteration, while BFGMREST chooses such directions at the beginning of the cycle only. Since DMBR(m) does not discard any information, the behaviour of truncated variants of DMBR(m) and BFGMREST(m) is thus expected to be different. This will be confirmed in Section 5.2 related to numerical experiments.
3.2
Connections with methods including deflation or truncation at each iteration
We now consider methods including deflation at each iteration. As far as we know the first method incorporating such a strategy is due to Robb´e and Sadkane [36] (R-criterion, Algorithm 2) named later BGMRES-R(m). As shown in Table 2, BGMRES-R(m) has considerable algebraic similarities with DMBR(m), although the initial formulations of both methods are significantly different. One noticeable difference between both algorithms lies in the fact that BGMRES-R(m) does not perform any deflation at the beginning of the cycle (T0 ) nor reordering on Λj . Interestingly enough, algebraically speaking, BFGMREST(m) and BGMRES-R(m) employ a very similar technique to deflate although they are opposite in the sense that the former deflates only at the beginning of the cycle (BFGMREST(m)), and the latter deflates only inside the cycle (BGMRESH Λ ˆ j = Λj and R(m)). Note also that because k1 = p0 in BGMRES-R(m), we have Fj+1 it is thus not necessary for BGMRES-R(m) to reorder any information in Λj . Also in [36] BGMRES-R(m) is said to satisfy an Inexact Arnoldi relation as ˜j AZj = Vj+1 Lj + Q ˜ j = Pj Gj , with (therein established without preconditioner, thus Zj = Vj ), where Q n×d d ×s Pj ∈ C j+1 and Gj ∈ C j+1 j is full rank. We obtain such a relation by simply defining Lj ∈ C(sj +kj+1 )×sj as the first sj + kj+1 rows of Hj and Gj ∈ Cdj+1 ×sj consisting of the last dj+1 rows of Hj . The flexible block Arnoldi relation (3) then becomes: Lj AZj = Vj+1 Pj . Gj It is beyond the scope of this manuscript to show a formal proof of the algebraic equivalence of BGMRES-R(m) and DMBR(m) for this special choice of Fj+1 though. To conclude we would like to emphasize that the formulations of BFGMREST(m), BGMRES-R(m) and DMBR(m) are quite independent. We have just shown that, in the framework of the deflated minimal block residual, we can clearly identify the common points between those methods. Indeed by definition DMBR(m) combines the best of both strategies and can be considered as a generalization of both methods. Moreover a flexible variant of DMBR(m) is proposed together with a variant based on truncation.
20 Furthermore DMBR(m) is flexible enough to accommodate alternative deflation criteria while maintaining several of its convergence properties shown in Section 2.3.2. In this manuscript we have proposed a deflation based on singular values, but DMBR(m) is definitively not restricted to this particular choice. We plan to also investigate a decomposition based on angles between subspaces as proposed in [15, 41].
4
Breakdown and convergence in exact arithmetic
In this section only we consider exact arithmetic and focus on a phenomenon that is related to the convergence of DMBR(m) - the breakdown - a notion introduced in Definition 2. This study is primarily of theoretical interest, since breakdowns are associated with the computation of the exact solution (or a linear combination of exact solutions), and the method tends to exhibit a full convergence before the occurrence of any partial breakdown in most practical cases. First we define the notions of partial (full) convergence and partial (full) breakdown respectively. Definition 1 (Partial and full convergence). If there exists an orthonormal matrix W ∈ Cp×l such that kRj D−1 W kF ≤ ε, where ε is a given convergence threshold, we say that l partial convergences have been detected. If l = p we say that a full convergence has been detected instead. Definition 2 (Partial and full breakdown). We say that pj partial breakdowns have been detected whenever pj < p (line 7 of Algorithm 2 or line 15 of Algorithm 1). If pj = 0 we say that a full breakdown has been detected. We note that the occurrence of a breakdown is always associated to “a missing column in Vˆj+1 ” (also at line 7 of Algorithm 2 where the initial residual rank deficiency “shrinks” the number of columns in Vˆ1 = Vˆ1 ). This does not necessarily characterize a breakdown in the sense that the algorithm cannot proceed its execution since the thin QR decomposition always exists, but as shown in Theorem 5, the situation is quite similar to the one known in the single right-hand side case [38]. Later we consider that normally pj = p holds at each iteration since we are referring to exact rank deficiency. Proposition 5. In the deflated minimal block residual method (DMBR(m), Algorithm 2), if each preconditioner Mi ( 1 ≤ i ≤ j) has been chosen such that rank(Zj ) = rank(Vj ), the exact solution of at least p − pj linear combinations of systems is already known at the end of the j-th iteration. Proof. Due to the assumption of rank preservation of Zj we know from Proposition 4 ˆ j ) = sj and that Vˆj+1 is full rank. Thus we obtain the relation: that rank(H ˆj H ˆ † )Λ ˆ j ), rank(Rj D−1 ) = rank((Isj +pj − H j
21 ˆ † denotes the pseudo inverse of H ˆ j . Is +p − H ˆj H ˆ † is the orthogonal projector where H j j j j ˆ onto the orthogonal complement of Hj , whose dimension is pj . In such a case the ˆ j onto such a subspace could not possibly span a subspace projection of the p columns of Λ of dimension larger than pj , and we obtain the inequality: ˆj H ˆ † )Λ ˆ j ) ≤ pj , rank((Isj +pj − H j proving that if pj < p, the scaled block true residual is rank deficient. Thus there exists an orthonormal matrix W such that Rj D−1 W = BD−1 W − AXj D−1 W = 0, where W ∈ Cp×l with l = nul(Rj D−1 ) ≥ p − pj , proving the theorem. Corollary 2. At the j-th iteration of DMBR(m) the inequality rank(Rj D−1 ) ≤ pj is satisfied. Corollary 3. If a full breakdown has occurred at the j-th iteration of DMBR(m) then there exists Xs ∈ Cn×p such that B − AXs = 0. Remark 3. We would like to emphasize that the occurrence of partial breakdowns does not stop the execution of Algorithm 2. In fact, once a partial breakdown has been detected, the subsequent iterations will detect a partial breakdown as well, since pj is defined as pj = pj−1 − nj . Only the full breakdown situation forces Algorithm 2 to stop, since kj+1 can not be determined due to violation of the condition 1 ≤ kj+1 ≤ pj = 0 (see line 8 or line 20 of Algorithm 2). Proposition 6. We consider a fixed preconditionerPM and the case of convergence toward the exact solution X? (ε = 0). We define λ = pk=1 λk , where λk is the degree of the minimal polynomial of AM −1 with respect to R0 D−1 ek . The deflated minimal block residual method (DMBR(m), Algorithm 2) finds Xs ∈ Cn×p such that B − AXs = 0 in s iterations with s ≤ min(n, λ). Proof. When a constant preconditioning matrix M is used, the deflated minimal block residual method builds a subspace of the block Krylov subspace Kj (AM −1 , R0 D−1 ) after j iterations. We note that dim(Ki (AM −1 , R0 D−1 )) ≤ λ for any i (see [24] for instance). Because Vˆj+1 is always full rank (Proposition 4), if λ < n we know that there exists s ≤ λ such that range(Vˆs ) = K (AM −1 , R0 D−1 ) where K (AM −1 , R0 D−1 ) is the entire block Krylov subspace of A with respect to R0 D−1 and contains the exact solution X? (see [24, Theorem 9]). Otherwise, if n ≤ λ, Proposition 4 indicates that Vs spans the entire Cn and therefore Xs is contained in Zs .
22 Both Theorems 5 and 6 are important to justify that DMBR(m) will always converge to the exact solution in at most n iterations if the restart parameter m is large enough. However in practice we are not looking for the exact solution (nor can we afford a restart parameter as large as n or λ in most cases) but for an approximate solution based on the convergence criterion kRj D−1 kF ≤ ε. Unless we are considering the convergence threshold ε being close to the machine precision, the partial breakdown phenomenon is rather uncommon, since the method tends to find an approximate solution satisfying kRj D−1 kF ≤ ε before the block true residual shows an exact rank deficiency.
5
Numerical experiments
In this section we investigate the numerical behaviour of block flexible Krylov subspace methods including deflation at each iteration on two different problems related to wave propagation phenomena where the multiple right-hand side situation frequently occurs. The first illustration focuses on an academic example, while the second is related to a challenging realistic application in geophysics. The source terms correspond to Dirac sources in both examples. Thus the block right-hand side B ∈ Cn×p is extremely sparse (only one nonzero element per column) and the initial block residual corresponds to a full rank matrix. We compare both BFGMRES-R(m) and DMBR(m) with various preconditioned iterative methods based on flexible (block) GMRES(m) for the solution of these two problems with a zero initial guess (X0 ) and a moderate value of the restart parameter m. The iterative procedures are stopped when the following condition is satisfied: ||B(:, l) − AX(:, l)||2 ≤ tol, ∀ l = 1, . . . , p. ||B(:, l)||2
(20)
Our goal is to analyze the performance of block Krylov subspace methods including deflation at each iteration versus classical methods discussed in Section 3. A primary concern will be to evaluate if DMBR(m) can be efficient when solving problems with multiple right-hand sides both in terms of preconditioner applications and total computational cost. Finally we fix the parameter d of Algorithms 3 or 4 to 1 in these experiments.
5.1
Complex-valued advection diffusion reaction problem
We first consider a complex-valued partial differential equation of advection diffusion reaction type in two dimensions defined on a square domain Ω = [0, 1]2 . Recently Haber and MacLachlan [25] have proposed a continuous transformation of the acoustic wave equation based on the Rytov decomposition that requires the solution of a partial differential equation that is more amenable to efficient numerical methods than the original indefinite Helmholtz equation. This equation with Dirichlet boundary conditions
23 reads as follows: −
∂2u ∂2u ∂u ∂u − 2 − 2iωc(αx + αy ) + ω 2 (c2 − κ2 )u = gs (x), ∂x2 ∂y ∂x ∂y u = 0 on
x = (x, y) ∈ Ω,(21) ∂Ω,
(22)
where ω, c, αx , αy , κ are real-valued coefficients. The source term gs (x) = δ(x − xs )e−ic(αx xs +αy ys ) represents a harmonic point source located at (xs , ys ) in Ω. We consider the pure advection diffusion √ case (c = √ κ = 1) and set the following values for the parameters ω = π, αx = 1/ 2, αy = 1/ 2. The discrete problem is obtained after second-order finite difference discretization of (21) with a second-order upwind scheme for the treatment of the advection terms as in [25]. The inner preconditioner is based on one cycle of non preconditioned GMRES(m) corresponding to pm additional matrixvector products when considering a linear system with p right-hand sides. Thus flexible outer Krylov subspace methods are required since the preconditioner is then variable. With this simple preconditioner we note that we can derive a purely matrix-free implementation. The tolerance is set to tol = 10−6 in these numerical experiments performed in MATLAB.
Table 3 collects the number of outer iterations (It) and number of preconditioner applications on a single vector (P r) required for various restarted block flexible Krylov subspace methods performing no deflation (BFGMRES(m)), deflation at the beginning of cycle only (BFGMRESD(m)) and deflation at each iteration (BFGMRES-R(m) and DMBR(m)) for two different values of the restart parameter (m = 5 and m = 10) respectively. We have also included results related to restarted flexible GMRES when solving in sequence the p linear systems independently. We note that all selected methods solve the minimization problem over a subspace of similar maximal dimension (mp). In addition we determine the computational complexity of all algorithms including costs related to QR factorization, singular value decomposition, orthonormalization (as listed in Table 1), matrix-vector products and preconditioning and define a measure of efficiency τ as τ (method) =
f lops(F GM RES(mp)) . f lops(method)
Thus a value of τ greater than one indicates that the given block subspace method leads to a computational improvement with respect to flexible GMRES applied on the given sequence of linear systems. Table 3 reveals that block Krylov subspace methods including deflation either at restart only or at each iteration usually are to be preferred. Indeed those methods always lead to efficiencies τ greater than one. On this application standard block Krylov subspace method (BFGMRES) is not efficient with respect to FGMRES(mp). This highlights the fact that to be effective block subspace methods must incorporate block size reduction or deflation [23]. On the two sets of numerical experiments, whatever the value of the restart parameter m, DMBR(m) always leads to the minimal number of preconditioner applications. Similarly DMBR(m) also delivers
24
Complex-valued advection diffusion problem - Grid : 128 × 128
Method FGMRES(5p) BFGMRES(5) BFGMRESD(5) BFGMRES-R(5) DMBR(5)
It 75 43 50 39 44
Method FGMRES(5p) BFGMRES(5) BFGMRESD(5) BFGMRES-R(5) DMBR(5)
It 315 26 49 39 45
Method FGMRES(10p) BFGMRES(10) BFGMRESD(10) BFGMRES-R(10) DMBR(10)
It 75 22 22 23 23
Method FGMRES(10p) BFGMRES(10) BFGMRESD(10) BFGMRES-R(10) DMBR(10)
It 315 17 26 23 23
m=5 p=4 Pr τ 75 1.00 172 0.41 80 0.95 80 0.90 67 1.06 p = 16 Pr τ 315 1.00 416 0.40 150 1.27 214 0.85 121 1.42 m = 10 p=4 Pr τ 75 1.00 88 0.57 63 0.96 51 1.43 46 1.69 p = 16 Pr τ 315 1.00 272 0.47 186 0.70 126 1.32 98 1.63
p=8 It 155 39 45 39 43 It 635 17 45 40 43
Pr τ 155 1.00 312 0.37 100 1.33 127 0.96 87 1.38 p = 32 Pr τ 635 1.00 544 0.39 235 0.97 386 0.62 181 1.14 p=8
It 155 20 24 23 23 It 635 16 30 22 22
Pr τ 155 1.00 160 0.57 104 0.96 78 1.43 65 1.69 p = 32 Pr τ 635 1.00 512 0.31 350 0.43 216 1.00 156 1.29
Table 3: Two-dimensional advection diffusion problem. Case of h = √ complex-valued √ 1/128, ω = π, αx = 1/ 2, αy = 1/ 2 with a number of right-hand sides given at once ranging from p = 4 to p = 32 for two different values of the restart parameter m = 5 (upper part) and m = 10 (lower part). It denotes the number of iterations, P r the number of preconditioner applications on a single vector and τ a scaled measure of efficiency in terms of computational operations.
the best efficiency (see bold values in Table 3). Finally we note that BFGMRES-R(m) is penalized when the number of outer cycles is large since this method does not include
25 initial deflation at the beginning of the cycle. On this academic problem we have shown the interest of using block Krylov subspace methods that include deflation at each iteration. DMBR(m) is indeed found to be competitive with respect to Krylov subspace methods including deflation at restart only.
5.2
Acoustic full waveform inversion
We focus on a specific application in geophysics related to the simulation of wave propagation phenomena in the Earth [44]. Given a three-dimensional physical domain Ωp , the propagation of a wave field in a heterogeneous medium can be modeled by the Helmholtz equation written in the frequency domain: −
∂2u ∂2u ∂2u (2πf )2 − 2 − 2 − 2 u = gs (x), 2 ∂x ∂y ∂z c (x, y, z)
x = (x, y, z) ∈ Ωp .
(23)
u represents the pressure field in the frequency domain, c the variable acoustic-wave velocity in ms−1 , and f the frequency in Hertz. The source term gs (x) = δ(x − xs ) represents a harmonic point source located at (xs , ys , zs ). A popular approach — the Perfectly Matched Layer formulation (PML) [6, 7] — has been used in order to obtain a satisfactory near boundary solution, without many artificial reflections. As in [11] we consider a second-order finite difference discretization of the Helmholtz equation (23) on an uniform equidistant Cartesian grid of size nx × ny × nz . The same stability condition (12 points per wavelength) relating f the frequency with h the mesh grid size and c(x, y, z) the heterogeneous velocity field has been considered: f=
min(x,y,z)∈Ωh c(x, y, z) . 12 h
In consequence A is a sparse complex matrix which is non Hermitian and nonsymmetric due to the PML formulation that leads to complex-valued variable coefficients in the partial differential equation [34, Appendix A]. Due also to their indefiniteness the resulting linear systems are known to be challenging for iterative methods [17, 18]. We consider the same approximate geometric two-level preconditioner presented in [11] that has been shown to be relatively efficient for the solution of three-dimensional heterogeneous Helmholtz problems in geophysics. We refer the reader to [11, Algorithm 5] for a complete description of the geometric preconditioner and to [34] for additional theoretical properties in relation with Krylov subspace methods. In this section we consider this variable two-grid preconditioner in the multiple right-hand side case and next investigate the performance of the block flexible Krylov methods presented in Section 2 on this challenging real-life application. The tolerance is set to tol = 10−5 in the numerical experiments. The numerical results have been obtained on Babel, a Blue Gene/P computer located at IDRIS (PowerPC 450 850 Mhz with 512 MB of memory on each core) using a Fortran 90 implementation with MPI in single precision arithmetic. This code was compiled by the IBM compiler suite with standard compiling options and linked with the vendor BLAS and LAPACK subroutines.
26
Acoustic full waveform inversion - Grid : 433 × 433 × 126 Method FGMRES(5p) BFGMRES(5) BFGMRESD(5) BFGMRES-R(5) DMBR(5) BFGMREST(5,p/2) DMBR(5,p/2) Combined(5,p/2) Combined(5,p/4)
It 56 14 14 16 16 24 16 15 18
Method FGMRES(5p) BFGMRES(5) BFGMRESD(5) BFGMRES-R(5) DMBR(5) BFGMREST(5,p/2) DMBR(5,p/2) Combined(5,p/2) Combined(5,p/4)
It 434 14 15 18 19 20 16 15 20
p=4 Pr 56 56 43 44 39 48 40 41 41 p = 32 Pr 434 448 225 283 181 255 189 184 191
T 624 622 489 503 452 542 459 471 474 T 670 713 371 466 316 396 310 305 320
p=8 It Pr 112 112 14 112 15 70 16 74 16 57 23 80 15 68 15 62 15 59 p = 64 It Pr 1152 1152 18 1152 20 490 25 618 25 413 25 550 24 444 20 409 20 398
T 629 631 401 431 339 447 392 359 346
It 224 14 15 16 18 20 17 15 15
T 925 962 422 537 375 444 396 348 342
It 2531 19 25 28 28 28 29 25 25
p = 16 Pr 224 224 120 134 102 140 124 103 102 p = 128 Pr 2531 2432 1015 1489 915 1125 976 899 898
T 665 668 371 417 328 410 384 323 320 T 1187 1187 509 762 497 524 523 442 448
Table 4: Acoustic full waveform inversion (SEG/EAGE Overthrust model). Case of f = 3.64 Hz (h = 50 m), with p = 4 to p = 128 right-hand sides given at once. It denotes the number of iterations, P r the number of preconditioner applications on a single vector and T denotes the total computational time in seconds. As in [11] we consider the velocity field issued from the public domain SEG/EAGE Overthrust model and analyze the performance of the numerical methods at a given frequency f = 3.64 Hz. Both the problem dimension (about 23 million of unknowns) and the maximal number of right-hand sides to be considered (128) correspond to a task that geophysicists typically must face on a daily basis. Thus efficient numerical methods must be then developed for that purpose. In [11] we have considered block flexible Krylov subspace methods including deflation at restart only on this application for a reduced number of right-hand sides (from 4 to 16). We continue this detailed analysis and investigate the performance of both DMBR(m) and BFGMRES-R(m) with a larger number of right-hand sides. We also consider the variants with truncation in memory (BFGMREST(m, pf )) and with truncation in operations DMBR(m, pf )) with pf set to p/2 in both cases. The number of cores is ranging from 32 for p = 4 to 1024 for p = 128. Since doubling the number of right-hand sides nearly doubles the memory
27 requirement of the block methods, we also multiply the number of cores by a factor of two with respect to the number of right-hand sides. This aims at imposing the same memory constraint on each core for all numerical experiments as in [11]. The maximal memory requested is about 488 Gb for p = 128. Table 4 collects in addition to outer iterations (It) and preconditioner applications on a single vector (P r) the computational times in seconds (T ). Among the different strategies DMBR(5) most often delivers the minimal number of preconditioner applications and computational times (see italic and bold values respectively in Table 4). This clearly highlights the interest of performing deflation at each iteration both in terms of preconditioner applications and computational operations on this given application. The latter result is especially important since deflating at each iteration induces an additional cost as shown in Table 1. DMBR(5) is thus found to be competitive with respect to methods incorporating deflation at restart only (a gain of up to 15% in terms of computational time is obtained for instance for p = 8) as well as DMBR(5,p/2) (gain of 27% for p = 32 when compared to BFGMREST(5,p/2)). This is a satisfactory improvement since methods including deflation at restart only are already quite efficient in this application as shown in [11]. We also note that the improvement over classical block flexible GMRES method is quite large as expected (a gain of up to 61% is obtained for p = 64). Figure 1 shows the evolution of kj - the number of Krylov directions effectively considered at iteration j - along convergence for the various block subspace methods in the case of p = 32. Regarding BFGMRESD(5) and BFGMREST(5,p/2) deflation is performed only at the beginning of each cycle, thus kj is found to be constant in a given cycle. Variations at each iteration can only happen in BFGMRES-R(5) or DMBR(5). As expected DMBR(5) enjoys a nonincreasing behaviour for kj along convergence, while peaks occur for BFGMRES-R(5) at the beginning of each cycle. On this example the use of truncation within DMBR(5) tends to delay the start of the decreasing behaviour of kj . After a certain phase deflation is nevertheless active and proves to be useful. We also remark that the use of truncation techniques in DMBR(m) leads to an efficient method. In certain cases DMBR(5, p/2) is as efficient as DMBR(5) in terms of computational times (see, e.g., the case p = 32 in Table 4). This feature is really important in this given application due to the large size of the linear systems. Furthermore DMBR(5, p/2) requires usually less preconditioner applications than BFGMREST(5, p/2). This satisfactory behaviour has indeed a reason: due to Theorem 2, we guarantee that the truncated variant of DMBR(m) minimizes the entire residual at each iteration (regardless of the value of pj ), whereas BFGMREST(m) chooses just a subset of the residual to be minimized at each cycle (as discussed in Section 3.2). We consider that this is indeed a critical feature of the truncated variant of DMBR(m). Finally we investigate the convergence properties of a combination of DMBR(m) and BFGMRESD(m). At the beginning of the convergence history this method (named Combined(m, ps ) in Table 4) is fully equivalent to DMBR(m). Then as soon as the number of Krylov directions effectively considered at iteration j (kj ) reaches a given prescribed value (ps ) the method switches to BFGMRESD(m) at the next restart. This
28
Acoustic full waveform inversion − p=32
Acoustic full waveform inversion − p=32
30
30 BFGMRES(5)
BFGMRES−RS(5)
BFGMRESD(5)
20
20 kj
25
kj
25
15
15
10
10
5
5
2
4
6
8
10 12 Iteration index
14
16
18
20
2
4
Acoustic full waveform inversion − p=32
6
8
10 12 Iteration index
14
16
18
20
Acoustic full waveform inversion − p=32
30
30 BFGMREST(5,p/2)
DMBR(5)
DMBR(5,p/2)
20
20 kj
25
kj
25
15
15
10
10
5
5
2
4
6
8
10 12 Iteration index
14
16
18
20
2
4
6
8
10 12 Iteration index
14
16
18
Figure 1: Acoustic full waveform inversion (SEG/EAGE Overthrust model). Case of p = 32. Evolution of kj versus iterations for p = 32 in BFGMRES(5), BFGMRESD(5) (top, left part), BFGMRES-R(5) (top, right part), DMBR(5) (bottom, left part) and truncated variants (BFGMREST(5,p/2), DMBR(5,p/2)) (bottom, right part).
mainly aims at reducing the computational cost in the next cycles by performing deflation only at the restart instead of at each iteration. As shown in Table 4 this combination leads to further reductions in computational times and is especially appropriate when the number of right-hand sides becomes large on this given application.
6
Conclusion
We have extended the block restarted flexible GMRES method to a variant that allows the use of deflation (i.e. block size reduction) at each iteration when solving multiple right-hand side problems given at once. The method aims at reducing the cost of each iteration of the block Krylov subspace method by judiciously choosing which informa-
20
29 tion to use and which information to be postponed. We have studied its convergence properties and have shown that the Frobenius norm of the block residual is always nonincreasing. Furthermore we have justified the choice of the deflation strategy that is based on a nonincreasing behaviour of the singular values of the scaled block residual. We have also proposed a variant of the deflated block residual method to be used in a constrained memory environment. Numerical experiments have shown the efficiency of the new method on two different problems issued from wave propagation situations requiring the solution of multiple right-hand side problems. The block flexible method including deflation at each iteration has proven to be efficient in terms of both preconditioner applications and computational operations. It has been found superior to recent block flexible methods including deflation at restart only. This satisfactory behaviour has been observed on an industrial simulation arising in geophysics, where large indefinite linear systems with multiple right-hand sides have been successfully solved in a parallel distributed memory environment. Furthermore reductions in terms of computational times have been obtained by combining methods including deflation at each iteration and deflation at restart only in a second phase. To the best of our knowledge these results consist in one of the first illustrations of the usefulness of block Krylov subspace methods including deflation at each iteration on a realistic three-dimensional application in a parallel distributed memory environment. To conclude it is worthwhile to note that the theoretical properties of the deflated minimal block residual method hold for any unitary matrix Fj+1 . Thus we plan to investigate other possible subspace decompositions in a near future that may lead to further improvements. Finally we note that the analysis proposed in this paper can be extended as well to other block Krylov subspace methods such as block FOM [35], block GCRO [46], and block simpler GMRES [30].
Acknowledgments The authors would like to acknowledge GENCI (Grand Equipement National de Calcul Intensif) for the dotation of computing hours on the IBM Blue Gene/P computer at IDRIS, France. This work was granted access to the HPC resources of IDRIS under allocation 2011065068 and 2012065068 made by GENCI.
References [1] J. I. Aliaga, D. L. Boley, R. W. Freund, and V. Hern´andez. A Lanczos-type method for multiple starting vectors. Mathematics of Computation, 69:1577–1601, 2000. 3 [2] J. Baglama. Augmented block Householder Arnoldi method. Linear Algebra and its Applications, 429(10):2315 – 2334, 2008. 6 [3] Z. Bai, D. Day, and Q. Ye. ABLE: an adaptive block Lanczos for non hermitian eigenvalue problems. SIAM J. Matrix Analysis and Applications, 20(4):1060–1082, 1999. 3
30 [4] A. H. Baker, J. M. Dennis, and E. R. Jessup. An efficient block variant of GMRES. SIAM J. Scientific Computing, 27:1608–1626, 2006. 3 [5] G. Barbella, F. Perotti, and V. Simoncini. Block Krylov subspace methods for the computation of structural response to turbulent wind. Comput. Meth. Applied Mech. Eng., 200(23-24):2067–2082, 2011. 2 [6] J.-P. Berenger. A perfectly matched layer for absorption of electromagnetic waves. J. Comp. Phys., 114:185–200, 1994. 25 [7] J.-P. Berenger. Three-dimensional perfectly matched layer for absorption of electromagnetic waves. J. Comp. Phys., 127:363–379, 1996. 25 [8] A. Bj¨ orck. Numerical Methods for Least Squares Problems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1996. 6 [9] W. Boyse and A. Seidl. A block QMR method for computing multiple simultaneous solutions to complex symmetric systems. SIAM J. Scientific Computing, 17(1):263– 274, 1996. 2 [10] P. A. Businger and G. Golub. Linear least squares solutions by Householder transformations. Numerische Mathematik, 7:269–276, 1965. 3 [11] H. Calandra, S. Gratton, J. Langou, X. Pinel, and X. Vasseur. Flexible variants of block restarted GMRES methods with application to geophysics. SIAM Journal on Scientific Computing, 34(2):A714–A736, 2012. 3, 11, 16, 18, 25, 26, 27 [12] J. Cullum and T. Zhang. Two-sided Arnoldi and non-symmetric Lanczos algorithms. SIAM J. Matrix Analysis and Applications, 24:303–319, 2002. 3 [13] R. D. da Cunha and D. Becker. Dynamic block GMRES: an iterative method for block linear systems. Adv. Comput. Math., 27:423–448, 2007. 3 [14] L. Du, T. Sogabe, B. Yu, Y. Yamamoto, and S.-L. Zhang. A block IDR(s) method for nonsymmetric linear systems with multiple right-hand sides. J. Comput. Appl. Math., 235:4095–4106, 2011. 2 [15] M. Eiermann, O. Ernst, and O. Schneider. Analysis of acceleration strategies for restarted minimal residual methods. J. Comput. Appl. Math., 123:261–292, 2000. 20 [16] H. Elman, O. Ernst, D. O’Leary, and M. Stewart. Efficient iterative algorithms for the stochastic finite element method with application to acoustic scattering. Comput. Methods Appl. Mech. Engrg., 194(1):1037–1055, 2005. 2 [17] Y. A. Erlangga. Advances in iterative methods and preconditioners for the Helmholtz equation. Archives of Computational Methods in Engineering, 15:37– 66, 2008. 25
31 [18] O. Ernst and M. J. Gander. Why it is difficult to solve Helmholtz problems with classical iterative methods. In O. Lakkis I. Graham, T. Hou and R. Scheichl, editors, Numerical Analysis of Multiscale Problems. Springer, 2011. 25 [19] R. W. Freund. Krylov-subspace methods for reduced-order modeling in circuit simulation. Journal of Computational and Applied Mathematics, 123(1-2):395–421, 2000. 2 [20] R. W. Freund and M. Malhotra. A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides. Linear Algebra and its Applications, 254:119–157, 1997. 2, 3 [21] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1996. Third edition. 3, 17 [22] A. El Guennouni, K. Jbilou, and H. Sadok. A block version of BICGSTAB for linear systems with multiple right-hand sides. Electronic. Transactions on Numerical Analysis, 16:129–142, 2003. 2 [23] M. H. Gutknecht. Block Krylov space methods for linear systems with multiple right-hand sides: an introduction. In A.H. Siddiqi, I.S. Duff, and O. Christensen, editors, Modern Mathematical Models, Methods and Algorithms for Real World Systems, pages 420–447, New Delhi, India, 2006. Anamaya Publishers. 2, 3, 5, 18, 23 [24] M. H. Gutknecht and T. Schmelzer. The block grade of a block Krylov space. Linear Algebra and its Applications, 430(1):174–185, 2009. 2, 21 [25] E. Haber and S. MacLachlan. A fast method for the Helmholtz equation. J. Comp. Phys., 230:4403–4418, 2011. 22, 23 [26] N. J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. 17 [27] R. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, Cambridge, UK, 1991. 11 [28] A. Khabou. Solveur it´eratif haute performance pour les syst`emes lin´eaires avec seconds membres multiples. Master’s thesis, University of Bordeaux I, 2009. 3 [29] J. Langou. Iterative methods for solving linear systems with multiple right-hand sides. PhD thesis, CERFACS, 2003. TH/PA/03/24. 2, 5 [30] H. Liu and B. Zhong. A simpler block GMRES for nonsymmetric systems with multiple right-hand sides. Electronic. Transactions on Numerical Analysis, 30:1–9, 2008. 29 [31] D. Loher. Reliable Nonsymmetric Block Lanczos Algorithms. PhD thesis, Swiss Federal Institute of Technology Zurich (ETHZ), Switzerland, 2006. Number 16337. 3
32 [32] A. A. Nikishin and A. Yu. Yeremin. Variable block CG algorithms for solving large sparse symmetric positive definite linear systems on parallel computers, i: General iterative scheme. SIAM J. Matrix Analysis and Applications, 16(4):1135–1153, 1995. 3 [33] D. P. O’Leary. The block conjugate gradient algorithm and related methods. Linear Algebra and its Applications, 29:293–322, 1980. 3 [34] X. Pinel. A Perturbed Two-Level Preconditioner for the Solution of ThreeDimensional heterogeneous Helmholtz problems with Applications to Geophysics. PhD thesis, CERFACS, 2010. TH/PA/10/55. 3, 25 [35] M. Robb´e and M. Sadkane. Exact and inexact breakdowns in block versions of FOM and GMRES methods. Technical Report, Universit´e de Bretagne Occidentale. D´epartement de Math´ematiques, 2004. Available at http://www.math.univbrest.fr/archives/recherche/prepub/Archives/2005/breakdowns.pdf. 29 [36] M. Robb´e and M. Sadkane. Exact and inexact breakdowns in the block GMRES method. Linear Algebra and its Applications, 419:265–285, 2006. 3, 18, 19 [37] A. Ruhe. Implementation aspects of band Lanczos algorithms for computation of eigenvalues of large sparse symmetric matrices. Mathematics of Computation, 33(146):680–687, 1979. 3, 6 [38] Y. Saad. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2003. 2, 4, 20 [39] T. Sakurai, H. Tadano, and Y. Kuramashi. Application of block Krylov subspace algorithms to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD. Computer Physics Communications, 181(1):113–117, 2010. 2 [40] P. Soudais. Iterative solution methods of a 3-D scattering problem from arbitrary shaped multidielectric and multiconducting bodies. IEEE Trans. on Antennas and Propagation, 42 (7):954–959, 1994. 2 [41] E. De Sturler. Truncation strategies for optimal Krylov subspace methods. SIAM J. Numerical Analysis, 36(3):864–889, 1999. 20 [42] X. Sun and C. Bischof. A basis-kernel representation of orthogonal matrices. SIAM J. Matrix Analysis and Applications, 16(4):1184–1196, 1995. 6 [43] A. Toselli and O. Widlund. Domain Decomposition methods - Algorithms and Theory, volume 34. Springer Series on Computational Mathematics, Springer, NewYork, 2004. 3 [44] J. Virieux and S. Operto. An overview of full waveform inversion in exploration geophysics. Geophysics, 74(6):WCC127–WCC152, 2009. 25
33 [45] B. Vital. Etude de quelques m´ethodes de r´esolution de probl`eme lin´eaire de grande taille sur multiprocesseur. PhD thesis, Universit´e de Rennes, 1990. 2, 18 [46] R. Yu, E. de Sturler, and D. D. Johnson. A block iterative solver for complex nonhermitian systems applied to large-scale electronic-structure calculations. Technical Report UIUCDCS-R-2002-2299, University of Illinois at Urbana-Champaign, Department of Computer Science, 2002. 29