computation and the memory requirements to increase beyond the capacity of computers to solve the problem. Iterative ...... g3(u; n; p) = ?r pew?urw + k2 = 0:.
CS{1993{04
Convergence of Nested Iterative Methods for Linear Systems Paul Joshua Lanzkron
Department of Computer Science Duke University Durham, North Carolina 27708-0129 1989
Convergence of Nested Iterative Methods for Linear Systems Paul Joshua Lanzkron
1989
Supervised by Donald J. Rose Dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University
This document is a reformatted version of the disseration, and equivalent in content.
Copyright c 1992 by Paul Joshua Lanzkron All rights reserved
Abstract We consider the solution of algebraic linear systems of the form Ax = b, where A is a matrix and x and b are vectors. Relaxation methods for solving this problem are de ned by splitting the matrix A as A = M ? N , and solving the system Mxk+1 = b + Nxk at each step k, from some initial guess x0 . Generally M is choosen to be easily invertible. We will consider the situation where M is not necessarily easy to invert. This situation might arise in block Gauss-Seidel where the diagonal blocks become very large. In this case we will consider solving the Mv = g system by an iterative method. We call this a nested iterative method. De ne the outer splitting as A = (D ? L) ? U and the inner splitting as D = F ? G. If the outer splitting is a convergent regular composite splitting (CRCS) and the inner splitting is weak regular then the overall iteration is convergent. Under those same conditions increasing the number of inner iterations does not decrease the convergence rate of the iterative method. The iteration matrices for the inner iteration with i and j inner iterations are dierent. It is well known that compositions of convergent matrices need not be convergent. We show that if the outer splitting is a CRCS and the inner splitting is regular then the number of inner iterations can be changed at every outer iteration while preserving convergence. It is well known that relaxation methods can be computed using the residual. This recurrence is necessary in practice since the residual is used to determine convergence of the solution, and explicit computation of the residual is an expensive process. We show how to extend the recurrence to nested iterative methods. We give some heuristic methods for choosing the number of inner iterations. We show that iterative methods are useful for certain problems in device simulation and circuit simulation. We also consider issues of parallel implementation of our nested solver. We have implemented our nested solver on a 42 node BBN Butter y, obtaining near linear speedup. We attained good speedup over the best sequential algorithm.
i
ii
Contents Abstract 1 Introduction 2 Nested Iteration Theory 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12
Introduction : : : : : : : : : : : : : : : : : : : : Preliminaries : : : : : : : : : : : : : : : : : : : : Comparison Theorem : : : : : : : : : : : : : : : Iteration Induced Splittings : : : : : : : : : : : : Monotonicity : : : : : : : : : : : : : : : : : : : : Varying the Number of Inner Iterations : : : : : Nested Block Iterative Methods : : : : : : : : : : Use of Unique Splitting Lemma : : : : : : : : : : Derivation of SOR within SOR Iteration Matrix SOR Straight Analysis - Kahan : : : : : : : : : : Ostrowski-Reich Analysis : : : : : : : : : : : : : SOR - Kulisch Analysis : : : : : : : : : : : : : :
3 Computational Aspects of Two-Level Iteration
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
4.1 Motivation : : : : : : : : : : : : : : : : 4.2 Problems : : : : : : : : : : : : : : : : : 4.2.1 Semicondutor Device Simulation 4.2.2 Circuit Simulation : : : : : : : : 4.3 Optimal Number of Inner Iterations : : 4.3.1 Circuit Simulation Data : : : : : 4.4 Approximation Algorithms : : : : : : : 4.5 Conclusions : : : : : : : : : : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
Background : : : : : : : : : : : : : : : The Iteration : : : : : : : : : : : : : : The Residual Algorithm : : : : : : : : When to Update the Actual Residual : Optimal p : : : : : : : : : : : : : : : : Simple Algorithm : : : : : : : : : : : : Residual Algorithm : : : : : : : : : : : Adaptive Residual Algorithm : : : : : Estimate Spectral Radius of Tp : : : :
: : : : : : : : : : : :
4 Sequential Experimental Results
iii
i 1 5
5 7 9 10 14 16 20 22 24 25 26 28
31
31 33 33 36 39 40 40 41 41
45
45 48 48 49 50 55 57 61
CONTENTS
iv
5 Parallel Block Methods
5.1 Background : : : : : : : : : : : : : : 5.1.1 Chaotic Relaxation Analysis : 5.2 Architecture of the BBN Butter y : 5.3 Parallel Implementation Issues : : : 5.3.1 How to Block : : : : : : : : : 5.3.2 How to Split the Problem : : 5.4 Results : : : : : : : : : : : : : : : : : 5.4.1 Iterative Block Jacobi : : : : 5.4.2 Outer SOR : : : : : : : : : : 5.4.3 SOR version 2 : : : : : : : : 5.5 Conclusions : : : : : : : : : : : : : :
6 Conclusions
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
63
63 65 67 68 68 69 70 71 77 81 83
89
List of Tables 4.1 Optimal numbers of inners and corresponding numbers of multiplications for the MOSFET semiconductor : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Numbers of multiplications for solving the inner iteration to various levels. From the MOSFET semiconductor. is the percentage of the outer residual to which the inner is computed. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.3 Matrix 3 from MOSFET set. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.4 Matrix 6 from MOSFET set. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.5 Matrix 9 from MOSFET set. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.6 Matrix 1 from Hall Sensor set. Converges for all !O = 0:5 and !I 1:15. % means that for those parameters the iteration diverged for every number of inner iterations 4.7 Matrix 5 from Hall Sensor set. Converges for all !O = 0:1 and !I 1:15 : : : : : : : 4.8 Matrix 10 from Hall Sensor set. Converges for all !O = 0:1 and !I 1:15 : : : : : : 4.9 Matrix 1 of Square Root set, blocked into 2 blocks : : : : : : : : : : : : : : : : : : : 4.10 Matrix 4 of Square Root set, blocked into 2 blocks : : : : : : : : : : : : : : : : : : : 4.11 Matrix 1 of Square Root set, blocked into 3 blocks : : : : : : : : : : : : : : : : : : : 4.12 Matrix 4 of Square Root set, blocked into 3 blocks : : : : : : : : : : : : : : : : : : : 4.13 Matrix 1 of Square Root set, blocked into 6 blocks : : : : : : : : : : : : : : : : : : : 4.14 Matrix 4 of Square Root set, blocked into 6 blocks : : : : : : : : : : : : : : : : : : : 4.15 Simple p selection algorithm applied to Matrix 1 of MOSFET set : : : : : : : : : : : 4.16 Random algorithm applied to Matrix 1 of MOSFET set : : : : : : : : : : : : : : : : 4.17 Optimal results for Laplacian : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.18 Random Algorithm with Upper bound 25 for Matrix 3 from MOSFET set. Compare with table 4.3. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.19 Random Algorithm with Upper bound 25 for Matrix 6 from MOSFET set. Compare with table 4.3. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.20 Random Algorithm with Upper bound 25 for Matrix 9 from MOSFET set. Compare with table 4.3. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.1 Times for best number of inner iterations on a single processor for MOSFET matrix 1 (in millions of clock ticks, 1 clock tick = 62.5 microseconds : : : : : : : : : : : : : 5.2 Mean, Standard of Deviation, and optimal number of inner iterations over 4 runs for MOSFET 6 matrix with inner SOR factor 1.15 and outer SOR factor 1.3 : : : : : : 5.3 speedup for MOSFET 1 compared to serial nested iterative code : : : : : : : : : : : 5.4 eciency for MOSFET 1 compared to serial nested iterative code : : : : : : : : : : : 5.5 speedup for MOSFET 1 compared to Block iterative code (direct solve on the diagonal 5.6 eciency for MOSFET 1 compared to Block iterative code : : : : : : : : : : : : : : 5.7 speedup for MOSFET 1 compared to serial nested iterative code for !I = 1:3 : : : : 5.8 eciency for MOSFET 1 compared to serial nested iterative code for !I = 1:3 : : : : v
52 53 54 54 54 54 54 55 55 55 56 56 56 56 57 58 60 61 61 61 70 71 72 72 73 74 74 75
5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 5.23 5.24 5.25 5.26 5.27 5.28
speedup for MOSFET 1 with !I = 1:3 compared to Block iterative code : : : : : : : eciency for MOSFET 1 compared to Block iterative code : : : : : : : : : : : : : : speedup for three other MOSFET matrices compared to nested Block iterative code eciency for three other MOSFET matrices compared to nested Block iterative code speedup for MOSFET 1 matrix compared to nested Block iterative code using SOR code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : eciency for MOSFET 1 matrix compared to nested Block iterative code using SOR code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : speedup for MOSFET 6 matrix compared to nested Block iterative code using SOR code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : eciency for MOSFET 6 matrix compared to nested Block iterative code using SOR code with !O = 1:3 and !I = 1:3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : speedup for MOSFET 1 matrix compared to Block iterative code using SOR code : eciency for MOSFET 1 matrix compared to Block iterative code using SOR code with !O = 1:3 and !I = 1:3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : speedup for Hall 5 matrix compared to nested iterative code using SOR code : : : : eciency for Hall 5 matrix compared to nested iterative code using SOR code with !O = 0:1 and !I = 1:0 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : speedup for Hall 5 matrix compared to block iterative code using SOR code : : : : : eciency for Hall 5 matrix compared to block iterative code using SOR code : : : : speedup for Hall 5 matrix compared to block iterative code using SOR version 2 code eciency for Hall 5 matrix compared to block iterative code using SOR version 2 code speedup for MOSFET 6 matrix compared to block iterative code using SOR version 2 code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : eciency for MOSFET 6 matrix compared to block iterative code using SOR version 2 code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : speedup for all MOSFET matrices compared to block SOR code : : : : : : : : : : : speedup for all Hall matrices compared to block SOR code : : : : : : : : : : : : : : :
75 76 76 77 78 78 79 79 80 80 81 82 82 83 84 84 85 85 86 86
Chapter 1
Introduction We consider the solution of the algebraic linear system of equations Ax = b; (1.1) where x and b are vectors and A is a square matrix. The solution methods for such linear systems have been classi ed as either direct or iterative. An iterative method is one in which approximate solutions, given by xk , converge to the actual solution x in some unknown number of steps. Direct methods are characterized by following some pattern that gives the solution to the linear system in some xed number of steps. Examples of iterative methods are Succesive Over-Relaxation (SOR) and conjugate direction methods (see [49, 22] and the references therein). An example of a direct method is Gaussian elimination (see [44, 22]). The problem with direct methods for sparse systems (i.e., systems with few nonzeros) is that during the computation many zero entries become nonzero. This can cause both the amount of computation and the memory requirements to increase beyond the capacity of computers to solve the problem. Iterative methods use much less memory, but in many cases there is diculty in forcing convergence. This work considers iterative solutions of the linear system (1.1). As an illustration of the iterative method that we will analyze consider the linear system
2 66 TT TT A = 66 .. .. . . 4 . . . Tk Tk 11
12
21
22
1
2
3
T1k T2k 777 .. 7 . 5
(1.2)
Tkk
where the Tii are square nonsingular matrices. The block Gauss-Seidel algorithm is given by solving at step l + 1 the linear system
Tii xi(l+1) = bi ?
X ji
Tij xj(l)
(1.3)
for each i in succession. Here xi and bi are vectors of a size compatible with the diagonal block Tii . It is well known that for certain classes of matrices (M-matrices) the larger the blocks Tii are taken the fewer the number of iterations required for convergence (see [49]). Of course, when larger blocks are taken the number of multiplications required to get the solution of the linear system (1.3) is increased nonlinearly because of ll. For some applications the blocking of the system is de ned by the application itself. For instance, in a coupled dierential equation the diagonal blocks might be the discretized systems 1
CHAPTER 1. INTRODUCTION
2
associated with the solution of the various dierential equations. In this example the size of the diagonal blocks could be very large, particularly for 3-D problems. We would like to be able to take advantage of the faster convergence associated with the block Gauss-Seidel algorithm, but the diagonal blocks might be too expensive for direct solution. The approach we consider is to solve the diagonal blocks themselves by an iterative method. We will call this method nested iteration. A detailed history of nested iterative methods may be found in the introduction of chapter 2 of this dissertation. One may rewrite the iteration of (1.3) in matrix form as
Dx(l+1) = b + Lx(l+1) + Ux(l) where
2 66 T0 T0 00 D = 66 .. .. . . .. 4 . . . . 0 0 Tkk 11
22
and
2 3 0 66 ?T0 77 0 77 ; L = 66 .. .. 4 . . 5 ?Tk ?Tk
2 0 ?T 66 0 0 6 U = 66 ... ... 64 0 0 0
21
1
0
12
?T ?T
13 23
... 0
2
?T k ?T k 1
2
.. .
.. .
0
0
(1.4)
?Tk? ;k 1
0 0 .. .
0 0 .. ... . ?Tk;k?1 0
3 77 77 5
3 77 77 77 : 5
Collecting terms we have (D ? L)x(l+1) = b + Ux(l):
(1.5)
We say that the splitting of A associated with this iterative technique is A = (D ? L) ? U (see [49]). In general, if one can show that the splitting is regular, and the matrix A?1 has no negative entries then the iteration is convergent. For this splitting to be a regular splitting neither (D ? L)?1 nor U can have negative entries. Chapter 2 of this work is dedicated to the analysis of nested iterative methods. We will give conditions for the overall iteration to converge for any number of inner iterations. We will show that under these same conditions increasing the number of inner iterations decreases (actually does not increase) the number of steps to convergence. This later result is necessary for algorithms trying to nd the optimal number of inner iterations. There are many situations when one might like to dynamically change the number of inner iterations from one outer iteration to the next. We will show that under slightly more restrictive conditions changing the number of inner iterations performed at every outer iteration will converge faster (not converge slower) than if one inner iteration had been done at each outer iteration. In the iteration de ned by (1.3) there are k systems of equations to solve. We will show that each of those linear system can be solved using dierent numbers of inner iterations. Chapter 2 is completed with some analysis of nested SOR. Chapter 3 is dedicated to implementation issues of the nested iterative method. It begins with a practical guide to the implementation of the nested iterative method. We show how the residual (rl = b ? Ax(l)) may be computed using a recurrence relation. Since the norm of the residual is frequently used as a measure of convergence, computing the residual without explicitly computing Ax is useful. We then turn our attention to the problem of nding the optimal number
3 of inner iterations. We will present several heuristic algorithms for nding the best number of inner iterations. In the discussion above we said we would like to be able to take advantage of the faster overall convergence of block Gauss-Seidel, but would not like to have to do the very expensive operation of doing a solve at each inner iteration. We decided to solve the linear systems that arise inexactly. We then showed that this method would give overall convergence. Chapter 4 shows that nested iteration with many inner iterations can be useful on real problems. We then show how the approximation algorithms derived in the previous chapter perform on these problems. In chapter 5 we consider implementing the problem on a parallel machine. We have solved the problems described in chapter 4 on a 42 processor BBN Butter y. The measure of success on a parallel machine is speedup, which is de ned as the time for the problem to be solved on one processor divided by the time to solve the problem on p processors. Linear speedup is optimal. We have attained near linear speedups for the problems examined in chapter 4. We also show that good speedups can be attained relative to the best (that we have found) sequential program.
4
CHAPTER 1. INTRODUCTION
Chapter 2
Nested Iteration Theory 2.1 Introduction We consider here iterative methods for the solution of the algebraic linear system of equations
Ax = b;
(2.1)
where x and b are vectors, and A is a square nonsingular matrix. It is customary to view iterative methods as the repeated solution of
Mxk+1 = b + Nxk
(2.2)
where A = M ? N is called a splitting of the matrix A, M is nonsingular, and x0 is given. Varga [48, 49] pioneered the study of such methods; see also Young [53] and Ortega and Rheinboldt [42]. The method (2.2) is a natural formulation for systems arising from discretizations of dierential equations and it is generally assumed that the system
Mv = g
(2.3)
can be solved with considerably less computational eort than (2.1). Here we consider the iterative solution of the system (2.3), called the inner iteration, at each iteration of (2.2), the outer iteration. Again, the customary way of looking at the inner iteration is by the splitting M = B ? C and the repeated solution of
Bvj+1 = g + Cvj :
(2.4)
These are often called two-stage iterative methods [23, 39]. There is a wide range of application for these methods, and, in particular, our theory applies to block iterative methods [49]. In fact, our theory applies to iterative methods which can be described by an error propagation equation of the form ek+1 = Tek , where ek is the error at step k of the iteration and T , the iteration matrix, is convergent; see section 2.2. Two-stage methods, also called inner/outer methods, have been applied to domain decomposition methods; see [20, 47] and the references given therein. Golub and Overton [23, 24] have considered two-stage methods when the outer iteration is the Chebyshev or the Richardson method. For nonlinear systems of equations, methods with outer nonlinear iteration and linear inner iteration have been extensively studied and have been applied to dierent areas of science and engineering; see, e.g., Bank and Rose [2], Dembo, Eisenstat and Steihaug [11] or Daz, Jines and Steihaug [13]. 5
6
CHAPTER 2. NESTED ITERATION THEORY
Nichols [39] studied two-stage methods for the solution of (2.1) with general inner iteration methods. She showed that if the outer and the inner iterations are convergent, then, for a large enough number of inner iterations, p, the two-stage method is convergent (see also [50]). In Lemma 8 we note that a very general class of iterative methods can be represented by corresponding (unique) splittings and thus, without loss of generality, both the outer and the inner iterations are represented by splittings. In Theorem 3 we set conditions on the splittings so that convergence is guaranteed for any number of inner iterations. Moreover, under these conditions, we show that the spectral radii of the global iteration matrices decrease when the number of inner iterations increases. This last intuitive result is an initial step for a strategy to nd an \optimal" number of inner iterations, but may not hold if the conditions of our theorem are violated; see Section 2.5. The conditions we set relate to regular and weak regular splittings. These arise naturally in many applications and have been extensively studied [6, 41, 49, 53]. Our proofs are based on the theory of nonnegative matrices, the Perron-Frobenius theorem, and also on comparison theorems; see the mentioned references and also [10]. In section 2.3 we prove a powerful comparison theorem which we use to develop our theory of convergence for the iterative methods studied here. Consider now the Block Gauss-Seidel method [6, 49, 53]. To that end, consider A = D ? L ? U , where D is block diagonal, L is strictly block lower triangular and U is strictly block upper triangular. The block Gauss-Seidel method is (D ? L)xk+1 = b + Uxk ;
(2.5)
where it is understood that the solution at each step proceeds one block at a time. Comparing (2.5) with (2.2) and applying the philosophy of two-stage methods, one would have to solve, at each step of (2.5), a linear system by splitting (D ? L) = V ? W , say. This is not how blocks methods are solved and, in addition, it would be an expensive iteration. The usual approach, instead, is to use an iterative method for the solution of the systems corresponding to each diagonal block. To study the usual block Gauss-Seidel method, we write (2.5) as
Dxk+1 = b + Uxk + Lxk+1 ;
(2.6)
and we think of (2.6) as an implicit method, where at each step we solve a system of the form
Dx = r(xk ; xk+1; b);
(2.7)
in which r is a function of the known iterate xk , the iterate to be determined xk+1 , and b, a given vector. Of course, due to the block triangular structure of D ? L, the system (2.7) is not truly implicit, since the needed components of xk+1 are always available. The solution of (2.7) by an iterative method can be represented by a splitting of the block diagonal matrix D = B ? C . This representation does not t the description of the two-stage methods described earlier. In Section 2.4 we develop the notion of composite splittings, which enables us to study methods such as block Gauss-Seidel. We also study nested iterative methods where, e.g., the solution of the system (2.4) is itself solved by another iteration, and this recursive idea is repeated for a certain number of levels of nesting. For these methods and for iterative block Gauss-Seidel our global convergence results also apply; see Section 2.4. Furthermore, the monotonicity result also holds; namely, under certain conditions, if two nested methods have inner iteration matrices whose spectral radii compare in one direction, say (R1) < (R2), then the global iteration matrices compare in the same direction; see Section 2.5. In the two-stage method described earlier, the same number of inner iterations, p, is performed on (2.4) at each outer step. In other words, p is independent of k, the index of the outer iteration
2.2. PRELIMINARIES
7
in (2.2). In dynamic nested iterative methods the number of inner iterations may change at each outer iteration. This amounts to concatenating dierent iterative methods and, since the product of two matrices with spectral radius less than unity may have spectral radius greater than one, the resulting method might not be convergent. In Section 2.6, we address this problem and show that, under a slightly more restrictive hypothesis, dynamic nested iterative methods are convergent. Finally, in Section 2.7 we study convergence and monotonicity results of block methods, in which dierent numbers of inner iterations are performed on the systems corresponding to dierent diagonal blocks. The method is algorithmically signi cant since any block iterative Gauss-Seidel type method would likely be implemented this way. In special cases this method reduces to the dynamic nested iteration or to the two-stage method. The theory presented herein is a rst step toward the development of parallel block chaotic relaxation methods [5, 9, 34, 46] in which the linear systems corresponding to diagonal blocks are solved by iterative methods.
2.2 Preliminaries In this section, we give some notation, present some basic results and review some de nitions (see [6, 41, 42, 45, 49, 53]). We denote the spectral radius of the matrix B by (B ). We say that a matrix B is convergent if (B) < 1. We say that a vector x is nonnegative, denoted x 0, if all its entries are nonnegative. De ne x > 0 as x 0 with each component xi 6= 0 for all i. Similarly, a matrix B is said to be nonnegative, denoted B 0, if all its entries are nonnegative or, equivalently, if it leaves invariant the set of all vectors with nonnegative entries. We compare two matrices A B , when A ? B 0. By Im we denote the m m identity matrix and when the order of the identity matrix is clear from the context, we simply denote it by I . A de nition we shall use implicitly throughout this chapter is De nition 1 A nonnegative matrix A is said to be Reducible if there is a permutation matrix P such that " # B 0 T P AP =
C D where B and D are square matrices. Otherwise A is Irreducible.
The critical theorem involving nonnegative matrices is the Perron-Frobenius Theorem. The theorem was originally given in terms of nonnegative irreducible matrices. We give the slightly weakened, but more useful, version for reducible matrices.
Theorem 1 ([21] Chapter 13 Theorem 3) A nonnegative matrix A always has a nonnegative
eigenvalue r such that the moduli of all the eigenvalues of A does not exceed r. To this maximal eigenvalue r there corresponds a nonnegative eigenvector, y (which we shall refer to as the Frobenius vector), such that Ay = ry where y 0, and y 6= 0.
We de ne A = M ? N as a splitting of A when M is nonsingular. We say that the splitting is convergent if (M ?1N ) < 1; regular if M ?1 0 and N 0; and weak regular if M ?1 0, M ?1 N 0 and NM ?1 0. Obviously, a regular splitting is a weak regular splitting but the converse is not always true.
CHAPTER 2. NESTED ITERATION THEORY
8
Lemma 1 Let T 0. Then (T ) < 1 if and only if (I ? T )? exists and (I ? T )? 0. 1
1
Proof. See [6, Lemma 6.2.1]. 2 Lemma 2 Let T 0. If there exists x 0, x 6= 0 and a scalar > 0 such that x Tx, then (T ). Proof. See [45, Theorem 12]. 2 Lemma 3 Let T 0. If there exists x > 0 and a scalar > 0 such that Tx x, then (T ) . Proof. See [45, Theorem 4]. 2 The following example shows that the condition of strict positivity of the vector x is essential.
Example 1
"
#
" #
T = 11 00 ; x = 01 ; (T ) = 1 and Tx x; for all > 0: Lemma 4 Let A = M ? N be a convergent weak regular splitting. Then A is nonsingular and A?1 0. Proof. Let S = M ?1 N . The lemma follows from A?1 = M ?1 (I ? S )?1 and Lemma 1. 2 Lemma 5 Let A = M ? N and M = F ? G be two weak regular splittings. Let S = M ?1 N , S = NM ?1 , R = F ?1 G and R = GF ?1 . Then for any analytic function f , i.e., for any function f that has a power series expansion, Mf (R)M ?1 = f (R ) and Mf (S )M ?1 = f (S ).
Proof. The lemma follows from the following identities; MS pM ?1 = M (M ?1 N )pM ?1 = (NM ?1)p MRpM ?1 = F (I ? R)Rp (I ? R)?1 F ?1 = FRp F ?1 = (GF ?1 )p ; where we have used the fact that two polynomials on a matrix commute. 2 Lemma 6 For any S such that (I ? S )?1 exists, then (I ? S )?1S = (I ? S )?1 ? I . Proof. The lemma follows by multiplying both sides by I ? S . 2 Lemma 7 If A; B; C; D are nonnegative, with A B and C D, then AC BD. Proof. From A B we have that AC BC and from C D that BC BD. 2 The following result, although straightforward, plays an important role in our analysis. Lemma 8 Given a nonsingular matrix A and T such that (I ? T )?1 exists, there exists a unique pair of matrices M , N , such that T = M ?1 N and A = M ? N , where M nonsingular. Proof. Consider M = A(I ? T )?1 and N = M ? A. Then M ?1 N = M ?1 (M ? A) = I ? (I ? T )A?1 A = T . For the uniqueness, let A = M ? N = M~ ? N~ be two splittings of A such that T = M ?1 N = M~ ?1 N~ , then 0 = M ? N ? (M~ ? N~ ) and multiplying on the left succesively by M ?1 and M~ ?1 gives 0 = I ? T ? M ?1 A; 0 = I ? T ? M~ ?1 A: Thus, M ?1 A = M~ ?1 A, which implies that M = M~ . 2 In the context of Lemma 8 we will say that T induces the unique splitting A = M ? N .
2.3. COMPARISON THEOREM
9
Lemma 9 Let A? 0 and let A = M ? N = M~ ? N~ be convergent weak regular splittings. If N~ N , then (M~ ? N~ ) (M ? N ). 1
1
1
Proof. See [17].
2.3 Comparison Theorem The main result of this section is a comparison theorem between weak regular splittings of the same matrix. It strengthens some comparison theorems in Varga [49], Csordas and Varga [10] and Elsner [17]; see also Miller and Neumann [37]. The theorem has recently been extended to arbitrary cones by Marek and Szyld [30]. Theorem 2 (Comparison Theorem) Let A = M ? N = M~ ? N~ be convergent weak regular splittings such that M~ ?1 M ?1 ; (2:8) and let x and z be the nonnegative Frobenius eigenvectors of T = M ?1 N and T~ = M~ ?1 N~ , respec~ 0 or if Nx 0 with x > 0, then tively. If Nz (T~) (T ): (2:9) Proof. If (T~) = 0, then the theorem is trivially true. We shall therefore only consider (T~) 6= 0. ~ 0. We have that Tz ~ = M~ ?1 Nz ~ = (T~)z , which implies Assume that Nz ~ = 1 Nz: ~ Mz (T~) Hence ~ ~ Az = M~ (I ? T~)z = 1 ? ~(T ) Nz 0: (T ) From (2.8) it follows that (I ? T~)z = M~ ?1 Az M ?1 Az = (I ? T )z:
~ = (T~)z Tz and by Lemma 2, (T~) (T ): The proof for the case Nx 0 with Therefore Tz x > 0 is analogous, using Lemma 3. 2 The proof of Theorem 2 is similar to that in [38, Lemma 2.2], where the comparison is done between splittings of two dierent matrices A1 and A2. Varga [49] showed that if A = M ? N = M~ ? N~ are regular splittings, N~ N (2:10) implies the result (2.9). Csordas and Varga [10] report that Woznicki proved the same result with the weaker hypothesis (2.8). Some still weaker conditions were set in [10, 37] always requiring the splittings to be regular. Elsner [17] proved that for weak regular splittings condition (2.10) was enough to show the inequality (2.9) (Lemma 9) and that (2.8) with either N 0 or N~ 0 were sucient too. Here we have shown that even if the matrices N or N~ do not map the entire set of nonnegative vectors into itself, the result (2.9) holds if N or N~ map particular nonnegative vectors into that set. Moreover, all splittings we consider in this paper satisfy the conditions of Theorem 2, namely that if v is the Frobenius eigenvalue of T = M ?1 N , then Nv 0; see Theorem 5.
CHAPTER 2. NESTED ITERATION THEORY
10
2.4 Iteration Induced Splittings
Consider a two-stage method for the solution of Ax = b with A = M ? N and M = B ? C . If p inner iterations are performed at each outer iteration, the method can be described by the following algorithm:
Algorithm 1 for k = 0; 1; : : :
y0 = x k ) for j = 0 to p ? 1 Byj+1 = b + Nxk + Cyj xk+1 = yp
(2.11)
To analyze the convergence properties of this method, we begin by replacing the loop over j (equation (2.11)) with the equation
yp =
pX ?1 i=0
(B ?1 C )iB ?1 (b + Nxk ) + (B ?1 C )py0
= (I ? H p )(I ? H )?1B ?1 (b + Nxk ) + H py0 ; where H = B ?1 C is the iteration of one step of the inner iteration and we have used the P ?matrix 1 identity (I ? H p)(I ? H )?1 = pi=0 (B ?1 C )i. We may therefore rewrite (2.11) with the equation:
B(I ? H )(I ? H p)?1 yp = b + Nxk + B(I ? H )(I ? H p )?1 H py0 : If we call F = B (I ? H )(I ? H p )?1 and G = B (I ? H )(I ? H p)?1 H p, we can represent (2.11) as Fyp = b + Nxk + Gy0 ; i.e.; Fxk+1 = b + Nxk + Gxk : (2.12) In the context of Lemma 8, F and G are the unique matrices induced by the iteration matrix R = H p = F ?1 G on the matrix M = (B ? C ) = F ? G. The iterative method (2.12) corresponding to the splitting M = F ? G is not generally used in actual computations. This is a convenient device to study two-stage methods. Furthermore, the fact that we consider an inner iteration with iteration matrix of the speci c form R = (B ?1 C )p is not used in our analysis. In the remainder of the paper we often use the concept of a splitting induced by a given iteration matrix R. In a way analogous to (2.7) we consider iterative methods for the solution of (2.1) where we solve at each step a system of the form
Mx = r(xk ; xk+1; b); (2.13) in which r is a function of the known iterate xk , the iterate to be determined xk+1 , and b, a given vector. When
r(xk ; xk+1; b) = b + Nxk ;
(2.14)
it corresponds to the explicit method (2.12) induced by a given matrix R. One could introduce
r(xk ; xk+1 ; b) = b + Nxk+1
(2.15)
2.4. ITERATION INDUCED SPLITTINGS
11
as an alternative to equation (2.14). This yields the implicit method (cf. (2.12) and also (2.6))
Fxk+1 = b + Nxk+1 + Gxk : (2.16) This method would occur naturally if A were lower block triangular. In order to work with a uni ed
theory for the iterative methods studied here, we use
r(xk ; xk+1 ; b) = b + N1xk + N2xk+1 ;
(2.17)
which yields the implicit method
Fxk+1 = b + N1xk + N2xk+1 + Gxk : (2.18) In this context, we have the composite splitting A = M ? N1 ? N2 , M nonsingular, and M = F ? G, the (unique) splitting induced by the (inner) iteration matrix R; see Lemma 8. Although the method (2.18) is implicit, the iteration matrix is clearly
T = (F ? N2)?1 (G + N1) and the (unique) matrices induced by T on A are MT = F ? N2 = M (I ? R)?1 ? N2; NT = G + N1 = M (I ? R)?1 R + N1:
(2.19) (2.20) (2.21)
The following recasts Algorithm 1 in terms of (2.17).
Algorithm 2 (Two-stage Method) for k = 0; 1; : : :
y0 = x k for j = 0 to p ? 1 Byj+1 = b + N1xk + N2xk+1 + Cyj xk+1 = yp
We de ne the Iterative Block Gauss-Seidel Method as the solution of (2.6) by an iterative method represented by the splitting D = B ? C , i.e., by Algorithm 2 with N2 = L block lower triangular and N1 = U block upper triangular. Algorithm 1 can obviously be recovered by setting N2 = 0. Therefore, the iteration matrix (2.19), the induced splitting, Algorithm 2, and the forthcoming analysis, all apply to that method and to iterative block Gauss-Seidel. The case N2 = 0 was studied by Nichols [39], and (2.19) becomes
T = F ?1 N1 + R = R + S ? RS = I ? (I ? R)(I ? S ); (2.22) where S = M ?1 N1, M = F ? G and R = F ?1 G. It should be noted, however, that the twostage method (Algorithm 2) encompasses a larger class of methods. In particular, iterative block Gauss-Seidel methods cannot be represented by an iteration matrix of the form (2.22). The following de nition is used in the theorem that gives general conditions for convergence of the two-stage method. De nition 2 We say that A = M ? N1 ? N2 is a convergent regular composite splitting if both M1 := M ? N2 and A = M1 ? N1 are convergent regular splittings.
CHAPTER 2. NESTED ITERATION THEORY
12
Theorem 3 Let A = M ? N1 ? N2 be a convergent regular composite splitting and let R 0, (R) < 1. If the (unique) splitting M = F ? G such that R = F ?1 G is weak regular, then the
iterative method de ned by A = MT ? NT ; MT = M (I ? R)?1 ? N2; and NT = M (I ? R)?1 R + N1 ; (2.23) is convergent. Moreover, A = MT ? NT is a weak regular splitting. Since the use of the induced splitting M = F ? G is a technical device, conditions on it are not expected to be directly veri ed. We will see later that certain iterative methods, including the two-stage method, induce splittings satisfying the hypothesis of the theorem. Proof of Theorem 3. Since F ?1 0, N2 0 and M ?1 N2 is convergent, the inequality M ?1 N2 = (I ? R)?1 F ?1 N2 F ?1 N2 0 implies that F ?1 N2 is a convergent nonnegative matrix. Therefore, since F ?1 0 and N1 0, T = (F ? N2)?1 (G + N1) = (I ? F ?1 N2)?1 (R + F ?1 N1) 0: Let S = (M ? N2)?1 N1, then
(I ? T )?1 = A?1 MT = (M ? N1 ? N2 )?1 M (I ? R)?1 ? N2 = (I ? S )?1(M ? N2)?1 M (I ? R)?1 ? M ?1 N2 = (I ? S )?1(I ? M ?1 N2)?1 (I ? R)?1 R + I ? M ?1 N2 = (I ? S )?1 (I ? M ?1 N2)?1 (I ? R)?1 R + I 0; where we have used Lemma 6 twice. Therefore, by Lemma 1, T is convergent. Furthermore, MT?1 = (I ? F ?1 N2)?1F ?1 0. Since N2F ?1 is similar to F ?1 N2, they share eigenvalues and thus (N2F ?1 ) = (F ?1 N2 ) < 1, and since GF ?1 0, NT MT?1 = (M (I ? R)?1 R + N1)(I ? (I ? R)M ?1 N2)?1 (I ? R)M ?1 = M (I ? R)?1R(I ? F ?1 N2)?1 F ?1 + N1(I ? F ?1 N2)?1 F ?1 = G(I ? F ?1 N2)?1 F ?1 + N1(I ? F ?1 N2)?1 F ?1 = GF ?1 (I ? N2F ?1 )?1 + N1(I ? F ?1 N2)?1 F ?1 0; 2 Note that the splitting A = MT ? NT is not necessarily regular since NT may have some of the negative entries that may have been in G. In the remainder of this section we will apply Theorem 3 to practical iterative methods. In the following corollary we present conditions for convergence of two-stage methods independent of the number of inner iterations, p. Corollary 1 Let A = M ? N1 ? N2 be a convergent regular composite splitting and let p be a nonnegative integer. If M = B ? C is a weak regular splitting, then the two-stage iterative method is convergent, and its induced splitting is weak regular. Proof. To apply Theorem 3 it suces to show that M = F ? G is a weak regular splitting with F = M (I ? (B ?1 C )p)?1 and G = F ? M ; see Lemma 8. First note that F ?1 = (I ? (B ?1 C )p)M ?1 = (I ? (B ?1 C )p)(I ? (B ?1 C ))?1 B ?1 =
pX ?1 i=0
(B ?1 C )i B ?1 0:
2.4. ITERATION INDUCED SPLITTINGS
13
Since M = B ? C is a weak regular splitting B ?1 C 0 and CB ?1 0. Thus F ?1 G = (B ?1 C )p 0, and using Lemma 5
GF ?1 = M (I ? (B ?1 C )p)?1 (B?1 C )p (I ? (B ?1 C )p)M ?1 = M (B ?1 C )pM ?1 = (CB ?1 )p 0: 2 The following example shows that the hypotheses of Corollary 1 (and by extension those of Theorem 3) can not be weakened. Let Tp be the iteration matrix for a two-stage method with p inner iterations.
Example 2
3 3 2 2 1 ?0:5 0 1:25 ?1 0:25 A = 64 ?1 1:5 ?1 75 ; M = 64 ?0:5 1 ?0:5 75 ; B = I; N = 0; 0 ?0:5 1 0:25 ?1 1:25 3 2 3 2 ? 0:25 1 ?0:25 0 0:5 0 M ? N = R = C = 64 0:5 0 0:5 75 ; T = F ? N + R = 64 1 ?0:5 1 75 : ?0:25 1 ?0:25 0 0:5 0 Thus, A = M ? N is a weak regular splitting and M = B ? C is a regular splitting. In this 2
1
1
1
1
1
example (M ?1N1) = (R) = 0:71 and thus the inner and outer iterations are both convergent; but (T1) = 1:9 and the two-stage iteration with 1 inner iteration is not convergent. Thus in this example a convergent iteration within a convergent iteration is divergent. Rodrigue [47, Theo. 4.1] showed that when N2 = 0, p = 1 and both the outer and the inner iterations correspond to regular splittings, then the overall method is convergent and, in particular, induces a regular splitting. For the case p > 1 and the same hypothesis, as the following example shows, the global induced splitting may not be regular.
Example 3 A=M = H = B ?1 C =
" "
#
2 ?1 ; B = ?1 2
#
"
#
2 0 ; 0 2
0 0:5 ; N = M (I ? H 2)?1 H 2 T2 0:5 0
"
C= =
"
#
0 1 ; 1 0 2 3 1 3
?
?
1 3 2 3
#
:
The fact that NT2 has negative entries also implies that the conditions in the comparison theorem by Elsner [17] are not satis ed either; see Section 2.3. The theory we are developing can be extended to the case of recursive inner iterations. We call these Nested Iterative Methods. Consider the solution of the system Ax = b by a two-stage method with the outer iteration de ned by the composite splitting A = M ? N1 ? N2. Then, instead of solving the system Mx = r(xk ; xk+1; b) by an iterative method, it is solved by a two-stage method, and so on. This implies the speci cation of a new composite splitting, with iteration matrix R, R = F ?1 G, M = F ? G, and the number of iterations (p) at each level. One could interpret M = F ? G as (2.20-2.21). Of course, at the last (innermost) level, the system is solved by an iterative method, say with M = B ? C . After the formal recursive de nition below, we show that under conditions similar to those in Corollary 1, nested iterations are convergent independent of the number of iterations at each level.
CHAPTER 2. NESTED ITERATION THEORY
14
De nition 3 Consider the system Ax = b. Let A = M ? N ? N be a composite splitting and p 1
2
be a nonnegative integer. In a Nested Iteration the equation Mx = b + N1xk + N2xk+1 is solved by either, 1. p steps of an iterative method (note that the eect of this is to solve Ax = b by a two-stage iteration), or, 2. p iterations of the Nested Iteration with iteration matrix R, (replace A by M and b by (b + N1xk + N2xk+1 ) and apply the de nition again).
Corollary 2 At each level let the rede ned A = M ? N1 ? N2 (i.e. in part 2 of the de nition) be a convergent regular composite splitting and at the innermost level let M = B ? C be a weak regular splitting (i.e., in part 1 of the de nition). Then the corresponding nested iterative method is convergent. Proof. By induction on the number of levels of nesting. Base: This is the two-stage method, and convergence was proved in Corollary 1. It was also shown that the induced splitting is weak regular. Inductive Step: Assume that the induced splitting of R on M = F ? G is weak regular. Then, using arguments similar to those in Corollary 1, F^ = M (I ? Rp)?1 and G^ = M (I ? Rp )?1 Rp form a weak regular splitting. Therefore, by Theorem 3 the method is convergent and if T is the iteration matrix, A = MT ? NT is a weak regular splitting, where MT = A(I ? T )?1 and NT = MT ? A. 2 Consider the particular case of nested iteration, where only one step of the rst inner iteration is performed. We show in the following theorem that such a method can be represented by a two-stage method corresponding to a particular composite splitting. Moreover, the result can be extended, by a simple induction argument, to show that if there are multiple levels of nested iteration with one inner iteration at every level (except the last) then the whole iteration reduces to a two-stage method. This is a natural situation, and the theorem shows that no matter how many levels of one iteration are used, the nested method can be viewed simply as a two-stage method.
Theorem 4 Consider the solution of Ax = b by a nested iteration with A = M ? N ? N . Assume 1
2
that the (outer) system Mx = b + N1 xk + N2 xk+1 is solved by only one step of the (inner) two-stage iteration corresponding to the composite splitting M = M^ ? N^1 ? N^2 . Assume further that the corresponding (inner) system is solved with p iterations of the method corresponding to the splitting M^ = F ? G. Then the resulting iterative method is given by the composite splitting A = M ? N~1 ? N~2 where N~1 = N1 + N^1 and N~2 = N2 + N^2 .
Proof. Let R = F ?1 G. From equation (2.19) we have that the inner iteration matrix is T1 = (F1 ? N^2)?1 (G1 + N^1), where F1?1 G1 = Rp = (F ?1 G)p. Thus MT1 = F1 ? N^2 and NT1 = G1 + N^1. The iteration matrix for the global (outer) iteration is T2 = (F2 ? N2 )?1 (G2 + N1), where M = F2 ? G2 is the splitting corresponding to the inner iteration, and F2 = MT1 and G2 = NT1 . Therefore T2 = (F1 ? N^2 ? N2)?1(G1 + N^1 + N1). The theorem follows by substituting N~1 = N1 + N^1 and N~2 = N^2 + N2. 2
2.5 Monotonicity In the previous section we gave sucient conditions for the convergence of nested iterative methods. In this section we show that, under the same conditions, if the spectral radius of the inner iteration
2.5. MONOTONICITY
15
matrix decreases (e.g., increasing the number of inner iterations) then, so does the spectral radius of the global iteration matrix. For example, in the case of two-stage methods, the spectral radius of the global iteration matrix, (Tp), is a monotonically decreasing function of p. This result is intuitive but, as we see later, if the conditions shown are not satis ed, the result may not hold. The main tool in our proofs is our comparison theorem, Theorem 2. Theorem 5 Let A = M ? N1 ? N2 be a convergent regular composite splitting. Let M = F^ ? G^ = F~ ? G~ be weak regular splittings and let R^ = F^ ?1 G^ , and R~ = F~ ?1 G~. Consider, as in Theorem 3, the iterative method de ned by (2.23) with corresponding global iteration matrices T^ and T~. If (R^) (R~ ) and F^?1 F~?1 then (T^) (T~). Proof. Let x be the Frobenius vector for a global iteration matrix T and let = (T ), ^ = (T^), and ~ = (T~). Since T = (I ? F ?1 N2)?1 (R + F ?1 N1 ) R then, (R) . Consider rst the case ^ = (R^). Then, ^ = (R^) (R~) ~ and the theorem is proved. If, on the contrary, (R^ ) < ^, or in our generic notation (R) < , then, (I ? R)?1 exists and is nonnegative. Thus,
x MT x ? 1 (M (I ? R) ? N2)x (I ? R)?1 x (I ? R)?1 (I ? R)x (I ? R)?1 x
= = = = = =
MT?1 NT x NT x M (I ? R)?1 Rx + N1x (I ? R)?1 Rx + M ?1 (N1 + N2)x M ?1 (N1 + N2)x (I ? R)?1M ?1 (N1 + N2)x:
We may now replace the (I ? R)?1x factor in NT x as follows:
NT x = M (I ? R)?1Rx + N1x = MR(I ? R)?1 M ?1 (N1 + N2)x + N1 x = R (I ? R )?1 (N1 + N2)x + N1x 0
where R = GF ?1 0 and the last equality follows from Lemma 5. Finally, since F^ ?1 F~ ?1 implies, by Lemma 7, that M^T ?1 = (I ? F^?1 N2)?1 F^ ?1 (I ? F~ ?1 N2 )?1 F~ ?1 = M~T ?1 , we conclude, by Theorem 2, that ^ ~. 2 The following corollary shows that (Tp) is a monotonically decreasing function of p. Although we prove it for a restricted class of nested iterative methods, the corollary applies, in particular, to two-stage methods, to iterative block Gauss-Seidel and, because of Theorem 4, to the special case of one iteration per level, except the last.
Corollary 3 Consider the solution of Ax = b by a nested iteration (De nition 3). Let at each level A = M ? N ? N be a convergent regular composite splitting and at the innermost level let M = B ? C be a weak regular splitting. Let q and p be nonnegative integers. Consider two nested 1
2
iterations, diering only in the number of inner iterations at the outer iteration, p in one case and q in the other. If q p, then (Tq) (Tp).
Proof. If there are only two levels of nesting, i.e., the iteration is a two-stage method, with inner splitting M = B ? C , then de ne R = B ?1 C . Otherwise, de ne R as the iteration matrix for the inner nested iteration. Recall from Corollary 2 that R is convergent. Clearly (Rq ) (Rp).
CHAPTER 2. NESTED ITERATION THEORY
16
Consider the splittings induced on M by Rp = Fp?1 Gp and Rq = Fq?1 Gq ; see Lemma 8. Then,
Fq?1 = (M (I ? Rq )?1 )?1 = (I ? Rq )(I ? R)?1 F ?1 =
pX ?1 i=0
qX ?1 i=0
Ri F ?1
RiF ?1 = (M (I ? Rp )?1 )?1 = Fp?1 :
Therefore by Theorem 5 (Tp) (Tq ). 2 As we pointed out, this corollary applies in particular to two-stage methods. Intuition indicates that if more inner iterations are performed, i.e., if we have a better approximation to the exact solution of (2.2) at each outer step, then the method should converge faster. A closer look at Example 2 reveals that 3 2 3 2 ? 0:125 0:75 ?0:125 0:25 0:25 0:25 T2 = 64 0:25 0:50 0:25 75 T3 = 64 0:75 ?0:25 0:75 75 ?0:125 0:75 ?0:125 0:25 0:25 0:25 and (T2) = 0:85, (T3) = 1:3. Thus if the conditions of Theorems 3 and 5 are violated, not only might there not be convergence for a small number of inner iterations, but the spectral radii of the iteration matrices may not be monotonically decreasing. In the nal result of this section we extend the point Stein-Rosenberg theorem [49] to the methods studied here. Theorem 6 (Nested Stein-Rosenberg) Let A = M ? N1 ? N2 = M ? N^1 ? N^2 be convergent regular composite splittings. Consider the solution of Ax = b by a nested iteration. Assume that the inner systems Mx = r(xk ; xk+1 ; b) are solved by a nested iteration with iteration matrix R and induced splitting M = F ? G. Let the corresponding iteration matrices be T = (F ? N2)?1 (G + N1) and T^ = (F ? N^2)?1 (G + N^1). If N2 N^2 then (T ) (T^). Proof. The theorem follows from G + N1 G + N^1 and Lemma 9. 2 The name of the theorem is inspired by the following observation. If N^2 = 0, we have an iterative (block) Jacobi type method, while the case N2 6= 0 yields an iterative block Gauss-Seidel type method.
2.6 Varying the Number of Inner Iterations In this section we study dynamic nested iterative methods, including two-stage and iterative block Gauss-Seidel methods, where the number of inner iterations, p, may vary at each outer iteration. In other words, pk inner iterations are performed at (outer) step k, i.e., the following algorithm, where A = M ? N1 ? N2, M = F ? G. Note that pk refers only to the outer iteration, the nested iteration below is not dynamic.
Algorithm 3 (Dynamic Nested Iteration) for k = 0; 1; : : :
y0 = x k for j = 0 to pk ? 1 Fyj+1 = b + N1 xk + N2 xk+1 + Gyj xk+1 = yp
2.6. VARYING THE NUMBER OF INNER ITERATIONS
17
The main result of the section is Theorem 8, where we show that under certain conditions, Algorithm 3 is convergent. Dynamic nested iteration can be viewed as the concatenation of dierent nested methods, each with dierent iteration matrix Tpk . Thus, for r outer iterations, the global iteration matrix for the dynamic nested iteration is the product Tpr Tpr?1 Tp1 . The diculty in analyzing this method lies in the fact that the product of convergent matrices may not be convergent. The following series of de nitions and results provide the background material for Theorem 8 and extend some results in [46]. The proof of Theorem 8 is fairly technical, since it introduces a special set of matrices (De nition 6) and a series of scaling matrices that allow certain inequalities to hold. These techniques might prove useful in other contexts as well. P Consider blockings of A = (Aij ), i; j = 1; : : :; q , where the size of block Aij is si sj , qi=1 si = n, the order of the matrix A. Similarly, a vector can be partitioned into v T = (v1 ; v2; : : :; vq )T , where vi has dimension si . It is well known, see for example [6, 21, 29, 49], that there exists a permutation matrix P such that L = PAP T is block lower triangular, i.e. Lij = 0 if j > i, with Lii either irreducible or a 1 1 null matrix. This form is called a normal form of A and it is essentially unique, i.e., unique up to permutations within the diagonal blocks and of the ordering of certain diagonal blocks; see the mentioned references. This implies that the permutation P is not unique. De nition 4 If PAP T is a normal form of A, then P is called a Canonical Permutation Matrix for A. Lemma 10 Let P be a canonical permutation matrix for the nonnegative matrix A, with L = PAP T and let b = (Lbb ), b = 1; 2; : : :; q. Then (A) = maxb b. Proof. See, for example, [29, p.23]. Lemma 11 Let P be a canonical permutation matrix for the nonnegative matrix A, with L = PAP T and let b = (Lbb), b = 1; 2; : : :; q. If for exactly one block, say a, (A) = a = maxb b , then the Frobenius vector for L has the form w = (0; 0; : : :; 0; va; wa+1; wa+2; : : :; wq )T , where va is the Frobenius vector for Laa. Proof. Clearly (Lw)b = (aw)b for all b a. For b > a assume inductively that (Lw)c = (aw)c for all c < b. SincePa > b , it follows thatP(aI ? Lbb )?1 exists and is nonnegative. Therefore ?1 L w is such that b L w = w . 2 wb = (aI ? Lbb )?1 bi=1 bi i a b i=1 bi i De nition 5 Let P be a canonical permutation matrix for A. We say that B is Triangular Conformable with A if LB = PBP T is such that (LB )ij = 0 for j > i. This de nition implies that B has a triangular form embedded in the normal form of A. The diagonal blocks of LB may be reducible. Note that the permutation P does not necessarily de ne the normal form of B . De nition 6 Let P be a canonical permutation matrix for T . We de ne A(T ) := fAjA is triangular conformable with T and if (PTP T )ii = 0 then (PAP T )ii = 0g. Note that A(T ) is closed under addition and multiplication. Lemma 12 Let fTig be a collection of matrices in A(T ), with L = PTP T and Li = PTiPQT . Then, for any xed set fi1; : : :; ir g, the spectral radius of r = Ti1 Ti2 Tir is (r ) = maxa ( rj=1 Liaaj ). Proof. The lemma follows from
0
1
0
1
Yr Yr ij A T T @ A @ (r) = (P r P ) = PTij P = max Laa : a j =1 j =1
2
CHAPTER 2. NESTED ITERATION THEORY
18
Theorem 7 (Robert, Charnay, Musy) Let fTig be a collection of convergent nonnegative matrices. If there exists a real 0 < 1 and a strictly positive vector v > 0 such that Ti v v for all i, then, for any xed set fi ; : : :; irg, the spectral radius of r = Ti1 Ti2 Tir is bounded by r < 1. 1
Proof. see [46, Theorems 2 and 3] and Lemma 3. If the matrices Ti in Theorem 7 are reducible, it may be dicult to nd the strictly positive vector v > 0 required in the hypothesis. In the following corollary we extend this theorem to reducible matrices that are triangular conformable with a given matrix. Note that this requirement is much less restrictive than having the same normal form.
Corollary 4 Let fTig be a collection of convergent nonnegative matrices in A(T ), with L = PTP T
and Li = PTi P T . Let va be the sa -dimensional Frobenius vector of Laa , with a = (Laa) < 1, for every a. If for every j , Ljaa va Laava = ava then, for any xed set fi1; : : :; irg, (r ) = (Ti1 Ti2 Tir ) maxa ra < 1.
Q
Proof. Consider the product r = Ti1 Ti2 Tir . By Lemma 12, (r ) = maxa ( rj=1 Liaaj ). There Q are two cases to be considered. If Laa = 0 then rj=1 Liaaj = 0. If Laa 6= 0 then Laa is irreducible Q and therefore va > 0. This allows us to apply theorem 7, to conclude that ( rj=1 Liaaj ) ra < 1 for every a and that therefore (r ) maxa ra < 1. 2
Theorem 8 If A = M ? N1 ? N2 is a convergent regular composite splitting and M = F ? G is a regular splitting, then the Dynamic Nested Iterative Method with r iterations is convergent for every r. Moreover, the method is no slower than the case when pk = 1; k = 1; : : :; r. In other words, (r ) r1 < 1, where r = Tpr Tpr?1 Tp1 , Tpk is the iteration matrix (2.19) corresponding to pk , and 1 = (T1). Proof. We have proved in Corollary 3 that (T1) (Tp) for all p. Let P be a canonical permutation matrix for T1, L = PT1 P T and Li = PTi P T . The proof proceeds in two parts. First we show that Tp 2 A(T1) for all p. Then we show that if a; va is the Frobenius eigenpair corresponding Laa , i.e., ava = Laava , then for every i, Liaava ava and by Corollary 4 that (r) = (Tpr Tpr?1 Tp1 ) maxa ra = r1 < 1. Part 1: First consider T1. Let R = F ?1 G,
T1 = (I ? F ?1 N2)?1 (F ?1 N1 + R) 1 X = F ?1 N1 + R + (F ?1 N2)i (F ?1 N1 + R)
(2.24)
i=1
Therefore, since F ?1 N1 , R, and F ?1 N2 are nonnegative, no cancellation can occur. It follows then that
R; F ?1 N1; (F ?1 N2)i F ?1 N1; (F ?1 N2)i R 2 A(T1 ):
(2.25)
Now consider Tp. Let
Qp
= (I ? Rp)M ?1 = (I ? Rp)(I ? R)?1F ?1
=
pX ?1 i=0
Ri F ?1 ;
(2.26)
2.6. VARYING THE NUMBER OF INNER ITERATIONS
19
then
Tp = (I ? Qp N2)?1 (QpN1 + Rp )
0 p? 1? 0p? 1 X X = @I ? Ri F ? N A @ Ri F ? N + RpA 0 1 pi ? 1 0pi? 1 X X X = @ ( Ri F ? N )j A @ Ri F ? N + Rp A 1
1
1
1
1
2
=0
1
=0
1
1
1
1
2
j =0 i=0 (F ?1 N2 )j is
i=0
1
and since every instance of followed by either F ?1 N1 or Ri (i; j > 0) it follows from (2.25) and the fact that A(T1) is closed under addition and multiplication that Tp 2 A(T1). Part 2: Consider the splitting induced by Tp on A = M (p) ? N (p), with M (p) = Q?p 1 ? N2 and N (p) = N1 + Q?p 1 Rp ; see Lemma 8. Note that
M (p)?1 = (I ? Qp N2)?1 Qp
0 p? 1? p? X X i ? = @I ? Ri F ? N A RF 1
1
1
1
2
1
i=0 i=0 ? 1 ? 1 ?1 (I ? F N ) F = M (1)?1:
(2.27) There are two cases to consider. If Laa = 0, for some a, then by part 1, Lpaa = 0 for all p. Thus, by Lemma 12, (P P T )aa = 0 and therefore does not enter into calculations of (). We shall therefore assume that Laa 6= 0 for all a. Let a; va be the Frobenius eigenpair of the submatrix Laa, i.e., Laava = ava. We have proved in Section 2.4 that T1 is convergent, thus a < 1 for all a. De ne q diagonal matrices E (a); a = 1; 2; : : :; q , called scaling matrices, where Eb(a) is the sb sb matrix e(a; b)Isb , b = 1; 2; : : :; q , e(a; b) 2 IR, e(a; b) = 1; if b < a or b = a 2
= a +
a b
? a 2
; otherwise:
Note that 1 e(a; b) > a for all a; b. Consider E (a)PT1P T , E (a) has the eect of scaling the bth block row of PT1P T by e(a; b) 1. Every diagonal block with a spectral radius less than a is left unchanged. Every diagonal block, b, with a spectral radius greater than a is scaled by e(a; b) giving a new spectral radius e(a; b)b = (b + 1) 2a < a. Thus (E (a)PT1P T ) = maxb ((E (a)PT1P T )bb) = a. Let wa be the Frobenius eigenvector of E (a)PT1P T , i.e., corresponding to a . Since all diagonal blocks except the ath block have spectral radius less than a, then, by Lemma 11, waT = (0; 0; : : :; 0; va; (wa)a+1 ; : : :; (wa)q )T . Note that, since M = F ? G is a regular splitting, N (1) = N1 + G 0 and thus E (a)PT1P T wa = awa E (a)PM (1)?1N (1)P T wa = awa N (1)P T wa = aM (1)P T E (a)?1wa PM (1)P T E (a)?1wa = 1 PN (1)P T wa 0 a
We wish to show PAP T E (a)?1wa 0. PAP T E (a)?1wa = (PM (1)P T E (a)?1 ? PN (1)P T E (a)?1)wa
CHAPTER 2. NESTED ITERATION THEORY
20
= ( 1 PN (1)P T ? PN (1)P T E (a)?1)wa a = PN (1)P T ( 1 I ? E (a)?1)w 0
a a where the last inequality follows from 1 e(a; b) > (a) for all a; b. From (2.27) it follows that PM (p)?1 P T PM (1)?1P T , thus
PM (p)?1 P T (PAP T E (a)?1wa) PM (p)?1 AP T E (a)?1wa P (I ? Tp)P T E (a)?1wa PT1P T E (a)?1wa
PM (1)?1 P T (PAP T E (a)?1wa) PM (1)?1 AP T E (a)?1wa P (I ? T1)P T E (a)?1wa PTp P T E (a)?1wa:
Since (PTpP T E (a)?1)aa = Lpaa, and since wa;j = 0, for j < a, (PTp P T E (a)?1wa)a = Lpaava it follows that a va = Laa va Lpaa va for all a. Therefore, by Corollary 4, (r ) = (Tpr Tpr?1 Tp1 ) maxa ra = r1 < 1. 2
2.7 Nested Block Iterative Methods In this section we consider nested block iterative methods. We focus on the case when a dierent number of inner iterations is performed in each block. One motivation is the situation in which the solution of some systems corresponding to diagonal blocks do not need as many inner iterations to achieve the same accuracy as those of other blocks. This applies, in particular, to the solution of block methods where the stopping criterion is related to the overall accuracy required in the norm of the residual. Consider the solution of Ax = b. Let A = M ? N1 ? N2 be a composite splitting with M = F ? G, F , and G block diagonal. Consider the matrices here partitioned into q q submatrices consistent with a blocking of A = (Aij ), i; j = 1; : : :; q . We assume here that the partition is such that the diagonal blocks Aii are nonsingular. Note that A is not necessarily in normal form and in particular it is not necessarily block triangular. In the block Gauss-Seidel method we usually have N2 strictly block lower triangular and N1 strictly block upper triangular. In what follows we do not make this assumption, although the general case may rarely occur in practice. In the following algorithm, subscripts are used to indicate subvectors or diagonal blocks of the matrix, while superscripts of vectors refer to the iteration number.
Algorithm 4 (Nested Block Iteration ) for k = 0; 1; : : :
y (0) = y = (y1 ; y2; : : :; yq)T = xk for i = 1 to q for j = 0 to pik ? 1 Fiyi(j+1) = (b + N1xk + N2xk+1 )i + Gi yi(j) (xk+1 )i = yi(pik )
In the remainder of this section we will analyze this method in order to present conditions for convergence and for a monotonicity rule for the simpler case pik = pi , for all k. Later we show convergence for the general case. Note that if for every i; pik = pk , Algorithm 4 reduces to
2.7. NESTED BLOCK ITERATIVE METHODS
21
Algorithm 3 and that if for every i; k; pik = p, it reduces to Algorithm 2. Thus the proofs in this section closely resemble the proofs of similar results in the previous sections. Since F; G are block diagonal, Ri = Fi?1 Gi , where as before R = F ?1 G. The equation to update a given component, i.e., the loop over j , may be represented as (xk+1 )i = (Isi ? Rpi ik )Mi?1 (b + N1xk + N2xk+1 )i + Rpi ik (xk )i : (2.28) Let Pk = fp1k ; p2k ; : : :; pqk g and de ne 3 2 (Is1 ? R1pi1 )M1?1 0 0 77 66 0 0 (Is2 ? R2pi2 )M2?1 77 ; (2.29) 6 Q(Pk ) = 6 .. .. .. ... 5 4 . . .p iq ? 1 0 0 (Isq ? Rq )Mq 3 2 p1k 66 R01 R0p22k 00 77 (2.30) R(Pk) = 66 .. .. . . . .. 77 ; . 5 . 4 . 0 0 Rpq qk from where it follows that Q(Pk ) = (I ?R(Pk ))M ?1 . With these de nitions it easy to see that the iteration matrix for the nested block iteration is given by: H (Pk ) := (I ? Q(Pk )N2)?1 (Q(Pk )N1 + R(Pk )): (2.31) Note that if for every i; k; pik = p, Pk = P , and Q(P ) reduces to Qp as in (2.26), R(P ) = Rp, and H (Pk ) = Tp . Of course we may apply Lemma 8 to show that MH (Pk ) = Q(Pk )?1 ? N2; NH (Pk ) = Q(Pk )?1 R(Pk ) + N1 de ne the unique splitting satisfying A = MH (Pk ) ? NH (Pk ), and H (Pk ) = MH (Pk)?1 NH (Pk ). In the following two theorems we assume pik = pi, for all i; k. Theorem 9 If M = F ? G is a weak regular splitting, and A = M ? N1 ? N2 is a convergent regular composite splitting, then (H (P )) < 1, where H (Pk ) is de ned in equation (2.31). Proof. Notice that MH (P ) and NH (P ) have the same form as MT and NT (equations (2.20) and (2.21)), respectively, with Q(P )?1 replacing F and Q(P )?1R(P ) replacing G. We will show that M = Q(P )?1 ?Q(P )?1R(P ) is a weak regular splitting and the theorem will follow from Theorem 3. Clearly Q(P ), and R(P ) are nonnegative, so it remains to show that Q(P )?1R(P )Q(P ) 0. Q(P )?1R(P )Q(P ) = M (I ? R(P ))?1R(P )(I ? R(P ))M ?1 ?1 ?1 = M 2 R(P )?M1 p1 = F R(P )F 3 0 0 (G1F1 ) 66 77 0 (G2F2?1 )p2 0 6 77 = R (P ) 0: 2 = 6 .. .. .. ... 4 . . . 5 ? 1 pq 0 0 (GqFq ) Theorem 10 Let P = fp1; p2; : : :; pqg and P^ = fp^1; p^2; : : :; p^qg where pi p^i for all i. If M = F ? G is a weak regular splitting, and A = M ? N1 ? N2 is a convergent regular composite splitting then (H (P^ )) (H (P )).
CHAPTER 2. NESTED ITERATION THEORY
22
Proof. Note that Q(P^ ) Q(P ), and that (R(P^ )) (R(P )). Thus by Theorem 5, (H (P^ )) (H (P )). 2 We consider now the convergence of the nested block iterative method in the general case.
Theorem 11 Let M = F ? G be a regular splitting and A = M ? N ? N be a conver1
2
gent regular composite splitting. Algorithm 4 is convergent and no slower than the case when pik = 1; i = 1; : : :; q; k = 1; : : :; r. In other words, (r ) r1 < 1, for all r, where r = H (Pr )H (Pr?1) H (P1), and 1 = (T1).
Proof. The proof is analogous to the proof of Theorem 8. Let E = f1; 1; : : :; 1g. Note that Q(E ) Q(Pk ) and by Theorem 10 (H (E )) (H (Pk)), for all k. Let P be the canonical permutation matrix for H (E ). Let L = PH (E )P T and LPk = PH (Pk )P T . Part 1: Note that H (E ) = T1 as in (2.24), with MH (E ) = M (1) and NH (E ) = N (1). Therefore, as in (2.25),
R(E ); Q(E )N ; (Q(E )N )iQ(E )N ; (Q(E )N )iR(E ) 2 A(H (E )): (2.32) Furthermore since R(E ) 2 A(H (E )) it follows from (2.30) that R(Pk ) 2 A(H (E )). Consequently the graph of (I ?R(Pk ))M ? has no more connected components than the graph of (I ?R(E ))M ? 1
2
1
2
1
1
and therefore we may modify (2.32) to be
R(Pk); Q(Pk)N ; (Q(Pk)N )iQ(Pk )N ; (Q(Pk)N )iR(Pk) 2 A(H (E )): Thus H (Pk ) 2 A(H (E )) for all Pk . 1
2
1
2
Part 2: The second part of the proof is the same as in the proof of Theorem 8 with the substitution of MH (P ) for M (p) and H (Pk ) for Tp , respectively. 2
2.8 Use of Unique Splitting Lemma We feel that the unique splitting lemma, Lemma 8, may have a very large range of uses. Indeed we have used the lemma many times throughout this chapter. In this section we give an example of another use of the unique splitting lemma. We would once again like to solve the system of linear equations Ax = b. Consider the semiiterative method examined in [15]
xk+1 = b + 0 Txk + 1xk + 2xk?1 + : : : + p xk?p+1 where T is an n n matrix and the i are scalars such that concern ourselves here only with p = 2, i.e.,
P = 1 (for consistency). We will i
xk+1 = b + 0 Txk + 1xk + 2 xk?1 : It is easily seen that the iteration may be rewritten as
"
xk+1 xk
#
=
"
0 T + 1 I 2 I I 0
#"
(2.33)
(2.34)
# " #
xk + b 0 xk?1
(2.35)
or
yk+1 = Jyk + b0:
(2.36)
2.8. USE OF UNIQUE SPLITTING LEMMA
23
The xed point of this iteration is the solution to the system of equations
A0 y =
"
A 0 0 A
#"
# " #
x = b : b x
(2.37)
Thus for the semi-iterative method to converge it is necessary that (J ) < 1. We assume that J 0, in particular that T 0 and i 0 for all i. By the unique splitting lemma there is a splitting of A0 = M ? N such that J = M ?1 N . As was shown in the unique splitting lemma M = A0(I ? J )?1 and N = A0 (I ? J )?1 J . Let S = [(1 ? 1 ? 2 )I ? 0 T ]?1 : One can show that " #" # S 0 I 2 ? 1 (I ? J ) = 0 S (2.38) I (1 ? 1 )I ? 0 T Thus
"
M = A0 A0 and
"
#"
N = MJ = A0 A0
S 0 0 S
#"
#"
S 0 0 S
#
I 2 I (1 ? 1 )I ? 0 T ;
#"
(2.39)
#
0 T + 1 I + 2 I 2 I ; I 2 I
(2.40)
It is unclear whether M ?1 and N are nonnegative, and we can not therefore know whether the matrices above constitute a weak regular splitting of A0 . However, if 1?10?2 (T ) then (1 ? 1 ? 2 )I ?0 T is an M-Matrix and hence inverse nonnegative. We therefore conclude that (A0 )?1 N 0. To show convergence under these conditions we follow the proof of Varga [49] section 3.6. J = M ?1 N = (A0 + N )?1 N = (I + G)?1G (2.41) where G = (A0)?1 N . Consider an eigenvector, v , of G with corresponding eigenvalue then
Jv = (I + G)?1Gv = 1 + v = v
(2.42)
Gw = 1 ? w:
(2.44)
from which we conclude that v is also an eigenvector of J . Conversely if w is any eigenvector of (I + G)?1G with (I + G)?1Gw = w then Gw = (I + G)w (2.43) and thus 6= 1 which implies Because G 0 it follows that the largest eigenvalue of G is nonnegative, i.e., 0. By the same token the largest eigenvalue of J must be nonnegative. But 1? < 0 for all > 1 thus (J ) < 1. From which we conclude the following theorem.
Theorem 12 The iteration?given by equation (2.34) is convergent to the solution of Ax = b if T 0, ; ; 0 and 10?2 > (T ). 0
1
2
1
CHAPTER 2. NESTED ITERATION THEORY
24
2.9 Derivation of SOR within SOR Iteration Matrix In this section we derive the iteration matrix associated with doing Successive Over Relaxation on the inner and on the outer iteration. An SOR step for the k +1 iterate of the ith component of the solution is given by rst computing
x~ = b ?
X ji
aij x(jk)
(2.45)
and then the new value is given by
x(ik+1) = !x~ + (1 ? !)x(ik)
(2.46)
where ! is the relaxation factor. The iteration matrix for this method, as derived for instance in [49] section 3.1, is L! = (D ? !L)?1(!U + (1 ? !)D) (2.47) where A = D ? L ? U , D the (invertible) diagonal of A, L strictly lower triangular and U strictly upper triangular. Recall that the iteration matrix for nested Block Gauss Seidel is Tp = (Q?p 1 ? L)?1 (U + Q?p 1 Rp ) (2.48) p ? 1 Qp = (I ? R )D (2.49) where D, L and U are as de ned previously. The inner iteration is de ned by iteration matrix R. No restrictions are put on R so in particular we can consider the inner iteration matrix to be the iteration matrix associated with solving D by SOR. Let us de ne R!I as that iteration matrix and let the induced splitting be D = F ? G, F ?1 G = R!I . Thus the nested SOR procedure is given by y0 = xk (2.50) Fyj+1 = (b + Lxk+1 + Uxk ) + Gyj (2.51) x~k+1 = yp : (2.52) The inner iteration may be rewritten as x~k+1 = (I ? Rp )D?1 (b + Lxk+1 + Uxk ) + Rpxk (2.53) where R = F ?1 G. Finally putting together equations (2.46) and (2.53) we get xk+1 = !O ((I ? Rp)D?1 (b + Lxk+1 + Uxk ) + Rp xk ) + (1 ? !O )xk : (2.54) Collecting terms (Q?p 1 ? !O L)xk+1 = !O b + (!O (U + Q?p 1 Rp) + (1 ? !O )Q?p 1)xk (2.55) Qp = (I ? Rp )D?1 (2.56) Therefore the iteration matrix for outer SOR is Tp;!O = (Q?p 1 ? !O L)?1(!O (U + Q?p 1 Rp) + (1 ? !O )Q?p 1): (2.57) There are a couple of things to note about equation (2.57). The rst is that if p ! 1 then, since Q?11 = D, Tp;!O reduces to T1!O = (D ? !O L)?1 (!O U + (1 ? !O )D) (2.58)
2.10. SOR STRAIGHT ANALYSIS - KAHAN
25
the iteration matrix of block SOR, where a direct solve is done on the blocks. The second thing to note is the case p = 1 with inner iteration point SOR with relaxation parameter !I = !O = ! . Let D = E ? F ? G, and R! = ( !1 E ? F )?1 ( 1?!! E + G) where we have pulled out an ! because we want to be solving Dx = y as opposed to !Dx = !y . Note that in this case Q?1 1 = !1 E ? F . The iteration matrix is (2.59) T = ( 1 E ? F ? !L)?1 (!U + 1 ? ! E ? !D) !
!
1
!
Notice that (2.59) reduces to point GS if ! = 1, but, counter intuitively, does not reduce to point SOR if ! 6= 1. Outer SOR with an arbitrary inner iteration matrix is a very frustratring problem. In the following sections we give various bounds on the outer relaxation factor.
2.10 SOR Straight Analysis - Kahan This section is based on a derivation attributed to Kahan in [49]. In point or block SOR one can get bounds for ! (outside of which the iteration will diverge) in a very simple way. We rewrite equation (2.47) the iteration matrix for block SOR as
L! = (I ? !D? L)? (!D? U + (1 ? !)I )) 1
1
1
(2.60)
where we assume that D is block diagonal, L is strictly block lower triangular and U is strictly block upper triangular. Under these conditions D?1 L is strictly lower triangular, and D?1 U is strictly upper triangular. Let us recall some properties of determinants 1. d(AB ) = d(A)d(B ) 2. d(A) = n d(A) Q 3. d(A) = i i , i the eigenvalues of A.
Q
4. (A) = j((A))nj n i jij n = jd(A)j n . Consider d(L! ), since I ? !D?1 L is lower triangular with 1's on the diagonal the eigenvalues are all 1 and we conclude that d(I ? !D?1 L) = 1. Thus 1
1
1
d(L! ) = d(I ? !D?1 L)d(L!) = d(!D?1U + (1 ? !)I ): (2.61) By the same token !D?1 U + (1 ? ! )I is an upper diagonal matrix with (1 ? ! )'s on the diagonal and therefore all the eigenvalues of this matrix are 1 ? ! thus d(L! ) = (1 ? !)n : (2.62) From which we conclude that (L! ) j1 ? ! j or that convergence can only occur for 0 < ! < 2. Note ! in this interval does not guarantee convergence. If we try to apply this same analysis to Tp;!O we derive the following d(Tp;!O ) = d(!(QpU + Rp) + (1 ? !)I ) (2.63) Y p = (1 ? ! + !i (R)) (2.64) i
CHAPTER 2. NESTED ITERATION THEORY
26
It is clear that if R is convergent then for (Tp;!O ) to be convergent ! must be greater than 0. When ! is large enough so that j1 ? ! + !pi(R)j > 1 for all i we can conclude that the (Tp;!O ) > 1. Since ! > !pi(R) (since R is convergent) we want to know when 1 ? ! + !pi(R) < ?1
(2.65)
for all i. Clearly the right hand side of (2.65) is minimized when ?jpi (R)j ?p (R). From which we conclude that the nested iterative method will de nitely diverge for ! 1 ? 2p(R) (2.66) This result seems extremely counter intuitive, that is, how can solving the inner iteration inexactly actually allow for the use of a relaxation factor greater than that allowed by solving the inner exactly. We can however give such an example. Consider
"
#
A = ?21 20 " # 2 0 D = 0 2 " # 0 0 L = 1 0 U =0 " # " # 2 : 5 0 : 5 0 F = G = 0 :5 0 2:5
(2.67) (2.68) (2.69) (2.70)
Clearly A = D ? L ? U and D = F ? G. The inner iteration has iteration matrix
R =
"
#
:2 0 : 0 :2
(2.71)
The inner iteration matrix has spectral radius 0:2 suggesting, by equation (2.66), that ! could be as large as 2:5. It is easy to see that the iteration matrix for one inner iteration with ! as the relaxation factor is " # 2:5?2! 0 T1;! = (2.72) ! (22::55 ? 2! ) 2:5?2! : 2:5 (2:5)2 And it is clear that (T1;!) < 1 for all 0 < ! < 2:5.
2.11 Ostrowski-Reich Analysis In this section we go through the analysis of the Ostrowski-Reich Theorem for outer SOR. The objective of the original analysis was to show that if A and D were positive de nite then (L! ) < 1 for 0 < ! < 2. Let
m+1 = Tp;!O m = (Q?p 1 ? !L)?1(!(U + Q?p 1 Rp ) + (1 ? !)Q?p 1 )m: Let m = m ? m+1 . From equation (2.73) we get (Q?p 1 ? !L)m+1 ? Q?p 1 m + !Lm = ! (L + U + Q?p 1 Rp ? Q?p 1 )m :
(2.73) (2.74)
2.11. OSTROWSKI-REICH ANALYSIS
27
Note that
Q?p 1 Rp ? Q?p 1 = ?(D(I ? Rp )?1 ? D(I ? Rp )?1 Rp) = ?D(I ? Rp)?1 (I ? Rp ) = ?D;
(2.75)
thus (Q?p 1 ? !L)m = !Am :
(2.76)
Also beginning from equation (2.73) (Q?p 1 ? !L)m+1 ? Q?p 1 Rpm+1 ? !Um+1 = (! (U + Q?p 1 Rp ) + (1 ? ! )Q?p 1 )m ?(!(U + Q?p 1Rp) + (1 ? !)Q?p 1)m+1 :
(2.77) (2.78) (2.79)
Once again using equation (2.75) we have
!Am+1 = (!(U + Q?p 1 Rp ) + (1 ? !)Q?p 1 )m:
(2.80)
Combining equations (2.75) and (2.73) gives (Q?p 1 ? !L)m+1 = (Q?p 1 ? !D + !U )m :
(2.81)
Rearranging terms, multiplying on the right by m+1 and subtracting from both sides m+1 !Dm+1 yields
m+1 !Dm = m+1 !Lm+1 ? m+1 Q?p 1 m+1 + m+1 !Um + m+1 Q?p 1 m ? m+1 !Dm+1
(2.82)
De ne Am = m Am and Am+1 = m+1 Am+1 . From equation (2.76)
m (Q?p 1 ? !L)m = !mAm :
(2.83)
From equation (2.80) and equation (2.75)
!m+1 Am+1 = m+1 (!(U ? D) + Q?p 1 )m:
(2.84)
Combining equations (2.83) and (2.84) gives
!(Am ? Am+1 ) = m (Q?p 1 ? !L)m ? m+1 (!(U ? D) + Q?p 1 )m = m Q?p 1 m ? m !Lm ? m+1 ! (U ? D)m
(2.85)
Starting from equation (2.81) and redistributing yields
Q?p 1 m = ?!Lm+1 + !Dm ? !Um : Multiplying on the left by m
m Q?p 1 m = ?!m Lm+1 + !m Dm ? !m Um ;
(2.86)
CHAPTER 2. NESTED ITERATION THEORY
28 and taking the hermitian of both sides yields
m Q? p m = ?!m+1 Lm + !m Dm ? !m Um ;
(2.87)
where by Q? we mean inverse hermitian. Finally combining equations (2.85) and (2.87) gives !(Am ? Am+1 ) = m Q?p 1 m + m Q? p m ? !m Dm
(2.88)
The nal step in the Ostrowski-Reich analysis is to let 0 be an eigenvector of Tp;!O . Then 1 = 0 and 0 = (1 ? )0. Thus from equation (2.88) we have (1 ? jj2) A = 1 j1 ? j2 (Q?1 + Q? ? !D) 0
0
!
0
p
p
0
= !1 j1 ? j2 0 (Q?p 1Rp + (Q?p 1Rp ))0 + (2 ? ! )0 D0 :
(2.89)
Since A is positive de nite the left hand side is positive for jj < 1. We want to know what the conditions are that make the right hand side be positive. Certainly if 0 < ! < 2 and Q?p 1 Rp + (Q?p 1 Rp) positive semi-de nite then the equation is satis ed. Unfortunately there is great diculty in determining whether this equation is positive semi-de nite. Even in the case that R and Qp are symmetric and positive de nite there is diculty, see for instance [40]. We will leave this problem in this somewhat unsatisfactory state and merely point out that if p ! 1 then equation (2.89) becomes (1 ? jj2) A = 1 j1 ? j2 (2 ? ! )D ; 0
0
!
0
0
the result of the original Ostrowski-Reich analysis, from which one concludes that 0 < ! < 2 implies convergence.
2.12 SOR - Kulisch Analysis This analysis is derived from [26]. Recall
Tp;!O = (I ? !QpL)?1 ((1 ? !)I + !Rp + !QpU ): Consider
T~p;!O = (I ? !jQpLj)?1 (j(1 ? !)jI + !jRpj + !jQpU j); where
A~! = I ? (jQpLj + jRpj + jQpU j);
(2.90)
and = 1?j1!?!j . The induced splitting is
M~ ! = !1 (I ? !jQpLj) N~! = j(1 ?! !)j I + jRp j + jQpU j
(2.91) (2.92)
2.12. SOR - KULISCH ANALYSIS
29
where A~! = M~ ! ? N~! and T~p;!O = M~ !?1 N~! . We assume that Qp is derived from a consistent splitting of D (i.e., no o diagonal entries are introduced) and is therefore block diagonal. We further assume that L is strictly lower triangular and U is strictly upper triangular. Let
B = Q p L + R p + Qp U (note that B is the iteration matrix for iterative block Jacobi) and from the assumptions we know that
jBj = jQpLj + jRpj + jQpU j: We now prove a series of Lemmas. Lemma 13 A~! = M~ ! ? N~! is a regular splitting and T~p;!O jTp;!O j. Proof - Since jQp Lj is strictly upper triangular (jQpLj) = 0 and therefore M~ ! = !1 I ? jQp Lj is an M-Matrix and hence inverse nonnegative. Clearly N~! 0. Thus M~ ! ? N~! is a regular splitting of A~! . It remains to show that T~p;!O > jTp;!O j. Consider
jM!?1j =
1 X ? 1 j(I ? !QpL) j = j (!QpL)ij i=0 1 X j(!QpL)ij = (I ? !jQpLj)?1 = M~ !?1 : i=0
Clearly jN! j = j(1 ? ! )I + !Rp + !Qp U j N~! thus
jTp;!O j = jM!? N! j jM!? jjN!j M~ !? N~! = T~p;!O : 2 Lemma 14 If (jBj) < 1 then (T~p;!O ) < 1 for 0 < ! < jBj . 1
1
1
2 1+ (
)
Proof - Remembering that any weak regular splitting of a nonsingular M-matrix is convergent to show the lemma we must show that A~! = I ? jB j is a non-singular M-matrix, or
= 1 ? j!1 ? !j > (jBj):
For 0 < ! 1 we have = 1 > (jB j) by hypotheses of lemma. For ! > 1 we have = 2?!! which implies > (jB j) for ! < 1+2(jB j) . 2 The critical corollary is as follows
Corollary 5 If R is derived from a consistent weak regular splitting of D, D? , L, and U 0, 1
then Iterative Block Jacobi convergent, i.e.,
(B) = (QpL + Rp + Qp U ) < 1 and (Tp;!O ) < 1 for 0 < ! < 1+2(B )
The corollary follows from the fact that, under the hypotheses of the corollary, we showed in previous sections that A = Q?p 1 ? (L + U + Q?p 1 Rp ) is a weak regular splitting.
30
CHAPTER 2. NESTED ITERATION THEORY
Chapter 3
Computational Aspects of Two-Level Iteration In this chapter we give algorithmic aspects for the solution of linear systems via the methods introduced in the previous chapter. The critical question raised in the previous chapter was how accurately should the inner iteration be solved? In this chapter we present various heuristic methods for answering that question. These methods can be broken down into two categories. The rst is to get control of the inner iteration by trying various numbers of inner iterations and deciding, somehow, on the best one to use. The second is to decide that the inner iteration needs to be solved to some accuracy relative to the outer iteration and then proceeding accordingly. For heuristic methods in which the inner iteration is controlled in relation to the outer iteration it is important to be able to compute the residual of both the inner and outer iteration. We show how the inner and the outer iterations can be computed using the residual. Because residual methods are not dependent on the input data at every iteration, the solution begins to lose accuracy due to roundo. We also include upper bounds on the error due to roundo as the iteration proceeds. The chapter is divided into four parts. In the rst part we present some background that will be useful in the derivation of the heuristic iterative methods. The second part is an examination of how the iteration should be calculated, that is, using residuals to compute the iteration. The third part is an analysis of how the solution varies from the actual solution due to roundo. The nal part is a series of heuristic methods.
3.1 Background Our adaptive methods are dependent on some theory developed in [49] which we summarize here. Consider the iterative method of the last chapter
xk+1 = Tpxk + Mp?1 b for the solution of the linear system of equations Ax = b, where we have expressed the iteration in terms of the induced splitting (see the unique splitting lemma, Lemma 8). That is, A = Mp ? Np and Tp = Mp?1 Np. Let em = xm ? x , where x is the exact solution of Ax = b, be the error vector of the iteration. Note that the iteration matrix, Tp, is the matrix that propagates ek to ek+1 , i.e., xk+1 ? x = Tpxk + Mp?1 b ? x ek+1 = Tpxk + Mp?1 (Ax ? Mp x ) = Tpxk + Mp?1 (?Npx ) 31
32
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION = Tp(xk ? x) = Tpek :
Thus,
em = Tpem?1 = = Tpme0 which upon taking norms yields
k em kk Tpm kk e k m 0: 0
This leads us to the following de nition. De nition 7 [49][Thm 3.2] Let T and B be two n n complex matrices. If for some positive integer m, k T m k< 1 then
R(T m) = ? log (k T m k) m1 (3.1) is the average rate of convergence for m iterations of the matrix T . If R(T m ) < R(B m ) then B is iteratively faster for m iterations than T . Computationally it is not possible to know R(T m ). We therefore introduce the notation, as in
Varga [49] section 3.2,
k em k m1 = ke k
(3.2)
0
which is the average reduction factor per iteration of the error. From the de nition we have
(k T m k) m1 = b?R(T m)
(3.3)
where b is the base of the log in (3.1). Finally Varga de nes Nm = (R(T m ))?1. This quantity is proportional to the number of iterations required to reduce the norm of the error by a factor of b. This comes from Nm b, which is itself a consequence of (3.3). The key theorem is Theorem 13 (Varga) Let T be a convergent n n complex matrix. For all m suciently large, the average rate of convergence for m iterations R(T m ) satis es mlim !1 R(T
m ) = ? log (T ) R
1 (T )
(3.4)
Unfortunately, while is the measure that we need, it is no more computable, in general, than R(T m). We now derive an estimate for , based on residuals. Let rk b ? Axk be the residual at the kth step of the computation. Note that rk = Ax ? Axk = ?Aek . Thus k ek k kkrAkkk , and k ek kk A?1 kk rk k. This yields the bound on , ^, given by
k r k m1 1 k rm k m1 k em k m1 (3.5) ^ = (A) k r k k e k = (A) k rm k where (A) is the condition number of A, i.e., k A kk A? k. Since (A) is a constant, if enough iterations are done, i.e., m large enough, ^ ! . Thus k r k m1 log ( ( A )) (3.6) ? log ^ = m ? log k rm k ; 0
0
0
1
0
3.2. THE ITERATION
33
We will estimate this number by assuming that log(m(A)) is zero. This estimate is a computable and, possibly, good estimate for the rate of convergence of the iterative method. Throughout the rest of this chapter we refer to this estimate as , given explicitly by
k rm k m1 = k r k ; 0
(3.7)
It is, obviously, critical that we have a cheap method for computing the residual. In the following sections we give algorithms for computing the residual with little extra cost. While these algorithms seem to be well known, they do not seem to be documented anywhere, so we include them here for completeness.
3.2 The Iteration The basic algorithm we shall be examining in this chapter is:
Algorithm 5 (Iterative Block Gauss-Seidel) for k = 0; 1; : : :
y (0) = y = (y1; y2 ; : : :; yq )T = xk for i = 1 to q for j = 0 to pik ? 1 Fi yi(j+1) = (b + Uxk + Lxk+1 )i + Giyi(j) (xk+1 )i = yi(pik )
where A = D ? L ? U and Di = Fi ? Gi . We will of course allow relaxation parameters on both the inner and outer iterations. The cost for one iteration (k ? 1 to k) of this method is
NZU + NZL +
q X i=1
pik NZDi
(3.8)
where NZU , NZL, and NZDi are the number of non-zeros in U , L and the diagonal blocks Di respectively.
3.3 The Residual Algorithm In this section we modify the IBGS algorithm so that with little extra computation we get an expression for the residual. We begin by considering the simpler iteration Mxk = b + Nxk?1 . Recall that the residual at the kth step is
rk = b ? Axk = b ? Mxk + Nxk ; substituting Mxk = b + Nxk?1 (the iteration) we may rewrite the residual as rk = N (xk ? xk?1 ):
(3.9) (3.10)
34
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
De ne zk = xk ? xk?1 . Note that instead of substituting for Mxk in (3.9) we could have substituted for b + Nxk yielding Mzk+1 = rk : (3.11) Now suppose we knew rk and xk then we could generate zk+1 , and hence xk+1 , by solving equation (3.11). We could then generate rk+1 by computing Nzk+1 , as in equation (3.10). We have therefore derived an algorithm for propagating the residual and the solution from step k to step k + 1. If we set x0 = 0, which implies r0 = b we have a complete algorithm.
Algorithm 6 (Residual Algorithm for Mxk = b + Nxk ) +1
x0 = 0, r0 = b for k = 0; 1; : : : Solve Mzk+1 = rk xk+1 = xk + zk+1 rk+1 = Nzk+1
(3.12) (3.13) (3.14)
While not immediately pertinent to the IBGS algorithm, it is convenient, at this juncture, to show how to modify this algorithm to take care of the SOR iteration. In the SOR iteration the system !Ax = !b is solved by the iteration (D ? !L)xk+1 = !b + (1 ? ! )Dxk + !Uxk (3.15) where A = D ? L ? U . It is clear that we could set M = D ? !L, N = (1 ? ! )D + !U and r0 = !b and use algorithm 6. Unfortunately this would require computing Dzk at each iteration k and this could be an expensive computation. The way to avoid this computation is very simple. The algorithm proceeds as before by rst solving Mzk+1 = rk . In this case M = D ? !L so we set vk+1 = !rk + !Lzk+1 (3.16) for reasons that will become appearent mommentarily. Now we solve Dzk+1 = vk+1 : (3.17) xk+1 is updated in the same way as before, i.e., using equation (3.13). We now need to update the residual by computing Nzk+1 . But notice that vk+1 = Dzk+1 so we avoid this computation by setting !rk+1 = (1 ? !)vk+1 + !Uzk+1 : (3.18) The beauty of all this is that the vector v is only a device to show how the computation proceeds-in fact the values may be directly written on r. We have written all of the above in terms of !r. Obviously the user should not go around dividing everything by ! to get the exact r; instead the user should just consider the vector as !r. We give the algorithm in full in Algorithm 7. Note that in this algorithm we assume that D is block diagonal and L is strictly lower triangular (obviously in general U is strictly upper triangular but that is not necessary). In the algorithm by Li , Di and Ui we mean the ith row of L, D and U respectively, and by vki we mean the ith subvector of vector vk . Note that in the algorithm while we have written rk+1 and rk they are actually stored in identical memory locations.
3.3. THE RESIDUAL ALGORITHM
35
Algorithm 7 (SOR Residual Algortihm) x0 = 0, !r0 = !b for k = 0; 1; : : : for i = 1; 2; : : :; number of blocks !rki = !rki + !Li zk+1 Solve Di zki +1 = !rki xik+1 = xik + zki +1 !rki +1 = (1 ? !)!rki + !Uizk We now return to the problem of IBGS. The problem is that instead of solving Mxk+1 = b + Nxk exactly, we solve it only inexactly. In particular if we de ne the residual of the inner iteration as sk then sk+1 = b + Nxk ? Mxk+1 . In rederiving equation (3.10) we get
rk = sk + Nzk :
(3.19)
rk ? sk+1 = Mzk+1 :
(3.20)
Similarly for equation (3.11) we get So if we have the residual of the inner iteration we can get the correct residual for the outer iteration by simply adding the residual of the inner iteration to the residual computed in Algorithm 6 (see Algorithm 8). Notice that it is unimportant how we compute the inner iteration, any inner iteration which yields the inner residual, even conjugate gradient type algorithms, can be used to solve equation (3.21).
Algorithm 8 (Residual Algorithm for IBGS) x0 = 0, r0 = b for k = 0; 1; : : : Solve Mzk+1 = rk by iteration Let sk+1 = rk ? Mzk+1 xk+1 = xk + zk+1 rk+1 = sk+1 + Nzk+1
(3.21) (3.22) (3.23) (3.24)
In Algorithm 8 if equation (3.21) is solved using the residual algorithm (Algorithm 6) then sk+1 is given with no extra computation. We note that starting with the inner residual equal to rk (and therefore the initial guess for zk+1 = 0) is exactly the same as is done in IBGS, since the initial guess for xk+1 in IBGS is xk . The nal part of this section is to show how to do the residual algorithm for SOR when the inner iteration is inexact. Because of the inexact inner iteration sk+1 = vk+1 ? Dzk+1 , the correct residual update should be
!rk+1 = (1 ? !)(vk+1 ? sk+1 ) + !Uzk+1
(3.25)
36
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
We summarize the results of this section by giving the complete algorithm for residual SOR within SOR. Let A = D ? L ? U and D = E ? F ? G. Let the inner relaxation factor be !I and the outer relaxation factor be !O . The complete algorithm is given in Algorithm 9.
Algorithm 9 (Residual Algorithm for SOR within SOR) x0 = 0, !O r0 = !O b for k = 0; 1; : : : vk+1 = !O rk + !O Lzk+1 !I s0 = !I vk+1 , z = 0 for i = 0; 1; : : :; p ? 1 solve ui+1 = Ey = !I si + !I F z = z+y !I si+1 = (1 ? !I )ui+1 + !I Gy xk+1 = xk + zk+1 !O rk+1 = (1 ? !O )(vk+1 ? sk+1 ) + !O Uzk+1 We now analyze how computing with the residual causes propagation of the error and bound that error.
3.4 When to Update the Actual Residual In the previous section we gave an algorithm for determining the residual at the (k +1)th step from the residual at the kth step. This method gave us an extra way of controlling the iteration at little extra cost. The problem with the method is that after initially setting the residual to the right hand side of the problem, b, b disappears from the computation. Since in real computations this is likely to cause a loss of accuracy as the computation proceeds a natural question to ask is how often should we recompute the actual residual. Since computing the actual residual is an expensive and useless (as far as advancing the solution goes) operation, it is one that we would like to avoid. In this section we shall give bounds on the size of the error being propagated. The analysis of this section follows the analysis in [52]. q De neqthe spectral norm of A as k A k= (AAT ). De ne the Frobenius norm of A as k A kF = Pi Pj a2ij . Recall that the iteration proceeds by rst solving Mzk+1 = rk and then computing rk+1 = Nzk+1 . Let esk+1 be the error in the solve step which we de ne as
esk+1 = Mzk+1 ? rk :
Let emk+1 be the error in the multiply step which we de ne as
emk+1 = rk+1 ? Nzk+1 : We may now de ne the error in propagating from rk to rk+1 as Ek+1 = rk+1 ? NM ?1rk
3.4. WHEN TO UPDATE THE ACTUAL RESIDUAL
37
that is, the computed residual minus the residual that would have been computed in exact arithmetic. Let T^ = NM ?1 . If we assume there was some error in previously computed residuals then expanding we have ^ k Ek+1 = rk+1 ? T^2 rk?1 ? TE = rk+1 ? T^k+1 b ?
k X ^i i=1
T Ek?i+1 :
Letting rk+1 = T^ k+1 b = T^k+1 r0 , the residual that would have been computed had exact arithmetic been done throughout the computation, we derive
rk+1 = rk+1 +
k X ^i i=0
T Ek?i+1 ;
(3.26)
Of course as k ! 1, k rk k goes to zero. We have shown that the computed residual varies from the residual that would have been computed in exact arithmetic. A consequence of rk 6= rk is that at the next step of the computation the problem being solved is not Ax = b. To derive the actual problem being solved as the algorithm progresses, assume that the residual at step k is correct, i.e. rk = b ? Axk . From equation (3.26) the computed residual at the next step is rk+1 = b ? Axk+1 + Ek+1 . We may therefore think of rk+1 as the residual in the solution of the new system Ax = b + Ek+1 . If we now view rk+1 as the correct residual at step k + 1 the residual at step k + 2 may be viewed as the residual in the solution of the system Ax = b + Ek+1 + Ek+2 . Thus if we consider k steps of the residual algorithm the residual computed may be viewed as the residual of the solution of
Axk = b +
k X i=1
Ei
(3.27)
In other words when we decide that the algorithm has converged, it has actually found the solution of the system of equations given by (3.27). The remainder of this section is devoted to determining how far the actual system solved varies from the system we want to solve. Consider the error at step k + 1 ^k Ek+1 = rk+1 ? Tr Using our de nitions from above we have ^k Ek+1 = Nzk+1 + emk+1 ? Tr ^k = NM ?1 esk+1 + NM ?1 rk + emk+1 ? Tr ^ sk+1 + emk+1 : = Te (3.28) Thus our task remains to bound esk+1 and emk+1 . Our analysis is derived from [52]. De ne jAj as jAj = jaij j. We begin with the solve step. Consider the solution of the equation Mz = r where M is lower triangular. The solution proceeds at each step by computing the value of zi using z1 ; z2; : : :; zn . Therefore
zi = fl ri ? mi;1 z1 ? mi;2mz2 ? : : : ? mi;i?1 zi?1 i;i
!
(3.29)
38
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
where by fl(x) we mean the oating point truncated computation of x (see [51]). Also in [51] is the result of an inner product which yields
miizi (1 2?t ) + mi;1z1 (1 3 2?t ) + mi;2 z2 (1 4 2?t ) + : : : + mi;i?2 zi?2 (1 i2?t ) + mi;i?1 zi?1 (1 i2?t ) = ri((1 2?t ) (3.30) where each is a dierent number, jj < 1, and t is the number of bits of accuracy of the machine.
Hence the computed solution satis es:
Mz + Mz = r + r For example for n = 5 we have
2 jr j 3 66 jr j 77 ? t jrj 2 666 jr j 777 2?tr 4 jr j 5 1
2
3
4
0
and
2 jm 66 2jm jM j < 2?t 666 3jm 4 3jm 3jm
11
j j j j j
21
31
41
55
jm 3jm 4jm 4jm
22
j j jm j j 4jm j jm j j 5jm j 5jm j jm j
32
33
42
43
55
55
44
55
3 77 77 = 2?tV 75
55
With this in mind we may rewrite our problem as
jMzk ? rk j < 2?t [jrkj + V jzk j] : +1
+1
Taking norms we have
k Mzk ? rk kk jMzk ? rk j k 2?t [k jrkj k + k V kk jzk j k] 2?t [k rk kF + k V kF k zk kF ] +1
+1
+1
+1
(3.31) (3.32)
where we have used the identities
k jvj k=k v kF k A kk A kF Finally if we assume that M has been normalized then an upper bound for k V k can be obtained.
k V k k V kF
< [n32 + (n ? 1)42 + : : : + 2(n + 1)2 + (n + 2)2] 21 where all multiples in the ith column have been replaced by (n + 3 ? i). We conclude that k V k< 0:4n2. Thus a bound for equation (3.32) is
i h k esk k=k Mzk ? rk k 2?t k rk kF +0:4n k zk kF +1
+1
2
+1
(3.33)
3.5. OPTIMAL P
39
To complete our bound for Ek+1 we need to bound emk+1 . We need the error in a matrix vector multiply which is again analyzed in [52]. We want to analyze c = fl(Nz ) where again fl(x) is the
oating point truncated value of x. We may write ci = ni1 (1 + 1 )b1 + ni2 (1 + 2 )b2 + : : : + nin (1 + n )bn with jj < 2?t . Thus, for a matrix of order 4 we have 3 2 4jn11j 4jn12j 3jn13j 2jn14j 7 6 jc ? Nzj < 2?t 664 44jjnn2131jj 44jjnn2232jj 33jjnn2333jj 22jjnn2434jj 775 jzj: 4jn41j 4jn42j 3jn43j 2jn44j Hence jrk+1 ? Nzk+1j n2?tjN jjzk+1j: Taking norms yields k emk+1 k=k rk+1 ? Nzk+1 k k jrk+1 ? Nzk+1j k n2?t k jN j kk zk+1 k 3 = n 2 2?t k N kk zk+1 kF where we have used the identity k jAj k n 21 k A k. Putting all this together we have from equation (3.28) ^ sk+1 + emk+1 k k Ek+1 k = k Te i h k T^ k 2?t k rk kF +0:4n2 k zk+1 kF + n 32 2?t k N kk zk+1 kF
(3.34) (3.35)
If k N k= O(n 21 ) then we can bound the error by i h k Ek+1 k k T^ k 2?t k rk kF +O(n2) k zk+1 kF
(3.36)
Everything in this equation is computable, so we have a gross upper bound on how far the solution is varying from the actual solution. The only problem with equation (3.36) is that the residual is related to the actual error by the condition number of the matrix, and thus may be very large with respect to the error. This might lead to erroneous conclusions from the above formula. The above formula while giving an upper bound on the error is a gross over- estimate. In general the error is not so large. In the problems we have been considering the error is not large enough to be signi cant. As a practical measure we suggest that when the program declares convergence the actual residual should be computed just to determine that no large error has occured. The remainder of this chapter is a collection of algorithms we have used to approximate the optimal number of inner iterations.
3.5 Optimal p In this section we show that there is an optimal number of inner iterations. De ne Work as the number of iterations required to reduce the error by an order of magnitude times the cost per iteration. The cost per iteration is given by C (p) = Cost(outer) + Cost(inner) p = NZL + NZU + p NZD
40
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
where NZL , NZU , and NZD are the number of nonzeros in L, U and D respectively (see equation (3.8)). Since the average convergence rate is inversely proportional to the number of iterations required to reduce the error by an order of magnitude the work is given by ( 1 W (p) = C (p) ?1log (Tp) ififTTp isisconvergent (3.37) p divergent (see Thoerem 13). Assume that the spectral radii of the inner and outer iterations are nontrivial and that the inner and outer iterations are convergent. Also assume that there is some number of inner iterations, k, such that the overall iteration is convergent Recall that limp!1 Tp = S where S is the outer iteration matrix (Remember taking p to in nity is equivalent to solving the inner iteration exactly.). W (k) is a nite number, since under the assumptions 0 < (Tk ) < 1, further limp!1 W (p) = 1, since 0 < (S ) < 1. Thus there is some nonempty, nite set of p's that minimize the work. In the case that the spectral radius of the outer iteration is zero, we point out that if the inner is solved exactly the outer iteration takes one step to solve the system exactly. Under these conditions the inner iteration controls convergence, so that the number of inner iterations may grow without bound without reaching a work minimum.
3.6 Simple Algorithm In the simple algorithm we assume that the number of inner iterations should be uniform from block to block. We only allow the number of inner iterations to change from outer iteration to outer iteration. To control the number of inner iteration we use an estimate for the amount of work, where work was de ned in equation (3.37). The number of iterations required to reduce the norm of the error by an order of magnitude is inversely proportional to the rate of convergence, for which, as we have shown, an estimate is ? log (equation (3.7)). Substituting the estimate into equation (3.37) gives the estimate for the work U + NZL ) : Wp = C (p NZD?+logNZ (3.38) The algorithm proceeds by increasing the number of inner iterations until some decrease in the work is detected, i.e., Wp > Wp?k2 . The algorithm cannot remain stationary on the best number of inner iterations detected, since is only an estimate. We, therefore force the algorithm to arti cially increase or decrease the number of inner iterations when it becomes stationary. The algorithm has two parameters which the user can set one,k1, gives the number of iterations before work should be computed and the other, k2, gives the increase in number of inner iterations.
3.7 Residual Algorithm The residual control algorithm is based on equation (3.24). One of the contributors to the overall residual is the inner residual directly. If we make the inner residual small then we reduce the contribution of the inner iteration. In this algorithm we arbitrarily reduce the norm of the inner residual by some factor newp then p = p + k2 , last = max(p; last) else if newp > k2 then p = newp ? k2 else p = newp
This algorithm has the signi cant advantage over the Simple p Algorithm in that the number of iterations can vary from block to block. Thus when some block is converging very quickly the algorithm will stop computing on it and go on to the next block.
3.8 Adaptive Residual Algorithm The problem with the residual algorithm is that it does not dynamically choose , the degree to which the inner iteration should be solved. In the next algorithm we take the adaptive ideas of Algorithm 10 and combine them with the residual algorithm to de ne an algorithm which adaptively chooses the degree of accuracy to which to solve the inner iteration. In the adaptive residual algorithm we again use as an estimate for , but in this case it is not immediately apparent what the amount of work is. That is, equation (3.8) no longer represents the cost per iteration. We can very simply deal with that by keeping track of the actual amount of work that was done between the time the last update to was done and the current update. The cost per iteration we can then estimate as the total work done over a period of, say, k2 iterations divided by k2 .
3.9 Estimate Spectral Radius of Tp The nal algorithm we consider attempts to estimate the spectral radius of the iteration matrix for dierent numbers of inner iterations. The problem with the adaptive algorithms that we have
42
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
Algorithm 11 (Residual Algorithm) for k = 0; 1; : : : r = b ? Axk
y(0) = y = (y1 ; y2; : : :; yq )T = xk for i = 1 to q b0i = (b + N1xk + N2xk+1 )i Let j = 0 s = b0i ? Di yi(0) while j < p and k s k> krqk Fi yi(j+1) = b0i + Gi yi(j) s = b0i ? Di yi(j+1) ,j = j + 1 (xk+1 )i = yi(pik )
presented in the previous three sections is that they scale up to the optimal number of inner iterations very slowly. What we would like to have is some way of getting close to the optimal number of inner iterations fast, and then get closer using the adaptive algorithms we have presented. To do this we will need to be able to get estimates quickly. We can get a reasonable estimate for the spectral radius of the inner iteration by doing one step of the iteration with a large number of inner iterations. We can also get an estimate for the spectral radius of T1 (i.e., IBGS with one inner iteration/ outer iteration) by doing, say, 20 iterations. Note that if the algorithm converges before these estimates are obtained then, probably, it was a waste of time to be searching for the optimal p. With estimates only for k R k and k T1 k it is dicult to derive good estimates for k Tp k. Recall the iteration matrix for the IBGS algorithm
Tp = (Q?p 1 ? L)?1 (Q?p 1 Rp + U ) Qp = (I ? Rp )D?1 : We derive an estimator as follows. Consider
I ? Tp = (Q?p 1 ? L)?1 (Q?p 1 ? L ? Q?p 1 Rp ? U ) = (Q?p 1 ? L)?1 (Q?p 1 (I ? Rp) ? L ? U ) = (Q?p 1 ? L)?1 (D ? L ? U ) = (Q?p 1 ? L)?1 A = (Q?p 1 ? L)?1 (Q?1 1 ? L)(I ? T1) = (I ? Qp L)?1Qp Q?1 1 (I ? Q1 L)(I ? T1) = (I ? Qp L)?1(I ? Rp )(I ? R)?1 (I ? Q1L)(I ? T1) Taking norms we have
k I ? Tp k k (I ? QpL)? kk (I ? Rp) kk (I ? R)? kk (I ? Q L) kk (I ? T ) k : 1
1
1
1
We assume that L is strictly block lower triangular and D is block diagonal, thus Qp L is strictly block lower triangular and therefore (I ? Qp L) = 1. While in general k B k may be much larger
3.9. ESTIMATE SPECTRAL RADIUS OF TP
43
Algorithm 12 (Adaptive Residual Algorithm) = 0
Do a few steps of Algorithm 5 Do k1 steps of Algorithm 5 Compute work and save as W = =2 Do until convergence Do k1 steps of Algorithm 5 Compute work and save as W Find minimum W If = = =2 else = +2 If j ? j < 0:1 then If next smallest W has > then = =2 else if < 0:5 then = 2 else =
than (B ) there is a norm for which they are arbitrarily close. Of course the norm we use will not be that norm but we assume that they will be close. This assumption yields
k I ? Tp k k I ? Rp kk (I ? R)? kk I ? T k 1
1
(3.39)
where by we mean that it is approximately less or equal although we have no guarantee. The following identities may be found for instance in [52]. If k T k< 1 then k I + T k 1? k T k k (I + T )?1 k 1? k1 T k :
Applying these identities to equation (3.39) we have
I ? T1 k 1? k Tp k k I ? Rp k k1? kRk
Redistributing
k Tp k 1? k I ? Rp k k1?I ?k TR kk : 1
Unfortunately we do not have estimates for k I ? Rp k or k I ? T1 k so we approximate k I ? T1 k by 1? k T1 k and k I ? Rp k by 1? k R kp yielding (3.40) k Tp k p = 1 ? (1? k R kp) 11?? kk TR1 kk ;
44
CHAPTER 3. COMPUTATIONAL ASPECTS OF TWO-LEVEL ITERATION
where by we mean approximately equal. The rst thing to notice about this estimator is that if T1 R, as is true in the theorems of the previous chapter, we can fairly assume that k T1 k>k R k (intuitively this says that the outer iteration with one inner per outer is converging slower than the inner iteration, something which in general must be true). Therefore our estimator is always less than one, a necessary condition for the estimator to have any meaning. The second thing to notice is that this could be a very good estimator. When p = 1, the estimator reduces to k T1 k. When p ! 1 the estimator reduces to k T1 k 1 ? 11?? kk TR1 kk : (3.41) The reason this is interesting is evident if one considers iterative block Jacobi. In IBJ L = 0 and U need not be strictly upper triangular. The iteration matrix reduces to Tp = I ? (I ? Rp )(I ? S ) where S is the outer iteration matrix (i.e., solving the inner exactly). Notice that T1 = I ? (I ? R)(I ? S ), plugging into equation (3.41) we get I ? S) k 1 ? 1? k I ?1(?I ?k RR)(k I ? S ) k 1 ? k (I ?1?Rk)(R k (I ? S ) k 1 ? k (I ? 1R?) kk kRk 1? k (I ? S ) kk S k Thus if the estimations are at all tight, then as p goes to in nity the estimator goes to the norm of the outer iteration matrix as we would hope. To use this estimator we enter a loop which computes the minimum value for various p of equation (3.38), where instead of ? log we use ? log p the estimator in equation (3.40). The algorithm may be given by:
Algorithm 13 (Complex p Selection) Do a few steps of Algorithm 5 with p = 1 Estimate 1 =k T1 k Do inner iterations until an estimate for =k R k is stabilized U +NZL p = 2, W1 = NZD +?NZ log1 NZU +NZL Compute W2 = 2NZD?+log 2 While Wp < Wp?1 p = p+1 NZU +NZL Compute Wp = pNZD?+log p p= p?1 Do Simple p selection algorithm with p as a base instead of 1
Chapter 4
Sequential Experimental Results In this chapter we present experimental results of nested iteration obtained in a sequential environment. The chapter is divided into three parts. In the rst part we give a motivation for the problem. We will show the kind of matrices for which nested iterations should be ecient. We then proceed to the semiconductor device simulation problem which is the main practical problem that we consider. In the second part we will consider the optimal number of inner iterations. We will consider two methods for obtaining the optimal number of inner iterations. The rst, method does the same number of inner iterations on each block. The second solves the inner iteration to some percentage of the outer iteration. The nal section deals with the approximation algorithms derived in the previous chapter. We will give results of these algorithms that show that some of these algorithms are not very successful and show how they may be improved.
4.1 Motivation We can nd matrices for which convergence does not occur until the number of inner iterations gets high enough. We can also nd matrices for which increasing the number of inner iterations provides better convergence. In this section we give an example of the kind of matrices for which increasing the number of inner iterations decreases work. Consider the contrived matrix
2 10:00 66 0 66 0 66 ?4:94 6 0 A = 66 ?0:03 66 ?0:03 66 ?0:03 64 ?0:03
0 10:00 0 0 ?4:94 ?0:03 ?0:03 ?0:03 ?0:03 ?0:03 ?0:03
0 0 10:00 0 0 ?4:96 ?0:03 ?0:03 ?0:03 ?0:03
?4:94 0 0 10:00 0 ?0:03 ?4:96 ?0:03 ?0:03 ?0:03
0 ?4:94 0 0 10:00 ?0:03 ?0:03 ?4:96 ?0:03 ?0:03
?0:03 ?0:03 ?4:96 ?0:03 ?0:03 10:00 0 0 ?4:94 0
45
?0:03 ?0:03 ?0:03 ?4:96 ?0:03 0 10:00 0 0 ?4:94
?0:03 ?0:03 ?0:03 ?0:03 ?4:96 0 0 10:00 0 0
?0:03 ?0:03 ?0:03 ?0:03 ?0:03 ?4:94 0 0 10:00 0
?0:03 ?0:03 ?0:03 ?0:03 ?0:03 0 ?4:94 0 0 10:00
3 77 77 77 77 77 77 77 5
46
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
with the splitting de ned by 2 10:00 0 3 0 ?4:94 0 0 0 0 0 0 66 0 10:00 0 0 ?4:94 0 0 0 0 0 77 66 0 0 10:00 0 0 ?4:94 0 0 0 0 77 66 ?4:94 0 0 10:00 0 0 ?4:94 0 0 0 77 6 0 ? 4:94 0 0 10:00 0 0 ? 4:94 0 0 77 M = 66 0 0 ? 4:94 0 0 10:00 0 0 ? 4:94 0 77 66 0 77 0 0 ? 4:94 0 0 10:00 0 0 ? 4:94 66 0 77 0 0 0 ? 4:94 0 0 10:00 0 0 64 0 0 0 0 0 ?4:94 0 0 10:00 0 5 0 0 0 0 0 0 ?4:94 0 0 10:00 and N = M ? A. The inner iteration method we use will be Jacobi. Recall that the iteration matrix for nested iteration is
Tp = (Q?p 1 ? L)?1 (U + Q?P 1 Rp ) Qp = (I ? Rp )D?1 here U = N , D = M and L = 0 thus
Tp = Qp N + Rp = (I ? Rp)M ?1 N + Rp = I ? (I ? Rp)(I ? S ) S = M ?1 N where S is the outer iteration matrix, and R is the inner iteration matrix. The spectral radius of S , (S ), is 0:048 and (R) = 0:80. The spectral radii for the rst ten nested iteration matrices are (T1) (T2) (T3) (T4) (T5) (T6) (T7) (T8) (T9) (T10) . 0:804 0:647 0:522 0:422 0:343 0:279 0:229 0:189 0:157 0:133 Thus the spectral radius of the overall matrix is dominated by the spectral radius of the inner iteration. We also note that the cost of doing an outer iteration is very large ( fty multiplications) compared to the cost of a single inner iteration. De ne cost as the number of multiplications. The following graph shows cost as the number of inner iterations increases. Note that the numbers next to the 3's in the graph are the number of outer iterations to convergence (obviously decreasing as the number of inner iterations increases).
4.1. MOTIVATION
47 10x10 matrix
2600
2400 p p
p p
p
p p
p p p p
p p
p
p p
p
36
330
2200
p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
2000 Cost 1800
1600
p p
p
p p p p
p p
p
p p
p
p p p
p
36
p p
p p
p
p p
p p
p p
p p
p
p p
p p
p
p p p
p
36
p p
p p
p
p p
p p
p
p p p
p
p p
p p
p
p p
p p
36
316 p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p p p p p p p p p p p p
p
p
p
p
p
p
p
p
p
p
p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
73
123 310 p
p p p
p p p
p p p
p p p
p p p
p p p p
p p
p
p p
p p
p
p p p
p
p p
p p
p
p p
p p
36
73
1400
1200
p p
p p
p p
p
p p
p p
p
p p
36
83 37 p
p
p
p
p p p p p p p p p p p p
p p p
p p p
6 8 10 12 14 Number of Inners Note that eventually the inner iteration is solved virtually exactly. More inner iterations does not decrease the number of outer iterations, thus the costs increases linearly with the equation 0
2
p
4
C (p) = Cost(outer) + Cost(inner) p: In summary we are looking for matrices where the inner iteration controls the overall convergence. In addition the computation of the outer iteration should also be expensive. One place where we would expect to nd matrices of this type is in coupled systems of partial dierential equations. The diagonal blocks of the system would be particular dierential equations, the solutions of which would yield an expensive inner iteration. The couplings would re ect the outer iteration. If the couplings involved derivatives then the o diagonal blocks would have a large number of nonzeros, and would therefore be expensive computations in the outer iteration. In the next section we will give a problem of this sort, plus another problem for which the nested iteration
48
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
results later in this chapter proved useful.
4.2 Problems These nested techniques have a wide range of uses. We will concentrate on examples from two areas, the rst is semiconductor device simulation and the second circuit simulation. While semiconductor device simulation would seem to be the more appropriate area for these methods, we did obtain some good results for circuit simulation.
4.2.1 Semicondutor Device Simulation
The following derivation is from [3]. The simulation problem is described by a system of dierential equations given by
g1 (u; n; p) = ?4u + n ? p ? k1 = 0 g2 (u; n; p) = ?r jn + k2 = 0 g3 (u; n; p) = ?r jp + k2 = 0 where g2 and g3 are the continuity equations and u, n, and p represent the potential eld, number of electrons, and number of holes respectively. g2 and g3 and are further speci ed by giving the current densities in the drift-diusion form:
jn = ?n nru + Dnrn jp = ?p pru + Dp rp where i and Di are the mobility and diusivity coecients respectively. Each of these may be eld dependent (i.e., dependent on u). The doping pro le k1 is usually only space dependent, and the recombination term k2 is usually of the form k2 = k2(u; n; p). When the Einstein relation is valid, it takes the form i = Di and we can write
n = eu?v p = ew?u : With this change of variables we may rewrite the system of equations as
g1 (u; v; w) = ?4u + eu?v ? ew?u ? k1 = 0; g2(u; n; p) = rn eu?v rv + k2 = 0; and
g3(u; n; p) = ?rpew?u rw + k2 = 0: The solution of these equations may be accomplished in one of two ways. In the rst, the so called Gummel or Plug-in method, the dierential equations are solved in turn with the other unknown components held constant. One may write this iteration as in Algorithm 14.
4.2. PROBLEMS
49
Algorithm 14 (Gummel iteration) k=0
Do until convergence Solve g1(uk+1 ; vk ; wk ) = 0 for uk+1 Solve g2(uk+1 ; vk+1 ; wk ) = 0 for vk+1 Solve g3(uk+1 ; vk+1 ; wk+1) = 0 for wk+1
(4.1) (4.2) (4.3) (4.4)
A second method for nding the solution of these equations is by applying Newton's method to the whole system. That is, instead of discretizing rst and then taking the Jacobian, one rst computes the symbolic Jacobian and then discretizes that. i.e., one computes the symbolic matrix
2 J = 64
@g1 @u @g 2 @u @g 3 @u
@g1 @v @g 2 @v @g 3 @v
@g1 @w @g 2 @w @g 3 @w
3 75
Explicitly we may write this as
3 2 u?v w?u ?4 +(eu?v + ew?u ) ? e ? e 75 ?rn eu r(?e?v ) 0 g0 = 64 r(n eu?v rv) r(p ew?u rw) 0 ?rpe?u r(ew )
where * is a placeholder function.
Algorithm 15 (Newton Method) k=0
Do until convergence Solve gk0 x = ?gk Compute zk+1 = zk + tk x
The Newton iteration is described in Algorithm 15 where gk = (g1; g2; g3)T and x = (@u; @v; @w)T corresponding to z = (u; v; w)T . In the matrix vector product the in each matrix indicates where to put the appropriate component of x. We do not present the details of computing the discretization of the above system as it exceeds the scope of this brief summary. For further details the reader is once again referred to [3]. We examined the Jacobians from two dierent problems. The nested iterative method used was block SOR with point SOR solving the diagonal blocks.
4.2.2 Circuit Simulation
The circuit simulation problem did not seem, initially, to be a good problem for nested iteration. There is no inherent blocking of the system due to, say, coupled equations. While, algebraically, there is no direct way to break down the circuit problem, in practice engineers build circuits using
50
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
subcircuits which are connected together to build the whole circuit. These subcircuits can be viewed as circuits unto themselves and the nested solver would then solve these subsystems in succession. The details of this problem may be found, for instance, in [18]. We merely summarize the problem here. The circuit problem may be expressed as a nonlinear dierential algebraic system of the form d dt q (u(t); z (t)) + f (u(t); z (t)) = h(u(t); z (t)) = 0
u(t0) = u0
(4.5)
where q; f; u 2 IR N , z 2 IRM , t 2 [t0 ; tr ], N is the number of unknowns and M is the number of knowns (i.e., internal voltages and external (applied) voltages respectively). As before we solve this system with a Newton iteration. The equation to be solved at each iteration is (4.6) h0(u)x = d[Qx] + Fx
dt
where Q; F are the Jacobians of q and f respectively. Thus at each stage of the Newton method the following equation must be solved h0 (u)x = d[Qx] + Fx = ? dq(u) + f (u) = ?r (4.7)
dt x(0) = x0 :
dt
(4.8)
It is from this equation that the matrices from the circuit simulation have been derived.
4.3 Optimal Number of Inner Iterations In this section we show the optimal number of inner iterations with respect to the cost for convergence. We shall consider three sets of matrices, two from the semiconductor device simulation problem and one from a 26 bit square root circuit. We will also examine the convergence of the iteration with respect to the inner and outer relaxation factor. For all these systems the code was modi ed to output the discretized Jacobian of the system. The rst problem was from a 0:5 micron MOSFET with gate above threshold. A more thorough description of this problem may be found in [19]. The number of grid points for this problem was 2085, making the total number of unknowns 6255. The second semiconductor problem was from a Hall Sensor with no magnetic eld. Further information on this can be found in [8, 1]. The grid size for this system had 1173 points making a total number of unknowns of 3519. Finally the circuit simulation was for a circuit that takes square roots of 26 bit integers. There were six subcircuits of the square root circuit, thus six blocks. We will also examine the eects of coallescing blocks, creating larger blocks. In these discussions once again we let !I and !O denote the inner and outer relaxation factors for SOR respectively. We begin by giving a graph of the number of inner iterations versus the amount of work for Gauss-Seidel within Gauss-Seidel (i.e. !I = !O = 1) for the rst matrix from the MOSFET set.
4.3. OPTIMAL NUMBER OF INNER ITERATIONS 120
Outer Omega = 1.000000 Inner Omega = 1.000000 p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
110 100
number of multiplications90 in millions 80
p p
p
p
p
70 60
51
0
5
p
pp
p
pp
p
pp
pp pppppp pp ppppppp
p
p
pppppp ppp p ppp
pp
ppppp
p
pp
pp pppppp
p
p
p pp p pp
pp
p ppp
p
p
p
p
p
p
p
p
p
p
p
p
p
p p p p p p p p p p p p p pp p p pp p pppp p ppp pp p p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
10 15 20 25 30 35 40 45 50 number of inner iterations Notice in this graph that the minimum is attained at 38 inner iterations per outer iteration, and the amount of work performed is approximately half of that performed for point Gauss-Seidel (one inner iteration per outer iteration). This gives us a real example in which nested iteration improves overall work. Also note that after 40 inners per outer the amount of work increases linearly. The reason for this is that the inner iteration is at that point basically solved exactly, so that the number of outer iterations remains constant while the amount of work is increasing linearly in the number of inner iterations. In the next graph we see that, for this example, the number of outer iterations is decreasing monotonically with the number of inner iterations.
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
52
!O !I 1:0 1:1 1:2 1:3 1:0 mults 64.3 54.3 47.7 41.7 inners 38 32 28 25 1:1 mults 53.6 46.1 40.8 35.5 inners 38 34 30 26 1:2 mults 46.2 39.5 35.7 39.8 inners 39 35 30 25 1:3 mults 41.6 42.3 40.8 38.3 inners 41 34 30 27 Table 4.1: Optimal numbers of inners and corresponding numbers of multiplications for the MOSFET semiconductor 1400
Outer Omega = 1.000000 Inner Omega = 1.000000 p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
1200 1000 800 number of outer iterations 600 400 200 0
0
p
p p
p p
p
p
5
p
p
p
p
p
p
p
p
p
pp p
pp pp ppp ppp ppp ppp ppppp ppppp pppppp ppppppppp ppppppppppppp ppppppppppppp ppppppppppp pppppppppppppppppppppppppppppppppppppppppppp
10 15 20 25 30 35 40 45 50 number of inner iterations Allowing relaxation on the inner and outer iteration does not cause a serious change in the appearance of the above graphs, rather the number of inner iterations needed to achieve optimal work drops. This, of course, causes the total amount of work to go down. In the tables that follow we will show how varying the inner and outer relaxation factors eects the amount of work required for optimal convergence (see table 4.1). We may also determine the optimal number of inners by trying to solve the inner iteration to various levels of precision with respect to the outer residual (see Algorithm 2 from the previous chapter). The advantage of solving the inner iteration to some fraction of the outer iterations is that the number of inner iterations per block need not be uniform. Improvement in work will be attained if thare are blocks with widely diering convergence rates. The disadvantage is that there is a signi cant amount of extra computation required since the norm of the residual must be computed at each inner iteration. While this may not sound like a signi cant contribution, recall that the matrices under consideration are very sparse. The diagonal blocks have less than ten
4.3. OPTIMAL NUMBER OF INNER ITERATIONS !O = 1:0 30 !I = 1:0 45 !O = 1:0 !I = 1:1 !O = 1:0 !I = 1:2 !O = 1:1 !I = 1:2 !O = 1:2 !I = 1:1
60 30 45 60 30 45 60 30 45 60 30 45 60
53
= 0:05 0:15 0:25 0:35 0:45 0:55 0:65 0:75 66.9 64.7 74.5 52.7 59.9 67.6 47.1 60.9 63.8 39.3 42.6 43.7 41.4 37.5 42.7
61.5 58.1 66.9 48.7 50.2 58.2 42.8 51.1 56.9 36.1 39.5 47.4 39.3 34.1 39.0
61.9 57.8 65.2 48.6 49.7 56.5 42.5 49.0 53.7 35.9 39.7 42.6 39.3 34.9 37.5
61.8 57.3 65.0 49.3 50.0 55.8 42.3 48.5 47.4 37.0 39.1 42.9 40.5 38.8 39.1
63.2 60.1 65.3 51.2 50.0 56.5 43.0 49.4 47.6 38.2 40.8 42.3 41.7 39.2 41.5
63.2 62.0 66.4 52.8 53.2 56.3 44.6 46.3 46.8 39.1 42.2 43.0 42.2 42.6 43.0
63.9 63.8 68.6 53.4 54.4 59.0 44.3 47.5 46.6 41.5 43.4 42.6 43.1 43.1 45.3
65.3 65.9 69.9 54.5 56.6 59.9 45.2 48.6 47.4 41.0 43.3 41.8 43.2 45.2 47.2
Table 4.2: Numbers of multiplications for solving the inner iteration to various levels. From the MOSFET semiconductor. is the percentage of the outer residual to which the inner is computed. nonzeros per row. This extra computation adds about ten percent to each inner iteration. Thus without the widely diering convergence rates there will be no savings in work. For the semiconductor device simulation problem the rst block is a Laplacian with nonnegative elements added to the diagonal. This causes the iteration matrix corresponding to the rst block to have small spectral radius. It therefore requires very few inner iterations to converge. The other blocks are only diagonally semi-de nite, and therefore the iteration will converge slowly. Intuitively this would seem to be an ideal problem for solving the inner iteration to a percentage of the outer iteration. For this method there are two parameters, the rst is the percentage of the outer residual to which the inner iteration should be solved. The second is a bound on the number of iterations allowed for each block. The purpose of this second parameter is to stop inner iterations that would take a tremendous amount of time to converge for no good reason. Results for the same semiconductor device simulation problem as before are given in table 4.2. For this matrix little more can be saved by only solving each block as accurately as needed. We now give tables for the other matrices from the MOSFET set and the Hall sensor set. The point here is that it is important to see that the nested iteration is useful for all the Newton steps. There would be virtually no point in studying this problem if the nested iteration only paid for itself on certain Newton iterations. The following runs were done only for p = 5k + 1, and were solved somewhat less acurately than the previous results, due to time considerations. The rst thing to notice about all the data is that as the inner relaxation factor increases the optimal number of inner iterations decreases. The reason for this is that it takes fewer inner iterations to solve the inner exactly. A second thing to note is that while there are some very good numbers associated with increasing the inner and outer relaxation factors the danger of divergence is correspondingly increased. For instance in Hall Sensor matrix 5, for !O = 0:2 the matrices show their best convergence for the numbers given, but increasing the number of inners causes the system to diverge.
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
54
!O !I
1.0 1.0 1.0 1.15 1.15 1.15 1.30 1.30 1.30 1.0 1.15 1.30 1.0 1.15 1.30 1.0 1.15 1.30 Mults 42.0 34.0 27.6 33.4 26.6 21.6 28.5 24.6 22.2 Inners 21 26 21 31 26 21 36 31 26 Table 4.3: Matrix 3 from MOSFET set.
!O !I
1.0 1.0 1.0 1.15 1.15 1.15 1.30 1.30 1.30 1.0 1.15 1.30 1.0 1.15 1.30 1.0 1.15 1.30 Mults 26.4 21.3 16.8 22.2 18.0 13.8 19.2 15.6 12.0 Inners 21 16 11 26 21 16 26 21 16 Table 4.4: Matrix 6 from MOSFET set.
!O !I
1.0 1.0 1.0 1.15 1.15 1.15 1.30 1.30 1.30 1.0 1.15 1.30 1.0 1.15 1.30 1.0 1.15 1.30 Mults 25.1 20.1 15.7 21.6 16.6 12.9 17.7 14.4 11.1 Inners 16 11 16 21 16 16 26 21 16 Table 4.5: Matrix 9 from MOSFET set.
!O !I
0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 1.0 1.3 1.6 1.9 1.0 1.3 1.6 1.9 1.0 1.3 1.6 1.9 Mults 24.5 13.8 6.8 2.5 20.4 11.8 6.3 % 18.4 14.9 % % Inners 26 16 11 1 16 6 6 % 6 1 % % Table 4.6: Matrix 1 from Hall Sensor set. Converges for all !O = 0:5 and !I 1:15. % means that for those parameters the iteration diverged for every number of inner iterations
!O !I
0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 1.0 1.3 1.6 1.9 1.0 1.3 1.6 1.9 Mults 123.6 69.0 34.5 12.8 61.7 34.3 17.2 % Inners 16 11 6 1 21 11 6 1 Table 4.7: Matrix 5 from Hall Sensor set. Converges for all !O = 0:1 and !I 1:15
4.3. OPTIMAL NUMBER OF INNER ITERATIONS
55
!O !I
0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 1.0 1.3 1.6 1.9 1.0 1.3 1.6 1.9 1.0 1.3 1.6 Mults 120.0 67.2 33.7 12.7 59.7 33.6 16.7 6.3 41.4 33.4 % Inners 16 11 6 1 16 11 6 1 6 1 %
0.3 1.9
% %
Table 4.8: Matrix 10 from Hall Sensor set. Converges for all !O = 0:1 and !I 1:15
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.26 1.78 1.31 1.23 1.25
5 7 15 7 3
1.55 1.19 0.88 0.90 1.66
3 5 11 3 3
0.98 0.77 0.61 1.01 2.70
3 3 5 3 5
0.53 0.46 0.84 1.72 4.76
3 0.52 5 3 0.98 9 5 1.85 11 7 3.83 13 9 10.69 19
Table 4.9: Matrix 1 of Square Root set, blocked into 2 blocks
4.3.1 Circuit Simulation Data
To determine the optimal number of inner iterations for the square root simulation matrices we must rst determine the number of blocks to use. The system was naturally broken into six blocks. By coallescing blocks we can get systems with fewer blocks. We will present results for blocking the system into 2, 3, and 6 blocks. Using larger blocks means slower inner convergence, but faster outer convergence. The data exhibits the amount the outer iteration convergence rate plays in the optimal amount of computation. That is, the outer iteration can be more expensive (more multiplications), but still might pay if the convergence rate is high enough. The data shows that although the larger blocks are better when no over-relaxation is done, as the outer over-relaxation increases using the smaller blocks begins to pay since the outer convergence rate is fast enough to make up for the extra work. (see tables 4.9, 4.10, 4.11, 4.12, 4.13, 4.14). Note that all of these tables show the same eect as the previous sets, that is, a decrease in the number of inner iterations as the inner relaxation factor increases. Here however the inner and outer relaxation factors conspire to cause the optimal number of inner iterations to increase if both
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.14 1.62 1.34 1.16 1.29
21 23 17 9 3
1.42 1.05 0.92 0.88 1.13
19 19 11 5 3
0.89 0.59 0.57 0.66 1.86
15 15 7 3 5
0.48 0.34 0.54 1.15 3.09
11 7 5 5 7
0.31 0.65 1.19 2.43 6.78
Table 4.10: Matrix 4 of Square Root set, blocked into 2 blocks
13 9 11 13 17
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
56
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.03 1.54 1.38 1.30 1.60
13 21 11 5 1
1.38 1.05 0.93 0.90 1.75
7 15 7 3 3
0.91 0.68 0.64 1.01 2.85
5 7 3 3 5
0.51 0.45 0.83 1.75 4.85
3 0.51 7 5 0.96 9 5 1.82 11 5 3.71 13 7 10.54 17
Table 4.11: Matrix 1 of Square Root set, blocked into 3 blocks
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.10 1.59 1.32 1.14 1.27
21 23 17 9 3
1.39 1.03 0.91 0.87 1.11
19 19 11 5 3
0.88 0.58 0.56 0.65 1.84
15 15 7 3 5
0.47 0.34 0.54 1.13 3.04
15 7 5 5 7
0.30 0.64 1.17 2.39 6.65
13 9 11 13 17
Table 4.12: Matrix 4 of Square Root set, blocked into 3 blocks
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.21 1.73 1.30 1.19 1.56
5 7 11 5 1
1.51 1.17 0.90 0.83 1.72
5 5 7 3 3
0.98 0.76 0.60 0.99 2.78
3 3 3 3 5
0.54 0.45 0.81 1.70 4.72
3 0.49 7 3 0.94 9 5 1.76 11 7 3.60 13 7 10.22 17
Table 4.13: Matrix 1 of Square Root set, blocked into 6 blocks
!I = 1:0 !I = 1:2 !I = 1:4 !I = 1:6 !I = 1:8 !O Mults p Mults p Mults p Mults p Mults p 1.0 1.2 1.4 1.6 1.8
2.25 1.76 1.28 1.18 0.82
5 11 17 7 3
1.55 1.21 0.84 0.79 1.21
5 5 11 5 3
1.00 0.79 0.55 0.67 1.73
3 5 7 3 5
0.55 0.41 0.52 1.07 2.88
3 3 5 7 7
0.33 0.59 1.09 2.27 6.32
3 9 11 13 17
Table 4.14: Matrix 4 of Square Root set, blocked into 6 blocks
4.4. APPROXIMATION ALGORITHMS Alg. 10 !O = 1:0 80.4 1.1 52.9 1.2 54.5 1.3 56.0
!I = 1:0
57 1:1 69.5 72.5 47.8 50.0
1:2 60.7 63.0 66.3 44.4
Table 4.15: Simple p selection algorithm applied to Matrix 1 of MOSFET set the inner and outer relaxation factors are increased. While this section has demonstrated that there are good possibilities for using nested iteration, there is a high degree of dependence on the accuracy used to solve the inner iteration. In the next section we will show the ecacy of the approximation techniques that we developed in the previous chapter.
4.4 Approximation Algorithms Knowing that there is a number of inner iterations that minimizes the work and it can halve the amount of work done is of little interest since, in general, we can not know what that number is. We would like to estimate the optimal number using approximation algorithms. In this section we examine the approximation algorithms of the previous chapter. The results of this section show the diculty of the problem. We show that the approximation algorithms do not perform very well. Approximations to the spectral radius of the iteration matrix prove elusive. Some of the better results we obtained are due to external factors which we will explore in this section. We begin with the Simple p selection algorithm (Algorithm 10). Recall that this algorithm monitors the work ratio and chooses the next number of inner iterations accordingly. The results obtained are encouraging as is evidenced in table 4.15. These numbers appear to be within acceptable levels of optimal, but one would certainly like to improve upon them. Upon closer observation one sees that the increase in the number of inner iterations selected was less due to an improvement in the work as the number of inner iterations increased than an increase in the ratio rmrm?k (residual at step m divided by residual at step m ? k, where k is the number of steps until stability in the estimate has occurred). That is, our estimate for is changing as the iteration progresses, making earlier numbers of inner iterations appear more successful than they actually were. This observation leads to a new algorithm. The algorithm puts a time stamp on work estimates for numbers of inner iterations, that is, after a certain number of work estimates for various numbers of inner iterations have been made the older estimates expire and are no longer considered. Running this algorithm improves things to a certain extent, however it does not really change the eect shown in the original algorithm. There is a suggestion that something very dierent is going on. Let Ti and Tj be the iteration matrices for nested iteration with i and j inner iterations done respectively. Assume j(Tk )1j j(Tk)2 j : : : j(Tk )n j, where (Tk )s is an eigenvalue of Tk k = i; j . In general Ti and Tj do not share eigenvectors, but for illustrative purposes suppose they did. Further, suppose that the eigenvector corresponding to the eigenvalue of Ti, (Ti)s , also corresponds to the eigenvalue (Tj )n?s+1 of Tj . Now consider an iterative procedure using Ti . After a few steps the error vector begins to point in the direction associated with the eigenvector corresponding to
CHAPTER 4. SEQUENTIAL EXPERIMENTAL RESULTS
58 Ub 1.0 1.1 1.2 1.3
30 76.3 68.8 61.7 54.1
!I = 1:0 45 69.8 55.4 61.4 57.9
60 73.3 59.6 66.3 55.5
30 66.4 57.9 48.9 48.8
!I = 1:1 45 63.6 54.2 49.6 60.0
60 64.7 61.4 61.3 48.4
30 59.2 45.5 45.4 40.5
!I = 1:2 45 52.8 51.4 43.0 81.7
60 62.2 68.7 45.2 36.3
Table 4.16: Random algorithm applied to Matrix 1 of MOSFET set the largest eigenvalue. Continued iterations with the same iteration matrix will now only reduce the size of the error by a ratio corresponding to the largest eigenvalue. Algebraically, after p steps the error vector approaches p(Ti)1 v1 (where v1 is the eigenvector associated with the largest eigenvalue of Ti). Recalling that the iteration matrix is the matrix that propogates the error from one step to the next we see that one more iteration with i inner iterations yields error vector p+1 (Ti)1 v1, while one step of the iteration with j inner iterations yields (Tj )n p (Ti)1v1 , which is clearly a tremendous improvement. The example just given was obviously not realistic, but does present an interesting problem. How do we know what the approximation algorithms are doing? There are two possibilities: The rst is that the approximations are helping in the choice of the next number of inner iterations to use. The second is that the approximations are not helping, and it is just the fact that we change numbers of inner iterations that makes a dierence. This suggests an algorithm in which we change the number of inner iterations completely randomly. That is after every, say, ve iterations we call a random number generator to determine the number of inner iterations to use in the next ve iterations (see algorithm 16).
Algorithm 16 (Random Algorithm) flag = k1
Do until convergence If flag = k1 then inners = random integer in the range 1 : : :Ub else flag = flag + 1 Do Algorithm 9 with inners inner iterations
The results of this algorithm are very impressive. Table 4.16 gives the results. The random generator generates numbers between 1 and some upper bound Ub. We tried Ub = 30; 45; 60 for this demonstration. As further evidence consider running the random algorithm with a lower and upper bound. We ran the random algorithm with !O = 1:0 and !I = 1:0 with random numbers of inner iterations chosen between 32 and 42 (optimal was 38). The number of multiplications came to 62:9 million as opposed to the \optimal" number for 38 inner iterations of 64:3 million. With the impressive results of the random algorithm, we need only to choose an upper bound. This brings us to the predictor algorithm (Algorithm 13). In the original formulation of this
4.4. APPROXIMATION ALGORITHMS
59
algorithm we predicted the optimal number of inner iterations using approximations to the norms of T1 (the iteration with one inner) and R (the inner iteration matrix), and then switched to Algorithm 10. In view of the previous results of this section, we instead switch to the random algorithm, using the predictor as an upper bound on the number of inners to use. The results of this approach are very encouraging. A problem is that the predictions may compute k T1 k