EFFICIENT SOLUTION OF LARGE SPARSE ... - CiteSeerX

EFFICIENT SOLUTION OF LARGE SPARSE EIGENVALUE PROBLEMS IN MICROELECTRONIC SIMULATION

BY ALBERT THOMAS GALICK B.S., Massachusetts Institute of Technology, 1980 M.S., University of Illinois, 1984

THESIS Submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1993

Urbana, Illinois

EFFICIENT SOLUTION OF LARGE SPARSE EIGENVALUE PROBLEMS IN MICROELECTRONIC SIMULATION Albert Thomas Galick, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1993 Thomas Kerkhoven, Advisor We present a new Chebyshev{Arnoldi algorithm for nding the lowest energy eigenfunctions of an elliptic operator. The algorithm, which is essentially the same for symmetric, nonsymmetric, and complex nonhermitian matrices, is adapted to a speci c problem by two subroutines which encapsulate the problem{speci c de nition of energy, plus the discretization and matrix{vector multiply routines. We adapt the algorithm to two important problems, the self{consistent Schrodinger{Poisson model of quantum{eect devices, and the vector Helmholtz equation for a dielectric waveguide, addressing other important physical, numerical and computational issues as they arise. An asymptotic convergence estimate is derived which shows the Chebyshev{Arnoldi algorithm to be superior to Chebyshev{preconditioned subspace iteration. We also examine Newton methods for general large sparse eigenvalue problems satisfying the overdamping condition and show how to use sparse iterative solvers more eectively in them.

iii

TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 EFFICIENT NUMERICAL SIMULATION OF ELECTRON STATES IN QUANTUM WIRES : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 Introduction : : : : : : : : : : : : : : : : : : : : : : The equations : : : : : : : : : : : : : : : : : : : : : Reformulation as a xed point problem : : : : : : : The Eigenvalue Problem for Schrodinger's Equation The Solution of the Nonlinear Poisson Equation : : Stabilization and Acceleration : : : : : : : : : : : : 2.6.1 Stabilization by Adaptive Underrelaxation : 2.6.2 Acceleration by Newton's Method : : : : : : 2.6.3 Nonlinear versions of the GMRES algorithm 2.7 Numerical Results : : : : : : : : : : : : : : : : : : : 2.8 Conclusion : : : : : : : : : : : : : : : : : : : : : : :

2.1 2.2 2.3 2.4 2.5 2.6

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

6 8 13 16 18 21 22 23 24 28 40

3 FINITE-DIFFERENCE ANALYSIS OF REAL AND COMPLEX DIELECTRIC WAVEGUIDES : : : : : : : : : : : : : : : : : : : : : : : : : : 41 3.1 3.2 3.3 3.4 3.5 3.6 3.7

Introduction : : : : : : : : : : : : : Maxwell equations : : : : : : : : : Interfaces between Dierent Media Discretization Approach : : : : : : Solution of the Eigenvalue Problem Numerical Results : : : : : : : : : : Conclusion : : : : : : : : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

41 43 49 51 57 62 73

4 ITERATIVE CHEBYSHEV{ARNOLDI METHODS FOR EIGENVALUE PROBLEMS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74 4.1 4.2 4.3 4.4

Introduction : : : : : : : : : : : : : : : : : Banded Arnoldi Process : : : : : : : : : : The Selection Iteration : : : : : : : : : : : Convergence of Krylov Subspace Methods 4.4.1 Heuristic Convergence Analysis : : 4.4.2 Predictable Convergence : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

74 75 77 82 83 84

5 NEWTON METHODS FOR EIGENVALUE PROBLEMS : : : : : : 92 5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 iv

5.2 A Note on Complex Scaling : : : : : : : : : : : : : : : : : : : : : 5.3 Direct Newton Methods : : : : : : : : : : : : : : : : : : : : : : : 5.4 Indirect Newton Methods : : : : : : : : : : : : : : : : : : : : : : 5.4.1 Example: The Hermitian Generalized Eigenvalue Problem 5.5 Convergence of the Newton Iteration : : : : : : : : : : : : : : : : 5.6 General Eigenvalue Problems : : : : : : : : : : : : : : : : : : : : 5.6.1 General Block Rayleigh Quotient : : : : : : : : : : : : : : 5.7 In ated Inverse Iteration : : : : : : : : : : : : : : : : : : : : : : : 5.7.1 Defective or Ill{conditioned Eigenvalues : : : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

93 96 98 99 101 103 104 106 110

LIST OF REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 APPENDIX A PDE CONSISTENCY CHECK : : : : : : : : : : : : : : B CHEBYSHEV POLYNOMIALS : : : : : : : : : : : : : : B.1 Chebyshev Polynomials Adapted to an Ellipse : VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

v

: : : : : : : : : 118 : : : : : : : : : 123 : : : : : : : : : 124 : : : : : : : : : 126

LIST OF TABLES Table I II

Page

Occupancy levels of the rst four wavefunctions for applied potentials V1 = V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32 Occupancy levels of the rst 12 wavefunctions for applied potentials V1 = 0:7V and V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. 39

vi

LIST OF FIGURES Figure

Page

1 ::::::: 1 + exp Ek?BEF Schematic diagram of the simulated heterojunction structure. : : : : : : Convergence to self{consistency for applied potentials V1 = V2 = 0:2V . Flattening of the conduction band potential energy V corresponding to a rapid exponential decay in the ionized acceptor density NA? . : : : : : : : Closeup of the conduction band potential energy V , the quantum electron density nQW , and the rst two wavefunctions for applied potentials V1 = V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Convergence to self{consistency for applied potentials V1 = 0:7V and V2 = 0:2V . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Closeup of the conduction band potential energy V , the quantum electron density nQW , and the rst two wavefunctions for applied potentials V1 = 0:7V and V2 = 0:2V , with the exchange{correlation potential Vxc(nQW ) included. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Closeup of wavefunctions 3{6 corresponding to the potential of Fig. 2.7 . Closeup of wavefunctions 7{10 corresponding to the potential of Fig. 2.7 . Interface Between Dierent Media. : : : : : : : : : : : : : : : : : : : : : Modi ed Box{Integration Scheme. : : : : : : : : : : : : : : : : : : : : : : Preconditioning and Selection of Complex Modes. : : : : : : : : : : : : : The Chebyshev Polynomial pn () adapted to the ellipse of Fig. 3.3 for n = 16, plotted vs. real . : : : : : : : : : : : : : : : : : : : : : : : : : : Square waveguide dispersion. : : : : : : : : : : : : : : : : : : : : : : : : : H21x mode (Hx component only) for a square waveguide as in Fig. 3.5 . : H21x mode (Hy component only) for a square waveguide as in Fig. 3.5 . : H2x;1+2 mode (Hx component only) for a square waveguide as in Fig. 3.5 . H23x mode (Hx component only) for a square waveguide as in Fig. 3.5 . : Channel waveguide dispersion. : : : : : : : : : : : : : : : : : : : : : : : : H32x mode (Hx component only) for a channel waveguide as in Fig. 3.10 . Strip{slab waveguide dispersion. : : : : : : : : : : : : : : : : : : : : : : : Complete spectrum (using EISPACK) of the square waveguide of Fig. 3.5 discretized on a 13 13 grid, together with a priori preconditioning and cuto curves. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2.1 Fermi{Dirac distribution function f (E ) =

14

2.2 2.3 2.4

29 30

2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 4.1

vii

32 33 34 36 37 38 50 53 59 59 63 65 66 67 68 70 71 72 80

CHAPTER 1 INTRODUCTION The self{consistent Schrodinger{Poisson model of quantum{eect devices and the 2D vector Helmholtz equation for dielectric waveguides require resolution of the lowest energy eigenfunctions of an elliptic operator. For these and similar problems, we propose a uni ed approach in which the problem{speci c de nition of energy is encapsulated in a few subroutines. The only restriction on this de nition of energy is that it must decrease as one moves outward in the discrete spectrum from some ellipse in the complex plane. One problem{speci c task is to determine this ellipse, and another is to determine cuto curves for the desired part of the low energy spectrum outside this ellipse. There are also problem{speci c subroutines for the discretization and the corresponding matrix{vector multiply. Our approach is validated by the disparity of the problems we successfully solve: Schrodinger's equation leads to a real symmetric eigenvalue problem with a built{in de nition for the energy of a wavefunction corresponding exactly to our de nition, while Helmholtz' equation can lead to complex nonhermitian eigenvalue problems and has no natural de nition for the energy of an optical mode. The de nition of energy we propose assigns the lowest energy to the lowest order modes. 1

When this type of eigenvalue problem is discretized over a spatial domain, the result is a large, sparse, generalized eigenvalue problem Ku = Mu. These eigenvalue problems are nondefective, at least in the desired invariant subspaces, and M is hermitian positive de nite, always. In fact, M is a natural inner product for the problem, arising directly from the discretization. In our problems, M is diagonal, and we easily recast the problem as Av = v, where A = M ?1=2KM ?1=2 and v = M 1=2u. The discretization resolves modes less well the higher their order, and above the rst few modes the resolution is so poor that they no longer bear any relevance to the PDE. An ecient solver must pick the low order modes from the vast majority that are worthless. Furthermore, of the modes that are well{resolved by the discretization, only a few may be physically interesting, for example, the occupied wavefunctions of Schrodinger's equation, or the con ned, propagating modes of a waveguide. Standard dense eigenvalue solvers are unaware of all this. For realistic grids, their use leads to unmanageable CPU time and memory requirements, since the solution for all the possible eigenvalues is attempted. The implicit assumption behind such packages as EISPACK and LAPACK is that the matrix is the eigenvalue problem, but the matrix does not contain any information about the appropriate de nition of energy. Even useful information that is contained in the matrix, namely the special matrix structure that comes from the discretization, cannot be used by dense methods. Under these circumstances, it is inappropriate to use dense matrix eigenvalue solvers. In fact, it is inappropriate to think of these as matrix eigenvalue problems at all. We view them as operator eigenvalue problems which have been projected onto a vector space of 2

functionals by the discretization. However, this vector space is much larger than what we really want. Projection methods [61] project the problem onto a smaller subspace but are inadequate without some kind of acceleration to the desired invariant subspace. Appropriate algorithms for determining a subset of eigenvalues/eigenvectors of large, sparse nonsymmetric matrices include bi{iteration [14], lopsided iteration [78], simultaneous iteration [76, 77], nonsymmetric Lanczos algorithm [53, 39], and variations on the method of Arnoldi [60]. These are all projection methods, but they all fail to address the issue of ecient acceleration to the desired subspace in a particular problem. Shift{ and{invert as a preconditioner for Arnoldi's method [48] is not eective for large, sparse problems. Even the Tchebychev{Arnoldi algorithm rst proposed by Saad [62] and implemented for the \optimal ellipse" in [27] does not use the Chebyshev polynomial to precondition the problem, but only the starting vector. Furthermore, it doesn't \know" anything about the particular problem being solved. RITZIT [58, 59] comes close to addressing the problem for symmetric matrices, but adapting it to a particular problem is awkward because the concept of energy is missing. RITZIT relies on the user to shift the problem so that the low{energy eigenvalues have the largest absolute value, and uses a cuto count rather than a cuto energy to decide whether a mode is desired. Our approach has been to attack two particular eigenvalue problems using any physical information that helps, and determine what common algorithmic approach we can devise. We chose the method of Arnoldi as the basic algorithm for several reasons. First, Krylov subspace methods, such as the Lanczos and Arnoldi algorithms, have superior convergence properties for extremal eigenvalues [61] and can be eectively accelerated 3

by preconditioning [1]. Also, numerical stability is improved by the modi ed Gram{ Schmidt process [4] in the Arnoldi iteration. Finally, the Arnoldi method can easily be generalized to nd eigenvalues of any given multiplicity. For the Chebyshev acceleration, the approach has been to write skeleton code with robust general{purpose subroutines, and modular problem{speci c subroutines to specialize the algorithm to the particular problem. While the main focus has been eigenvalue problems, it was neither desirable nor possible to ignore other physical, numerical, and computational issues that came up. In Chap. 2, in order to be faithful to the physics, we couple a highly nonlinear Poisson equation to the Schrodinger equation. To more accurately model the nonlinear terms, we use a discretization scheme that combines information about the iterates on the grid with material parameters de ned on the dual grid. Our code allows boundary conditions, material interfaces, or changes from classical to quantum model to be speci ed on any points/lines of the grid. We adapt the solution method to the convergence of the outer iteration, rst using an adaptive underrelaxation strategy, then switching to a Jacobian{ free Newton acceleration near the xed point. Finally, our symmetric version of the Chebyshev{Arnoldi algorithm is much faster than a suitably adapted and optimized version of RITZIT. In Chap. 3 we present a new discretization of the 2D vector Helmholtz equations for the modes of a dielectric waveguide. In App. A, we show it is rst{order consistent, and second{order consistent on regular grids. We present an ad hoc calculation of the ellipse

4

to which we adapt the Chebyshev polynomial acceleration, and we present numerical results verifying the eectiveness of this approach. In Chap. 4, the Chebyshev{Arnoldi algorithm, which was introduced informally in Chap. 3, is discussed in depth. The algorithm is essentially the same for symmetric, nonsymmetric, and complex nonhermitian problems. In the symmetric case, the ellipse is just an interval on the real axis. The asymptotic convergence estimates in Sec. 4.4.2 are also universally applicable, and show the method to be much faster than subspace iteration. The asymptotic estimates come from examining the trans nite diameter of a set, which is yet another application of Chebyshev polynomials. We review the essential facts about Chebyshev polynomials in App. B. In Chap. 5, we examine Newton{like algorithms and show how to use sparse iterative solvers and inverse iteration eectively for general large sparse nonlinear eigenvalue problems satisfying the overdamping condition. In our linear eigenvalue problems, M is diagonal and reduction to a standard eigenvalue problem is trivial, but in general, Cholesky reduction is very expensive because of ll{in. For linear pencils in general, the mass matrix should be lumped before Cholesky reduction, and the solution from Chebyshev{Arnoldi should be used as an initial guess for Newton's method, where the full mass matrix may be used.

5

CHAPTER 2 EFFICIENT NUMERICAL SIMULATION OF ELECTRON STATES IN QUANTUM WIRES 2.1 Introduction Recent advances in the fabrication technology of semiconductor nanostructures have made possible the realization of systems with extremely small sizes, which behave as quasi{one{dimensional conductors or quantum wires. In most cases, quantum{mechanical methods are necessary to describe such systems, to take into account eects like size quantization [43, 44, 42] and quantum interference [29]. Quantum wire devices, whose operation is based on quantum interference eects, have been proposed [72, 73, 17] and preliminary experimental veri cation has been presented [47]. In this chapter we present a new algorithm for the numerical computation of electron states in the cross{section of quantum wires. The two{dimensional physical model consists of Poisson's equation for the electrostatic potential , coupled with Schrodinger's equation for the electron wavefunctions f lg:

?r [r] = ;

?r

h h2

2m r l

i

+ [V ? El] l = 0: 6

These equations and and their coupling are discussed in detail in Sec. 2.2. Brie y, the electrostatic potential aects the conduction band potential V , and the occupied wavefunctions

l

determine the electron density nQW in the quantum well, which contributes

to the total charge density . Earlier computations on this set of equations were performed by Laux and Stern [43], Laux and Warren [44], and Laux [42]. Our approach to the solution of this coupled system diers from the methods presented in [43, 44, 42] in four ways. First, our outer iteration is formulated in terms of the quantum electron density nQW , rather than the electrostatic potential . We proceed by successive solution of Poisson's equation and the eigenvalue problem for Schrodinger's equation, substituting the solution of one equation into the coupling term of the next. Such a process can be represented through iteration of a mapping T , de ned in Alg. 2.1 in Sec. 2.3. Clearly, a xed point of

T corresponds to a self{consistent solution. We believe this is a cleaner formulation than the iteration in employed in [43, 44, 42], for which the exchange{correlation potential

Vxc (nQW ) (See Sec. 2.2) is a \lagging" term in the Schrodinger equation. Second, we solve the eigenvalue problem for Schrodinger's equation by subspace iteration [59] or Chebyshev{Arnoldi (See Chap. 4) rather than by the Lanczos algorithm of Cullum and Willoughby [16]. We discuss our approach to this eigenvalue problem in Sec. 2.4. Third, as discussed in Sec. 2.5, the Newton direction for solving nonlinear Poisson's equation is found using a version of the Conjugate Gradient (CG) method which involves a red{black reordering and block reduction to halve the matrix size while doubling the 7

number of o{diagonal bands in the matrix. The reduction in size corresponds to a reduction in matrix{vector multiplies required to converge, while the increased matrix band width increases potential parallelism in those multiplies. Fourth, we adapt our algorithm depending on how close we are to the solution. T is initially stabilized through adaptive underrelaxation as discussed in Sec. 2.6.1. Close to the solution we accelerate the speed of convergence of the iteration with T by applying Newton's method to the corresponding root nding problem ? T ( ) = 0. Our version of Newton's method is Jacobian{free and based on the Generalized Minimum Residual method (GMRES) by Saad and Schultz [64], as described in Secs. 2.6.2 and 2.6.3. In Sec. 2.7, we present some numerical results on our approach. The combination of two dierent solution algorithms for the outer iteration in T with Chebyshev{Arnoldi and reduced system conjugate gradient for implementing the mapping T performs remarkably well. The computer programs have been run on a variety of computers, mostly super{ scalar or multi{vector machines, and can take advantage of parallel architectures in many of the most compute{intensive subroutines.

2.2 The equations We repeat Poisson's and Schrodinger's equations:

?r [r] = ;

?r

h h2

i

2m r l + [V ? El ] l = 0:

8

(2.1) (2.2)

Here m is the eective mass of electrons, is the dielectric constant, is the charge density, and V is the conduction band potential energy. The charge density in Eq. (2.1) is expressed in terms of the magnitude of the elementary charge e, the electron density n, the hole density p, and the total density of ionized dopants ND+ ? NA? as

= e[p ? n + ND+ ? NA?]:

(2:3)

The electron density n = nQW in the quantum well is obtained from the eigenpairs

fEl ; lg of Schrodinger's Eq. (2.2) through nQW =

1 X l=1

Nl l2:

(2:4)

The number of occupied levels Nl of the one{dimensional electron gas associated with the

lth two{dimensional eigenstate is expressed in terms of the Fermi level EF , the eective mass mz along the wire axis z, the valley degeneracy gv (which is 1 for GaAs and 2 for Si), and the temperature , through the Fermi{Dirac integral of order ?1=2 as

Nl = gv 2mz k2B h

!1=2

EF ? E l F?1=2 k : B

(2:5)

The rest of the terms in the charge density Eq. (2.3) are modeled classically [80], as is the electron density n = nCL in the bulk beyond the quantum well. The densities of

9

ionized dopants are given by

2 ND+ = ND 41 ?

3 1 ND 5 = 1 + g1D exp EDkB?EF 1 + gD exp EFkB?ED NA? =

NA : 1 + gA exp EAkB?EF

(2:6) (2:7)

Here, ND and NA are the concentrations of donor and acceptor impurities. The donor and acceptor energy levels ED and EA are determined by the conduction{band energy

EC ? EF V , the energy bandgap Eg , and the ionization energies Ed and Ea for donors and acceptors, respectively, by

ED = EC ? Ed EA = EC ? Eg + Ea The ground{state degeneracies of the donor and acceptor levels are, respectively, gD = 2 and gA = 4 for Ge, Si, and GaAs. The hole density p near the top of the valence band is given by

EV ? EF 2 p = NV p F1=2 k : B

Here, NV is the eective density of states for a three{dimensional hole gas, given by

NV = 2 mdhkB2 2h

10

!3=2

;

where the density{of{state eective mass of the valence band is given in terms of the light and heavy hole masses by

mdh = (m3lh=2 + m3hh=2)2=3: Similarly, the electron density nCL near the bottom of the conduction band is given by

EF ? EC 2 nCL = NC p F1=2 k : B Here, NC is the eective density of states for a three{dimensional electron gas, given by

NC = 2 mdekB2 2h

!3=2

MC ;

where mde is the density{of{state eective mass for electrons and MC is the number of equivalent minima in the conduction band. To summarize, the nonlinear form of Poisson's Eq. (2.1) in terms of the dimensionless potential = ke is B

e (; n ) 0 = ?r r ? k QW 0 B 0

(2:8)

where

( ) V V 2 2 h ? Eg h (; nQW ) = e NV p F1=2 ? + k ? nQW or NC p F1=2 ? k B B "

11

3 N N A D ? 5 : + 1 + gD exp + ?VkhB+Ed 1 + gA exp ? + Vh?kEBg+Ea We obtain boundary conditions for Poisson's Eq. (2.8) at the backgate by setting (; 0) = 0 for charge neutrality and solving for using an interval bisection method. The potential V in Schrodinger's Eq. (2.2) is given by

V = ?e + Vh + Vxc (nQW ); where Vh takes into account heterojunction discontinuities and Vxc is the local exchange{ correlation potential [75] given by

h i 2 R; Vxc = ? 1 + 0:7734x ln(1 + x?1) r s

1=3 where = 4 , x = rs =21, 9 4nQW ?1=3, ; a rs = 3 2

h ; a = 4me 2 and the eective Rydberg is given by

e2 : R = 8a

12

In the computer program, we use the form

" #) ( Vxc = ? 0 4nQW 1=3 + 0:7734 ln 1 + 21a 4nQW 1=3 2 aR kB 3 21a 3

(2:9)

where the dimensionless Rydberg 2 e R = 8 ak 0 B

and the Bohr radius

2

0h a = 4m e2 0

don't depend on variations in or m. The image potential has been neglected. In terms of the dimensionless potential , the conduction band dimensionless potential is

V = ? + Vh + Vxc kB kB kB and the Schrodinger equation is

? r mm0 r l + 2mh0k2B k V ? kEl l = 0: B B

(2:10)

2.3 Reformulation as a xed point problem We found the nearly step{like behavior of the Fermi{Dirac distribution function f (E ) at very low temperatures (Fig. 2.1) to be the most important consideration for conver13

1

f (E )

= 4:2K = 77K

0.5

0

EF E

Figure 2.1 Fermi{Dirac distribution function f (E ) =

1 1 + exp Ek?BEF

gence. Successive solution of Eqs. (2.1,2.2) applied to a quantum electron density nQW not near the xed point causes energy levels near the Fermi level to become completely occupied or empty. To smooth this behavior, we de ne our mapping T in terms of , which corresponds approximately to ?f ?1 (nQW ). The precise de nition of the mapping

T is given below. Note that each step of an iterative solver may be considered part of the xed{point iteration, so that exact solutions of the Schrodinger and Poisson equations are not always necessary.

Algorithm 2.1 (The Fixed Point Mapping T : 7! ) To smooth the behavior of

nQW under successive solution of Poisson and Schrodinger equations: ( ) n max + 2 nQW = max 0; e?= + 1 ? Perform one or more Newton steps (with or without an inexact linesearch) for nonlinear Poisson's Eq. (2.8) for an intermediate estimate to the dimensionless potential (nQW ). 14

Obtain the conduction band dimensionless potential V=kB = ? + Vh =kB + Vxc (nQW )=kB from and nQW . Given V=kB , perform several iterations of Chebyshev{Arnoldi or subspace iteration on Schrodinger's Eq. (2.10) for the occupied levels fEl=kB ; lg. Compute the electron density in the quantum well 1 X n QW = Nl l2: l=1

! n + 2 max = ? ln n + ? 1 : QW

The constant is added to avoid singularity and to somewhat desensitize the iteration to small variations in very small n QW , which the logarithm alone would greatly amplify. We set

nmax = pmax gv 2mz k2B h

!1=2

F?1=2(100)

and

= 0:01nmax=pmax where pmax represents the maximum number of computed wavefunctions. This xed point mapping, essentially a nonlinear block Gauss{Seidel algorithm, may or may not converge to a solution, but it can be stabilized and accelerated by an appropriate underrelaxation strategy. The xed point iteration and approximate Newton's method both bene t from the additional smoothness introduced by our transformation. We discuss stabilization and acceleration procedures in detail in Section 2.6.

15

2.4 The Eigenvalue Problem for Schrodinger's Equation Schrodinger's Eq. (2.10) is discretized on a nonuniform rectangular grid using a box{ integration scheme, as described for instance in [82], except that the variable coecient (in this case m0=m) is evaluated in the middle of mesh regions instead of on mesh segments. Thus, the ux term

?

I m0 r n ds; @ m

obtained from integrating Eq. (2.10) over a box of boundary @ with outward normal

n, has two components for each side of a box, e.g. the North face: w m0 e m0 ? 2 m +2m NW NE

N

? n

P

(See Fig. 3.2 for some explanation of symbols). This is better for modeling abrupt changes in material, since it places all such interfaces on grid lines. The wavefunctions are assumed to vanish at the boundaries. This discretization yields a symmetric generalized eigenvalue problem, K + (V ? El)M = 0, where K is a symmetric matrix with only 5 nonzero diagonals and M is a positive diagonal matrix. This generalized eigenvalue problem is recast as a symmetric eigenvalue problem, Ax = Elx, where A = M ?1=2KM ?1=2 + V and

x = M 1=2 . We rely upon the quantum electron density Eq. (2.4) to reduce to a sum over a few occupied wavefunctions. The occupancy of an energy state is given by the Fermi{ 16

Dirac distribution function (See Fig. 2.1), which decreases rapidly as the eigenvalue El increases. Thus, only the lowest energy wavefunctions (i.e., leftmost eigenvalues) of Schrodinger's equation are needed, and it is not necessary to resort to inverse iteration. A subspace iteration or Krylov subspace method can nd extremal eigenpairs using only matrix{vector multiplications. Such multiplications can be implemented very eciently for discretizations on rectangular grids. Our rst program [34] used the Chebyshev accelerated subspace iteration subroutine RITZIT of Rutishauser [58, 59, 52]. This algorithm does not require sophisticated software to recognize and reject spurious eigenvalues or replicated eigenvectors [16] as employed in [43, 44, 42]. As RITZIT delivers the dominant eigenvalues, we only needed to shift the spectrum to the left so that the leftmost eigenvalues became dominant. Also, Rutishauser's RITZIT routine was modi ed to use column operations, which improve data locality and vectorize well, at the expense of a larger workspace. With such minor adaptations, RITZIT was a robust and reasonably fast code which freed us to concentrate on the other aspects of the problem. Currently, we use a much faster and more accurate preconditioned Krylov subspace method, Chebyshev{Arnoldi, which enhances the rapid convergence of Krylov subspace methods and economizes on the cost of orthogonalization without undermining the algorithm (see Chap. 4). Since our goal was robust software, we avoided Lanczos' method and its associated problems, and that strategy has resulted in a faster solver as well. In the computations by Laux and Warren [44], the Lanczos solver accounts for more than

17

95% of the total computation time. The Chebyshev{Arnoldi solver for Schrodinger's equation is only about 15{20% of the total cost.

2.5 The Solution of the Nonlinear Poisson Equation Poisson's Eq. (2.8) is discretized using the same box{integration scheme as for Schrodinger's equation. We impose Dirichlet boundary conditions for device contacts and zero{valued Neumann boundary conditions for other domain boundaries. This yields a 5{banded symmetric stress matrix K = kij , similar to that for Schrodinger's equation. The mass matrix

M = miq , however, is novel in that it couples the grid to the dual grid. The coupling is nonzero for indices q = fNW; SW; SE; NE g, referring to the four quadrants adjacent to the mesh point with index i. The motivation for this is that is very sensitive to material parameters which can change across mesh lines, such as Vh, Eg , ND , NA, mde , and mdh, and also depends on and nQW , which are de ned on mesh points. Our approach is to take an area{weighted average of over the four regions in the box{integration scheme, always using and nQW from the center point. This somewhat complicates the Newton solver, but it improves the accuracy over regions of changing material. We use Newton's method with an inexact linesearch to minimize the right hand side (RHS) of Eq. (2.8). The Newton direction is given by

2 miq " e V ? E ?(3 = 2) 2 h g i = ? kij + k NV p ?(1=2) F?1=2 ?i + k ( B 0 )B Vh ?(3=2) F ? 0 or NC p2 ?(1 ?1=2 i ? =2) k B

18

(2.11)

3 1?1 ?Vh+Ed V ? E + E g a h NAgA exp ?i + kB ND gD exp i + kB 75 ij CC +h ?Vh+Ed i2 + h i A 2 1 + gD exp i + kB 1 + gA exp ?i + Vh?kEBg+Ea q 2 miq " e kij j ? k NV p2 F1=2 ?i + Vhk?Eg B ( B 0 ) 2 V h ? nQW i or NC p F1=2 i ? k B 31 ND ?Vh +Ed ? NA Vh?Eg +Ea 5 CA ; + 1 + gD exp i + kB 1 + gA exp ?i + kB q where we have used the identity [5]

j + 1) F : Fj0 = ?(?( j ) j?1 If no initial guess is available, we create one by solving a linear Poisson equation with

p = n = 0 and all dopants completely ionized,

2 ?r r = k e [ND ? NA ]: 0 B 0 Since there is a nonempty Dirichlet boundary for Poisson's equation, the matrix K is positive de nite [28]. Each nonlinear term results in a positive addition to the diagonal, so the Jacobian J () is positive de nite. Since the Jacobian is dierentiable for > 0, it is Lipschitz continuous near a root, although clearly less so at low temperatures. Thus we expect quadratic convergence in some neighborhood of the root, by the standard local convergence theorem for Newton's method (See e.g. page 90 of [19]), but perhaps in a very small neighborhood. The CG algorithm is well suited for solving Eq. (2.11). 19

As a further improvement, we implemented a red{black reordering and used CG on the reduced system [31]. The red{black reordering consists of splitting the mesh into two sub{meshes in a checkerboard pattern. Since the Laplacian is discretized on a 5 point cross, with the current point in the center of the cross, every black mesh point is coupled only to red ones, and vice versa. The discretization of the Jacobian J () is scaled and put into red{black order by a permutation P . This results in a linear system

PD?1=2 JD?1=2P T (PD1=2x) = PD?1=2f; or

10 0 BB I G CC BB xred CA B@ B@ T xblack G I

1 1 0 CC BB fred CC CA ; CA = B@ fblack

which is still symmetric and positive de nite, where I denotes an identity matrix. A single step of block Gaussian elimination yields the decoupled system

10 0 B BB I xred G C B C B C B@ A@ xblack 0 I ? GT G

1 0 CC BB CA = B@

fred fblack ? GT fred

1 CC CA :

The second equation is symmetric positive de nite: Otherwise, if yT (I ? GT G)y 0 with

y 6= 0, then

0 B ?Gy B B @ y

1T 0 10 CC BB I G CC BB ?Gy CA B@ CA B@ T y G I 20

1 CC CA 0;

which contradicts the positive de niteness of the original system. In particular, the diagonal of I ? GT G is positive. Therefore, we may scale the second equation and solve it using CG. Without the scaling, this CG requires exactly half the number of iterations as CG on the scaled original system [10]. By scaling the second equation, the number of iterations of CG is further reduced. The solution of the original system is reconstructed from that of the reduced system.

2.6 Stabilization and Acceleration In the computations by Laux, et. al., a xed point iteration in the electrostatic potential is employed. In [43], this outer iteration is accelerated by employing a classical approximation to the full Jacobian of the coupled Poisson's Eq. (2.8). In [44] Aitken's acceleration procedure is employed as well, and in [42] a generalized extrapolation algorithm of Anderson [2]. It is recognized in [43, 44] that the lack of an implementation of Newton's method in which the Jacobian is approximated accurately is an obstacle to rapid convergence of the outer iteration, due to the nonlinearity of the coupled system. In our algorithm for the solution of the coupled Eqs. (2.1,2.2), we accelerate and stabilize the xed point iteration T de ned in Alg. 2.1. Initially, iteration with T is stabilized through underrelaxation. A properly underrelaxed iteration with the mapping

T is expected to constitute an eective algorithm far away from the solution [38]. Close to the solution, the equation T ( ) = is solved through an inexact version of Newton's method which is locally quadratically convergent. The linear systems which have to be solved for Newton's method are dense, but this is no obstacle because they are solved in 21

a Jacobian{free manner by the iterative method GMRES which requires only Jacobian{ vector products. The latter can be approximated by nite dierences.

2.6.1 Stabilization by Adaptive Underrelaxation Our de nition of T in Alg. 2.1 is inspired by a linearization of the Fermi{Dirac distribution function, and our strategy is to underrelax as if the map T were indeed a linear function of one variable, for which the optimal underrelaxation parameter ! is exactly known in terms of the perceived slope alone. On oscillations, we get a nonlocal estimate of this slope, so we encourage them by increasing ! when the convergence seems to be stalling. We start with a factor of ! = 0:5, i.e., a bisection search:

Algorithm 2.2 (Underrelaxed Nonlinear Gauss{Seidel) To stabilize T : Set ! = 0:5, choose 1, and set r0 = k1k for i = 1; : : : "nlPoiss = 0:7"nlPoiss, "Schrod = 0:7"Schrod, ri = kT (i) ? i k. if i > 1 then i?1 = kT (i) ? T (i?1)k. if i = 1 then [avoid losing a good initial guess] if ri < 2"nlGS then !0 = ! = 1=(pmax + 1) elseif ri > ri?1 then [damp the oscillation]

!ri?1 i?1 !0 = !r , ! := i?1 i?1 + !ri?1 elseif ri > "nlGS and ri < 10 then [encourage mild oscillation] if rri > rri?1 and i > 2 then [ease out of stall] !0 = ! := min(1:1!; 0:5) i?1 i?2 else [relax after an oscillation] ! = !0 i+1 = i + ! [T (i) ? i] until ri < "nlGS and ri?1 < "nlGS [the oscillations have died out]

22

It must be admitted that some of this strategy is based on trial and error on a small set of test cases. However, the essential features have clear heuristic motivation as noted in each case we treat. In contrast to earlier underrelaxation strategies [38, 34], we increase as well as decrease ! in response to the recent iteration history. This is important to avoid stalling the iteration unnecessarily with ever shrinking !. Our numerical results indicate that this underrelaxed xed point iteration with T does indeed damp the oscillations without killing them and converges reliably, especially for smaller electron densities. The early oscillations far from the xed point, which are not encouraged, are so extreme that it is pointless to solve Schrodinger's equation and the nonlinear Poisson equation to high accuracy, so our algorithm progressively shrinks the tolerances for these equations.

2.6.2 Acceleration by Newton's Method Close to the solution we reliably accelerate the speed of convergence by applying Newton's method to the xed point problem for T . To this end, we rewrite the xed point iteration

i+1 = T (i)

(2:12)

? T ( ) = 0:

(2:13)

as the nonlinear root nding problem

23

To this root nding problem we apply Newton's method which is locally quadratically convergent. Newton's method requires at every step the solution of the linear system [I ? T 0(i)]d = ?[i ? T (i)];

(2:14)

where T 0(i ) is the Jacobian matrix of the mapping T at the point i. This means that the typically dense linear system I ? T 0(i) has to be inverted. However, we can solve this linear system without ever generating the Jacobian by employing a nonlinear version of the iterative solution method GMRES [64] described next.

2.6.3 Nonlinear versions of the GMRES algorithm In an analysis and set of numerical experiments in [37], it was discussed why a Jacobian{free approach based on GMRES is highly suitable for accelerating the convergence of a nonlinear xed point mapping T de ned through successive solution of coupled elliptic problems. This implementation of Newton's method is stabilized through a \damping" strategy [23] so as to make it more reliable further away from the solution. In [37] it was shown that, for such a mapping T , the eigenvalue spectrum of I ? T 0(i ) clusters at 1 only. We can take advantage of this feature of the spectrum of I ? T 0(i ) in the following approach. Solution of Newton's Eqs. (2.14) is equivalent to minimization over d of

k[I ? T ](i) + [I ? T 0(i )]d k2:

24

(2:15)

where k k2 is the Euclidean norm in RN . Suppose that i is the current approximation to the solution of ? T ( ) = 0 and that we wish to nd a new approximation of the form i+1 = i + d . In the nonlinear version of GMRES [84], to be called henceforth NLGMR, the vector d is written in the form

d =

m X j =1

j vj ;

(2:16)

where the vj0 s are m orthonormal vectors that form a basis of the Krylov subspace

Km = span fv1; [I ? T 0(i)]v1; :::; [I ? T 0(i)]m?1v1g. These vectors are easily determined by an Arnoldi process, provided the operation v 7! w = [I ? T 0(i)]v is available. The coecients j are unknowns to be determined. Minimization of (2.15) can be achieved by applying the GMRES algorithm [64] to the linear system [I ? T 0(i)]d = ?[i ? T (i)];

(2:17)

starting with the initial solution d (0) = 0. Notice that solving this equation exactly will yield the Newton direction [I ? T 0(i )]?1[i ? T (i)], so this procedure is nothing but an inexact Newton method. One of the most important aspects of the procedure based on this approach is that the Jacobian matrix T 0(i ) is never needed explicitly. The only operations on the Jacobian matrix T 0(i) that are required for GMRES are matrix{vector multiplications

25

w = T 0(i )v, which can be approximated by T 0(i )v T (i + hvh) ? T (i) ;

(2:18)

where i is the point where the Jacobian is being evaluated and h is some carefully chosen small scalar. This approximation to the product of a vector by a Jacobian was successfully used in the context of ordinary dierential equations [13, 22, 9, 8] and is quite common in nonlinear equation solution methods and optimization methods (see for example [23, 15, 84, 19]). Because the mapping T is now ultimately being used to compute a search direction, we set the tolerances on the coupled equations to machine precision, and rely on the solvers to compute the best possible answer if this precision is unattainable. As discussed in [37], the accuracy to which Newton's Eqs. (2.14) are solved is adjusted adaptively. For every Newton step, the maximum number m of steps in the GMRES algorithm is varied according to the level of nonlinearity as determined from the residual in Newton's equations. Let 0 be an approximate solution to ? T ( ) = 0, and let m be the improved approximate solution after m steps of GMRES. If the nonlinearity is only mild the nonlinear residual resnl = m ? T (m) should be approximated well by the GMRES approximation reslin = 0 ? T (0) + [I ? T 0(0)](m ? 0). The Euclidean norms of the two expressions should therefore be approximately equal. If they are not, the linearized model employed in the NLGMR algorithm is clearly not very good and accurate solution of Newton's method is probably wasteful. In this case, m should be 26

kept small. In case of good agreement, m should be increased. This reasoning is re ected in our implementation of the NLGMR algorithm which employs an adaptively expanding subspace over which Newton's equations are solved. We start out with the number of GMRES iterations m = 2. We double the size of the subspace up to a maximum of 25 whenever we nd that the residual for the linearized equations is within a factor of 1.5 of the nonlinear residual. The size of the subspace is kept unchanged if the nonlinear residual was in between 1.5 and 5 times the linear residual. Otherwise the size of the subspace is halved. The algorithm is summarized below.

Algorithm 2.3 (Nonlinear GMRES) To accelerate convergence near the xed point: Form an initial guess 0 for the electron density.

repeat

Employ m steps of GMRES to solve the system [I ? T 0(i )]d = ?[i ? T (i)] for Newton's direction dn without generating the Jacobian I ? T 0(i ). if 2=3 kresnl k=kreslin k 3=2 then m := m 2 else if 3=2 kresnlk=kreslin k 5 then m := m else m := m=2. Perform an inexact linesearch for the stepsize along the direction dn to guarantee a decrease of the nonlinear residual (i + d ) ? T (i + d ). until convergence. The ultimate rate of convergence per outer iteration will be quadratic for this algorithm because it is a form of Newton's method. 27

2.7 Numerical Results The geometry of the test structure is depicted in Fig. 2.2. With reference to Fig. 2.2, we assumed in our model calculations a Schottky barrier of 0:99 eV, and a uniform concentration of Al equal to 30% throughout the AlGaAs layer. This corresponds to a conduction band discontinuity of approximately 0:25 eV. The top AlGaAs n{layer (I) is uniformly doped with ND = 6 1017cm?3, while the AlGaAs spacer layer (II) and the GaAs substrate (III) are considered to be slightly p{doped with NA = 1014cm?3. The thickness of top and spacer layers are assumed to be 30 nm and 10 nm, respectively. The domain extends 300 nm into the air above the surface, and metal contacts extend 10 nm above the surface. The temperature was set at = 4:2K. The nonlinear Poisson equation is discretized on a nonuniform rectangular grid of size 53 78, and the Schrodinger equation is discretized on a 53 35 subgrid covering the top two layers and the quantum wells. The x{grid spacing grows exponentially as the x{component of the electric eld Ex and nQW both decay from the centers of the quantum wells to zero at the sides. The y{grid spacing is uniformly ne in each of the top two layers, where Ey can be large, and grows exponentially as Ey decays into the air above the contacts. The y{grid spacing grows exponentially as Ey00, Ex, and nQW all decay from the heterojunction into the bulk, the potential quickly assuming a parabolic form. Thus, a uniformly coarse y{grid spacing through the remainder of the bulk makes the discretization second{order and incurs negligible error. All material interfaces fall on grid lines. The computations were performed on an HP 9000/730. 28

x

y

10nm

100nm

60nm

70nm

60nm

100nm

30nm 10nm V2 ~6um

V1

V2

I II III

Figure 2.2 Schematic diagram of the simulated heterojunction structure.

29

res

10000 1000 100 10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07

br br

br rb b b rr bb rr bb rr bb b r r rb b b b r b bb rrrr r b b b rr rr rb bb b rr b bb b r rrrr r r b br r b b bb r bb bb b rrr rr bbbb b r rr r rr bbbb b b b rr bbb rr rr bb r rr bb b r rr r r bbb bb rb r bb rb b bb rrr rr rr rr b b rr

NLGS

NLGMR

0

100

200

300 400 500 computation time (s)

600

700

800

Figure 2.3 Convergence to self{consistency for applied potentials V1 = V2 = 0:2V . The residual norm k ? T ( )k2 is shown vs. CPU time (HP 9000/730), with (solid dots) and without (open dots) the exchange{correlation potential Vxc (nQW ). The step from one dot to the next represents one application of T as in Alg. 2.1. In Fig. 2.3 the convergence of the underrelaxed xed point iteration NLGS (Alg. 2.2) is seen to be satisfactory for this low quantum electron density, whether or not it is followed by the NLGMR accelerated iteration (Alg. 2.3), although NLGMR certainly converges faster. We plot the convergence of these algorithms both with and without the density dependent exchange{correlation potential Vxc (nQW ). Evidently, the nonlinear complication introduced does not injure the convergence of our algorithms. At the start, the nonlinear Poisson solver solves to medium residual tolerance, using the solution of the linear Poisson equation as initial guess, so the rst application of T is expensive. Subsequent nonlinear Poisson solves have reasonable initial guesses (because oscillations are not encouraged until they are sure to be mild), and the tolerance is only gradually reduced, so the time for each application of T diminishes accordingly. Upon 30

switching to the NLGMR acceleration, the nonlinear Poisson equation is solved to high precision, but by this point there is a very good initial guess for each solve and few steps are needed, so each application of T is reasonably inexpensive. When the NLGMR algorithm is solving Newton's equations, the residual ? T ( ) is linearized and the linearized residual decreases monotonically. After the approximate solution of Newton's equations over a subspace of small dimension has been obtained, the algorithm computes the actual nonlinear residual which is expected to be larger depending on the nonlinearity of the problem. The computation of the actual nonlinear residuals appear in the convergence graph as upward surges which result in a \jagged" appearance at the right end of the gure. The potential energy V and ionized acceptor density NA? are given in Fig. 2.4. The classical electron and hole densities are negligible in this example, and all donors are completely ionized, ND+ = ND . In this example, we apply voltages of V1 = V2 = 0:2V to the electrodes. Between the electrodes, our idealized surface model [i.e., Eq. (2.8)] results in parabolic dips in the potential, which extend through the nearby heterojunction to form the quantum wells. We include a very thick GaAs substrate of 5:96 m to model the conduction band past the point of attening. The potential energy V , the quantum electron density nQW , and the rst two wavefunctions obtained are shown in Fig. 2.5 on a part of the y{grid including only small portions of the air and bulk. The corresponding occupancy levels are shown in Table I. The occupancy levels of the successive wavefunctions decrease rapidly to zero due to the rapid decay of the Fermi{ Dirac distribution function (Fig. 2.1). 31

Figure 2.4 Flattening of the conduction band potential energy V corresponding to a ? ?

rapid exponential decay in the ionized acceptor density NA . V and ln(NA + 10?8 ) are plotted over the entire domain for applied potentials V1 = V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included.

State i 1 2 3 4

Ei (eV ) 10?3 -3.48 -3.48 4.18 4.18

Ni (m?1) 106 57.98 57.98 .00 .00

Table I Occupancy levels of the rst four wavefunctions for applied potentials

V1 = V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. 32

Figure 2.5 Closeup of the conduction band potential energy V , the quantum electron density nQW , and the rst two wavefunctions for applied potentials V1 = V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. 33

res

10000 1000 100 10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07

br rb rb

rb

br rb br rb r bbbrbrrrr bbbb rrrrr bbb rrr bbb rrr bbb rrr r bb bb b b brrrrr b b brbr r r r bb b r r bbbb r rrrr rr bbbb r bbb rrrrr r bbb b bb bb b b b rrrrr r rr b b b b br r r r r bb r b r b bb rr b b r r b r r b bb rrr b bb

NLGS

NLGMR

0

200

400

600 800 1000 computation time (s)

1200

1400

Figure 2.6 Convergence to self{consistency for applied potentials V1 = 0:7V and V2 = 0:2V . The residual norm k ? T ( )k2 is shown vs. CPU time (HP 9000/730), with (solid dots) and without (open dots) the exchange{correlation potential Vxc (nQW ). The step from one dot to the next represents one application of T as in Alg. 2.1. We present no results at = 77K because there are too many occupied wavefunctions. This is no problem for our solver, but by about the 15th eigenvalue, our Schrodinger grid is hopelessly inadequate to resolve the wavefunctions. Since we do not use a re nement of the Poisson grid for the Schrodinger equation, adding enough lines would make the nonlinear Poisson solver tremendously expensive. Our focus is low{temperature devices where there are fewer occupied wavefunctions to resolve, and the cost of the extra lines in the Poisson grid is moderate. To get some feeling for the eect of additional nonlinearity on our outer iteration in T , we did an additional model calculation where we pulled down the gate potential energy, thus deepening the quantum well(s). In Fig. 2.6 we plot the convergence behavior. The most striking dierence in convergence is that NLGS (Alg. 2.2) alone is 34

no longer sucient to give rapid convergence for this higher quantum electron density. The additional occupied wavefunctions are clearly an added nonlinearity that our simple underrelaxation strategy is not designed to handle. Another dierence in convergence is that the rst application of T takes about twice as long. This is due to almost three times as many steps to solve the nonlinear Poisson Eq. (2.8) to medium precision. This, in turn, is explained by the fact that the parabolic dips in potential between contacts fall below the Fermi level at the surface, thus activating the ionized donor density ND+ there as additional nonlinearities in the Poisson equation (the other active nonlinearity is at the backgate). Globally, the solution was similar to that of Fig. 2.4, with almost imperceptible dips in ND+ in the top layer between contacts. Locally, signi cant dierences in potential energy V , quantum electron density nQW , and coupling occurred near the gate, as shown in Fig. 2.7. The occupied wavefunctions, all coupled between the wires, are shown in Figs. 2.7{2.9, and the corresponding occupancy levels are shown in Table II. The larger number of occupied wavefunctions in this calculation highlights the dierence between the Chebyshev{Arnoldi and RITZIT eigenvalue solvers. Although identical input parameters need not (and do not) lead to identical iteration histories, the convergence behavior is similar using the two solvers. Predictably (from Sec. 4.4.2), Chebyshev{ Arnoldi economizes on the number of matrix{vector multiples (from three to ve times fewer), while the larger Krylov projection subspace entails larger orthogonalization costs (up to three times as many dot products). We reiterate that these additional dotproducts cannot be done in parallel, while the extra matrix{vector multiplies in RITZIT can be 35

Figure 2.7 Closeup of the conduction band potential energy V , the quantum electron density nQW , and the rst two wavefunctions for applied potentials V1 = 0:7V and V2 = 0:2V , with the exchange{correlation potential Vxc (nQW ) included. 36

Figure 2.8 Closeup of wavefunctions 3{6 corresponding to the potential of Fig. 2.7. 37

Figure 2.9 Closeup of wavefunctions 7{10 corresponding to the potential of Fig. 2.7. 38

State i 1 2 3 4 5 6 7 8 9 10 11 12

Ei (eV ) 10?3 -15.79 -15.70 -13.32 -12.18 -10.28 -8.23 -6.01 -3.63 -1.02 1.58 4.21 7.14

Ni (m?1) 106 110.30 110.30 110.29 110.14 99.26 91.08 77.16 59.17 29.75 .21 .00 .00

Table II Occupancy levels of the rst 12 wavefunctions for applied potentials V1 = 0:7V and V2 = 0:2V , with the exchange{correlation potential Vxc(nQW ) included.

to some extent. Nevertheless, the Chebyshev{Arnoldi solver is to be preferred for several reasons. First, the cost of each matrix{vector multiply could increase tremendously on a non{rectangular grid where the banded form of the matrix and easy vectorizability are lost, while dotproducts always vectorize well. Second, we only need a few occupied wavefunctions for our self{consistent solutions, but other applications could require many more modes, where the dierence in cost would be much more apparent. Third, the "discounting of error quantities" in RITZIT [58, 59] introduces an unknown reduction in the accuracy of the solution, which can destroy convergence of the Newton accelerated outer iteration, as well as damage the xed{point iteration. While Chebyshev{Arnoldi has its own "giving up" mechanism, it is less prone to giving up too soon.

39

2.8 Conclusion We have designed and tested a new algorithm for the computation of electron states in quantum wires. It has been demonstrated here and elsewhere [56, 35, 45, 36] that the algorithm is robust over a practical range of device pro les and input parameters. We have shown that additional nonlinearities make the solution correspondingly more expensive and make Newton acceleration for the outer iteration comparatively more attractive. The Chebyshev{Arnoldi algorithm is faster and more accurate than RITZIT and thus is the preferred Schrodinger solver.

40

CHAPTER 3 FINITE-DIFFERENCE ANALYSIS OF REAL AND COMPLEX DIELECTRIC WAVEGUIDES 3.1 Introduction Dielectric channel waveguides are widely used for integrated circuit applications, from the microwave to the optical frequency range. For an accurate analysis of general waveguide structures of practical interest, numerical solutions are necessary. As pointed out in [3, 69], reliable simulation software should employ formulations which are free of spurious modes, for instance by solving directly the vector Helmholtz equations in terms of the transverse magnetic eld components Hx and Hy . Discretization of such vector equations leads to eigenvalue problems which are nonhermitian and two or more [7] times larger than those that arise from the corresponding scalar equation. The size is particularly problematic for standard dense eigenvalue solvers, which can only handle a relatively small number of mesh points, even on the fastest supercomputers available. In this chapter we solve eigenvalue problems for dielectric waveguides with an ecient implementation of the iterative Chebyshev{Arnoldi algorithm. In contrast to standard eigenvalue solvers, this approach greatly reduces memory requirements because it allows the matrix of the discretized problem to be stored in sparse form. The algorithm knows 41

about the eigenvalue problem only through a matrix{multiply routine, which may exploit the particular matrix structure arising from the discretization. In our discretization, rectangular meshes lead to an easily vectorizable matrix{multiply. Furthermore, through the use of a three{term recurrence for the Chebyshev preconditioning, the algorithm has very good data locality independent of the matrix structure. Most importantly, the computational complexity is minimized by computing only those eigenvalues and eigenvectors which correspond to propagating modes properly con ned to the waveguide. A region in the complex plane containing these modes is determined a priori. The modes are enhanced according to their degree of con nement by suitable adaptation of the Chebyshev acceleration. The poorly con ned modes and the rapidly attenuating modes are removed in an outer iteration. This dual strategy of preconditioning and selection rapidly isolates those few modes (typically less than 20) which are relevant from an engineering point of view. This allows us to employ extensive nonuniform meshes with many grid points on which the discretization error can be controlled. We also develop a novel discretization scheme for the vector Helmholtz problem. We obtain systems of equations for the two components which are equivalent in the sense that a rotation over 90 corresponds to a suitable permutation of indices.

42

3.2 Maxwell equations The macroscopic Maxwell equations (in SI units) are

r E = ?@tB r H = J + @t D rB = 0 rD = where E and B are the macroscopic (averaged) electric and magnetic elds, D and H are the corresponding derived elds, and and J are the macroscopic free charge and current densities [71]. Here and throughout the discussion we use the shorthand notation @ and @ = @ 2 . @t = @t tt @t2 For a linear isotropic dielectric of permeability and real permittivity Re, the derived elds satisfy D = ReE and B = H. We assume that and Re are not time-dependent. We restrict the problem to a single angular frequency ! 6= 0 by assuming a time dependence of ei!t, thus introducing complex-valued elds in exchange for eliminating time. We assume that any chemical processes aecting the charge density are too slow to have any component at frequency !, and the equations simplify to

r E = ?i!H r H = J + i!ReE 43

r (H) = 0

(3.1)

r (ReE) = 0

(3.2)

At this point, the last two equations follow from taking the divergence of the rst two and are extraneous (but useful). We now eliminate the current density J. In the simplest situation, we attribute any conductivity of the medium (e.g., from free carriers) to a complex permittivity. Using Ohm's law, J = E, where is the conductivity (at frequency !), we take the imaginary part of the dielectric permittivity to be ?i ! . We set = Re ? i ! , and the resulting equations,

r E = ?i!H

(3.3)

r H = i!E;

(3.4)

imply that H is determined by E and vice versa. In active devices there are other contributions to the imaginary part of the dielectric permittivity, such as a subtraction for optical absorption and an addition for optical gain. These must be formulated as contributions to J proportional to E, as above. Since is essentially constant in non-ferromagnetic materials, Eq. (3.1) implies that

r H = 0:

44

(3:5)

Using Eq. (3.5) and the vector identity r (r H) = r(r H) ?r2H, and substituting as needed from Eqs. (3.3) and (3.4), we have

?r2H = r (r H) = i!r (E) = i!(r E + r E) " # r H = i! r i! ? i!H = r (r H) + !2H; which we write in the form

? r2H ? r(ln ) (r H) ? !2H = 0:

(3:6)

Using determinant notation, we have

i j k ?r2H ? @x(ln ) @y (ln ) @z (ln ) @y Hz ? @z Hy @z Hx ? @xHz @xHy ? @y Hx

2 ? ! H = 0;

where i, j, and k are unit basis vectors along the x, y, and z axes, respectively. Thus the vector Helmholtz equation (3.6) may be written in terms of the individual components

45

of H as the system

?r2Hx ? @y (ln )(@xHy ? @y Hx) + @z (ln )(@z Hx ? @xHz ) ? !2Hx = 0 ?r2Hy ? @z (ln )(@y Hz ? @z Hy ) + @x(ln )(@xHy ? @y Hx) ? !2Hy = 0

(3.7)

?r2Hz ? @x(ln )(@z Hx ? @xHz ) + @y (ln )(@y Hz ? @z Hy ) ? !2Hz = 0: This 3D problem can be reduced to 2D for waveguides. We assume a spatial variation of ei z , where 6= 0 is constant and z is the displacement along the axis of the waveguide. Note that we have taken the sign of z to be positive, in contrast to standard textbook convention, because we do not imagine ourselves to be riding the crest of an electromagnetic wave, but instead take a xed position on the waveguide to analyze the optical gain or loss as a function of z. With this assumed variation, Eq. (3.5) becomes

@xHx + @y Hy + i Hz = 0;

(3:8)

and the Helmholtz system of equations (3.7) becomes

? @xxHx ? @yy Hx ? @y (ln )(@xHy ? @y Hx) ? !2Hx = ? 2Hx

(3:9)

? @xxHy ? @yy Hy + @x(ln )(@xHy ? @y Hx ) ? !2Hy = ? 2Hy

(3:10)

46

? @xxHz ? @yy Hz ? @x(ln )(i Hx ? @xHz ) + @y (ln )(@y Hz ? i Hy ) ? !2Hz = ? 2Hz : (3:11) Using Eq. (3.8) to eliminate Hz in Eq. (3.11), we get 1 [@ (@ H + @ H ) + @ (@ H + @ H )] yy x x y y i xx x x y y

# # " 1 1 ?@x(ln ) i Hx + i @x(@xHx + @y Hy ) + @y (ln ) ? i @y (@xHx + @y Hy ) ? i Hy "

2 2 + !i (@xHx + @y Hy ) = i (@xHx + @y Hy );

which simpli es to

@x(@xxHx + @yy Hx ) + @y (@xxHy + @yy Hy )

h

i

h

?@x(ln ) @x(@xHx + @y Hy ) ? 2Hx ? @y (ln ) @y (@xHx + @y Hy ) ? 2Hy +!2(@xHx + @y Hy ) = 2(@xHx + @y Hy ): Substituting Eqs. (3.9) and (3.10) into this, we get

@x[ 2Hx ? @y (ln )(@xHy ? @y Hx) ? !2Hx] +@y [ 2Hy + @x(ln )(@xHy ? @y Hx ) ? !2Hy ]

?@x(ln )[@x(@xHx + @y Hy ) ? 2Hx] ? @y (ln )[@y (@xHx + @y Hy ) ? 2Hy ] +!2(@xHx + @y Hy ) = 2(@xHx + @y Hy ); 47

i

which evaluates to

?@xy (ln )(@xHy ? @y Hx ) ? @y (ln )(@xxHy ? @xy Hx) ? !2@x(ln )Hx +@yx (ln )(@xHy ? @y Hx ) + @x(ln )(@yxHy ? @yy Hx ) ? !2@y (ln )Hy

?@x(ln )[@x(@xHx + @y Hy ) ? 2Hx] ? @y (ln )[@y (@xHx + @y Hy ) ? 2Hy ] = 0; which simpli es to

i h @x(ln ) @yxHy ? @yy Hx ? !2Hx ? @x(@xHx + @y Hy ) + 2Hx h

?@y (ln ) @xxHy ? @xy Hx + !2Hy + @y (@xHx + @y Hy ) ? 2Hy

i

h i = @x(ln ) ?@yy Hx ? !2Hx ? @xxHx + 2Hx i

h

?@y (ln ) @xxHy + !2Hy + @yy Hy ? 2Hy = 0: Now we substitute Eqs. (3.9) and (3.10) one more time to arrive at the trivial identity

@x(ln )@y (ln )(@xHy ? @y Hx ) ? @y (ln )@x(ln )(@xHy ? @y Hx) = 0: Thus we have successfully eliminated Eq. (3.11) for Hz . Determining Hz from Hx and Hy in this way guarantees that spurious modes will not be present in the solution [3, 25]. This avoids the problem that some variational formulations have of generating spurious, nonphysical solutions [70]. 48

3.3 Interfaces between Dierent Media We take the same approach as in the introduction of [30]. At an interface between dierent media, the material is no longer isotropic. Furthermore, because the elds may be discontinuous at the interface, Stokes' theorem and the divergence theorem do not apply, so we must use the integral form of the Maxwell equations,

I I I I

@V @V

Z

E dl = ? S @tB dA @S

H dl = @S

Z

S

[J + @tD] dA

B n dA = 0 D n dA =

Z

V

dv:

Here, n is the unit outward normal from the boundary surface @ V of an arbitrary volume

V , and is the unit normal satisfying the right{hand rule on the boundary curve @S of an arbitrary surface S . We take S to be a long exible rectangular surface at the interface with boundary @S located so that its long edges closely parallel the interface on opposite sides (See Fig. 3.1). We let the width of S go to zero while the length is xed. We then divide by the length and take the limit as the length goes to zero. If one of the media is a perfect conductor and the other is a perfect insulator, then, in the limit, the enclosed current density J becomes a surface current density @ J. Similarly, we take V to be a exible at box at the interface with boundary @V located so that its at surfaces closely parallel the interface 49

Region 1

S

Region 2

V

n

Figure 3.1 Interface Between Dierent Media. Vectors n and are depicted for the

limiting argument.

on opposite sides (See Fig. 3.1). We let the height of V go to zero while the base area is xed. We then divide by the area and take the limit as the area goes to zero. If one of the media is a perfect conductor and the other is a perfect insulator, then, in the limit, the enclosed charge density becomes a surface charge density @. The result is (E2 ? E1) ( n) = 0 (H2 ? H1) ( n) = @ J (B2 ? B1) n = 0 (D2 ? D1) n = @: 50

Assuming the interface is between two isotropic media, both of which are poor conductors at the frequencies in question, the result may be stated in terms of E, H, and the (real) material parameters near the boundary in each region as (E2 ? E1) ( n) = 0

(3.12)

(H2 ? H1) ( n) = 0 ( 1 H2 ? 1 H1) n = 0 2 1 (2E2 ? 1E1) n = 0

(3.13) (3.14) (3.15)

Equations (3.12) and (3.13) say that the components of E and H tangential to the interface are continuous across the interface. Equations (3.14) and (3.15) say that the components of E and H normal to the interface may be discontinuous across the interface but are related by a simple proportionality.

3.4 Discretization Approach Since is essentially constant in non-ferromagnetic materials, Eqs. (3.13) and (3.14) imply that the magnetic eld is always continuous. The electric eld, on the other hand, is discontinuous across dielectric interfaces, which makes evaluating it there problematical. Thus, we eliminate the electric eld as in section 3.2. In order to put the reduced 2D vector Helmholtz formulation (3.9) and (3.10) for continuously varying into divergence form, we take to be piecewise constant. This would seem to eliminate all hope of a

51

discretization consistent with the partial dierential equations (PDEs) (3.9) and (3.10) for continuously varying , but that is exactly what we will obtain. Over each homogeneous region in the xy plane (i.e., each region where and are constant), the problem reduces to identical uncoupled PDEs in any component H of the magnetic eld, which may be written as

@xxH + @yy H + (!2 ? 2)H = 0: We integrate over a homogeneous region to get

Z

Z @xxH + @yy H dA + (!2 ? 2) H dA = 0:

By the divergence theorem in two dimensions,

I @

(@xH; @y H ) n

ds + (!2 ? 2)

Z

H dA = 0;

(3:16)

where @ is the boundary of and n is the unit outward normal of @ . The rst term represents the ux through the boundary of the homogeneous region . We discretize the two transverse magnetic eld components only, the axial component being determined by Eq. (3.8). We lay down a (nonuniform) rectangular grid and assume that the material interfaces occur only on grid lines. Since the dielectric permittivity is not assumed to be continuous on the four quadrants around a grid point, we use a

52

modi ed box{integration method, applying Eq. (3.16) to each transverse component in each box i around each grid point, as shown in Fig. 3.2. N

n

4

1

P

W

2

E

3

s

w

S

e

Figure 3.2 Modi ed Box{Integration Scheme. The cardinal points North (N ), West

(W ), South (S ), and East (E ) are used as references around the center point (P ). The mesh widths connecting P to the neighboring points are indicated by n, w, s, and e. The boxes i are delimited by the mesh lines n, w, s, and e and by their perpendicular bisectors. Over each homogeneous region i, the transverse magnetic eld components are uncoupled and the axial magnetic eld is determined by them. We now use the interface matching conditions of section 3.3 to couple them. The axial components of both electric and magnetic elds are tangential to interfaces and therefore continuous across them, by 53

Eqs. (3.12) and (3.13), so they must be explicitly matched. We use @xHy ? @y Hx = i!Ez and @xHx + @y Hy = ?i Hz from Eqs. (3.4) and (3.8), respectively, to restate axial eld continuity in terms of the transverse magnetic eld components as (@xHy ? @y Hx)i=i = (@xHy ? @y Hx )j =j

(3.17)

(@xHx + @y Hy )i = (@xHx + @y Hy )j .

(3.18)

Since the interfaces occur on a rectangular grid, only the term of @xHx + @y Hy that represents a cross-interface derivative can vary across the interface. The along-the-interface derivative will already be matched due to the continuity of H. The box{integration equations (3.16) for each i and the interface matching equations (3.17) and (3.18) for Hx at P are assembled into the matrix equation

2 66 1 ?1 0 0 0 0 66 66 0 0 ?1 ?1 0 0 66 66 66 0 0 0 0 ?1 1 66 60 0 0 0 00 0 = 6666 66 66 0 66 66 66 66 64 0

32 Z 0 0 77 66 @y Hx dx 77 66 Z 1N 0 0 777 666 @xHx dy 77 66 Z1W 0 0 777 666 @xHx dy 77 66 Z2W 1 1 77 66 @y Hx dx 77 66 Z2S 77 66 @ H dx 77 66 3S y x 77 66 Z 77 66 @xHx dy 77 66 Z3E 77 66 @xHx dy 77 66 4E 54 Z @y Hx dx 4N

54

3 3 2 0 77 77 66 77 77 66 77 77 66 0 77 77 66 77 77 66 0 77 77 66 77 77 66 77 77 66 0 7+ 77 + 66 77 66 1 ? 1 Z @ H dx 777 77 66 1 2 1S x y 77 77 77 66 1 1 Z 77 66 4 ? 3 @ x Hy dx 7 77 3N 77 66 77 77 66 77 0 77 66 75 5 4 0

2 3 2 32 Z 3 Z 2 2 66 (! 1 ? ) 1Hx da 77 66 0 0 0 1 ?1 0 0 0 77 66 @xHx dy 77 77 66 77 66 Z 2E 66 7 Z 66 (!22 ? 2) Hx da 77 66 1 0 0 0 0 1 0 0 77 66 @xHx dy 777 77 66 77 66 Z3W 66 77 Z2 7 7 66 2 6 6 7 66 (! 3 ? 2) 3Hx da 777 666 0 ?1 0 0 0 0 1 0 777 666 4W @xHx dy 777 7 6 76 Z 66 7 Z 66 (!24 ? 2) Hx da 777 666 0 0 ?1 0 0 0 0 ?1 777 666 @xHx dy 777 4 66 77 + 66 77 66 Z1E 77 66 7 7 6 6 77 ?1 1 0 0 7 6 77 66 0 @ H dx y x 66 7 6 77 1 2 77 66 77 66 Z 1S 0 66 77 7 7 6 6 1 ? 1 66 7 7 6 6 0 0 0 3 4 7 6 @y Hx dx 77 77 66 66 77 66 Z2N 77 77 66 66 7 6 77 66 1 ?1 0 0 77 66 @y Hx dx 777 0 66 75 64 75 64 Z3N 75 0 4 @y Hx dx 0 0 0 ?1 1 4S

(3:19)

with 4 terms involving exterior uxes, coupling terms, box integrals, and interior uxes. Each ux term represents the portion of the total ux through a speci c part of a boundZ @y Hx dx represents the portion of the ux through the northern boundary, e.g. 1N

ary of the northeast box 1. If the 8 8 matrix in the last term is singular, i.e., if

?1 1?3 1 ? ?2 1?4 1 = 0, then all interior uxes are eliminated by taking an appropriate linear combination of the rows in Eq. (3.19). When the interior uxes cannot all be eliminated, they are approximated by one{sided dierences. The exterior uxes of the union of the i are approximated by centered dierences as in the standard box{integration scheme. The choice of coecients

s s s ! s !! s s s s 2 ; 1 ; 4 ; 3 ; ?p ; ?p ; ? 1 1 + 4 ; ? 1 2 + 3 1 2 3 4 1 2 3 4 2 2 3 2 1 4 (3:20) for the rows of Eq. (3.19) combines elimination of all interior uxes whenever possible with invariance under a rotation of 90 of the discretized equations for Hx and Hy , which 55

is necessary because there is nothing inherent in the model to distinguish the transverse directions. The resulting discretization for Hx at point P is

s s ! s ! 4 HxS ? HxP 2 3 HxN ? HxP 1 + w + e 0 = w +e 2n 3 2s s1 s4 s s ! 2 + n 2 + n 3 + s 1 + s 4 HxW4?w HxP s 1 s 4 s 2 s 3! + n 2 + n 3 + s 1 + s 4 HxE ? HxP 1 4 2 3 4e 2 + !4 (n + s) (wp12 + ep34) HxP s s s ! s ! 3 HyE ? HyP 2 1 HyW ? HyP 4 + ? + ? 2 3 2 1 s 2 s s s !4 2 ? 4 wn 2 + ws 1 + en 3 + es 4 HxP . 1 2 4 3 s

(3.21)

The equations, their coecients, and the resulting discretized equation for Hy are obtained from those for Hx by rotating the labels in (3.19), (3.20), and (3.21) counterclockwise as follows:

x 7! y 7! ?x Hx 7! Hy 7! ?Hx n 7! w 7! s 7! e 7! n N 7! W 7! S 7! E 7! N 1 7! 2 7! 3 7! 4 7! 1

56

Although the derivation assumes piecewise constant , symbolic manipulations using Maple [11] show that this discretization is consistent to rst order with the Eqs. (3.9) and (3.10) for smoothly varying . Into the discretization scheme we substitute truncated Taylor series centered at P for the values of Hx and Hy at the points N , S , E , and W , and for the values i, taken to be (x; y) at the outside corner of the box i (Fig. 3.2). We use the PDEs (3.9) and (3.10) at P to simplify the resulting huge expressions and expand them in truncated Taylor series in n, w, s, and e to see the lowest order terms. The Maple session is displayed in Appendix A. Taking to be complex, our discretization yields a complex nonhermitian generalized eigenvalue problem, Ku = ? 2Mu. Here u is formed by concatenating Hx and Hy in the standard grid ordering. It can be seen from Eq. (3.21) that K has two diagonal blocks of ve bands each for Hx and Hy , and two o{diagonal blocks of three bands each for the coupling terms, while M is a positive diagonal matrix. This generalized eigenvalue problem is recast as a standard eigenvalue problem, Av = ? 2v, where

A = M ?1=2KM ?1=2 and v = M 1=2u.

3.5 Solution of the Eigenvalue Problem As in [33], we identify a priori a region in the complex plane where eigenvalues of any acceptable modes must lie. First and foremost, we require that the modes decay naturally to zero at the boundary, i.e., they must not \feel" the Dirichlet boundary. Modes that are not naturally con ned are non{physical because there is nothing to clamp the elds to zero in the cladding. In complex problems, we also require that the decay in the axial 57

direction not be so large that the mode is extinguished over the length of the device. If the problem parameters, the dielectric pro le, or the discretization are not carefully chosen, there may not be any acceptable modes. In this case we try to nd the best con ned mode. To characterize propagation of a mode in terms of the eigenvalue ? 2, we write

? 2 = + i = ?( + i=L)2, so that the axial dependence ei z has real part e?z=L , indicating a characteristic propagation length L. To extract L from the eigenvalue ? 2, note that = 1=L2 ? 2 and = ?2=L. Eliminating , we have 4=L4 ? 4=L2 ? 2 = 0, and solving for the positive root, we see that the propagating modes are those satisfying

p

1 = + 2 + 2 < 1 ; L2 2 L20

(3:22)

where L0 is the minimum acceptable propagation length. The level curves of L are left{opening parabolas with focus at the origin and focal length L?2 . The level curve

L = L0 = 10m is shown in Fig. 3.3. To characterize the con ned modes, we focus our attention on the cladding, where we may use the scalar Helmholtz equation for Hz , 1 @ (@ H ) + 1 @ H + (!2 ? 2)H = 0: z z 2 z

(3:23)

The switch to the scalar equation in polar coordinates is to clarify the discussion|this is a quick study rather than a rigorous proof. In this spirit, we assume a solution in the cladding with radial dependence e?(1= +i), i.e., a characteristic transverse con ne58

L = L0 = 10m = 0 = 2m reference ellipse Modes, Imf2g = 0 Modes, Imf2g = :10

Imf? 2g

r b

?!21 ? 0?2 Ref? 2g

U r

bb

Figure 3.3 Preconditioning and Selection of Complex Modes. The a priori con nement

(dotted) and propagation (solid) cuto parabolas and the reference ellipse (bold) are shown with the calculated con ned, propagating modes of a channel waveguide as in Fig. 3.10, with ! = 1015sec?1, corresponding to a normalized frequency V = (2h=)(2 ? 1)1=2 = 1:1. Each visible plotted point is actually two closely spaced eigenvalues. The axes here are scaled: In reality, the ellipse is extremely eccentric.

1

6

?!21 ? 0?2

Figure 3.4 The Chebyshev Polynomial pn () adapted to the ellipse of Fig. 3.3 for n = 16, plotted vs. real .

59

ment radius . Then Eq. (3.23) becomes (1= + i)2Hz ? (1= + i)Hz = + @ Hz =2 + (!2 ? 2) Hz = 0. Since we are in the cladding, we may let ! 1, where Eq. (3.23) becomes (1= + i)2 + !2 ? 2 = 0, or, using ? 2 = + i to break this into real and imaginary parts, 2= + = 0 and (1= )2 ? 2 + !2 + = 0. Eliminating , we have 4= 4 +4(!2 + )= 2 ? 2 = 0, and solving for the positive root, we see that the con ned modes are those satisfying

q 1 = ?!2clad ? + (!2clad + )2 + 2 > 1 ; 2 2 02

(3:24)

where 0 is the maximum acceptable transverse con nement radius and clad is the maximum permittivity attained on a signi cant section of cladding. The level curves of are right{opening parabolas with focus at ?!2clad and focal length ?2. The curve

= 0 = 2m is shown in Fig. 3.3. We use an iterative Chebyshev{preconditioned Krylov subspace method to converge speci cally to these con ned modes. We precondition with a Chebyshev polynomial pn () adapted to a reference ellipse (See Appendix B) with the same focus ?!2clad and focal length 0?2 as the con nement cuto parabola, as shown in Fig. 3.3. The right focus is chosen to be an upper bound for the spectrum, as determined by Gerschgorin disks [24]. A typical example of pn () is plotted for real in Fig. 3.4. The Chebyshev{Arnoldi algorithm generates an orthonormal basis Vm for the Krylov subspace Km [pn (A); v0] = spanfv0; [pn(A)]v0; [pn (A)]2v0; : : :; [pn (A)]m?1v0g. The Chebyshev polynomial preconditioning greatly magni es the separation of the eigenvalues sat60

isfying Ineq. (3.24) relative to the rest of the spectrum, while leaving the eigenvectors unchanged. As we will see in section 4.4.2, this accelerates convergence towards the con ned modes. After nding the orthonormal basis Vm , we take a Rayleigh quotient

Cm = Vm AVm of the operator A over the Krylov subspace. The spectral decomposition Cm = YmmYm?1 is found using standard EISPACK routines. The approximate eigenvalues and eigenvectors of A are then m and Vm Ym. The best con ned of these (i.e, those with smallest ) satisfying both the propagation and con nement conditions Ineq. (3.22) and (3.24) are combined into a new starting vector v0, and the whole process is repeated with stronger preconditioning, i.e., polynomials of higher degree n, until the residuals are within a prescribed tolerance. Although small residuals alone are not enough to guarantee small errors in any eigenvalue problem, we have found that the outer iteration eectively removes unwanted modes by the time the residuals are small. Problems where double eigenvalues are expected because of symmetry must use two independent starting vectors, v0 and w0, and a double{vector Arnoldi method so that the projection subspace is Km=2[pn(A); v0] Km=2[pn(A); w0]. To obtain satisfactory convergence behavior in this case, one should also increase m. Our algorithm is more eective for nding several eigenvectors than the Arnoldi{ Tchebychev method [62, 27]. This earlier approach preconditions the starting vector instead of the matrix operator. In section 4.3 we explain why this makes an ineective outer iteration. Preconditioning the operator allows dicult problems to be solved with a Krylov subspace of xed size m. Limiting the size of the Krylov subspace also limits the computational complexity of the modi ed Gram{Schmidt orthogonalization in the 61

Arnoldi algorithm. The storage requirement (in double precision real numbers) of the Chebyshev{Arnoldi algorithm is approximately (2m + ibl + 3)N + 2m2 + 3m, where N is the total number of unknowns and ibl is the maximum expected eigenvalue multiplicity. For our discretization, N = 2(nx ? 2)(ny ? 2), where nx and ny are the number of grid lines in the x and y directions, and the banded matrix A requires storage of 8N . Also, extra storage (about 13:5N ) is used once to simplify matrix generation, and then may be reclaimed.

3.6 Numerical Results We present numerical results for a number of representative dielectric waveguides. The examples have been selected to facilitate comparison with previously published numerical approaches, as detailed below. For all computations we use a uniform grid in the core region. The mesh spacing grows exponentially as the lines extend far out into the cladding. Such grids both resolve rapid oscillations of higher modes in the core region and extend suciently far to allow the modes to decay exponentially into the cladding. With two magnetic eld components at each grid point (and up to about a dozen complex unknowns per node in nite element discretizations of some similar problems [79]), realistic grids lead to very large scale eigenvalue problems. Our rst example is a square waveguide. A double{vector Chebyshev{Arnoldi method with a Krylov subspace of dimension m = 24 was used. The dispersion plots in Fig. 3.5 generally agree with earlier results [3, 46, 70]. For each mode, at low values of the normalized propagation constant B , we present results only for frequencies where there 62

1.0

y

x ; H11 H11

H21

XX X

x

XX XX

H2x;1+2 H12

0.8

XX XX

XX

X XX X

XX X

H22 ; H22 x

X X XX X

XX XX X XX XX X XX XX XX X X XX XX X XX X X X XX X X X X X X X XX X X XX XX X X XX X X X XX X

X X

X XX XX XX X X XX y X X X XX X XX X X

H1 2+2 x ;

X XX X

X X XX

x

XX X XX X

0.6

B 0.4

y

x H31 ; H13 y

x H13 ; H31 x H32 x H23

H2x;3+2 H3x;2+2

0 1

0.2

a

a

0.0

0

5

10

15

20

25

30

35

V

Figure2 3.5 Square waveguide dispersion. Normalized propagation constant B = ( =k0) ? 0 vs. normalized frequency V = k a( ? )1=2, with k the free{space 0 1 0 0 1 ? 0 wavenumber. The computational domain is 5m 5m, with a = 1m, 1 = 13:10. The discretization is on a 65 65 nonuniform rectangular grid. Solid lines indicate single

modes and dotted lines double modes.

63

is sucient decay within the discretized domain. Insisting on sucient decay, we do not nd the in ection points in the propagation curves for B near 0 that appear in [3]. It is likely that the in ection points in these propagation curves are due to the use of meshes with too few grid lines and over a spatial domain too small to allow natural decay of modes near cuto. The modes of the square waveguide re ect its invariance under a rotation of 90 . Double eigenvalues (dotted lines in Fig. 3.5) correspond to degenerate modes which are rotations of each other. Single eigenvalues (solid lines in Fig. 3.5) correspond to nondegenerate modes which are rotations of themselves. For example, the third mode can either be labeled H21x for its Hx component (Fig. 3.6) or H12y for its Hy component (Fig. 3.7). We could not label all modes unambiguously employing simple double integer indices. For example, because the mode in Figs. 3.6 and 3.7 should obviously be labeled H21x , and the mode in Fig. 3.9 should be labeled H23x , the novel label H2x;1+2 was de ned for the intermediate mode with Hx component shown in Fig. 3.8.

In this labeling

scheme, the additional indices correspond to \kinks" at interfaces, which are allowed by the interface matching conditions of Eqs. (3.17) and (3.18). The size of these kinks decreases with increasing frequency !, until a mode with kinks becomes indistinguishable from the corresponding mode without them. In other words, the matrix A(!) becomes more nearly defective as ! ! 1. None of our other examples have this problem. It is the noncommutative symmetry of the square waveguide that allows the occurrence of self{rotational pairs of nearly defective eigenvectors [6]. Initial guesses, computed on

64

: X XX z X

y `` `` ` 1 `

x

y X X XX

a

XX ) X XX

X

X X z X

a

Figure 3.6 H21x mode (Hx component only) for a square waveguide as in Fig. 3.5. The normalized frequency is V = 11:0. The plot is shown on a 55 55 centered subgrid.

65

: X XX z X

y `` `` ` 1 `

x

y X X XX

a

XX ) X XX

X

X X z X

a

Figure 3.7 H21x mode (Hy component only) for a square waveguide as in Fig. 3.5. The normalized frequency is V = 11:0. The plot is shown on a 55 55 centered subgrid.

66

: X XX z X

y x

y X X XX

a

a

`` `` ` 1 `

XX ) X XX

X

X X z X

Figure 3.8 H2x;1+2 mode (Hx component only) for a square waveguide as in Fig. 3.5. The normalized frequency is V = 11:0. The plot is shown on a 55 55 centered subgrid.

67

: X XX z X

y `` `` ` 1 `

x

y X X XX

a

X

XX z X

X ) X XX

a

Figure 3.9 H23x mode (Hx component only) for a square waveguide as in Fig. 3.5. The normalized frequency is V = 11:0. The plot is shown on a 57 57 centered subgrid.

68

a slightly (1%) non{square domain, were used to obtain satisfactory, canonical{looking eigenfunction plots for the multiple eigenvalues. Our second example is a channel waveguide. The dispersion plot is given in Fig. 3.10. A single{vector Chebyshev{Arnoldi method with a Krylov subspace of dimension m = 20 was used both for this example and the next one. There is good qualitative agreement with earlier results [3, 46, 55]. Demonstrating the need for large computational grids, we show in Fig. 3.11 the Hx component of the H32x mode for the channel waveguide of Fig. 3.10 at a normalized frequency of V = 2:10, at which this mode has a normalized propagation constant of B = :005. For clarity, the plot is presented on a 55 33 subgrid in which it decays to 1% of its maximum over the total 69 38 domain. Such a large domain is necessary to properly capture the decay in the cladding region of any mode near cuto. The strip{slab waveguide is our third example. The dispersion plot of the H11y mode is shown in Fig. 3.12 for four dierent dielectric con gurations as given in [3]. At low values of the normalized frequency V , the eigenfunctions expand considerably into the region of the cladding where the dielectric permittivity is highest. This eect is especially pronounced in case 4), where 4 = 2:5750. All of these computations were performed on a Sun 4/490 workstation with an intermediate version of the code speci cally designed for real matrices [21]. At most frequencies, the execution time to nd sixteen modes of the square waveguide was approximately 20 minutes. For the channel waveguide, the execution times were considerably smaller, since a single{vector method was used. Here, the modes occurred in nearly degenerate 69

1.0

y

H11 x

XX X XX X

y

X XX XX X

H11 H21

0.8

x

XX X XX X

y

X XX XX X

H21 H31 H31 x

y

H12

0.6

x H12 y

H22

B

x H22 y

0.4

H41 x H41

XX X X

XX

XX

XX XX

XX

XX X XX

XX X X XX X

XX X

X

X X

X XX X

XX XX X X XX X X XX X XX XX X X X XX X X X XX X X X X X XX X XX X X XX XX X X X XX X X XX XX X X XX XX X XX X XX X X X X X X XX XX X XX X XX X X X

X X X X

X X

X XX X

XX

H32

X

x H32

X

H51

X

x H51

XX X X

XX

y

X

y

0 2

0.2

h

1

0.0

2h

0

1

2

3

4

5

V

Figure2 3.10 Channel waveguide dispersion. Normalized propagation constant B = ( =k0) ? 1 vs. normalized frequency V = (2h=)( ? )1=2, with k the free{space 2 1 0 2 ? 1 wavenumber. The computational domain is 60m 32m, with h = 3m, 1 = 2:130 and 2 = 2:250. The discretization is on a 69 38 nonuniform rectangular grid.

70

CCW

y

h OCC

-

x

2h

-

Figure 3.11 H32x mode (Hx component only) for a channel waveguide as in Fig. 3.10. The normalized frequency is V = 2:10. The plot is shown on a 55 33 subgrid.

71

1.615 1.610 1) 4 = 2:5000

1.605

B

2) 4 = 2:5250 3) 4 = 2:5500

1.600

4) 4 = 2:5750

4)

1.595

0

3)

1.590

1

2)

1.585

4

0:6a

2

1)

4

a

3

1.580

a

0

5

10

15

20

25

30

35

40

V Figure 3.12 Strip{slab waveguide dispersion. Normalized propagation constant B = =k0 vs. normalized frequency V = ak0, with k0 the free{space wavenumber. The computational domain is 30m 30m, with a = 3m, 1 = 2:550, 2 = 2:620 , and 3 = 2:500. Dispersion plots for the fundamental mode H11y are shown, using four dierent values for 4. The discretization is on a 69 67 nonuniform rectangular grid.

72

pairs where the Hx component of one mode was almost identical to the Hy component of the other. The eigenvalues were not quite equal, though, and were easily resolved by a single{vector Chebyshev{Arnoldi method. The solution for sixteen modes at V = 2:76, for example, was obtained in only 5 minutes. In all examples, we obtained residuals of less than 0:5 10?7 , and for the lower modes we typically obtained residuals of 10?12 . The starting vector was random except for the strip{slab waveguide, where the eigenvector from the rst frequency was used as an initial guess for all other frequencies. Typically, the user would use the solution from one frequency as an initial guess for the next frequency in a continuation method.

3.7 Conclusion Realistic modeling of microelectronic lasers requires the solution of complex nonhermitian eigenvalue problems that are very large and sparse. The dense solvers used in [3, 69, 25, 79] nd the complete spectral decomposition at the expense of greatly increased computational complexity and memory requirements, to the extent of prohibiting the solution on realistic grids. Since only the eigenvectors for the con ned, propagating modes are desired, projection methods are very suitable for these applications. The Chebyshev{preconditioned Arnoldi algorithm allowed us to simulate a wide range of problems employing variable meshes with large numbers of unknowns. The code we developed is robust and easy to use, with automatic selection of most parameters.

73

CHAPTER 4 ITERATIVE CHEBYSHEV{ARNOLDI METHODS FOR EIGENVALUE PROBLEMS 4.1 Introduction Large, sparse nonsymmetric and nonhermitian eigenvalue problems present the algorithm developer with no easy options. For these problems, unlike Hermitian ones, orthogonal projection does not approximate the preferred projection direction along the undesired invariant subspace. Instead, the unknown angle between the undesired and desired invariant subspaces may approach zero for extremely poorly conditioned problems. Furthermore, the only known optimality property available among orthogonal projection methods is that the characteristic polynomial of the approximate problem VmH AVm , where

Vm is an orthogonal basis of the Krylov subspace Km (A; v) = spanfv; Av; : : :; Amvg, minimizes the norm kp(A)vk2 over all monic polynomials p of degree m [61, 63]. This is the motivation for Krylov subspace methods. However, for large sparse problems, building orthogonal Krylov subspaces using A itself exhausts memory and CPU resources long before convergence. Thus one must apply preconditioning to A or use an iterative approach. We will explain how to use both techniques in a complementary way.

74

In Chapter 3 we outlined how to apply these techniques to dielectric waveguides. We use this example here as well, but we cast the discussion in general terms.

4.2 Banded Arnoldi Process The rationale for an iterative approach is that many eigenvectors with well{separated eigenvalues can be represented by one vector containing their sum, the invariant subspace being reconstituted by the Arnoldi process which generates a Krylov subspace from this one vector. We extend this capability to eigenvectors of multiple nondefective eigenvalues by using a banded Arnoldi process. The idea is to generalize the Arnoldi process to span direct sums of Krylov spaces

Libl K (A; v ), where ibl is the maximum multiplicity of the desired eigenvalues, which i i=1 m is determined by geometric symmetry considerations [6]. Any set of eigenvectors can be represented by ibl sum vectors if those with equal eigenvalues are accumulated into separate sum vectors. As in the Arnoldi process, we use Modi ed Gram{Schmidt to orthogonalize the current vector to the previous ones at each step. However, because there are ibl starting vectors, the vector that A is applied to is ibl columns behind the Krylov vector that is being constructed, so that the matrix of inner products V AV has ibl bands below the diagonal. The subspace is built orthogonal to the converged eigenvectors, which are not included in the starting vector. The result is essentially the Band Lanczos Algorithm of Ruhe [57] with complete orthogonalization:

75

Algorithm 4.1 (Banded Arnoldi) To build a k{dimensional Krylov subspace. Given

orthogonal invariant basis vectors q1; : : :; qnconv and ibl starting vectors qnconv+1 ; : : : ; qnconv+ibl . For i = nconv + 1; : : : ; nconv + ibl Iteratively orthogonalize qi w.r.t. q1; : : :; qi?1 Normalize qi abl = ibl m = nconv 20 m := m + 1 m0 = m + abl qm0 = Aqm Iteratively orthogonalize qm0 w.r.t. q1; : : :; qm0?1 If kqm0 k < dtol() then abl := abl ? 1 If abl = 0 RETURN else Normalize qm0 If m + abl < k go to 20 RETURN The tolerance dtol(), computed at run{time, estimates the smallest oating point number of full precision. It is calculated as the ratio of the smallest representable positive number to the relative precision of oating point arithmetic. If one of the residual vectors becomes smaller than dtol(), we stop the Arnoldi process on it and continue on the remaining Krylov vectors. The reason we do not acknowledge convergence until such a catastrophe occurs is that we usually compute Krylov subspaces for a preconditioned matrix, for which it is not clear now to adjust a residual tolerance. The algorithm is stopped instead by 76

tight storage constraints to keep down the cost of orthogonalization. We do not make any other compromises to reduce this cost, because the orthogonality of the subspace basis vectors is fundamental to the projection method and the associated optimality property. Thus, we use strict Modi ed Gram{Schmidt orthogonalization, which unfortunately is inherently sequential. Also, we \iteratively orthogonalize", i.e., we immediately repeat the orthogonalization of a vector if it is reduced by 50%, as Rutishauser does in RITZIT [59] (but only for vectors reduced by 99%). Note that authors who do not precondition the matrix invariably use kqm0 k as the residual of the eigenvalue problem, even though it is only as reliable as the orthogonality of the Krylov vectors. There should, as a matter of principle, always be an independent check on the validity of the Arnoldi process.

4.3 The Selection Iteration For a nondefective matrix A, the Krylov subspace Km (A; v) in terms of the spectral N X decomposition v = j uj is j =1

9 8N N N = 0;

94

(5:2)

which we satisfy by adjusting n in Eq. (5.2). We always need to do this scaling of complex eigenfunctions, so we initially thought we should incorporate it into the Newton iteration for M ()u = (B ? A)u = 0, which needs an additional (complex) equation anyway for the solution to be locally unique. Without going into details, the equation that incorporates this natural quadratic scaling condition is

1 (uH Bu ? uH B u ? uT Bu ? 1) = 0 2

and the corresponding Jacobian for Newton's method is

2 66 64

3 (B ? A) Bu 77 75 , 1 H T H u B + 2 (u B ? u B) 0

so this strategy entails considerable complication. In fact, the scaling of u in Newton's method is arbitrary, so there is no reason to use anything more complicated than linear scaling. We also note here that it seems fundamentally misguided to devote an entire paper [68] to general scalings. Also, linear scaling allows us to equate the method with other apparently distinct methods.

95

5.3 Direct Newton Methods The usual Newton iteration, formulated in [54] for matrices over the real number eld, nds a root of

3 2 6 M ()u 77 75 = 0. F (u; ) = 664 H v u?1

(5:3)

The second row of Eq. (5.3) is the linear scaling condition. Note that nothing prevents us from taking v = uH B , thus changing the scaling condition at each iteration. If we normalize each iterate so that vH u = 1, the Newton step is

3 3 2 32 2 0 66 M () M ()u 77 66 ?u 77 66 M ()u 77 75 ; 75 = 64 75 64 64 H 0 ? v 0

(5:4)

M ()(u + u) = ?M 0()u

(5.5)

which we write as

vH (u + u) = 1, to see that the rst equation is just inverse iteration with a scaling factor of ?. Another way to derive Eq. (5.4) is to set the rst order Taylor polynomial about the current iterate (u; ) to zero:

M ()u + M ()u + M 0()u = 0

96

In the case M () = B ? A, we may use the exact expansion (B ? A)u + (B ? A)u + Bu + B u = 0 and add the scaling condition

vH (u + u) ? 1 = 0 to get

3 3 2 3 2 32 2 66 B ? A Bu 77 66 ?u 77 66 (B ? A)u 77 66 B u 77 75 : 75 + 64 75 = 64 75 64 64 H H 0 v u?1 ? v 0

(5:6)

We may then use a relaxation method to solve exactly for u and . Another method similar to Eq. (5.4) is trace minimization with Ritz shifts [67],

32 2 66 (j B ? A) BU 77 66 ?uj 75 64 64 H ?l U B 0

3 2 77 66 (j B ? A)uj 75 = 64 0

3 77 75 .

It is applicable only when (j B ? A) is hermitian and B is hermitian positive de nite, and only for nding the smallest eigenvalues. However, in this case it has convergence which is ultimately cubic [67]. It is exactly identical to Newton's method, Eq. (5.4), only when nding the single smallest eigenvalue.

97

5.4 Indirect Newton Methods Another way to use Newton's method, given in [50] and apparently never cited by anyone else, is to notice that the scaling factor in Eq. (5.5) approaches zero as M () approaches singularity. Given any matrix function M () which is singular at the desired eigenvalues, a scaling condition vH () = , and a xed right hand side x, the inverse iteration process de nes a scaling function (). We write

M ()u() = ()x;

(5.7)

vH u() = ;

(5.8)

and nd a root of the scaling function () using Newton's method. We dierentiate Eqs. (5.7) and (5.8) and eliminate terms to get

vH M ?1 [M 0u + Mu0 = 0x] vH u0 = 0; so that

and the Newton step is

H vH M ?1 M 0u = 0 v u

H = ? 0 = H ?v?1 u 0 ; v M Mu

98

which also follows directly from Eq. (5.5), assuming that each iterate there is normalized so that vH u = 1. Thus this is a dierent viewpoint but not really a dierent algorithm. We may identify this algorithm with other Newton{like methods if we change the nominally \ xed" parameters x, v, and at each iteration. For example, we may set

x = M 0 ((i))u(i) after the rst iteration to get the usual inverse iteration algorithm. In Rayleigh quotient iteration (RQI) for the generalized eigenvalue problem, the choices are

M () = B ? A, = 1, v = ((i)B ? A)H u(i+1), and x = Bu(i) after the rst iteration. Two solves are required unless v is of the form M H t. Unfortunately, with v of this form, Osborne's third order convergence theorem for isolated simple eigenvalues of M () is inapplicable (Remark (i), p. 39 of [50]|Osborne's s is our v). However, for isolated simple eigenvalues, RQI has even better second order convergence per solve by viewing the process as Newton's method for Eq. (5.3). Furthermore, for hermitian de nite pencils, Parlett [52] uses geometrical arguments to prove that RQI has third order convergence, whether the eigenvalue is simple or not.

5.4.1 Example: The Hermitian Generalized Eigenvalue Problem As an application of the indirect point of view, we give a simple algorithm for nding a small group of solutions of (B ? A)u = 0 with A and B large, sparse, and hermitian and B positive de nite. With standard inverse iteration, the conjugate gradient solver is not robust because B ? A is inde nite for interior eigenvalues. To remedy this, we choose M () = (B ? A)2 as our singular matrix function, with v = Bu(i+1), = 1, and

x = Bu(i) after the rst iteration. If is real, we can use the complex conjugate gradient 99

method to solve the hermitian positive de nite system (B ? A)2z = Bx and get the inverse iteration eect inexpensively for large, sparse matrices.

Algorithm 5.1 (Complex conjugate gradients) To solve Mz = x for z, where M is hermitian positive de nite, to a given tolerance ": z(0) initial guess r(0) = x ? Mz(0) p(0) = r(0) For i = 0; : : : (i)r(i) (i) = p(ri)Mp (i) z(i+1) = z(i) + (i)p(i) r(i+1) = r(i) ? (i)Mp(i) if r(i+1)r(i+1) < "2, RETURN. (i+1) (i+1) p(i+1) = r(i+1) + r r(i)rr(i) p(i) No Cholesky decomposition of B is required.

Algorithm 5.2 (Generalized inverse iteration squared) To nd a small group of solutions of (B ? A)u = 0 with A and B large, sparse, and hermitian and B positive

de nite: U (0) initial guess consisting of p B {orthogonal approximate desired eigenvectors For i = 0; : : : Find spectral decomposition U (i)AU (i) =: C (i+1)C , where C C = I xj = U (i)cj for j = 1; : : : ; p resj = k(ji+1)Bxj ? Axj k for j = 1; : : : ; p if resj < " for j = 1; : : : ; p, RETURN. zj = ((ji+1)B ? A)?2Bxj , j = 1; : : : ; p. Find UR decomposition Z =: U (i+1)R, where U (i+1)BU (i+1) = I .

100

In practice, the residual will be too pessimistic a stopping criterion for the conjugate gradient method, or any other iterative method, applied to this problem. Suppose we have computed an approximate solution z to the exact equation M ()(z + z) = M 0()x. Since the solution z + z is so large, the relative error z=kz + zk can be very small while z and the residual r = M 0x ? Mz = Mz are not small at all. This diculty is due to the near{singularity of M and the deliberate placement of the solution along the direction of the near{singularity rather than to any accident of scaling or choice of basis, so standard preconditioning techniques will not help. We will deal with this situation in Section 5.7.

5.5 Convergence of the Newton Iteration We'd like to apply standard convergence theorems to the Newton iteration Eq. (5.4) and thus infer second{order convergence for all these Newton{like methods. Unfortunately, if (u; ) is a root of F as in Eq. (5.3) and is not simple, then F 0(u; ) is singular, and linear convergence of the Newton iteration is the best that can be expected.

Theorem 5.1 At a nontrivial root (u; ) of F as in Eq. (5.3), where M 0() is invertible, the derivative map F 0 is singular if and only if is a multiple eigenvalue.

Proof: The proof is a direct generalization of a discussion in [54]. Note: 1. Since (u; ) satis es Eq. (5.3), vH u = 1. 2. We may use the Jordan canonical form of M 0()?1 M () to naturally extend the notion of generalized eigenvector to this problem. 101

If is not simple, then there exists y (not a multiple of u) such that M ()y = kM 0()u. The case k = 0 corresponds to multiple linear elementary divisors. Let y~ = y ? u(vH y)=(vH u). Then

32 3 2 66 M () M 0()u 77 66 y~ 77 75 64 75 = 0. 64 H k v 0

(5:9)

Conversely, suppose that (~y; k) satis es Eq. (5.9). If y~ = u, then the second row of Eq. (5.9) becomes (vH u) = 0, so = 0, y~ = 0 and k = 0. On the other hand, if y~ is

2

not a multiple of u, then the rst row of Eq. (5.9) means that is not simple.

For second{order convergence, we also need to check that the derivative map F 0 is Lipschitz continuous near a root. For M () = B ? A, the veri cation is straightforward:

2 3 2

6 M (2 ) M 0(2)u2 7 6 M (1 ) M 0(1 )u1 77 ? 66 kF 0(u2; 2) ? F 0(u1; 1)k2 =

664 5 4 H

vH v 0 0

2 3

6 (2 ? 1)B B (u2 ? u1) 7

77

=

664 5

0 0

2 = kB (2 ? 1; u2 ? u1)k2

3

77

75

2

kB k2 k(u2; 2) ? (u1; 1)k2 : Thus we have quadratic convergence in some neighborhood of a simple eigenvalue, by the standard local convergence theorem for Newton's method (See e.g. p.90 of [19]). For multiple or ill{conditioned eigenvalue problems, some modi cation may be necessary to improve the convergence of the Newton iteration. One possibility is to modify 102

F so that a root is an entire invariant subspace of dimension p. Unfortunately, a Newton method for any nonlinear function involving p vectors simultaneously will involve solving equations roughly p times larger than the single{vector method and will be prohibitively expensive. A further diculty with this approach is that p must usually be adjusted by sophisticated software to be the smallest possible dimension of a well{separated invariant subspace. The best simultaneous approach [12] manages to decouple the Jacobian into

p separate solves, which, however, must be done sequentially.

5.6 General Eigenvalue Problems We cannot aord to couple basis vectors of an invariant subspace together in one root of an expanded F as a way to meet the hypotheses of the standard theorems for second{order convergence. However, we can couple eigenvectors of multiple or nearly equal eigenvalues together in a block Rayleigh quotient after individual Newton/inverse iterations and take advantage of \constructive interference" between nearby approximate eigenvalues. By treating the interaction between eigenvectors in the projection step, the Newton iterations for individual eigenvectors may be carried out in parallel. Also, in Sec. 5.7.1, we will suggest simple modi cations to apply this method to defective or ill{conditioned eigenvalues as well as multiple ones.

Algorithm 5.3 (Multiple inverse iteration) To nd a small group of solutions of (B ? A)u = 0 with A and B large and sparse, and B hermitian positive de nite: U (0) initial guess consisting of p B {orthogonal approximate desired eigenvectors For i = 0; : : : 103

Find spectral decomposition U (i)AU (i) =: C (i+1)C ?1, where cj cj = 1 for j = 1; : : : ; p. xj = U (i)cj for j = 1; : : : ; p resj = k(ji+1)Bxj ? Axj k for j = 1; : : : ; p if resj < " for j = 1; : : : ; p, RETURN. zj = ((ji+1)B ? A)?1Bxj , j = 1; : : : ; p. Find UR decomposition Z =: U (i+1)R, where U (i+1)BU (i+1) = I . Several steps of inverse iteration for several dierent eigenvectors can be done in parallel until it is necessary to take a block Rayleigh quotient. This may be done in a hierarchy, on pieces of the approximate invariant subspace, with an overall block Rayleigh quotient done once at the end to eliminate any duplicates. The blocking criteria may be modi ed to improve parallelism, with signi cant loss of convergence speed whenever closely spaced eigenvalues are computed in separate blocks. This is due to the loss of constructive interference, in which the largest error terms, namely those from nearby eigenvalues, are absorbed into their corresponding eigenvectors in the same block.

5.6.1 General Block Rayleigh Quotient The standard block Rayleigh quotient can only be done for M 0() a constant hermitian positive de nite matrix. For a more general case,

M () = Bll + + B22 + B1 ? A,

104

where M 0() is hermitian positive de nite (the \overdamping condition"), Lancaster [41] recasts the problem as

8 2 > > 66 0 > > 66 > > 66 > > 6 > < 66 66 > 66 > > 66 > > . .. 66 > > 64 > > : Bl Bl?1

Bl

. . . Bl?1 . .. . . . ... ... . ..

B1

3 2 77 66 0 77 66 77 66 77 66 77 66 . .. 77 ? 66 77 66 77 66 Bl Bl?1 77 66 5 4 0 0

Bl

. .. Bl?1 . .. ...

B2 0

39 2 > 66 ul?1 0 77> > > 77> 66 > 6 l?2 0 777> > 66 u 77> = 66 . 7 0 77> 666 .. 77> 66 > > 7 > 0 77> 666 u 75> 64 > > A ; u

3 77 77 77 77 77 77 = 0, 77 77 77 5 (5:10)

which can be viewed as a generalization (B ? A)u = 0 of the generalized eigenvalue problem (B ? A)u = 0. This is how Lancaster arrives at a natural generalization of the Rayleigh quotient. A natural inner product can also be de ned by the B matrix in Eq. (5.10), but it must take into account the eigenvalue associated with each eigenvector:

"

huj ; j ; uk ; k iB = uj lj?1 uj lj?2

2 66 0 6 # 666 uj 6666 . .. 66 64 Bl Bl?1

Bl

. .. Bl?1 . .. ...

B1

32 3 l ? 1 77 66 uk k 77 77 66 7 77 66 uk l?2 777 77 66 k 77 . 77 66 .. 77 77 66 . 77 75 64 75 uk (5:11)

The numbers huj ; j ; uk ; k iA are de ned similarly. Note that hu; ; u; iB = uM 0()u > 0. The usual projection step (orthogonalization of a subspace basis followed by spectral decomposition of the projected eigenvalue problem) becomes a xed point iteration in the eigenpairs to nd a spectral decomposition of the matrix with entries huj ; j ; uk ; k iA , 105

for which the eigenpairs are orthogonal in terms of the inner product Eq. (5.11), i.e.,

huj ; j ; uk ; k iB = jk . This is clearly a lot of work to preserve the constructive interference of a block algorithm, but it is possibly worthwhile for groups of closely spaced eigenvalues, for which constructive interference is large and for which the xed point iteration will converge quickly. Unfortunately, our problems provide no opportunities for testing these nonlinear generalizations.

5.7 In ated Inverse Iteration We need a sparse solver for M ()z = M 0()x, with M () nonhermitian, large, sparse and nearly (possibly multiply) singular. As in Algorithm 5.2, the residual stopping criterion for any iterative method will be too pessimistic. If M is exactly singular, we may use Kramarz' approach [40] and solve the in ated equation (M + Y W )Z = Y ,

(5:12)

where W and Y are arbitrary except that they must not span a vector orthogonal to the nullvectors of M and M , respectively. This means they must have full rank equal to that of the nullspace. With such W and Y , the solution of Eq. (5.12) is a basis for the nullspace of M . We are interested in the case where M is only nearly singular, where the result of solving Eq. (5.12) is, in general, inferior to inverse iteration [40]. We will choose speci c Y and W to remedy this.

106

We choose Y = M 0U as a generalization of the Schur{Wielandt de ation, where the columns of U are the orthogonal (or M 0{orthogonal, if M 0 is hermitian positive de nite) nullvectors of M (j ). To further the analogy with Schur{Wielandt de ation, we take

W = (V ) with V U = I . (We take V = M 0U if M 0 is hermitian positive de nite, or V = U otherwise.) With these choices of Y and W , Eq. (5.12) becomes (M + M 0UV )Z = M 0U:

(5:13)

Note that = 0 corresponds to inverse iteration. If M ()U = 0, then the solution of Eq. (5.13) is Z = U=. If (U; ) is a good approximate eigenpair, Z = U= will still satisfy Eq. (5.13) approximately, by continuity of inversion for nonsingular matrices (assuming

M and M 0 are continuous at ). One advantage of taking 6= 0 is to have Z = U= as a good initial guess for an iterative solver. More importantly, since the solution does not \blow up," the residuals get small along with the relative error and are a good stopping criterion. Therefore, we take = 1. Having seen the advantages, we need to establish what, if anything, we lose by taking this approach instead of ordinary inverse iteration. Kramarz [40] points out that, in general, iterating with Eq. (5.12) does not converge as well as inverse iteration. But we have speci cally chosen Y = M 0U and W = V with V U = I , just as in Schur{ Wielandt de ation. The choice of V makes Z = U a good initial guess, and the choice of Y gives us exactly the same subspace iterate as inverse iteration, courtesy of the

107

Sherman{Morrison{Woodbury formula [49]:

U (i+1) = (M + M 0U (i)V )?1M 0U (i) = [I ? M ?1M 0U (i)(I + V M ?1M 0U (i))?1V ]M ?1M 0U (i) = M ?1M 0U (i)(I + V M ?1M 0U (i))?1(I + V M ?1M 0U (i))

? M ?1M 0U (i)(I + V M ?1M 0U (i))?1V M ?1 M 0U (i) = M ?1M 0U (i)(I + V M ?1M 0U (i))?1: In fact, the formula holds as well for rank one modi cations, so we have

u(i+1) = (M + M 0u(i)v)?1M 0u(i) = M ?1 M 0u(i)=(1 + vM ?1 M 0u(i)), and the equation we solve using GMRES [65] is (M + M 0u(i)v)u(i+1) = M 0u(i):

(5:14)

This rank one formulation eliminates the necessity of deciding in software whether eigenvectors belonging to nearly equal (nondefective) eigenvalues need to be grouped together. As with the full rank update, if M ((i))u(i) = 0, then u(i+1) = u(i) is a solution of Eq. (5.14). Even though the matrix in Eq. (5.14) is singular for (geometrically) multiple eigenvalues, the solution calculated by GMRES with an initial guess of u(i) (and initial

108

residual ?Mu(i)) does not \blow up": It minimizes

?Mu(i) ? M + M 0u(i)v u(i+1) ? u(i)

over all u(i+1) ? u(i) 2 Km (M + M 0u(i) v; ?Mu(i)) = spanfQmg. This means it minimizes

or equivalently

(i) kMu kq1 ? M + M 0u(i)v Qmy (i) kMu ke1 ? Qm+1 M + M 0u(i)v Qmy

over all y. This reduced problem is always nonsingular, since the construction of the Krylov subspace stops at dimension m if ever Qm+1 spans even an approximate nullvector of the operator, i.e., a vector x for which k(M + M 0u(i)v)xk is within the residual tolerance. The least squares solution y of the reduced problem is no bigger than kMu(i)k, which is small, so the GMRES solution u(i+1) = u(i) + Qmy is almost exactly the same as

u(i). Since the solution does not \blow up", the residual ?Mu(i+1) gets small along with the relative error and is a good stopping criterion. This residual is the same as that for the eigenvalue problem, so it cannot get small unless (u(i); (i)) are very good approximations. Nevertheless, we recommend that the residual stopping criterion for GMRES be the same as that for the eigenvalue problem, with the early GMRES solves stopped instead by fairly tight storage constraints, in order to control the cost of orthogonalization in GMRES.

109

In ation is not necessary for these early iterations where storage is the stopping criterion (i.e., we may take = 0).

5.7.1 Defective or Ill{conditioned Eigenvalues By our convention, if M ((ji))u(ji) = 0 and u(ji) corresponds to a nonlinear divisor of degree r, then any solutions of

M ((ji))u(ki) = M 0((ji))u(ki?) 1, k = j + 1; : : : ; j + r ? 1

(5:15)

are a basis for the corresponding invariant subspace, i.e., they are the generalized eigenvectors. None of the generalized eigenvectors are in the nullspace of M ((ji)). Therefore, the solution of Eq. (5.15) calculated by GMRES with an initial guess of u(ki?1) does not \blow up", and the residual kM 0u(ki?) 1 ? Mu(ki)k gets small along with the relative error and is a good stopping criterion. Thus the in ated inverse iteration is not necessary, and in fact, our choice of Y = M 0U does not meet Kramarz' criteria in the defective case:

M ()u1 = 0, M ()u2 = M 0()u1, vM () = 0 =) vM 0()u1 = vM ()u2 = 0; contradicting the requirement that Y = M 0U span no vector orthogonal to the nullspace of M . The equivalence to inverse iteration helps us to explain, in a less rigorous, more intuitive way than Varah [81], what he (and other authors e.g. [54]) have noted about 110

the behavior of inverse iteration \with a very good eigenvalue" for defective or very ill{ conditioned eigenvalues, namely, that the residual increases after the rst iteration. The explanation is that the appropriate equation to solve switches from Eq. (5.14) in the rst iteration to Eq. (5.15) in subsequent iterations, up to the degree of the nonlinear divisor. Since this would be a nightmare to treat in software, we recommend solving instead the corresponding nondefective eigenvalue problem [M ()]r u = 0,

(5:16)

where the degree r is determined by the noncommutative symmetry of the underlying dierential equation [6]. The nearly defective cases must correspond to near symmetries in the problem. Thus we may apparently eliminate this bothersome case by an otherwise irrelevant perturbation of the problem, such as changing the gridding or triangulation to break the symmetry.

111

LIST OF REFERENCES [1] V. I. Agoshkov and Ju. A. Kuznetsov. Lanczos method for the eigenvalue problem. In G. I. Marchuk, editor, Comp. Meth. Linear Algebra, pages 145{164, 1972. (in Russian). [2] D.G. Anderson. . J. Assoc. Comp. Mach., 12:547, 1965. [3] K. Bierwirth, N. Schulz, and F. Arndt. Finite-dierence analysis of rectangular dielectric waveguide structures. IEEE Trans. MTT, 34(11):1104{1114, 1986. [4] A. Bjorck. Solving linear least squares problems by Gram{Schmidt orthogonalization. BIT, 7:1{21, 1967. [5] J. S. Blakemore. Approximations for Fermi{Dirac integrals, especially the function F1=2() used to describe electron density in a semiconductor. Solid State Electronics, 25(11-A):1067{1076, 1982. [6] A. Bossavit. Symmetry, groups, and boundary value problems. A progressive introduction to noncommutative harmonic analysis of partial dierential equations in domains with geometrical symmetry. Comput. Methods Appl. Mech. Engrg., 56:167{ 215, 1986. [7] A. Bossavit. Simplicial nite elements for scattering problems in electromagnetism. Comput. Methods Appl. Mech. Engrg., 76:299{316, 1989. [8] P. Brown and A. C. Hindmarsh. Matrix-free methods for sti systems of ODEs. SIAM J. Numer. Anal., 23:610{638, 1986. [9] Tony F. Chan and K.R. Jackson. Nonlinearly preconditioned Krylov subspace methods for discrete Newton algorithms. SIAM J. Sci. Stat. Comp., 7:533{542, 1984. [10] R. Chandra. Conjugate Gradient Methods for Partial Dierential Equations. PhD thesis, Yale University, New Haven, 1978. Research report # 129. [11] B. W. Char, G. J. Fee, K. O. Geddes, G. H. Gonnet, and M. B. Monagan. A tutorial introduction to Maple. J. Symbolic Comp., 2(2):179{200, 1986. [12] F. Chatelin. Simultaneous Newton's iteration for the eigenproblem. Computing, 5:67{74, 1984. 112

[13] I.L. Chern and W.L. Miranker. Dichotomy and conjugate gradients in the sti initial value problem. Technical Report 8032-34917, IBM, Yorktown Heights, 1980. [14] M. Clint and A. Jennings. A simultaneous iteration method for the unsymmetric eigenvalue problem. Journal of the Institute for Mathematics and Applications, 8:111{121, 1971. [15] T.F. Coleman, B.S. Garbow, and J.J. More. Software for estimating sparse Jacobian matrices. Technical Report ANL-MCS-TM-14, Argonne National Laboratory, 1983. [16] J.K. Cullum and R.A. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. I, Theory. Birkhauser, Boston, 1985. [17] S. Datta. Quantum Devices. Superlattices and Microstructures, 6:83, 1989. [18] P. J. Davis. Interpolation and Approximation. Dover Publications, Inc., 1975. [19] J. E. Dennis and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewook Clis, New Jersey, 1983. [20] B. Fischer and R. Freund. Chebyshev polynomials are not always optimal. Technical Report 89.17, Research Institute for Advanced Computer Science, 1989. [21] A. Galick, T. Kerkhoven, and U. Ravaioli. Iterative solution of the eigenvalue problem for a dielectric waveguide. IEEE Trans. MTT, 40(4):699{705, April 1992. [22] C.W. Gear and Youcef Saad. Iterative solution of linear equations in ODE codes. SIAM J. Sci.Stat. Comp., 4:583{601, 1983. [23] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimization. Academic Press, 1981. [24] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, Maryland, 2nd edition, 1989. [25] K. Hayata, M. Koshiba, M. Eguchi, and M. Suzuki. Vectorial nite-element method without any spurious solutions for dielectric waveguiding problems using transverse magnetic{ eld component. IEEE Trans. MTT, 34(11):1120{1124, 1986. [26] E. Hille. Analytic Function Theory, volume II, pages 264{274. Ginn and Co., 2nd edition, 1962. [27] D. Ho, F. Chatelin, and M. Bennani. Arnoldi-Tchebychev procedure for large scale nonsymmetric matrices. Math. Model. Numer. Anal., 24(1):53{65, 1990. [28] Horn and Johnson. Matrix Analysis. Cambridge University Press, 1990. [29] IBM J. Res. Develop., May 1988. contains an extensive review. 113

[30] J. D. Jackson. Classical Electrodynamics, pages 17{22. John Wiley and Sons, 2nd edition, 1974. [31] W. Jalby, U. Meier, and A. Sameh. The behavior of conjugate gradient algorithms on a multivector processor with a hierarchical memory. Technical Report CSRD 758, UIUC, 1988. [32] T. Kerkhoven. Private Communication, 1992. [33] T. Kerkhoven, A. Galick, and U. Ravaioli. Ecient numerical solution of large sparse eigenvalue problems in microelectronic laser design. In Proceedings of the international workshop on computational electronics, pages 87{90. Printex, 1992. [34] T. Kerkhoven, A. T. Galick, U. Ravaioli, J. H. Arends, and Y. Saad. Ecient Numerical Simulation of Electron States in Quantum Wires. J. Appl. Physics, 17(6), October 1990. [35] T. Kerkhoven, M. Raschke, and U. Ravaioli. Self{consistent simulation of corrugated layered structures. Superlattices and Microstructures, 12(4):505{508, 1992. [36] T. Kerkhoven, M. W. Raschke, and U. Ravaioli. Self{consistent simulation of quantum wires in periodic heterojunction structures. (submitted to Journal of Applied Physics). [37] T. Kerkhoven and Y. Saad. On Acceleration Methods for Systems of Coupled Nonlinear Partial Dierential Equations. Technical Report UIUCDCS-R-1363, University of Illinois, February 1989. [38] Thomas Kerkhoven. On the Eectiveness of Gummel's Method. SIAM J. on Sci. & Stat. Comp., 9:48{60, January 1988. [39] M. Kheli . Lanczos maximal algorithm for unsymmetric eigenvalue problems. Appl. Numer. Math., 7:179{193, 1991. [40] L. Kramarz. Algebraic perturbation methods for the solution of singular linear systems. Lin. Alg. Appl., 36:79{88, 1981. [41] P. Lancaster. Lambda-Matrices and Vibrating Systems. Pergamon, Oxford, 1966. [42] S. E. Laux. Numerical Methods for Calculating Self-Consistent Solutions of Electron States in Narrow Channels. In J.J.H. Miller, editor, Proceedings of the Fifth International Conference on the Numerical Analysis of Semiconductor Devices and Integrated Circuits, pages 270{275, Dublin, Ireland, 1987. NASECODE V, Boole Press Limited. [43] S. E. Laux and F. Stern. Electron states in narrow gate-induced channels in Si. Appl. Phys. Lett., 49(2):91{93, 1986. 114

[44] S. E. Laux and A. C. Warren. Self-consistent calculation of electron states in narrow channels. Technical report, IBM, T.J. Watson Research Center, 1986. [45] M. Macucci, U. Ravaioli, and T. Kerkhoven. Analysis of electron transfer between parallel quantum wires. Superlattices and Microstructures, 12(4):509{512, 1992. [46] E. A. J. Marcatili. Dielectric rectangular waveguide and directional coupler for integrated optics. Bell Syst. Tech. J., 48:2071{2102, 1969. [47] D.C. Miller, R.K. Lake, S. Datta, M.S. Lundstrom, M.R. Melloch, and R. Reifenberger. Modulation of the Conductance of T-Shaped Electron Waveguide Structures with a Remote Gate., pages 165{174. Academic Press, 1989. [48] R. Natarajan. An Arnoldi{based iterative scheme for nonsymmetric matrix pencils arising in nite element stability problems. Technical Report RC 16327 (#69303), IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 1990. [49] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York, 1970. [50] M. R. Osborne. Inverse iteration, Newton's method, and non-linear eigenvalue problems. In The Contribution of Dr. J. H. Wilkinson to Numerical Analysis, number 19 in Symposium Proceedings Series, pages 21{53. Institute of Mathematics and its Applications, 1978. [51] B. N. Parlett. The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Clis, New Jersey, 1980. [52] B. N. Parlett. The symmetric eigenvalue problem. Prentice Hall series in computational mathematics, 1980. [53] B. N. Parlett, D. R. Taylor, and Z. A. Liu. A look{ahead Lanczos algorithm for unsymmetric matrices. Math. Comp., 44:105{124, 1985. [54] G. Peters and J. H. Wilkinson. Inverse iteration, ill{conditioned equations and Newton's method. SIAM Review, 21(3):339{360, 1979. [55] B. M. A. Rahman and J. B. Davies. Penalty function improvement of waveguide solution by nite elements. IEEE Trans. MTT, 32:922{928, 1984. [56] U. Ravaioli, T. Kerkhoven, M. Raschke, and A. T. Galick. Numerical simulation of electron con nement in contiguous quantum wires. Superlattices and Microstructures, 11(3):343{345, 1992. [57] A. Ruhe. Implementation aspects of band Lanczos algorithms for computation of eigenvalues of large sparse symmetric matrices. Math. Comp., 33(146):680{687, 1979. 115

[58] H. Rutishauser. Computational aspects of F. L. Bauer's simultaneous iteration method. Numerische Mathematik, 13:4{13, 1969. [59] H. Rutishauser. Simultaneous iteration method for symmetric matrices. Numerische Mathematik, 16:205{223, 1970. Also in [85, pp.284{301]. [60] Y. Saad. Variations of Arnoldi's method for computing eigenelements of large unsymmetric matrices. Linear Algebra and Its Applications, 34:269{295, 1980. [61] Y. Saad. Projection methods for solving large sparse eigenvalue problems. In Matrix Pencils, Proceedings, volume 973 of Lecture Notes in Math., pages 121{144. SpringerVerlag, 1982. [62] Y. Saad. Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems. Math. Comp., 42(166):567{588, 1984. [63] Y. Saad. Numerical methods for large matrix eigenvalue problems. Unpublished Class Notes, 1987. [64] Y. Saad and M. H. Schultz. GMRES: A Generalized Minimal Residual Method for solving Nonsymmetric Linear Systems. SIAM J. Sci. Stat. Comput., 7:856{869, 1986. [65] Youcef Saad and Martin H. Schultz. GMRES: A Generalized Minimal Residual Algorithm for solving Nonsymmetric Linear Systems. SIAM J. Sci. Stat. Comput., 7(3):856{869, 1986. [66] M. Sadkane. A block Arnoldi-Chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices. Numer. Math., 64:181{193, 1993. [67] A. Sameh and J. A. Wisniewski. A trace minimization algorithm for the generalized eigenvalue problem. SIAM J. Numer. Anal., 19(6):1243{1259, 1982. [68] M. C. Santos. A note on the Newton iteration for the algebraic eigenvalue problem. SIAM J. Matrix Anal., 9(4):561{569, 1988. [69] N. Schulz, K. Bierwirth, F. Arndt, and U. Koster. Finite-dierence method without spurious solutions for the hybrid{mode analysis of diused channel waveguides. IEEE Trans. MTT, 38(6):722{729, 1990. [70] E. Schweig and W. B. Bridges. Computer analysis of dielectric waveguides: A nitedierence method. IEEE Trans. MTT, 32(5):531{541, 1984. [71] L. C. Shen and J. A. Kong. Applied Electromagnetism. PWS Publishers, 2nd edition, 1987. [72] F. Sols, M. Macucci, U. Ravaioli, and K. Hess. On the Possibility of Transistor Action Based on Quantum Interference. Appl. Phys. Lett., 54:350{352, 1989. 116

[73] F. Sols, M. Macucci, U. Ravaioli, and K. Hess. Theory for a Quantum Modulated Transistor. J. Appl. Phys., 66(8):3892{3906, 1989. [74] D. C. Sorensen. Implicit application of polynomial lters in a k-step Arnoldi method. SIAM J. Matrix Anal., 13(1):357{385, 1992. [75] Frank Stern and Sankar Das Sarma. Electron Levels in GaAs-Ga1?xAlxAs Heterojunctions. Physical Review B, 30(2):840{848, 1984. [76] G. W. Stewart. Simultaneous iteration for computing invariant subspaces of non{ Hermitian matrices. Numer. Math., 25:123{136, 1976. [77] G. W. Stewart. SRRIT | a FORTRAN subroutine to calculate the dominant invariant subspaces of a real matrix. Technical Report TRR-514, University of Maryland, Department of Computer Science, 1978. [78] W. J. Stewart and A. Jennings. A simultaneous iteration algorithm for real matrices. ACM Trans. Math. Software, 7(2):184{198, 1981. [79] J. Svedin. A modi ed nite-element method for dielectric waveguides using an asymptotically correct approximation on in nite elements. IEEE Trans. MTT, 39(2):258{266, 1991. [80] S.M. Sze. Physics of Semiconductor Devices 2nd Edition. Wiley-Interscience, 1981. [81] J. M. Varah. Computing invariant subspaces of a general matrix when the eigensystem is poorly determined. Math. Comp., 24:137{149, 1970. [82] R.S. Varga. Matrix Iterative Analysis, page 184. Prentice-Hall, Englewood Clis, 1962. [83] P. Wayner. Genetic algorithms. Byte, 16(1):361{368, 1991. [84] L. B. Wigton, N. J. Yu, and D. P. Young. GMRES acceleration of computational

uid dynamics codes. In AIAA 7th Computational Fluid Dynamics Conference, 1985. Report # AIAA-85-1494. [85] J. H. Wilkinson and C. Reinsch. Handbook for Automatic Computation. V.II Linear Algebra. Springer, New York, 1971.

117

APPENDIX A PDE CONSISTENCY CHECK |\^/| MAPLE V ._|\| |/|_. Copyright (c) 1981-1990 by the University of Waterloo. \ MAPLE / All rights reserved. MAPLE is a registered trademark of Waterloo Maple Software. | Type ? for help. # # DISCRETIZATION ERROR OF Hx. # > readlib(mtaylor); proc() ... end > # # # # # # # # # > > > > > > > > > > > > > >

printlevel := -1; Define the discretization on a rectangular grid, with Hx and Hy on mesh points expanded in third degree Taylor polynomials, and eps on dual mesh points expanded in second degree Taylor polynomials about the center point (0,0). The unknowns n,s,e,w refer to the distance to the next mesh point to the North, South, East, and West, respectively. We apologize for the complexity, which is the minimum necessary. d1 := \ (w*sqrt(mtaylor(eps(-w/2,-s/2),[w,s],3)/mtaylor(eps(-w/2,n/2),[w,n],3)) +\ e*sqrt(mtaylor(eps(e/2,-s/2),[e,s],3)/mtaylor(eps(e/2,n/2),[e,n],3)))*\ (mtaylor(Hx(0,n),[n],4)-Hx(0,0))/(2*n) +\ (w*sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)/mtaylor(eps(-w/2,-s/2),[w,s],3)) +\ e*sqrt(mtaylor(eps(e/2,n/2),[e,n],3)/mtaylor(eps(e/2,-s/2),[e,s],3)))*\ (mtaylor(Hx(0,-s),[s],4)-Hx(0,0))/(2*s) +\ (n*(sqrt(mtaylor(eps(-w/2,-s/2),[w,s],3)/mtaylor(eps(-w/2,n/2),[w,n],3)) +\ sqrt(mtaylor(eps(e/2,-s/2),[e,s],3)/mtaylor(eps(e/2,n/2),[e,n],3))) +\ s*(sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)/mtaylor(eps(-w/2,-s/2),[w,s],3)) +\ sqrt(mtaylor(eps(e/2,n/2),[e,n],3)/mtaylor(eps(e/2,-s/2),[e,s],3))))*\ (mtaylor(Hx(-w,0),[w],4)-Hx(0,0))/(4*w) +\ (n*(sqrt(mtaylor(eps(-w/2,-s/2),[w,s],3)/mtaylor(eps(-w/2,n/2),[w,n],3)) +\ sqrt(mtaylor(eps(e/2,-s/2),[e,s],3)/mtaylor(eps(e/2,n/2),[e,n],3))) +\

118

> s*(sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)/mtaylor(eps(-w/2,-s/2),[w,s],3)) +\ > sqrt(mtaylor(eps(e/2,n/2),[e,n],3)/mtaylor(eps(e/2,-s/2),[e,s],3))))*\ > (mtaylor(Hx(e,0),[e],4)-Hx(0,0))/(4*e) +\ > o^2*u*(n+s)*\ > (w*sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)*mtaylor(eps(-w/2,-s/2),[w,s],3)) +\ > e*sqrt(mtaylor(eps(e/2,n/2),[e,n],3)*mtaylor(eps(e/2,-s/2),[e,s],3)))*\ > Hx(0,0)/4 - b^2*\ > (w*(n*sqrt(mtaylor(eps(-w/2,-s/2),[w,s],3)/mtaylor(eps(-w/2,n/2),[w,n],3)) +\ > s*sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)/mtaylor(eps(-w/2,-s/2),[w,s],3))) +\ > e*(n*sqrt(mtaylor(eps(e/2,-s/2),[e,s],3)/mtaylor(eps(e/2,n/2),[e,n],3)) +\ > s*sqrt(mtaylor(eps(e/2,n/2),[e,n],3)/mtaylor(eps(e/2,-s/2),[e,s],3))))*\ > Hx(0,0)/4 +\ > (sqrt(mtaylor(eps(-w/2,-s/2),[w,s],3)/mtaylor(eps(-w/2,n/2),[w,n],3)) -\ > sqrt(mtaylor(eps(-w/2,n/2),[w,n],3)/mtaylor(eps(-w/2,-s/2),[w,s],3)))*\ > (mtaylor(Hy(-w,0),[w],4)-Hy(0,0))/2 +\ > (sqrt(mtaylor(eps(e/2,n/2),[e,n],3)/mtaylor(eps(e/2,-s/2),[e,s],3)) -\ > sqrt(mtaylor(eps(e/2,-s/2),[e,s],3)/mtaylor(eps(e/2,n/2),[e,n],3)))*\ > (mtaylor(Hy(e,0),[e],4)-Hy(0,0))/2; bytes used=1000372, alloc=851812, time=2.633 # # Define the relations (i.e., the differential equations) # > siderels :=\ > {(D[1,1] + D[2,2])(Hx)(0,0)*eps(0,0) + o^2*u*(eps(0,0))^2*Hx(0,0) -\ > D[2](eps)(0,0)*D[2](Hx)(0,0) +\ > D[2](eps)(0,0)*D[1](Hy)(0,0)=b^2*Hx(0,0)*eps(0,0),\ > (D[1,1] + D[2,2])(Hy)(0,0)*eps(0,0) + o^2*u*(eps(0,0))^2*Hy(0,0) +\ > D[1](eps)(0,0)*D[2](Hx)(0,0) -\ > D[1](eps)(0,0)*D[1](Hy)(0,0)=b^2*Hy(0,0)*eps(0,0)}; # # Display the relations. # > latex(siderels); bytes used=2000496, alloc=1113908, time=5.800

n (D1;1(Hx )(0; 0) + D2;2(Hx )(0; 0)) (0; 0) + o2 u(0; 0)2Hx (0; 0) ?D2()(0; 0)D2 (Hx )(0; 0) + D2()(0; 0)D1 (Hy )(0; 0) = b2Hx (0; 0)(0; 0); (D1;1(Hy )(0; 0) + D2;2(Hy )(0; 0)) (0; 0) + o2u(0; 0)2Hy (0; 0) o +D1()(0; 0)D2 (Hx )(0; 0) ? D1()(0; 0)D1 (Hy )(0; 0) = b2Hy (0; 0)(0; 0) # # Define the variables to be those indeterminates in siderels

119

# # > # # # >

which are of Maple type `function'. vars := indets(siderels,'function'); Display the variables. latex(vars);

fHx (0; 0); Hy (0; 0); (0; 0); D1 ()(0; 0); D2()(0; 0); D1(Hy )(0; 0);

D1;1(Hy )(0; 0); D2(Hx )(0; 0); D2;2(Hx )(0; 0); D1;1(Hx )(0; 0); D2;2(Hy )(0; 0)g

# # Simplify the discretization with respect to the relations, # using the `total degree' ordering in the variables. # > derr1 := simplify(d1,siderels,vars); bytes used=3000972, alloc=1113908, time=9.066 bytes used=4007768, alloc=1572576, time=12.750 # # Re-order the terms according to the degree of n,s,e,w # up to third degree terms. # > derr1 := mtaylor(derr1,[n,s,e,w],4); bytes used=5007896, alloc=1769148, time=15.966 bytes used=6008024, alloc=1769148, time=20.383 bytes used=7008168, alloc=2227816, time=24.533 bytes used=8008544, alloc=2948580, time=28.750 bytes used=9008732, alloc=3669344, time=33.666 # # Display the result. Since this discretization remainder # is third order per mesh point, the discretization is first # order consistent with the 2D PDE for Hx. An analogous # result holds for Hy, by rotation of indices. # > latex(derr1); bytes used=10008872, alloc=4259060, time=36.633 bytes used=11009136, alloc=4259060, time=39.083

o2uHx (0; 0)D2 ()(0; 0) + D2;2()(0; 0)D1(Hy )(0; 0) 8 16 (0; 0) ! D 2;2;2(Hx )(0; 0) D2;2 ()(0; 0)D2 (Hx )(0; 0) n2 e ? + 12 16 (0; 0) 120

2 D1 (Hy )(0; 0) + o uHx (0; 0)8D2()(0; 0) + D2;2()(016; 0)(0 ; 0) ! )(0; 0) ? D2;2()(0; 0)D2(Hx )(0; 0) wn2 + D2;2;2(Hx 12 16 (0; 0)

2 D1 ()(0; 0) + D1()(0; 80)D(01;;10)(Hx )(0; 0) ? b Hx (08; 0) (0; 0) ? D1;2()(08; 0)(0D;20)(Hx )(0; 0) + D2()(0; 80)D(01;;10)(Hy )(0; 0) D1 (Hy )(0; 0) + D1()(0; 80)D(02;;20)(Hx )(0; 0) + D1;2()(08; 0) (0;!0) 2 )(0; 0) ne2 + o uHx (0; 0)4D1 ()(0; 0) + D1;1;1(Hx 12 2 D1 (Hy )(0; 0) ? o uHx (0; 0)D1 ()(0; 0) + ? D1;2()(08; 0) (0; 0) 4 )(0; 0) ? D1()(0; 80)D(01;;10)(Hx )(0; 0) ? D1;1;1(Hx 12 2 D1 ()(0; 0) + D1;2()(0; 0)D2 (Hx )(0; 0) + b Hx (08; 0) (0; 0) 8 (0; 0) ! D 2 ()(0; 0)D1;1 (Hy )(0; 0) D1 ()(0; 0)D2;2 (Hx )(0; 0) 2n ? w ? 8 (0; 0) 8 (0; 0) D1 (Hy )(0; 0) ? D2;2;2(Hx )(0; 0) + ? D2;2()(016; 0)(0 ; 0) 12 ! 2 D 2;2 ()(0; 0)D2 (Hx )(0; 0) o uHx (0; 0)D2 ()(0; 0) 2 se ? + 16 (0; 0) 8

D1 (Hy )(0; 0) ? D2;2;2(Hx )(0; 0) + ? D2;2()(016; 0)(0 ; 0) 12

! 2 D 2;2 ()(0; 0)D2 (Hx )(0; 0) o uHx (0; 0)D2 ()(0; 0) 2 ws ? + 16 (0; 0) 8 2 + D1()(0; 0)D1;1(Hx )(0; 0) ? b Hx (0; 0)D1 ()(0; 0) 8 (0; 0) 8 (0; 0) ? D1;2()(08; 0)(0D;20)(Hx )(0; 0) + D2()(0; 80)D(01;;10)(Hy )(0; 0) D1 (Hy )(0; 0) + D1()(0; 80)D(02;;20)(Hx )(0; 0) + D1;2()(08; 0) (0;!0) 2 )(0; 0) se2 + o uHx (0; 0)4D1 ()(0; 0) + D1;1;1(Hx 12 2 D1 (Hy )(0; 0) ? o uHx (0; 0)D1 ()(0; 0) + ? D1;2()(08; 0) (0; 0) 4 121

)(0; 0) ? D1()(0; 80)D(01;;10)(Hx )(0; 0) ? D1;1;1(Hx 12

D1 ()(0; 0) + D1;2()(0; 0)D2 (Hx )(0; 0) + b Hx (08; 0) (0; 0) 8 (0; 0) ! D 2 ()(0; 0)D1;1 (Hy )(0; 0) D1 ()(0; 0)D2;2 (Hx )(0; 0) ? ? w2 s 8 (0; 0) 8 (0; 0) 2

# # Let's see what happens on a regular grid. # > newrels := {n=s,e=w}; > vars := indets(newrels); > derrnew := simplify(derr1,newrels,vars); bytes used=12009452, alloc=4259060, time=42.466 # # Display the result. The discretization is second # order consistent on regular meshes. # > print(derrnew); 0 > quit bytes used=12237928, alloc=4259060, time=43.050

122

APPENDIX B CHEBYSHEV POLYNOMIALS Our interest in Chebyshev polynomials is motivated by two things. First, we wish to control the size of the polynomial on an ellipse. Second, we require that the polynomials be an orthogonal family, so that we may use a three{term recursion to eciently apply them. We summarize here how Chebyshev polynomials meet these requirements. By Tonelli's theorem [18], the unique minimal monic polynomial of degree n on a compact set E containing at least n + 1 points reaches its maximum modulus at least n +1 times in E . The Chebyshev polynomials are de ned (up to scaling) by this property on [?1; 1] as tn(z) cos(n arccos z): (B:1) If we let = arccos z, then is ambiguous|we may substitute 2m|but there is no ambiguity in tn(z) = cos n( 2m). Then, from the identities cos 0 = 1; cos 1 = z; cos (n + 1) + cos (n ? 1) = 2 cos cos n; we have the three-term recursion

t0(z) = 1; t1(z) = z; : : :; tn+1 (z) = 2ztn(z) ? tn?1(z);

(B:2)

which veri es that tn(z) as de ned in Eq. (B.1) is a polynomial with leading coecient 2n?1 . We reserve the capitalized label

Tn(z) = 21?n tn(z)

(B:3)

for the unique monic minimax polynomial. If we set w = ei, the ambiguity of leads to two possible values, speci cally ei(2m) = w1 . However, it is always true that ?1 i ?i z = cos = e +2 e = w +2w . Since in ?in n ?n tn(z) = cos n = e +2 e = w +2 w ; the asymptotic behavior of tn(z) as n ! 1 is governed by the root w of largest modulus ?1 of w +2w = z, or equivalently w2 ? 2zw + 1 = 0. Thus, the \control quantity" is

p

w = z z 2 ? 1; 123

(B:4)

whichever has largest modulus, and the asymptotic behavior is n tn (z) ! w2 as n ! 1: (B:5) i + r?1 e?i re , where r = jwj, This has the same modulus for all points of the form z() = 2 ? 1 which we verify is an ellipse with foci 1 and major axis r + r : i ?1 e?i rei + 2 + re?i re ? 2 + r + jz() ? 1j + jz() + 1j = 2 2 2 2 p p 1 1 1 1 ? i= 2 ? i= 2 i= 2 i= 2 = 2 re ? pr e + 2 re + pr e ! p ! p 1 1 1 i= 2 ? i= 2 ? i= 2 i= 2 = 2 re ? pr e re ? pr e ! ! p p 1 1 1 i= 2 ? i= 2 i= 2 ? i= 2 re + pr e + 2 re + pr e = 1 r + r?1 ? ei ? e?i + 1 r + r?1 + ei + e?i 2 2 ? 1 = r+r : Thus, the Chebyshev polynomials tn(z) asymptotically satisfy the Tonelli criterion on ellipses with foci 1, and, by the maximum principle, also on the interiors of this confocal family. Therefore, although they may not be the minimal polynomials on these ellipses [20], they are asymptotically minimal (up to scaling).

B.1 Chebyshev Polynomials Adapted to an Ellipse We have seen that the Chebyshev polynomials tn(z) are naturally adapted to the family of ellipses with foci 1. We now adapt them to arbitrary ellipses in the complex plane of center d and foci d c by taking z = ?c d . We also scale them to unity at a reference point ref . The result is !, ! ? d ref ? d pn () = tn c tn : (B:6) c The control quantity (B.4) in terms of is one of 0 v !2 1 u u w() = B @ ?c d t ?c d ? 1CA ;

124

(B:7)

whichever has largest modulus, and the asymptotic behavior is pn () ! [()]n as n ! 1, where () = ww(() ) : (B:8) ref ! n?1 . The three{term recursion (B.2) implies that ref ? d and Let n = tn n= c n

0 = 1; 1 = 1 = refc? d ; : : : ; 1 ! ref ? d n+1 = 2 n ? n?1 c so n+1 = 1 = 2 ? n n n+1 1 1 and n+1 = 2= ? ; 1 n and also that

0p0() = 1; 1p1() = ?c d ! ? d so p1() = 1 c ; : : :; ! ? d n+1 pn+1 () = 2 c npn () ? n?1pn?1 () ! ? d so pn+1 () = 2 c n+1pn () ? nn+1 pn?1 (): Thus, the matrix{vector product vn = [pn (A)]v0 may be computed recursively [62, 27] using v1 = (1=c)(A ? dI )v0; : : :; vn+1 = 2 nc+1 (A ? dI )vn ? n n+1vn?1: The three vectors fvn?1; vn; vn+1g are taken to be contiguous in the storage reserved for the next three Krylov vectors, thus reducing the need for access to slow memory|a typical bottleneck in supercomputing applications.

125

VITA Albert T. Galick was born in New Brunswick, New Jersey, on July 23, 1958. He received the B.S. degree in mathematics early in 1980 from the Massachusetts Institute of Technology. Also in 1980, he completed a three{month COBOL training program to became a Member of the Programming Sta at AT&T General Departments. That fall, he entered the Ph.D. program in mathematics as a Teaching Assistant at the University of Illinois at Urbana{Champaign (UIUC), where he received the M.S. degree in 1984. From 1984 to 1986, he programmed in Shell and C for shop{ oor tracking systems at AT&T Engineering Research Center, and became a Member of the Technical Sta just before returning to UIUC to pursue a Ph.D. in computer science. From 1986 to 1992, he was a Research Assistant at UIUC's Center for Supercomputing Research and Development developing FORTRAN programs for semiconductor device simulation. He transferred to the National Center for Computational Electronics at UIUC's Beckman Institute in 1992.

PUBLICATIONS \Ecient Numerical Simulation of Electron States in Quantum Wires" by Thomas Kerkhoven, Albert T. Galick, Umberto Ravaioli, John H. Arends, and Youcef Saad in Journal of Applied Physics , 68(7), 1 October 1990, pp. 3461{3469. \Iterative Solution of the Eigenvalue Problem for a Dielectric Waveguide" by Albert T. Galick, Thomas Kerkhoven, and Umberto Ravaioli in IEEE Trans., Microwave Theory and Techniques . 40(4), April 1992, pp. 699{705.

126

EFFICIENT SOLUTION OF LARGE SPARSE ... - CiteSeerX

EFFICIENT SOLUTION OF LARGE SPARSE ... - CiteSeerX

Suggest Documents

the efficient parallel iterative solution of large sparse linear ... - CiteSeerX

Solution of large-scale sparse least squares problems ... - CiteSeerX

the efficient parallel iterative solution of large

Efficient Methods for Conversion and Solution of Sparse Systems of ...

Efficient Learning of Sparse, Distributed, Convolutional ... - CiteSeerX

Parallel Solution of Large Sparse SPD Linear Systems Based on ...

Iterative algorithms for solution of large sparse ... - Bilkent University

Vectorization Algorithm for the Solution of Large, Sparse Triangular ...

Efficient Algorithms for Convolutional Sparse ... - CiteSeerX

EFFICIENT SPARSE DYNAMIC PROGRAMMING FOR ... - CiteSeerX

EFFICIENT SPARSE SHAPE COMPOSITION WITH ITS ... - CiteSeerX

Parallel iterative solution method for large sparse linear equation ...

EFFICIENT ALGORITHMS FOR SPARSE SINGULAR ... - CiteSeerX

Efficient Numerical Solution of Stochastic Differential ... - CiteSeerX

Sparse Semantic Hashing for Efficient Large Scale ... - Google Sites

Sparse Non-Negative Solution of a Linear System of ... - CiteSeerX

Efficient Mining of Large Maximal Bicliques - CiteSeerX

Inexact Matching of Large and Sparse Graphs Using ... - CiteSeerX

A Sparse QS-Decomposition for Large Sparse Linear System of

Sparse Variational Analysis of Large Longitudinal Data Sets - CiteSeerX

designing reliability and energy efficient solution for large ... - IJESRT

FAST SOLUTION OF MSC/NASTRAN SPARSE MATRIX ...

Parallel Solution of General Sparse Linear ... - users.cs.umn.edu

A Global Solution to Sparse Correspondence Problems - CiteSeerX