Waveform Iterative Methods for Parallel Solution of Initial Value Problems Andrew Lumsdainey
Jeffrey M. Squyresz
Abstract The traditional approach for computing the solution to large systems of ordinary differential or differential-algebraic equations typically includes discretization in time with an implicit integration formula. The primary opportunity for parallelization is therefore limited to the linear system solution that is performed at each timestep. Waveform techniques, on the other hand, decompose the problem at the equation level and solve for different components of the system independently, using previous iterates from other processors as inputs. This approach is particularly well-suited for message-passing computing environments, especially those with high communication latency because synchronization and communication take place infrequently, and communication consists of large packets of information. In this paper, we present an MPI-based implementation of a waveform relaxation-based semiconductor device simulation program and provide experimental results using this program to solve the time dependent semiconductor driftdiffusion equations on a cluster of workstations.
1 Introduction The traditional approach for computing the solution to large systems of ordinary differential or differentialalgebraic equations typically includes discretization in time with an implicit integration formula. The primary opportunity for parallelization is therefore in the linear system solution that is performed at each timestep. However, this may result in poor parallel performance on machines with high synchronization and communication costs because the processors must communicate and synchronize (possibly several times) at each timestep. Waveform methods provide an attractive alternative to standard methods. With the waveform approach, the system of equations is decomposed and distributed among the processors of the parallel machine before being discretized in time. Each processor then solves its own subsystem Presented at the Scalable Parallel Libararies Conference, Mississippi State, MS, October, 1994. y Dept. of Comp. Sci. and Eng., University of Notre Dame, Notre Dame, IN 46556 (
[email protected]). z Dept. of Comp. Sci. and Eng., University of Notre Dame, Notre Dame, IN 46556 (
[email protected]). x The MathWorks, Inc., 24 Prime Park Way, Natick, MA 01760 (
[email protected]).
Mark W. Reicheltx
over the temporal interval of interest using previous iterates from other processors as inputs. Synchronization and communication take place infrequently, and communication consists of large packets of information – entire waveforms. Because of this computation and communication structure, waveform methods are especially well suited to implementation on loosely-coupled parallel machines (such as workstation clusters) using message-passing techniques. One criticism of classical waveform relaxation is that it is not practical because of its typically slow rate of convergence. However, acceleration techniques such as convolution SOR and waveform GMRES can dramatically improve the convergence rate of waveform relaxation for large classes of problems. In fact, accelerated waveform methods can be competitive with standard methods in serial implementations, yet they possess the potential for much better scalability in parallel implementations. In this paper, we briefly review accelerated waveform methods and discuss relevant issues in our MPI implementation of the pWORDS parallel semiconductor device simulation program. Experimental results are presented to show the performance of several different numerical techniques for solving the time dependent semiconductor drift-diffusion equations. These results demonstrate the robustness of waveform techniques in the “communicationhostile” parallel environment found in a typical workstation cluster.
2 Waveform Methods In this section, we develop a brief description of waveform methods using a model problem. For the model problem, we seek to compute the transient (temporal) solution to a system of N simultaneous ordinary differential equations (ODEs), subject to an initial condition. This type of problem, usually called an initial value problem (IVP) is expressed as:
d dt x(t) + Ax(t)
x
f
x(0)
= =
A
f (t) x0 ;
(1)
where (t) 2 RN , (t) 2 RN and 2 RN N . The vector (t) is understood to be the input and (t) is understood to be the unknown which is to be computed over a time interval t 2 [0; T ]. The traditional approach for numerically solving the IVP begins by discretizing (1) in time with an implicit integration rule (since large dynamical systems are
f
x
typically stiff) and then solving the resulting matrix problem at each time step [1, 2]. This pointwise approach can be disadvantageous for a parallel implementation, especially for distributed memory parallel computers having a high communication latency, since the processors will have to synchronize repeatedly for each timestep. A more suitable approach to solving the IVP with a parallel computer is to decompose the problem at the differential equation level. That is, the large system is decomposed into smaller subsystems, each of which is assigned to a single processor. The IVP is solved iteratively by solving the smaller IVPs for each subsystem, using fixed values from previous iterations for the variables from other subsystems. This dynamic iteration process is variously known as waveform relaxation (WR), dynamic iteration, or as the Picard-Lindel¨of iteration [3, 4]. Example 1. Consider an equation by equation decomposition of (1). A waveform relaxation algorithm for this process is described by: Algorithm 1 (Jacobi Waveform Relaxation). 1. Start: Select initial value
x0(t) for t 2 [0; T ].
For each equation i = 1; 2; : : :; N solve the IVP: =
fi (t) ?
xi(0)
=
x0 i :
X
j 6=i
Ax
C
f
2.1 Operator Equation Formulation
2. Iterate: For each waveform iteration k = 1; 2; : : :; until satisfied do:
k+1 d k+1 dt xi (t) + aiixi (t)
which arise when WR is applied to the more general form d + ) (t) = (t) are examined in [8] and for the dif( dt ferential algebraic case in particular in [9]. Since the WR method decomposes the problem before time discretization, it has been used as a tool for examining the stability properties of multirate integration methods [10]. Though the major practical success of WR has been in accelerating the simulation of integrated circuits [11, 12, 13, 14], it has been examined for other applications. For example, the method has been successfully applied to semiconductor device simulation [15] and to chemical engineering problems [16]. As the above body of work makes clear, for WR to be a computational competitor to pointwise methods, its convergence must be accelerated. Approaches to accelerating the convergence of WR include multigrid [17, 18], SOR [3], convolution SOR [19], Krylov-subspace methods [20], adaptive window size selection [21, 22], and the use of shifted iterations [23]. In the following sections, we describe the Krylovsubspace and convolution SOR acceleration techniques, concentrating primarily on their practical aspects.
aij xkj (t)
Here, at every waveform iteration k, each equation in the system is solved for its corresponding component of , using previous values of the other components as input. Note that the left hand side uses only the diagonal element of (i.e., aii for the computation of xi(t)). This computational structure is analogous to that of Gauss-Jacobi relaxation for solving linear systems of equations [5]; hence, this particular decomposition and solution process is usually called Jacobi waveform relaxation. Alternative decompositions can be constructed (e.g., Gauss-Seidel), based on alternative splittings of the matrix .
x
A
A
Since the time that the WR algorithm was first introduced as an efficient technique for solving the large sparselycoupled differential equation systems generated by simulation of integrated circuits [6], its properties have been under substantial theoretical and practical investigation. The precise nature of the loose coupling in integrated circuits, which was responsible for the rapid convergence of WR for those examples, was first made clear in [7]. The more formal theory for WR applied to linear time-invariant systems in normal form is described in [3], and theoretical aspects
Many of the techniques used to accelerate WR are waveform extensions of well-known acceleration methods from linear algebra. Hence, to describe these methods, it is useful to put WR into a form that is analogous to the canonical = formulation of linear algebra problems. In (1), let = ? be a splitting of . The waveform relaxation algorithm based on this splitting is expressed in matrix form as
Ax b
A
M N
A
Algorithm 2 (WR for Linear Systems). 1. Initialize: Pick
x0
2. Iterate: For waveform iteration k
d k+1 k+1 dt x (t) + Mx (t)
x(0)
x
= =
=
0; 1; : : : solve
Nxk(t) + f (t) x0
for k+1 (t) on [0; T ].
x
We can solve for k+1 (t) explicitly [24], that is,
xk 1 (t) = +
e?Mt x(0) +
Z
t 0
(2)
e
?M (t?s) ?
N (s)x(s) + f (s) ds:
Instead of using this formulation, it is useful to abstract (2) and consider as an element of a function space (of N dimensional functions) and the integral as an operator on N -dimensional functions. Using operator notation, we can write (2) as k+1 = K k + : (3)
x
x
x
Here the variables are defined on the space of N dimensional square integrable functions, which we will denote as H = L2([0; T ]; RN ). The operator K : H ! H is defined by
Kx)(t) =
Z
t
(
and
0
e?M (t?s)N (s)x(s)ds;
2 H is given by t
( )=
e?Mt x(0) +
Z
t 0
e?M (t?s)f (s)ds:
More intuitively, application of the operator K can be roughly interpreted as: “take one step of waveform relaxation.” Now, we also know (based on the splitting) that the solution to (1) will satisfy
x
d dt x(t) + Mx(t)
Nx(t) + f (t) x0 : Or, using operator notation, we can see that x will satisfy (I ? K)x = (4) where the operator I is the identity operator. Example 2. Let M (t) be the diagonal part of A(t). x(0)
=
=
Then Algorithm 2 becomes the Jacobi WR algorithm (Algorithm 1). It can be shown that on finite time intervals, K has zero spectral radius [25, 26], so that the method defined in (3) converges. A more detailed analysis of convergence can be derived by considering cases for which K is defined as T ! 1, in which case K has nonzero spectral radius [3].
2.2 Krylov Subspace Algorithms In general, the operator K is not self-adjoint, i.e., K 6= K , so of the various Krylov-subspace methods, only
those that are appropriate for non-self-adjoint operators, are appropriate for accelerating WR [20]. The waveform GMRES algorithm (WGMRES) — the extension of the generalized minimum residual algorithm (GMRES) [27] to the function space H — is given below. Some theoretical convergence results for WGMRES are given in [20]. Algorithm 3 (Waveform GMRES). 1. Start: Set
kr0k
r0 = ? (I ? K)x0 , v1 = r0=kr0 k, =
2. Iterate: For k
=
1; 2; : : :; until satisfied do:
hj;k = h(I ? K)vk ; vj i, j = 1; 2; : : :; k vˆ k+1 = (I ? K)vk ? Pkj=1 hj;kvj
hk+1;k = kvˆ k+1k vk+1 = vˆ k+1=hk+1;k 3. Form approximate solution:
xk = x0 + V k yk , where yk minimizes k e1 ? H¯ k yk k The two fundamental operations in Algorithm 3 are the operator-function product, ( ?K) , and the inner product, h; i. When solving (4) in the function space H , these operations are as follows:
I
p
Operator-Function Product: To calculate the waveform ( ? K) :
w I
p
1. Solve the IVP
N (t)p(t) = p0 = 0 for y(t), t 2 [0; T ]; this gives us y = Kp. 2. Set w = p ? y Inner Product: The inner product hx; yi is given by d
( dt +
M (t))y(t) y(0)
hx; yi =
N Z X i=1
T 0
=
xi (t)yi (t)dt:
Recall that the application of K is basically equivalent to the application of one step of WR. The first part of Step 1 of the operator-function product is therefore equivalent to one step of the standard WR iteration (with zero input), hence WGMRES can be considered as a scheme for accelerating the convergence of WR. This also implies that computing the operator-function product in the Krylov-subspace based methods is as amenable to parallel implementation as WR. Moreover, the inner products required by the WGMRES algorithm can be computed by N separate integrations of the pointwise product xi (t)yi (t), which can be performed in parallel, followed by a global sum of the results. Finally, it should be noted that although the initial residual is given by 0 = ? ( ? K) 0 ; it is computed in practice according to
r
I
x
r0 = (Kx0 + ) ? x0: The latter formulation avoids explicit computation of , since the expression in parentheses is merely the result obtained by performing a single step of WR.
2.3 Hybrid Methods for Nonlinear Systems Many interesting applications are nonlinear and cannot be described as linear time invariant systems (like our model
problem). We will use the following as a model nonlinear problem:
d dt x(t) + F (x(t); t)
=
x(0)
=
0
x0 :
(5)
In order to use the previously developed methods, which only apply to linear systems, we must first linearize (5). To linearize (5), we apply Newton’s method directly to the nonlinear ODE system (in a process sometimes referred to as the waveform Newton method (WN) [28]) to obtain the following iteration: ?
d m m+1 dt + J F (x )mx+1
x
(0)
J F (xm)xm ? F (xm) x0 :
= =
(6) Here, F is the Jacobian of . We note that (6) is a linear time-varying IVP to be solved for m+1 , which can be accomplished with a waveform Krylov-subspace method. Note that the previous development of waveform Krylovsubspace methods extends trivially to the linear timevarying case [20]. The resulting operator Newton/Krylovsubspace algorithm, a member of the class of hybrid Krylov methods [29], is shown below.
J
F
x
Algorithm 4 (Waveform Newton/WGMRES). 1. Initialize: Pick
x0
2. Iterate: For m = 0; 1; : : : until converged
Linearize (5) to form (6) Solve (6) with WGMRES Update m+1
x
J F (xm (t)) = M (t) ? N (t):
It is also possible to use a “Jacobian-free” approach [30], but the nature of the linearization in the operator-Newton algorithm makes that approach somewhat unreliable [20]. Within each nonlinear (operator-Newton) iteration, the initial residual for the WGMRES algorithm must be computed. Denote the initial guess for m+1 in the WGMRES part of the hybrid algorithm as m+1;0 and the initial residual by m+1;0 . If m+1;0 = m , then the initial residual for the WGMRES algorithm can be computed using a two-step approach as follows:
x
x
x
y(0)
for
y(t), t 2 [0; T ].
= =
y ? xm
This approach is similar to that used by WGMRES for linear systems. Methods for approximating m+1;0 so that (t) m (t) does not need to be explicitly calculated can be found in [20].
r
M x
2.4 Convolution SOR Successive overrelaxation (SOR) type acceleration of WR was studied in great detail in [3], with somewhat discouraging results. However, convolution SOR (CSOR), a novel type of SOR acceleration, was developed in [19] and circumvents the limitations of waveform SOR as described in [3]. To abbreviate the description of the CSOR algorithm, we will consider the problem of numerically solving the linear initial-value problem (1). A waveform relaxation algorithm using CSOR for solving (1) is shown in Algorithm 5. We take an ordinary Gauss-Seidel WR step to obtain a value for the intermediate variable xˆ ki +1 . The iterate xki +1 is obtained by moving xˆ ik+1 slightly farther in the iteration direction by convolution with a CSOR parameter, function !(t). With the convolution, the CSOR method correctly accounts for the temporal frequency-dependence of the spectrum of the Gauss-Jacobi WR operator (e.g., Gauss-Jacobi WR smoothes high frequency components of the error waveform more rapidly than low frequency components), by in effect, using a different SOR parameter for each frequency.
1. Initialize: Pick vector waveform with 0 (0) = 0 .
x
M (t)xm (t) ? F (xm(t)) x0
x
x0 (t) 2 ([0; T ]; Rn)
2. Iterate: For k = 0; 1; : : : until converged
Solve for scalar waveform xˆ ki+1 (t) xˆ ki+1 (0) = x0i , ?
x
1. Solve the IVP
d dt y(t) + M (t)y(t)
=
Algorithm 5 (Gauss-Seidel WR with CSOR Acceleration).
For the WGMRES algorithm applied to solving (6), the required operator-function product can be computed using the formulas in Section 2.3, with the splitting
r
r
2. Set m+1;0
2 ([0; T ]; R) with
k+1 d dt + aii xˆ i (t) = n i?1 X X fi (t) ? aij xki+1(t) ? aij xki (t): j =1 j =i+1
Overrelax to generate xik+1 (t) 2 ([0; T ]; R),
xki+1 (t) =
xki (t) +
Z 0
t
h i !( ) xˆ ki+1 (t ? ) ? xki (t ? ) d:
In a practical implementation, the CSOR method is used to solve a problem that has been discretized in time with a
multistep integration method. The overrelaxation convolution integral (7) is replaced with a convolution sum,
xki+1 [m] = xki [m] +
m X `=0
?
![`] xˆ ki+1 [m ? `] ? xki [m ? `] :
Here, m denotes the timestep, k denotes the discretized waveform iteration, and i denotes the component of . Like the standard algebraic SOR method [5, 31], the practical difficulty is in determining an appropriate overrelaxation parameter, in this case, the sequence ![m]. One successful approach for estimating the optimal SOR parameter has been to consider the spectrum of the SOR operator as a function of frequency and to use a power method to estimate an optimal !opt [m] [19]. There are a variety of alternative approaches to extending the CSOR algorithm to problems with nonlinearities. We used a waveform extension of relaxation-Newton methods (WRN) for solving nonlinear algebraic problems [28, 32]. For the nonlinear problem of the form of (5), the iteration update equation for the ith component of in a CSORNewton algorithm is given by
x
x
d k+1 dt xˆ i (t)+ @Fi (xk (t)) ?xˆ k+1 (t) ? xk (t) + F (xk (t); t) = 0; i i i @xi followed by
xik+1 (t) = xki (t) +
Z
t 0
i h !( ) xˆ ki+1(t ? )?xki (t ? ) d:
3 Semiconductor Device Simulation
is temperature, and is the spatially-dependent dielectric permittivity [33, 34]. The current densities n and p are given by the driftdiffusion approximations:
J
Jn
= =
Jp
J
= =
?qn n r kTq u + qDn rn ?kTn nru + qD n rn kT ?qp p r q u ? qDp rp ?kTp pru ? qDp rp
(10)
(11)
where n and p are the electron and hole mobilities, and Dn and Dp are the diffusion coefficients. The mobilities n and p may be computed as nonlinear functions of the electric field E , i.e., "
n = n
0
1+
n E vsat
#?(1= )
0
;
where vsat and are constants and n0 is a dopingdependent mobility [35]. The diffusion constants Dn and Dp are related to the mobilities by kT=q (the thermal voltage) in a pair of equations known as the Einstein relations [36]
Dn = kT q n
and
Dp = kTq p :
The drift-diffusion approximations (10) and (11) are typically used to eliminate the current densities n and p from the continuity equations (8) and (9), leaving a differentialalgebraic system of three equations in three unknowns, u, n, and p.
J
J
3.1 The Drift-Diffusion Equations
3.2 MOSFET Simulation
Charge transport within a semiconductor device is assumed to be governed by the Poisson equation and by the electron and hole continuity equations:
A key component of modern VLSI circuits is a semiconductor device known as a MOSFET (Metal-Oxide Semiconductor Field Effect Transistor). Although a MOSFET is a three-dimensional structure consisting of several different regions of silicon, oxide and metal, a MOSFET may be modeled by a two-dimensional slice of the device, as shown in Fig. 1. In the figure, thick lines represent metal contacts to the drain, source, substrate and gate oxide regions, to which external voltage boundary conditions are applied. Given a rectangular mesh covering a two-dimensional slice of a MOSFET, a common approach to spatially discretizing the device equation system is to use a finitedifference formula to discretize the Poisson equation, and an exponentially-fit finite-difference formula to discretize the continuity equations (this process is known as the Scharfetter-Gummel method [37]). On an N -node mesh, this spatial discretization yields a sparsely-coupled
kT r (ru) + q (p ? n + N ? N ) = 0 (7) D A q @n r J n ? q @t + R = 0 (8) @p r J p + q @t + R = 0 (9) Here, u is the normalized electrostatic potential in thermal volts, n and p are the electron and hole concentrations, J n and J p are the electron and hole current densities, ND and NA are the donor and acceptor concentrations, R is the net generation and recombination rate, q is the magnitude of electronic charge, k is Boltzmann’s constant, T
gate
j
dij
L ij
oxide source
L ik
drain
Ai
silicon
i
substrate
k dik
Figure 1: A two-dimensional slice of a MOSFET device. differential-algebraic initial value problem (IVP) consisting of 3N equations in 3N unknowns, denoted by
d dtd n(t) dt p(t)
F 1(u(t); n(t); p(t)) + F 2 (u (t); n(t); p(t)) + F 3 (u (t); n(t); p(t))
= = =
0 0 0
subject to initial conditions
u(0) = u0 n(0) = n0 p(0) = p0; where t 2 [0; T ], and u(t); n(t); p(t) 2 RN are vectors
of normalized potential, electron concentration, and hole concentration. The initial conditions are assumed to be consistent [38]. Here, 1 ; 2 ; 3 : R3N ! RN are specified component-wise as
F F F
F1 (ui ; ni; pi; uj ) = kT X dij ij (u ? u ) ? qA ?p ? n + N ? N A i j i i i D q L i
ij j F2i (ui ; ni; uj ; nj ) =h kT X dij nij n B (u ? u ) ? n B (u ? u )i + R i qAi j Lij i i j j j i F3i (ui ; pi; uj ; pj ) =h kT X dij pij p B (u ? u ) ? p B (u ? u )i + R i qAi j Lij i j i j i j i
i
The summations are taken over the silicon nodes j adjacent to node i. As shown in Fig. 2, for each node j adjacent to node i, Lij is the distance from node i to node j , dij is the length of the side of the Voronoi box that encloses node i and bisects the edge between nodes i and j , and Ai is the area of the Voronoi box. Similarly, the quantities ij , nij and pij are the dielectric permittivity, electron and hole mobility, respectively, on the edge between nodes i and j . The Bernoulli function, B (x) = x=(ex ? 1), is used to exponentially fit potential variation to electron and hole concentration variations, and effectively upwinds the current equations.
4 Implementation As reported in [39], the node-by-node Gauss-Jacobi WR algorithm will typically require many hundreds (or even
Figure 2: Illustration of a mesh node i, the area Voronoi box, and the lengths dij and Lij .
Ai of its
thousands) of iterations to converge, severely limiting the efficiency of WR-based device simulation. Moreover, assigning each node to a separate processor in a parallel implementation would require on the order of a thousand processors or more (the number of nodes typically necessary for accurate device simulation). Since the WR algorithm is better suited to a coarse-grained MIMD type of architecture, such a fine-grained division of the problem is not necessary or desirable. Instead of the node-by-node approach, pWORDS collects groups of mesh nodes into blocks and solves the nodes in each block simultaneously. In particular, the nodes in each vertical line of the discretization mesh are grouped together into blocks — this has been shown to be a particularly effective blocking strategy for MOSFET simulation [39]. The main WR routine in the pWORDS program uses a red/black block Gauss-Seidel scheme in which the system of equations governing the nodes in each vertical mesh line is solved using a backward-difference integration formula. The implicit algebraic systems generated by the backwarddifference formula are solved with Newton’s method and the linear equation systems generated by Newton’s method are solved with sparse Gaussian elimination.
4.1 The pWORDS Progam The pWORDS program uses a Manager/Worker approach in which the supervisory Manager program initiates the actual parallel simulation process by invoking W copies of a Worker program on W client machines. The Manager program reads in the device input file that specifies the geometry and the voltage boundary conditions imposed upon the device, as well as the 2D spatial discretization mesh. Given a rectangular mesh with C vertical lines, the Manager splits the mesh into W non-overlapping blocks consisting of C=W adjacent vertical lines and sends each block to a Worker for solution. In addition to the block of lines that each Worker solves, each Worker also contains storage for the two vertical lines on either side of the contiguous block. These “pseudolines” are used to store the solutions generated by the Workers controlling those adjacent vertical lines.
4.2 Parallel Waveform Relaxation As mentioned earlier, computing the operator-waveform product used by the waveform Krylov-subspace methods requires performing one step of traditional waveform relaxation. Therefore, the operator-waveform product can be accomplished with a function call to the WR routine already implemented within pWORDS — the vertical mesh line preconditioningscheme inherent to the WR routine will automatically be used by the Krylov-subspace method as well. To describe the algorithm running on each Worker machine, we alternately assign a “color,” either red or black, to the vertical mesh lines. The Worker WR iteration, shown in Figure 3 is as follows: 1. Send the black line solutions needed for pseudo-lines on other Workers and compute the solution for those red lines that do not depend on black pseudo-line solutions.
vertical-line blocks as the waveform methods. The Jacobian matrix is stored in a block-row wise fashion, as shown in Figure 4. Although many of these communication operations can potentially be overlapped with local computation, the communication latency may be so large that it cannot be hidden by the (relatively) small amount of computation done at a single timestep. Furthermore, a significant amount of synchronization results from the inner-product computations within each GMRES iteration. An approach such as that given in [40] might prove to be helpful, however.
proc i-1 proc i proc i+1
Figure 4: Partitioning of the system Jacobian matrix for the pointwise approach.
2. Receive black pseudo-line solutions. 3. Compute solutions for the remaining red lines. 4. Send the red line solutions needed for pseudo-lines on other Workers and compute the solution for those black lines that do not depend on red pseudo-line solutions. 5. Receive red pseudo-line solutions. 6. Compute solutions for the remaining black lines. Communication and computation are overlapped on machines that support asynchronous (non-blocking) communication. proc i-1
proc i
(1) (2) (3)
R
B
R
proc i+1
(4) (5) (6)
B
R
B
(1) receive black from left (2) solve red (3) send red to left (4) receive red from right (5) solve black (6) send black to right
Figure 3: The parallelized waveform relaxation step.
4.3 Parallelizing the Pointwise Approach In our experience, the most efficient serial algorithm for device transient simulation was the pointwise NewtonGMRES algorithm. In this algorithm, block-Jacobi preconditioned GMRES [27] is used to solve the linear systems arising at each Newton iteration of each timestep of an implicit integration formula applied to (??). The pointwise Newton-GMRES method in pWORDS uses the same
5 Experimental Results Fig. 5 illustrates the MOSFET simulation used for the numerical experiments in this section. The figure shows a two-dimensional slice of the silicon device along with the external boundary conditions on the potentials u at the terminals. Here, the potentials at the source and substrate terminals are held at 0, the potential u at the gate terminal is held at 5V , and there is a short pulse at the drain terminal. The concentrations n and p at each terminal are held constant at an equilibrium value determined by the background doping concentration at that terminal. 5v 5v 0v 0 psec
512 psec
2.2 microns
Figure 5: Illustration of the example problem with Dirichlet boundary conditions on the terminal potentials u. The experiments compared parallelized pointwise Newton/GMRES, WRN [28], WN/WGMRES, and WRN with CSOR acceleration. To obtain the CSOR results, the “optimal” CSOR parameter was determined by linearizing the device problem about the solution at time t = 0, and fitting !opt(z ) (as a function of frequency) with a rational function as described in [19]. Also, to diminish the effect of the nonlinearity, the overrelaxation convolution was applied only to the potential variables u.
Method Pointwise (GMRES) Pointwise (GMRES) WRN WRN WRN WRN WN/WGMRES WN/WGMRES WN/WGMRES WN/WGMRES WRN with CSOR WRN with CSOR WRN with CSOR WRN with CSOR
# Procs 1 2 1 2 4 8 1 2 4 8 1 2 4 8
Time 912.04 1220.13 2730.96 1503.71 1230.66 861.48 934.16 504.21 349.67 303.14 566.71 308.83 222.65 158.47
Table 1: Execution times (in wall clock seconds) for transient simulation of the example problem on an MPI workstation cluster. The backward Euler method with 256 fixed timesteps was used for all experiments, on a simulation interval of 512 picoseconds. Although the use of global uniform timesteps precludes multirate integration (one of the primary computational advantages of WR on a sequential machine), it also simplifies the problem of load-balancing. The convergence criterion for all experiments was that the maximum relative error of any terminal current over the simulation interval be less than 10?4 . The initial guess for WRN and for the accelerated waveform methods was produced by performing 8 WR iterations beginning with flat waveforms extended from the initial conditions. Table 1 shows a comparison of the execution times (in wall clock seconds) required to complete a transient simulation of the example problem using pointwise Newton-GMRES, WRN, WN/WGMRES, and CSOR on an Ethernet-connected MPI [41] workstation cluster consisting of Sun SPARC 5 and SPARC 10 workstations. Despite some small differences in compute node processing power, the mesh was divided as evenly as possible among the nodes — no load balancing was attempted. Note that the execution time for pointwise Newton-GMRES increased when parallelized on two processors. The qualitative communication behavior of the different types of methods can be illustrated very clearly through the use of plots obtained by using MPE libraries. Figure 6 shows an Upshot plot of the communication operations required for only a fraction of a timestep of pointwise Newton-GMRES. Note the large number of individual communication operations, as well as the enormous number of synchronizations, that are required. In contrast, Figures 7 and 8 show four iterations of the waveform relaxation
and WGMRES methods, respectively (the CSOR method will exhibit behavior similar to that of WRN). Note that the communication operations take place infrequently and that, although WGMRES does require synchronization, the number of synchronizations are few in comparison to the pointwise method — which is why the waveform methods are able to exhibit a speedup in a parallel implementation on a workstation cluster and the pointwise method is not.
6 Conclusion The experimental results presented here showed that waveform methods are relatively insensitive to the underlying communication network of the parallel environment where they are running. As a result, waveform methods are well suited to environments having very high communication latency, such as workstation clusters. As MIMD computers continue to become more popular, and as workstation clusters continue to become a legitimate parallel computing resource, waveform methods will grow in importance.
Acknowledgments The authors would like to thank Jacob White, Ken Jackson, and Stefan Vandewalle for many helpful discussions.
References [1] C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations. Automatic Computation, Englewood Cliffs, New Jersey: Prentice-Hall, 1971. [2] E. Hairer, S. P. Norsett, and G. Wanner, Solving Ordinary Differential Equations, vol. 1 and 2. New York: SpringerVerlag, 1987. [3] U. Miekkala and O. Nevanlinna, “Convergence of dynamic iteration methods for initial value problems,” SIAM J. Sci. Stat. Comp., vol. 8, pp. 459–467, 1987. [4] J. K. White and A. Sangiovanni-Vincentelli, Relaxation Techniques for the Simulation of VLSI Circuits. Engineering and Computer Science Series, Norwell, Massachusetts: Kluwer Academic Publishers, 1986. [5] R. S. Varga, Matrix Iterative Analysis. Automatic Computation Series, Englewood Cliffs, New Jersey: Prentice-Hall Inc, 1962. [6] E. Lelarasmee, A. E. Ruehli, and A. L. SangiovanniVincentelli, “The waveform relaxation method for time domain analysis of large scale integrated circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 1, pp. 131–145, July 1982. [7] F. Odeh, A. Ruehli, and C. Carlin, “Robustness aspects of an adaptive wave-form relaxation scheme,” in Proceedings of the IEEE Int. Conf. on Circuits and Comp. Design, (Rye,N.Y.), pp. 396–440, October 83.
Barr
Bcast
Recv
Reduce
Send
1 2 3 4 7.913
7.982
8.052
8.121
8.191
8.260
8.330
Figure 6: Upshot plot of a fraction of one timestep of pointwise Newton-GMRES.
Barr
Bcast
Recv
Send
1 2 3 4 4.5
6.1
7.8
9.4
11.1
12.7
14.4
Figure 7: Upshot plot of four iterations of pointwise waveform relaxation.
Barr
Bcast
Recv
Reduce
Send
1 2 3 4 13.3
15.6
17.9
20.1
22.4
Figure 8: Upshot plot of four iterations of WGMRES.
24.7
26.9
[8] O. Nevanlinna and F. Odeh, “Remarks on the convergence of the waveform relaxation method,” Numerical Functional Anal. Optimization, vol. 9, pp. 435–445, 1987. [9] U. Miekkala, “Dynamic iteration methods applied to linear DAE systems,” J. Comput. Appl. Math., vol. 25, pp. 133– 151, 1989. [10] J. White and F. Odeh, “A connection between the convergence properties of waveform relaxation and the A-stability of multirate integration methods,” in Proceedings of the NASECODE VII Conference, (Copper Mountain, Colorado), 1991.
[24] T. Kailath, Linear Systems. Englewood Cliffs: Prentice-Hall, 1980. [25] R. Kress, Linear Integral Equations. New York: SpringerVerlag, 1989. [26] J. B. Conway, A Course in Functional Analysis, Second Edition. New York: Springer-Verlag, 1990. [27] Y. Saad and M. Schultz, “GMRES: A generalized minimum residual algorithm for solving nonsymmetric linear systems,” SIAM J. Sci. Statist. Comput., vol. 7, pp. 856–869, July 1986.
[11] D. Dumlugol, The Segmented Waveform Relaxation Method for Mixed-Mode Simulation of Digital MOS Circuits. PhD thesis, Katholieke Universiteit Leuven, October 1986.
[28] R. Saleh and J. White, “Accelerating relaxation algorithms for circuit simulation using waveform-Newton and step-size refinement,” IEEE Trans. CAD, vol. 9, no. 9, pp. 951–958, 1990.
[12] S. Mattison, CONCISE: A concurrentcircuit simulation program. PhD thesis, Lund Institute of Technology, Lund, Sweden, 1986.
[29] P. Brown and Y. Saad, “Hybrid Krylov methods for nonlinear systems of equations,” SIAM J. Sci. Statist. Comput., vol. 11, pp. 450–481, May 1990.
[13] F. Odeh, A. Ruehli, and P. Debefve, “Waveform techniques,” in Circuit Analysis,Simulation and Design,Part 2 (A.Ruehli, ed.), pp. 41–127, North-Holland, 1987.
[30] P. N. Brown and A. C. Hindmarsh, “Matrix-free methods for stiff systems of ODE’s,” SIAM J. Numer. Anal., vol. 23, pp. 610–638, June 1986.
[14] J. White, F. Odeh, A. Vincentelli, and A. Ruehli, “Waveform relaxation: Theory and practice,” Trans. of the Society for Computer Simulation, vol. 2, pp. 95–133, June 1985.
[31] D. M. Young, Iterative Solution of Large Linear Systems. Orlando, FL: Academic Press, 1971.
[15] M. Reichelt, J. White, J. Allen, and F. Odeh, “Waveform relaxation applied to transient device simulation,” in Proceedings of the IEEE Int. Conf. on Circuits and Systems, (Espoo, Finland), pp. 396–440, October 88. [16] A. Skjellum, Concurrent dynamic simulation: Multicomputers algorithms research applied to ordinary differentialalgebraic process systems in chemical engineering. PhD thesis, California Institute of Technology, May 1990. [17] C.Lubich and A. Osterman, “Multigrid dynamic iteration for parabolic problems,” BIT, vol. 27, pp. 216–234, 1987. [18] S. Vandewalle and R. Piessens, “Efficient parallel algorithms for solving initial-boundary value and time-periodic parabolic partial differential equations,” SIAM J. Sci. Statist. Comput., vol. 13, pp. 1330–1346, November 1992. [19] M. Reichelt, Accelerated Waveform Relaxation Techniques for the Parallel Transient Simulation of Semiconductor Devices. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1993. [20] A. Lumsdaine, Theoretical and Practical Aspects of Parallel Numerical Algorithms for Initial Value Problems, with Applications. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1992. [21] B. Leimkuhler, “Estimating waveform relaxation convergence,” SIAM J. Sci. Comput., vol. 14, no. 4, pp. 872–889, 1993. [22] B. Leimkuhler and A. Ruehli, “Rapid convergence of waveform relaxation,” Applied Numerical Mathematics, vol. 11, pp. 221–224, 1993. [23] R. D. Skeel, “Waveform iteration and the shifted Picard splitting,” SIAM J. Sci. Statist. Comput., vol. 10, no. 4, pp. 756– 776, 1989.
[32] J. M. Ortega and W. C. Rheinbolt, Iterative Solution of Nonlinear Equations in Several Variables. Computer Science and Applied Mathematics, New York: Academic Press, 1970. [33] R. Bank, W. Coughran, Jr., W. Fichtner, E. Grosse, D. Rose, and R. Smith, “Transient simulation of silicon devices and circuits,” IEEE Trans. CAD, vol. 4, pp. 436–451, October 1985. [34] S. Selberherr, Analysis and Simulation of Semiconductor Devices. New York: Springer-Verlag, 1984. [35] R. S. Muller and T. I. Kamins, Device Electronics for Integrated Circuits. New York: John Wiley and Sons, 1986. [36] P. E. Gray and C. L. Searle, Electronic Principles: Physics, models and circuits. New York: Wiley, 1969. [37] D. Scharfetter and H. Gummel, “Large-signal analysis of a silicon read diode oscillator,” IEEE Transactions on Electron Devices, vol. ED-16, pp. 64–77, January 1969. [38] K. E. Brenan, S. L. Campbell, and L. R. Petzold, Numerical Solution of Initial-Value Problems in Differential-Algebraic Equations. New York: North Holland, 1989. [39] M. Reichelt, J. White, and J. Allen, “Waveform relaxation for transient two-dimensional simulation of MOS devices,” in International Conference on Computer Aided-Design, (Santa Clara, California), pp. 412–415, November 1989. [40] E. D. Sturler, “A parallel restructured version of GMRES(m),” in Proceedings of the Copper Mountain Conference on Iterative Methods, (Copper Mountain, Colorado), 1992. [41] M. P. I. Forum, “MPI: A Message Passing Interface,” in Proc. of Supercomputing ’93, pp. 878–883, IEEE Computer Society Press, November 1993.