An Incomplete Domain Decomposition Preconditioning ... - CiteSeerX

3 downloads 0 Views 184KB Size Report
CMFD-only and nodal calculations, it is demonstrated that speedups as large as 49 with ... Among the other nodal methods, the nonlinear iteration nodal method ...
An Incomplete Domain Decomposition Preconditioning Method for Nonlinear Nodal Kinetics Calculations

Han Gyu Joo Purdue University School of Nuclear Engineering West Lafayette, Indiana 47907-1290 [email protected] and Thomas J. Downar Purdue University School of Nuclear Engineering West Lafayette, Indiana 47907-1290 [email protected]

Mailing Address: Prof. T. Downar 1290 Nuclear Engineering Purdue University W. Lafayette, IN 47907-1290

Number of Pages: Title and Abstract: Text: Figures: Tables:

2 26 3 3

Abstract - Methods are proposed for the efficient parallel solution of the nonlinear nodal kinetics equations. Because the "two-node" calculation in the nonlinear nodal method is naturally parallelizable, the majority of the effort here was devoted to the development of parallel methods for solving the coarse mesh finite difference (CMFD) problem. A preconditioned Krylov subspace method (Bi-Conjugate Gradient Stabilized) was chosen as the iterative algorithm for the CMFD problem and an efficient parallel preconditioning scheme was developed based on domain decomposition techniques. An incomplete LU factorization method is first formulated for the coefficient matrices representing each three-dimensional subdomain and coupling between subdomains is then approximated by incorporating only the effect of the non-leakage terms of neighboring subdomains. The methods are applied to fixed source problems which are created from the IAEA three-dimensional benchmark problem. The effectiveness of the incomplete domain decomposition preconditioning on a multi-processor is evidenced by the small increase in the number of iterations as the number of subdomains increases. Through the application to both CMFD-only and nodal calculations, it is demonstrated that speedups as large as 49 with 96 processors are attainable in the nonlinear nodal kinetics calculations.

3 I. INTRODUCTION

The prospects of three-dimensional reactor core simulation with system codes such as RELAP, RETRAN, and TRAC has motivated interest in high performance computing. The calculation of the multi-dimensional flux distribution during a transient simulation contributes a substantial fraction of the overall computational burden. The motivation for the work here is to investigate the use of parallel processing to achieve a significant reduction in the execution time of the reactor kinetics portion of the system simulation calculation. Discretization in time of the time-dependent neutron diffusion equation yields a fixed source type problem at every time point. The flux distribution can be determined using one of several advanced nodal methods. With the exception of the work of Kirk and Azmy1, however, none of these methods have been adapted to parallel computing. Kirk and Azmy developed the "nodal integral method" so that it was naturally adaptable to parallel computing and achieved excellent performance for a two-dimensional problem on a shared memory machine. However, their methods performed poorly on a distributed memory architectures because of the need for frequent data transfers to a single processor. Among the other nodal methods, the nonlinear iteration nodal method developed by Smith2 appears best suited for parallel computation. This is because it is based on a naturally parallel "two-node" calculation to incorporate higher order local coupling effects. The principal obstacle to parallelization of this method then becomes the other major calculation in the method, the coarse mesh finite difference (CMFD) calculation. The CMFD problem is elliptical in nature since every node in the problem domain is coupled to all the other nodes, and is thus very difficult to parallelize efficiently. In the work here, we concentrate primarily on a parallel solution method for the CMFD problem. The use of a coarse mesh in the CMFD formulation is an important consideration in the choice of the basic iteration algorithm of the parallel solution method. This is because the number of unknowns in a coarse mesh formulation is much less than the corresponding fine mesh formulation. The granularity,

4 defined as the ratio of the average computation time to communication time, is much smaller in a coarse mesh formulation, which in turn reduces the parallel efficiency. Thus the SOR type methods such as the Red-Black Line SOR or Cyclic Chebyshev Semi-Iterative (CCSI) which involve only a small amount of computation per data transfer would not be appropriate candidates for a parallel coarse mesh calculation. Some researchers3-6 have attempted to parallelize the solution of the neutron diffusion equation for a fine mesh finite difference formulation based on the SOR-type methods. However, none of these methods would achieve a good parallel efficiency for the CMFD problem, particularly when the machine floating point operation speed is high. Because Krylov subspace methods can provide much higher granularity, they are clearly superior for CMFD problems and were used in the work here.

3456

In the application of a Krylov subspace method, proper preconditioning of the linear system becomes very important because the convergence rate, as well as the operation counts are strongly dependent on the quality of the preconditioner. If a proper preconditioning scheme is used, these methods can perform better than the conventional SOR-type methods, even in normal single processor applications. Yang et. al. examined various Krylov subspace algorithms and preconditioning schemes in the solution of the two group, two dimensional, fine mesh reactor kinetics problems.7 Their results indicate that the Biconjugate Gradient Stabilized (BiCGSTAB, See Appendix) method8 with a Blockwise Incomplete LU (BILU) preconditioner9 performs better than the other methods considered, including the SOR method. From the standpoint of parallel computation, however, the BILU preconditioner does not provide much parallelism since the solution of the preconditioner equation in a block LU form involves a forward and backward substitution which is completely serial in nature. To resolve this problem, Meurant introduced a domain decomposition preconditioning method which limits the BILU factorization to the coefficient matrices for each subdomain problem and thus enables parallel solution of the preconditioner equation in each subdomain.10 In his method, the coupling between the subdomains is incorporated in terms of the solutions at the interfacial regions. However, because the subdomain problems need to be solved twice - once to provide the boundary condition for the interfacial region problems, the parallel efficiency of

5 Meurant’s method is limited to 50%, even without any communication overhead. In order to improve the parallel efficiency, the subdomain coupling term should be estimated in a more efficient way. Based on the above considerations, the research reported here focused on developing an efficient preconditioning scheme for the CMFD problems in which the preconditioned BiCGSTAB method is employed as the basic iteration algorithm and the three-dimensional problem domain is decomposed into several subdomains. In the following paper, the linear system is first formulated from the time dependent neutron diffusion equation. Then a domain decomposition preconditioning scheme is described characterized by a BILU factorization method for three-dimensional subdomain problems and by an efficient method of incorporating subdomain coupling. The parallel solution method is first examined for the CMFD problem only to determine the effectiveness of the preconditioning scheme. The parallel application is then extended to actual nonlinear nodal calculations.

II. FORMULATION OF THE LINEAR SYSTEM

The time dependent behavior of the two-group neutron fluxes is governed by the following balance equations, given here in terms of standard notations and with the subscripts p and d denoting the prompt and delayed neutrons, respectively:

(1) and

(2)

When we discretize Eq. (1) in space using the nonlinear nodal method, the net neutron current on the interface is represented by the following equation in terms of the node fluxes of the left and right nodes:

6

(3) where DgF is the coupling coefficient determined by the finite difference approximation of -Dg∇φg and DgN is the correction term to the coupling coefficient which is determined by the two-node nodal solutions and is updated during the nonlinear iteration.11 Based on this approximation, the discretized equations can be written using matrix operators as:

(4)

where the migration and loss operator Mg is a standard seven-stripe matrix and all the other matrix operators (Fpg, Fdkg, and T) are diagonal. The vectors φg and ck are column vectors of dimension N which is the number of nodes. By applying a time differencing technique such as the implicit method and after eliminating ck by solving the precursor equation analytically assuming a time dependence of the group fluxes,11 it is possible to reduce Eq. (4) to the following fixed source type problem :

(5)

where the notations with tilde indicate appropriately modified matrix operators. Eq. (5) is the linear system to be solved at every time point. Because the coefficient matrix of Eq. (5) is very nonsymmetric, it is not convenient for constructing a good preconditioner. We can rearrange the linear system to be more symmetric by placing the fast and thermal group fluxes of each node consecutively in the solution vector. The reordered system is

7 represented as follows:

(6) where

(7)

and where the submatrix A(d) representing the coefficient matrix in d-dimension is defined as:

(8)

with n being the number of planes (K), rows (J), or columns (I) depending on the value of d. Note that the matrix A(d) is a block tridiagonal matrix while L(d) and U(d) are simple diagonal matrices consisting of the coupling coefficients, lgd and ugd, respectively. At the lowest dimension (d=1), the smallest submatrix is defined as follows, with m being a general index representing the spatial coordinate:

(9)

where In the following sections, we will describe a method for solving the CMFD fixed source problem represented by Eqs. (6)-(10), assuming that the nodal correction terms to the coupling coefficients, DgN, are known.

8

(10)

III. PRECONDITIONING SCHEME

In preconditioned Krylov subspace methods, the linear system to be repetitively solved involves a preconditioner (Pz=r where P is the preconditioner). The preconditioner is an approximation of the original matrix and it should be constructed such that solving the preconditioner equation, Pz=r, is much easier than solving the original linear system, Ax=b. One of the methods for constructing an easy-to-solve preconditioner is the incomplete LU factorization. The blockwise incomplete LU (BILU) preconditioner has been widely used in two-dimensional problems. We will first describe the extention of BILU factorization to the three-dimensional coarse mesh neutron diffusion problems. The BILU preconditioner will be applied to each subdomain problem. We will then describe a domain decomposition method to efficiently incorporate the subdomain coupling effect in the global preconditioner.

III.A. BILU3D Preconditioner

Consider a block tridiagonal matrix A(d) (the superscript d omitted in the following for brevity) as defined in Eq. (8). We can factorize the matrix completely by employing the following recursion relation:

(11)

to obtain the LU factor:

9

(12) where L and U are the strictly lower and upper triangular parts of A, and ∆ is a block diagonal matrix consisting of ∆m’s. The problem associated with the complete factorization, however, is that it is difficult to find the inverse of ∆m since ∆m becomes a full matrix. In the following we consider two approximations to avoid this problem.

III.A.1. Symmetric Gauss-Seidel Factorization

-1 The crudest way to simplify the factorization process is to neglect completely the Lm∆m-1 Um-1 term in

Eq. (11) so that the matrix ∆ becomes the block diagonal matrix D which consists of the submatrices Am’s. In that case, the LU factor, denoted by P, becomes a symmetric Gauss-Seidel (SGS) LU factor and can be represented by the sum of the original matrix A and a remainder matrix R as follows:

(13) This factorization can be a good approximation only when the norm of the remainder matrix LD-1U is small compared to that of the original matrix A. In a coarse mesh formulation of a typical reactor problem, this condition is satisfied for A=A(3), in which case the axial coupling effect represented by L and U is weaker than the radial coupling effect represented by D. This follows from the fact that the axial mesh size is normally chosen to be larger than the radial mesh size because the reactor is less heterogeneous in the axial than in the radial direction. The larger axial mesh makes the L and U small compared to D, and thus LD-1U becomes much smaller than A. Therefore we can take the symmetric Gauss-Seidel LU factor, P, as the preconditioner which approximates A(3). Rewriting Eq. (13) in terms of the submatrices, we obtain the following LU factor as the global preconditioner:

10

(14)

where

(15)

and

(16)

The preconditioner equation involving these LU factors can be solved by forward and backward substitution. In the substitution processes it is necessary to solve smaller linear systems involving Ak(2)’s which are the coefficient matrices for the planes. Because these matrices are block penta-diagonal matrices which are not easy to solve and because the preconditioner equation is to be solved repetitively, it is more appropriate to factorize the Ak(2) matrices.

III.A.2. Blockwise Incomplete LU factorization

The factorization of A(2) (plane index k omitted) by the recurrence relation (11) involves submatrices

11 which have the dimension of two times of the number of nodes in the x-direction. To prevent these submatrices from being full, we can neglect some of the elements of ∆-1j-1(the general index m was replaced by the row index j) such that ∆j has the same sparse structure as Aj. By denoting the matrix obtained by neglecting some elements of ∆-1j-1 by Ω(∆-1j-1), the recursion relation is modified as:

(17)

The concern now is how to obtain Ω(∆-1j-1). This requires finding the elements of the inverse at the locations where Aj has nonzero elements. Noting that the Aj is a smaller block tridiagonal matrix for a one dimensional row, we need to find the three main diagonal blocks of ∆-1j-1 . To find these blocks efficiently without calculating the other unnecessary elements of the inverse, we use the idea of so called "approximate block inverse method"12 which utilizes the LU factor of a matrix to find some specific entries of the inverse of the matrix. The LU factorization for ∆j-1 can be done completely since it only involves inverses of 2x2 matrices which are easily obtainable. Suppose that the LU factor is available for ∆j-1 in the following form in which Γ is a block diagonal matrix consisting of 2x2 matrices, Γi’s, and H and F are the strictly lower and upper triangular matrices of ∆j-1, respectively:

(18)

By manipulating Eq. (18), we can obtain the following relations for ∆-1j-1which is denoted by E for brevity: (19) The two terms in the first relation in Eq. (19) appear in terms of submatrices as the following equations:

12

(20)

and

(21)

Note that the first term as shown in Eq. (20) contains no upper triangular blocks. Using Eqs. (19) through (21), we can derive the following recursion relation for the diagonal and upper diagonal blocks of E:

(22)

The lower diagonal blocks Ei i-1 appearing in the above recursion relation can be similarly obtained as follows by using the second relation for E in Eq. (19) :

(23) By using the submatrices generated by these recursion relations, Ω(∆-1j-1) is now formed as a block tridiagonal matrix, and the blockwise incomplete factorization of Ak(2) can be completed. Ak(2) is now

13 approximated by its BILU factor Pk(2) as: (24)

By replacing Ak(2) by Pk(2) in Eqs. (15) and (16) with Eq. (24), a preconditioner applicable to a threedimensional problem is obtained. We will refer to it as the BILU3D preconditioner. This preconditioner can be applied to any subdomain problem as well as to a single entire domain problem. The linear system involving the BILU3D preconditioner is solved by three levels of forward and backward substitutions, one for each dimension. In previous work,13,14 it was shown that the execution time of the preconditioned BiCGSTAB algorithm using the BILU3D preconditioner was comparable to conventional methods such as Line SOR and CCSI in the solution of CMFD fixed source problems on a single processor.

III.B. Domain Decomposition Preconditioning

The major concern in a domain decomposition preconditioning is how to incorporate the coupling between subdomains. In the original domain decomposition preconditioning method10, the subdomain coupling is provided via the interfacial regions introduced between the subdomains. Here all subdomains are solved first without any subdomain coupling effect to provide boundary conditions for the subsequent interfacial region calculations. The subdomain problems are then solved once more with the boundary condition available from the interface solutions. This method ensures good subdomain coupling and is recommended for strongly coupled problems. However, the parallel efficiency of this method is limited to 50% even without the communication overhead because the subdomain problems must be solved twice, which is not necessary for the single domain case. Therefore we consider a new approach of incorporating the subdomain coupling to improve the parallel efficiency.

14 Consider a two-subdomain case in which the problem domain is divided into top and bottom subdomains. In this case the coefficient matrix can be partitioned as follows, with p=K/2, q=p+1:

(25)

Given this partition, a linear system Ax=b can be replaced by two smaller linear systems:

(26)

where

(27) Here the coupling between the two subdomains is represented by the second terms ( via Upx21 and Lqx1p) on the right hand sides. Because of these terms, the two equations can not be solved independently unless some sort of approximation is introduced. One possibility would be to neglect completely these terms when solving the preconditioner equation. In this case, the preconditioner consists of only the two diagonal blocks and is refered as a block diagonal (BD) preconditioner. With the block diagonal preconditioner, however, the penalty for neglecting the subdomain coupling would appear as the increase in the number of iterations of the BiCGSTAB algorithm, which reduces the parallel efficiency. One approach to estimate x21 without completely solving the two coupled equations is to approximate B2 with only its diagonal blocks by neglecting the offdiagonal coupling terms. In this case x21 can be obtained by solving an equation involving only the first diagonal block of B2 which is Aq(2). If we further approximate Aq(2) by Dq(2) by taking only its basic 2x2 diagonal blocks, x21 can then be easily obtained. A similar approach can also be applied to find an approximate solution of x1p.

15 Once the estimates of the true solution x21 and x1p are obtained by solving the following equations,

(28)

the two equations in Eq. (26) can be independently solved by the two processors assigned to each subdomain. For the calculation for each subdomain, we replace the matrix Bk with its BILU3D factor. Thus the actual equations to be solved in parallel become:

(29)

where δi′ is obtained by replacing xij in Eq. (27) with xij′. Eqs. (28) and (29) complete the incomplete domain decomposition preconditioning. The preconditioner matrix that represents the preconditioning scheme can be explicitly expressed as:

(30)

The incomplete domain decomposition preconditioning scheme does not contain much computational overhead because solving Eqs. (28) is trivial. However it does involve some communication overhead to transfer the approximate solution to the neighboring subdomain. Approximating the subdomain coupling with the solution of Eqs. (28) can be interpreted as a first order approximation that neglects the leakage effect since it accounts only for the local cross section effect. If the leakage effect neglected here is significant, the number of BiCGSTAB iterations would increase significantly. In the next section, we will see that neglecting the leakage effect in the subdomain

16 interfaces is a reasonable approximation for the coarse mesh formulation in which the coupling between nodes is weaker than for the corresponding fine mesh formulation. The incomplete domain decomposition preconditioning scheme presented for the two subdomain case can be easily generalized to any number of subdomains. In the following section, the application of the preconditioning scheme to various types of domain decomposition is presented.

IV. PARALLEL CMFD CALCULATIONS

In order to examine the effectiveness of the incomplete preconditioning scheme, we first solved a CMFD only problem which does not include the coupling coefficient correction term, DgN of Eq. (3). For this we created a full core fixed source problem based on the geometry and cross sections available from the IAEA 3D PWR benchmark problem.15 Parallel computations were performed on an Intel PARAGON XP/S-1016 with double precision arithmetic.

IV.A. Test Problem

The IAEA 3D benchmark problem is originally an eigenvalue problem for a two zone core consisting of 177 fuel assemblies surrounded by a 20 cm reflector. There are nine fully inserted and four partially inserted control rods. The assembly pitch is 20 cm and the active core height is 340 cm. Choosing 10 cm as the mesh size except for the axial mesh size of the reflector which is taken as 20 cm yields a 34x34x36 problem for full core representation. The fixed source problem was constructed such that it could simulate the conditions encountered when

17 solving a transient problem. First, a subcritical state in which keff was 0.9924 was created by uniformly increasing the thermal absorption cross sections. This keff corresponds to a state that is critical if the delayed neutrons were included. The normalized fission source distribution of the eigenvalue problem was taken as the distributed source for the fixed source problems (FSP). A base case FSP was solved to provide the initial flux guess for the subsequent perturbed case for which the parallel calculations were performed. The perturbation was introduced by withdrawing a partially inserted control rod by 10 cm on a quadrant so that it resulted in an asymmetric problem. The average fission source of the perturbed case was increased by 0.67% from the base case. The maximum local change in the fission source was 44.8%. A reference solution was generated for the perturbed cases with a tight local relative fission source convergence criterion (1.0x10-6). The convergence of the parallel cases were checked by comparing the local fission source with the reference values to ensure the same accuracy in all cases. The convergence criteria are 0.005% and 0.01% for the global and local fission source errors, respectively.

IV.B. Paragon XP/S-10

The Purdue INTEL Paragon is a MIMD distributed memory parallel computer, configured with 140 compute nodes each having 31 MB of memory. Each node is powered by an Intel i860XP microprocessor providing a theoretical peak double precision performance of 75 Mflops and therefore a peak machine performance of about 10 Gflops. Each node contains another i860XP processor which is totally dedicated to handling message passing operations. The message processors are connected to a 2D mesh interconnection network on the backplane. The bandwidth and latency regularly achieved by the packet switched message passing are 90 MB/sec and 45 microseconds, respectively.

IV.C. Domain Decomposition Results

18 The parallel calculations were performed with various types of domain decomposition. Axial decomposition cases, in which the problem domain is divided into several axial layers, required less communication overhead than general three-dimensional decomposition cases because data transfers were required in only one direction for the axial cases. Table I shows the parallel computation results for both axial (1x1 radial decomposition) and threedimensional decomposition cases. The numbers in parenthesis are the number of radial nodes in one direction of the subdomains. For each case, the effectiveness of incomplete domain decomposition(IDD) preconditioning was first determined by comparing the number of iterations with those of block diagonal preconditioning cases in which the subdomain coupling terms were neglected. The reduction in the number of iterations achievable by IDD preconditioning compared to the corresponding BD preconditioning case is given as ∆NBD in the table. The reduction increases as the number of subdomains increases because more subdomain coupling terms are neglected in the BD cases. In the 36 PE case of the axial decomposition, the improvement by IDD preconditioning over the BD preconditioning is dramatic. This is because in this case only one plane is assigned to a processor and the axial coupling effect is totally neglected in the BD global preconditioner while the IDD preconditioner retains the axial coupling through the Lk(3) and Uk(3) terms in the symmetric Gauss-Seidel LU factor (see Eqs. 15 and 16). The results for ∆NBD show that IDD is consistently more efficient than BD in all cases and the improvement is more substantial for the three-dimensional decomposition cases than for the axial decomposition cases. For instance, in 4x4 radial decomposition ∆NBD is 6 for the 64 PE case. Since the number of iterations is 13 for the single processor case, the reduction of 6 iterations by IDD corresponds to about a 46% reduction in the floating point operations. Even though IDD requires fewer iterations than BD, the number of iterations (Nitr) generally increases as more PEs are used. Deviations from the general increasing trend of Nitr is observed for some cases with 4 axial subdomains because in this cases the subdomain boundary conicides with the depth of the control rod insertion. Especially in the axial decomposition cases, the increase in the number of iterations from

19 the single processor value(13) to the multi-processor cases is very large (greater than 6) when more than 12 PEs are used. The problem with more than 12 PEs in the axial decomposition is that the subdomain becomes too thin and thus the leakage effect becomes more significant and the IDD becomes less effective. For a given number of PEs, the three-dimensional decomposition allows smaller surface-tovolume ratios of a subdomain, thus making the leakage effect less important. For all the three-dimensional cases, the number of iterations is much smaller than the corresponding axial decomposition cases. For instance, for the 36 PE cases, the numbers of iterations are 25 and 18 for the axial and three-dimensional cases (2x2 radial), respectively, which results in an increase in the efficiency from 50.9% to 59.1% with the three-dimensional decomposition, despite the additional communication overhead required for threedimensional decomposition. It is shown in Table I that in all decomposition cases the efficiency generally decreases as the number of PEs is increased. In addition to the increase in the number of iterations, there were two additional causes for the decrease in efficiency: the communication overhead which increases with the number of PEs and the load imbalance. Because of the unequal radial domain sizes which is unavoidable in the 4x4 and 5x5 radial decomposition cases, the maximum efficiency without any overhead would be only 89% for the 4x4 cases. As indicated in the table, the maximum speedup and efficiency were 45.2 and 47.1%, respectively, for the case of the 4x4 radial decomposition with 96 PEs.

V. NONLINEAR NODAL CALCULATIONS

The CMFD calculation discussed in the previous section can not provide sufficient accuracy in the solution of the fixed source problem because of limitations in the linear spatial flux approximation in each node. In practical calculations, a higher order spatial approximation is used in the so called advanced nodal methods. In this section, the parallel solution of the fixed source problem employing the nonlinear nodal

20 method is described. The CMFD problems involved in the nonlinear iteration is solved by the preconditioned BiCGSTAB algorithm and the two-node nodal calculation is performed using the method described in Reference 11, which employs the nodal expansion method (NEM) for solving each of the two-node problems. Since the work here is the first application of the preconditioned BiCGSTAB algorithm within the framework of the nonlinear iteration, we first discuss an iteration strategy for the normal single processor applications. We then describe the parallel solution method and results of the nonlinear nodal calculations.

V.A. Nonlinear Iteration Strategy

In the application of the nonlinear nodal method, a series of linear systems must be solved since the coupling coefficients correction terms, DgN of Eq. (3), are updated after every two-node nodal calculation. It is not necessary to fully converge the solution of each linear system since the overall problem is nonlinear and the coupling coefficients are not correct until the overall problem - the iteration between the two-node nodal calculation and the CMFD calculation - is converged. A partially converged solution of the linear CMFD problem is sufficient for the subsequent two-node nodal calculation. However the extent of partial convergence affects the overall convergence history and consequently the total computation time. In NESTLE11, which employs the standard outer/inner iteration method to solve the CMFD problem, a partial convergence is achieved by fixing the number of outer iterations. Another approach to achieve partial convergence would be using an error reduction factor. If there is a measure of the error of the current iterate of the flux, it is possible to terminate the iteration when the error is reduced by a certain factor. In the BiCGSTAB algorithm, a residual vector defined as is generated during the iteration and it provides a convenient measure of the error. If the residual is a null

21

(31) vector for a given vector φ, then φ is the solution Eq. (6). In the work here we used an error reduction factor based on the 2-norm and defined as: (32) where r0 is the initial residual vector corresponding to the initial flux guess taken from the previous CMFD calculation. Compared to the approach of fixing the number of iterations, this method is more general since the number of iterations to achieve a certain convergence can change during the nonlinear iteration as well as in different problems. (e.g. different number of PEs, etc.) Table II shows the dependence of the global iteration characteristics on the error reduction factor. The calculations listed here were performed for the perturbed case of the test problem. The initial nodal coupling coefficients as well as the initial flux guess for the perturbed case were taken from the base case calculation. The global iteration was terminated when the global and local fission source errors obtained from the first BiCGSTAB iteration after a coupling coefficient update satisfied the convergence criteria. This additional BiCGSTAB iteration was used to ensure convergence of the global iteration. The reference solution used in the convergence check was generated using a tighter convergence criterion. Since the linear systems change during the nonlinear iteration, the BILU3D preconditioner was recalculated after each coupling coefficient update. The CMFD time listed in the table includes the time for the preconditioner update. As indicated in Table II, the number of NEM coupling coefficient updates decreases as the error reduction factor decreases from 0.1 to 0.01. This implies that a more accurate CMFD solution results in better coupling coefficients in the subsequent NEM calculation. However, this is no longer true if the error reduction factor is sufficiently small. As shown in the last three cases for which the error reduction factor is less than or equal to 0.01, tighter convergence of the CMFD solution beyond a certain value is only a

22 waste of operations and based on the results shown in Table II, an optimum error reduction factor of 0.01 was used in the work here. In the optimal case, the CMFD and NEM calculations take about 55% and 40% of the total computation time, respectively.

V.B. Parallel Two-Node NEM Solution Method

Although the two-node NEM calculations are naturally parallelizable by domain decomposition, they require a nontrivial amount of data transfer. This is primarily the result of the quadratic approximation of the transverse leakage in the Nodal Expansion Method. In principle, it is possible to proceed with the two-node NEM calculation by having a single data transfer after the CMFD calculation. Only the node average fluxes of the relevant regions of the neighboring subdomains could be transferred to determine the quadratic polynomial coefficients to describe the spatial variation of transverse leakages. If the twonode calculations are based on flux transfers only, however, the data communication scheme becomes very complicated. This is because fluxes from all the neighboring subdomains surrounding a subdomain, including those located diagonally to the subdomain, must be transferred to calculate the transverse leakages required for the two-node calculations of the subdomain. In order to avoid the diagonal data transfer which results in a very complicated communication scheme and significant communication overhead, an additional data transfer involving the transverse leakages is introduced. The calculation of the transverse leakages of a subdomain requires the transfer of the fluxes from the neighboring subdomains contacting the outer surfaces of the subdomain not including those located diagonally to the subdomain. The calculated transverse leakage is then transferred back to the neighboring subdomains for the two-node calculation and thus the diagonal transfer of fluxes is not necessary. After the transfer of transverse leakages, the two-node calculations are performed on each PE and the resulting nodal coupling coefficients are transferred to the neighboring subdomains for the next CMFD calculation. The parallel two-node calculation procedure outlined above is shown in Figure 1 along

23 with the nonlinear iteration strategy.

V.C. Results of Parallel Nonlinear Nodal Calculations

The parallel nonlinear nodal calculations were performed according to the procedure shown in Figure 1. The CMFD part of the nodal calculation was performed exactly the same as the parallel CMFD-only calculations described in Section III.B, except that the BiCGSTAB iterations of each CMFD problem were terminated when the desired error reduction was satisfied. The values of the optimum error reduction factor (εr) were chosen in the range of 0.01 and 0.02. The results of the parallel nonlinear nodal calculations are given in Table III. In all the cases listed in the table, the number of NEM updates was three. It is seen in Table III that the number of iterations of the nonlinear nodal calculations generally increases with the number of PEs similar to the CMFD-only cases from Table I. However, the behavior of the increase for the nonlinear problem is somewhat irregular compared to the linear CMFD-only cases. This can be attributed to the impact of partial convergence of BiCGSTAB iterations on the nonlinear nodal calculations. Although the BiCGSTAB algorithm is guaranteed to converge in no more than N iterations, where N is the order of the system, the BiCGSTAB algorithm is not formulated in a way to minimize the residual vector in every iteration, and therefore the norm of the residual vector can fluctuate locally as the iteration proceeds. More fluctuation is noted in the initial stage of the iteration, especially when the initial guess has a large error. Because of the partial convergence of the CMFD solution in the nonlinear nodal calculation and the resulting fluctuation of the residual vector in the initial stages of the BiCGSTAB iteration, we observe some irregularity in the total number of iterations in cases of using different preconditioners, i.e using different number of PEs. However, this irregularity disappears if the BiCGSTAB iteration is allowed to be fully converged in every CMFD calculation. The efficiency of the NEM portion of the nonlinear nodal calculation is much higher than that of the

24 CMFD portion as indicated in the table and in Figure 2 which shows the variation of the NEM and CMFD efficiencies for the 4x4 radial decomposition cases. Since there is essentially no computational overhead in the parallel two-node NEM calculations, the reduction of the efficiency of the NEM portion is solely due to the communication overhead which increases monotonically with the number of PEs. The NEM parallel efficiency is as high as 65% even with as many as 100 PEs. Owing to the high NEM efficiency, the efficiency of the nonlinear nodal calculation becomes higher than the corresponding CMFD-only calculation. However, we note that the overall efficiency is determined more by the CMFD efficiency because the CMFD portion increases with the number of PEs due to the increase in the number of BiCGSTAB iterations. We also note in Table III that in the 4 PE case of the 2x2 radial decomposition, the efficiencies are slightly higher than 100%. This is possible because the use of smaller array sizes in the 2x2 radial decomposition case improves the efficiency of traffic through the local cache memory. As indicated in Figure 3 which shows the overall speedups for the nonlinear nodal calculations, the 4x4 radial decomposition is the best for all the radial decomposition cases. The maximum speedup and efficiency were 48.9 and 51%, respectively, for the case of 96 PEs.

25 VI. SUMMARY AND CONCLUSIONS

Because the two-node NEM calculation is naturally parallelizable, the nonlinear iteration method was chosen for parallel solution of the spatial kinetics equation. The CMFD problem, which takes a slightly higher fraction of the overall computation time than NEM, was solved by the preconditioned BiCGSTAB method to achieve high granularity in the coarse mesh formulation. An incomplete domain decomposition preconditioning scheme was developed by taking advantage of a coarse mesh formulation in which the coupling between nodes is rather weak. The preconditioning scheme is characterized by the BILU3D preconditioner which is generated for each subdomain and by an efficient method of incorporating the subdomain coupling terms. The preconditioning scheme was first examined for a CMFD-only problem to determine its effectiveness. The effectiveness of the preconditioning scheme was evidenced by the small increase in the number of iterations as the number of subdomains increases, particularly for the threedimensional domain decomposition cases. For the parallel application of the BiCGSTAB algorithm to the nonlinear nodal problem, an error reduction factor was defined in terms of the norm of the residual vector which is readily available from the BiCGSTAB algorithm. The optimum error reduction factor was chosen to minimize the execution time on a single processor, and similar values were then used in the parallel calculations. In the parallel twonode calculations, a transfer of transverse leakage was introduced to minimize the communication overhead. The parallel nodal calculations demonstrated that the efficiencies of the nonlinear nodal calculation were higher than that of the corresponding CMFD-only calculation primarily because of the high efficiency of the parallel two-node calculation. The maximum speedup achieved was 48.9 with 96 PEs. The results here indicate that the computation time for the nonlinear nodal kinetics calculations can be significantly reduced by the parallel solution method developed in this work.

26 REFERENCES

1.

B. L. Kirk and Y. Y. Azmy, "An Iterative Algorithm for Solving the Multidimensional Neutron Diffusion Nodal Method Equations on Parallel Computers," Nucl. Sci. Eng., 111, 57 (1992).

2.

K. S. Smith, "Nodal method storage reduction by nonlinear iteration," Trans. Am. Nucl. Soc. 44, p. 265, ANS, Detroit, MI (1983).

3.

S. K. Zee, P. J. Turinsky, and Z. Shayer, "Vectorized and Multitasked Solution of the Few-Group Neutron Diffusion Equations," Nucl. Sci. Eng., 101, 205 (1989).

4.

C. S. Henkel and P. J. Turinsky, "Solution of the Few-Group Neutron Diffusion Equations on a Distributed Memory Multiprocessor," Proc. Topc. Mtg. Advances in Reactor Physics, Charleston, S.C., March 8-11, p. 1-108 (1992).

5.

R.M. Al-Chalabi, "Development of Neutronic Core Physics Simulator on Advanced Computer Architecture," M.S. Thesis, North Carolina State University (1991).

6.

Y. H. Kim and N. Z. Cho, "Parallel Solution of the Neutron Diffusion Equation with the Domain Decomposition Method on a Transputer Network," Nucl. Sci. Eng., 114, 252 (1993).

7.

D. Y. Yang, G. S. Chen, and H. P. Chou, "Application of preconditioned conjugate gradient-like methods to reactor kinetics," Ann. Nucl. Energy 20, pp. 9-33 (1993).

8.

H. A. Van Der Vorst, "BI-CGSTAB: A fast and smoothly converging variant of BI-CG for the solution of nonsysmmetric linear systems," SIAM J. Sci. Stat. Comput. 13, pp. 631644 (1992).

9.

P. Concus, G.H. Golub, and G. Meurant, "Block preconditioning for the conjugate

27

gradient method," SIAM J. Sci. Stat. Comput. 6, pp. 220-252 (1985). 10.

G. Meurant, "Domain decomposition vs. block preconditioning," Proceedings of the first international symposium on domain decomposition methods for partial differential equations, R. Glowinsky et. al., pp. 231-249, SIAM (1988).

11.

P. J. Turinsky et. al., "NESTLE: A Few-Group Neutron Diffusion Equation Solver Utilizing The Nodal Expansion Method for Eigenvalue, Adjoint, Fixed-Source SteadyState and Transient Problems", EGG-NRE-11406 (1994).

12.

O. Axelsson, "Incomplete block matrix factorization preconditioning method. The ultimate answer?," J. Comp. Appl. Math. 12&13, pp. 3-18 (1985).

13.

H. G. Joo and T. J. Downar, "Incomplete Domain Decomposition Preconditioning for Coarse Mesh Neutron Diffusion Problems," Proc. Int. Conf. Mathematics Computation, Reactor Physics, and Environmental Analyses, April 30 - May 4, 1995, Portland, Oregon, pp. 1584-1594 (1995).

14.

H. G. Joo and T. J Downar, "A Comparison of Iterative Methods for Solution of the CMFD problem in the Nonlinear Nodal Method," Trans. Am. Nucl. Soc., 73, pp. 434-436, ANS, San Francisco, CA (1995).

15.

Argonne Code Center, Benchmark Problem Book, ANL-7416 (1976).

16.

ParagonTM User’s Guide, Order Number: 312489-003, Intel Corporation (1994).

28 APPENDIX

Preconditioned Bi-CGSTAB Algorithm Implemented

r0=b-Ax0 for an initial guess x0; ρ0=α=ω0=1; v0=p0=0 for i=1,2,3... ρi=(r0,ri-1); β=(ρi /ρi-1)*(α/ωi-1); pi=ri-1+β(pi-1-ωi-1vi-1) Solve My=pi for y vi=Ay; α=ρi /(r0,vi); s=ri-1-αvi Solve Mz=s for z t=Az; ωi=(t,s)/(t,t); xi=xi-1+αy+ωiz if xi is converged, then quit. ri=s-ωit end