A Highly Parallel Algorithm for the Numerical Simulation of ... - CiteSeerX

A Highly Parallel Algorithm for the Numerical Simulation of Unsteady Diffusion Processes∗ Yu Zhuang Dept. of Computer Science Texas Tech University Lubbock, Texas 79409 E-mail: [email protected]

Abstract Stabilized explicit implicit domain decomposition (SEIDD) is a class of globally non-iterative domain decomposition methods for the numerical simulation of unsteady diffusion processes on parallel computers. By adding a communication-cost-free stabilization step to the explicit-implicit domain decomposition (EIDD) methods, the SEIDD methods achieve high stability but with the restriction that the interface boundaries have no crossing-overs inside the domain. In this paper, we present a parallelized SEIDD algorithm with paralellism higher than the number of subdomains, eliminating the disadvantage of non-crossing-over interface boundaries at a slight computation cost.

1

Introduction

Many diffusion-involved processes are studied in sciences and engineering, e.g. heat transferring based engineering [2, 25, 30], pollution modeling and wast treatment [29, 8, 15, 27], and the neuron potential propagation [18]. In this paper, we present an algorithm for the numerical simulation of unsteady diffusion processes governed by the two-dimensional equation  ut (t, x, y) = (a ux )x +(b uy )y +f u, (x, y) ∈ Ω u(t, x, y) = ub (t, x, y), (x, y) ∈ ∂Ω (1)  u(0, x, y) = u0 (x, y), (x, y) ∈ Ω

on parallel computers, where a, b are smooth, positive functions on the rectangular domain Ω, f , a continuous function, is the source/sink term for the diffusion substance, and ∂Ω denotes the boundary of domain Ω. Processes governed by time dependent equations have both temporal and spatial variations. A numerical simulation hence involves approximations in both time

∗ This research was supported in part by the National Science Foundation under NSF Grant No. 0305393 and NSF Grant No. 0305355, and by NSF cooperative agreement ACI-9619020 through computing resources provided by the National Partnereship for Advanced Computational Infrastructure at the San Diego Supercomputer Center.

Xian-He Sun Dept. of Computer Science Illinois Institute of Technology Chicago, Illinois 60616 E-mail: [email protected]

(temporal discretization and advancing) and space (spatial discretization and solving). Thus, parallelism can be sought at both temporal and spatial approximation levels. The algorithm we present in this paper combines a SEIDD method at the temporal level and a parallel solver at the spatial level. The SEIDD algorithm is an operator-splitting temporal discretization method where the operator splitting is (spatial) domain decomposition based. However, the SEIDD method requires that the domain is decomposed in such a way that the interface boundaries do not cross into each other inside the domain, e.g. as in Figure 1. This obviously reduces the flexibility in data partition and distribution to different processors when each subdomain is assigned a processor. To overcome this domain partition inflexibility, we propose an algorithm that combines the SEIDD method at the temporal approximation level with another parallelizing technique at the spatial level. By allowing the domain partition that is used to construct the SEIDD method to be different from the domain partition that is used to partition the data for distribution to different processors, we have avoided violating SEIDD’s restriction of no crossover interface boundaries while maintaining low communication cost for problems run on massively massively parallel computer systems, resulting in a parallel algorithm of high efficiency and scalability. There have been substantial research activities in domain decomposition methods during the past two decades, most of which were directed towards the Schwarz alternating algorithms [5, 6, 13, 14, 16, 17, 28], a class of globally iterative domain decomposition algorithms for elliptic equations. Here the term “globally” refers to the solution process that is carried out over the entire problem domain as opposed to solution processes for subdomain problems which could be either iterative or direct. These Schwarz-type elliptic solvers are applicable to diffusion problems when implicit schemes are used for temporal discretization [3, 4]. Since globally iterative methods incur repeated data transmission among processors, it is appealing to keep the global iterations to a small number [19, 24]. Since overlapping

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

. . .

Figure 1 also increases computation and communication costs, it is naturally desired to minimize overlapping size together with the number of global iterations. Many noniterative, non-overlapping domain decomposition methods [1, 7, 9, 10, 11, 12, 19, 20, 21, 22, 23, 26, 34, 33, 35, 36] have been investigated. Among these methods is the group of EIDD methods which have good accuracy when compared with other non-iterative nonoverlapping methods. EIDD methods are algorithmically simple, and computationally and communicationally efficient for each time step. However, they are subject to some level of stability or consistency-related time step restrictions. To ease the time step size restriction, stabilized EIDD (SEIDD) methods were introduced [33, 36, 35] by adding a stabilization step to EIDD methods, which adds zero communication cost and negligible computation communication cost to EIDD, but at the price of the aforementioned domain partitioning inflexibility. Investigation has been carried out to overcome this inflexibility of one SEIDD method for heat equations [37], but that technique is not applicable to problems with non-uniform diffusion coefficients. The purpose of this paper is to find a technique that addresses the domain partition inflexibility for general diffusion problems. The paper is organized as follows. Section 2 contains the SEIDD method and the whereupon based parallel algorithm. Experimental results are reported in Section 3 and conclusion is given in Section 4.

2

where Ah is the discrete approximation of the spatial elliptic operator of equation (1). The discrete grid domain is divided into p × q subdomains Ωi,j of equal size as in Figure 2. Let Ωi = q ∪j=1 Ωi,j for i = 1, 2, · · · , p, and denote by Bi the vertical interface boundary between Ωi and Ωi+1 for i = 1, 2, · · · , p−1. We further denote by B the union of c all Bi (i.e. B = ∪p−1 i=1 Bi ), and denote by B the complement of all vertical interface boundaries in Ωh . For a rectangular domain partitioned as in Figure 2, we use the partition by the vertical lines to construct the operator splitting of the SEIDD method while using the decomposition by both the vertical and horizontal lines for data partitioning and distribution to different processors.

d dt uh (t) uh (0)

= Ah uh (t), = u0 ,

(2)

The Proposed Parallel Domain Decomposition Algorithm

To numerically solve problem (1), we choose a discrete spatial grid Ωh with uniform mesh size h, and discretize equation (1) spatially into

Figure 2

2.1

The SEIDD Method

The SEIDD method we use is one of the three methods proposed in [33] (also see [36, 35]). For the convenience of reading this paper, the mathematical description of the SEIDD method that we will use in this paper

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

is provided below. We first let Ah = Axh + Ayh , where Axh and Ayh are discrete approximation of (a ux )x+0.5f u and (b uy )y+0.5f u respectively. For a subset S ⊂ Ωh , let χS be an operator on the function space L2 (Ωh ) given by v(x), x ∈ S, χS v(x) = 0, x 6∈ S, that is, χS is a diagonal matrix with 1 on the positions corresponding to the grid points in the subset S ⊂ Ωh and 0 elsewhere. We use uk to denote the numerical solution of the k-th time step. Then the mathematical description of a SEIDD method (without parallelization) for computing the solution uk+1 at the (k +1)-th time step from the current k-th time step is given below. 1. Compute the interface boundary condition on B using the forward Euler scheme χ k+1/3 = χB (I + ∆tAh )uk , B u (3) χB c uk+1/3 = χB c uk , 2. With exterior boundary conditions and the interface boundary conditions computed at step 1, compute the solution on the subdomains B c using the directionally factorized backward Euler scheme χ 1 k+ 23 =χB uk+ 3 , B u 2 χB c (I − ∆tAx )(I − ∆tAy )uk+ 3 =χB c uk+ 13 . (4) h h 3. Throw away the interface boundary condition computed at step 1, and using solution data uk+2/3 on nearby subdomain as boundary conditions, re-compute interface boundary condition on B with the backward Euler, χ k+1 = χ B uk , B (I − ∆tAh )u (5) χB c uk+1 = χB c uk+2/3 . Go back to step 1 for the next time step iteration.

2.2

The Parallel Algorithm

We denote the part of the interface boundary Bi between subdomains Ωi,j and Ωi+1,j by Bi,j , Then q Bi = ∪j=1 Bi,j for each i = 1, 2, · · · , p. Now given p × q processors labeled as Pi,j for i = 1, 2, · · · , p and j = 1, 2, · · · , q, the distribution of data to processors is given as follows: Data Distribution for the Parallel Algorithm • Assign subdomain Ωi,j to processor Pi,j .

• Assign interface boundary Bi,j to processor Pi,j .

The prediction of the interface boundary condition (IBC) at Step 1 involves a matrix-vector product, where the matrix is very sparse with non-zero entries only in rows corresponding to the gridpoints on the interface boundaries B. The sparsity makes possible a straightforward parallelization of the explicit prediction. From equation (3), we can see that the explicit computation of IBC on Bi,j needs the previous time-step solutions at x-direction neighboring gridpoints on the subdomain Ωi+1,j which is assigned to processor Pi+1,j , and also needs one previous time-step solution data item from each of processors Pi,j−1 and Pi,j+1 for computing the IBC at the two ends of the interface boundary Bi,j . Thus, the parallelization requires processor Pi+1,j sending data to Pi,j , and processor Pi,j exchanging one solution data item with each of its upper and lower neighboring processors Pi,j−1 and Pi,j+1 . Step 2 of the SEIDD method involves solving two sequences of one dimensional equations on each of the p subdomains Ωi (i = 1, · · · , p), one sequence corresponding to the operator (I − ∆tAxh ), and the other corresponding to (I − ∆tAyh ). With standard finite difference for the operators Ax and Ay , the two discrete operators (I −∆tAxh ) and (I −∆tAyh ) are diagonally dominant tridiagonal matrices. On the subdomain Ωi (i = 1, · · · , p), each of the tridiagonal system of the sequence associated with (I − ∆tAxh ) represents a difference equation on a x-direction gridline on Ωi . Since Ωi is partitioned along y-direction into q sub-subdomains Ωi,1 , Ωi,2 , · · · , Ωi,q , each x-direction gridline on Ωi lies entirely on one of the sub-subdomains. Hence each tridiagonal system of the sequence associated with (I − ∆tAxh ) has its data entirely on one processor, and can be solved by a single processor while tridiagonal systems of the sequence of (I − ∆tAxh ) on different processors can be computed completely mutually independently and hence in parallel. And again due to the y-directional partition of Ωi into q sub-subdomains, each of the tridiagonal system of the sequence associated with (I − ∆tAyh ) has its data distributed among the q processors Pi,1 , Pi,2 , · · · , Pi,q . Since these tridiagonal matrices are strictly diagonally dominant, they can be solved by the Parallel Diagonal Dominant (PDD) algorithm for tridiagonal matrices. The PDD algorithm [31, 32] is a highly parallel and communicationally efficient solver with two data transfer operations involving a total of three data items for each processor. The price of the PDD algorithm for its high parallelism and communication efficiency is a computation cost that approximately doubles that of a sequential tridiagonal solver. The stabilization (Step 3) of SEIDD involves solving p−1 tridiagonal systems, each corresponding to an equation (5) on one of the p−1 interface boundaries Bi for i = 1, 2, · · · , p − 1. Each of the p − 1 tridiagonal systems has its data distributed to q−1 processors, e.g. the tridiagonal system corresponding to interface boundary

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Bi has its data distributed among the q −1 processors Pi,1 , Pi,2 , · · · , Pi,q . Since equation (5) is strictly diagonal dominant, we apply the PDD algorithm [32] for the parallel solution of the tridiagonal systems. Summarizing the solution procedure described above, a parallel algorithm is given below. 1. For i = 1, 2, · · · , p − 1, processor Pi,j send the previous time-step solution uk at the upper end of Bi,j to processor Pi,j+1 for j < q, and send uk at the lower end of Bi,j to processor Pi,j−1 for j > 0; then processor Pi,j explicitly compute the interface boundary condition (IBC) uk+1 on Bi,j using the forward Euler scheme (3). The explicit computation of IBC on Bi,j needs the previous time-step solution on the gridline on Ωi+1,j but immediately near Bi,j . But processor Pi,j already received them from processor Pi+1,j at Step 3 of the previous time step and no update on solution on Ωi+1,j has been carried out before this step, so no data transfer is necessary. 2. For i < p, processor Pi,j send to processor Pi+1,j the IBC explicitly computed at Step 1, and then each processor assembles the right hand side of the equation (4) using exterior boundary conditions and the interface boundary conditions computed at step 1. All processors solve tridiagonal systems of associated with (I−∆tAxh ) in parallel; and then for i = 1, · · · , p, processors Pi,1 , Pi,2 , · · · , Pi,n combine to solve each of the tridiagonal systems associated with (I − ∆tAyh ) on Ωi using the PDD algorithm. 3. For i = 1, · · · , p − 1, processor Pi+1,j send to processor Pi,j the solution on the gridline that is on Ωi+1,j but immediately adjacent to Bi,j ; Then processors Pi,1 , Pi,2 , · · · , Pi,q combine to re-compute solution uk+1 implicitly on Bi using the the backward Euler scheme (5). The implicit re-computation is by carried out by solving in parallel the tridiagonal system obtained from the backward Euler scheme using the PDD algorithm. Go back to Step 1 for the next time step iteration. Assuming that the discrete spatial domain Ωh has M × N grid points, and that the p q subdomains Ωi,j (i = 1, 2, · · · , p and j = 1, 2, · · · , q) are of equal size, then the upper bounds for computation and communication per-processor costs for each of the steps above

are given below when second order standard finite difference schemes are used: 18N qr

1.

parallel floating point operation time, and 2α + 2β communication time, where r is the single processor floating point operation rate, α denotes the startup time for each send or receive communication operation, and β is a system dependent per-word data transfer rate.

2.

41(M −p)N pqr

floating point operation time and 3α + [3(M/p − 1) + N/q]β communication time, where 17(M −p)N of the floating point operation time pqr comes from solving y-direction tridiagonal systems, each distributed among q processors, 8(Mp −p)N qr of the floating point operation time comes from solving x-direction tridiagonal systems, and and 16(M −p)N of the floating point operation time pqr comes from assembling the matrix coefficients for aforementioned the y- and x-direction tridiagonal systems.

3.

33N qr

floating point operation time and 3α + (3 + N/q)β communication time, and 17N q of the floating point operations come from solving a y-direction tridiagonal system for each interface boundary Bi distributed among q processors, and 16N from asq sembling the right hand side due to x-direction boundary condition for each gridpoint on Bi .

Summing up all the costs above, the total costs for each time step advancement are approximately 41 Mp qN floating point operations, and 8α+(3M/p+2N/q +4)β communication cost. The non-crossover restriction on interface boundaries of the SEIDD method allows parallelism only along one direction (the x-direction in our presentation). This new algorithm that implements SEIDD creates parallelism also in the y-direction. The additional computation cost of this new direction of paralN 17M N lelism is about 9M p q r floating point operations — p q r flops of the PDD parallel tridiagonal solution time when y-direction is parallelized as in the proposed algorithm, N minus 8M p q r flops of sequential tridiagonal solution time when y-direction is not parallelized. The major communication overhead of this added direction of parallelism comes from the PDD solver in step with a total of 2α + 3βM/p parallel communication time, 2α + 3β time per y-direction tridiagonal system solved by PDD. To obtain an estimate of the speedup of this algorithm, we see that the sequential implementation the SEIDD method can use the fastest tridiagonal solver in Step 2 to save computation cost. Since the sequential tridiagonal solver takes 8N floating point operations for solving a tridiagonal system of size N , the total computation cost of the sequential solver is 32M N flops, 8M N flops for each of x- and y-direction tridiagonal

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

solving, and 16M N flops for assembling the matrix coefficients of the x- and y-direction solving. Therefore, the speedup, defined as single processor execution time over parallel processing time, is Speedup(pq) = =

41M N pq

32M N 2N + 8α + ( 3M p + q +4)β 32

41 pq

+

8α MN

+ ( p3N + q 2M + M4N )β

.

And the parallel efficiency, defined as Speedup(p q) over the number of processors pq, is Efficiency(pq) =

3

41 +

8pq MN α

32 . 4pq + ( 3Nq + 2Mp + M N )β

4

Conclusion

Based on a SEIDD type domain decomposition method, we have proposed a new parallel algorithm with which the data partition and distribution to processors do not follow the boundaries of the domain partition that is used to construct the operator splitting of the SEIDD method. By combining parallelism creation at the temporal level of the SEIDD method with spatial level parallelism utilization, we have achieved higher flexibility than allowed by the conventional parallelization technique for domain decomposition methods, while also having avoided violating a requirement of the SEIDD method.

References [1] K. Black, “Polynomial collocation using a domain decomposition solution to parabolic PDE’s via the penalty method and explicit-implicit time marching”, J. Sci. Comput., 7 (1992), no. 4, 313–338.

Experimental Study

We tested the proposed parallel SEIDD algorithm on a National Partnership for Advanced Computational Infrastructure (NPACI) IBM computer with Power4 processors, each running at 1.5GHz with AIX5.2. The testing problem is ut = [(cos(x)/2+1)ux ]x+uyy +(cos(x)+3)u on spatial domain [0, 100]×[0, 100] with the simulation time interval [0, 1]. The problem is solved on a uniform mesh grid of size 2048×2048 and the time step size is 1/1000. In the tests, the square domain is divided into √ √ p × p equal-size square subdomains, with p ranging from 1 to 256, and each subdomain is assigned to a different processor. We measured the communication time, the total execution time, and the maximal error of the numerical solution at time t = 1. These measured data are listed in Table 1 together with calculated values for Speedup, and Efficiency. The data in Table 1 show that as the number of processors goes from 1 to 256, the parallel efficiency first increases to above 100% and then decrease to 99%. This phenomenon has two reasons behind it. The low communication cost of the algorithm is one reason for the high speedup and efficiency. We believe that another important factor for the superlinear speedup is that we fixed the problem size in all the tests with 2048 × 2048 spatial grid points and 1000 time steps. When the problem is divided into several small problems, each processor get a smaller amount of data so that the cache hit ratio possibly becomes higher with the smaller data size. Again due to the fixed size property, when the entire problem is divided into 256 subdomain problems, each processor gets a much small amount of data with 128 × 128 spatial grid points, and communication cost becomes substantial compared with computation cost (about 45% of computation cost), which leads to lower efficiency.

[2] T.A. Bogetti and J.A. Gillespie, “Two-dimensional cure simulation of thick thermosetting composites”, J. Composite Materials, Vol. 25 (1991), pp. 239–273. [3] X.-C. Cai, “Additive Schwarz algorithms for parabolic convection-diffusion equations”, Numer. Math., 60 (1991), pp. 41–61. [4] X.-C. Cai, “Mulitiplicative Schwarz methods for parabolic problems”, SIAM J. Sci. Comput., 15 (1994), pp. 587–603. [5] X.-C. Cai, W. D. Gropp and D. E. Keyes, “A comparison of some domain decomposition and ILU preconditioned iterative methods for nonsymmetric elliptic problems”, J. Numer. Lin. Alg. Appl., 1 (1994), pp. 477–504. [6] T. F. Chan and T. Mathew, “Domain decomposition algorithms”, Acta Numerica, 1994, pp. 61–143. [7] H.Chen and R.Lazarov, “Domain splitting algorithm for mixed finite element approximations to parabolic problems”, East-West J.Numer.Math., Vol. 4, No. 2, 1996, pp. 121-135. [8] C.G. Cogger, L.M. Hajjar, C.L. Moe, and M.D. Sobsey, “Septic system performance on a coastal barrier island”, J. Environmental Quality. Vol. 17 (1988), No. 3, pp. 401–408. [9] D. S. Daoud, A. Q. Khaliq and B. A. Wade, “A non-overlapping implicit predictor-corrector scheme for parabolic equations”, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2000), Las Vegas, NV, H.R. Arabnia et al, ed., Vol. I (ISBN 1-892512-22-x), CSREA Press, pp. 15–19, 2000. [10] C. Dawson and T. Dupont, “Explicit implicit, conservative domain decomposition procedures for

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Table 1: Solving ut = [0.5(cos(x) + 1)ux ]x + uyy + (cos(x) + 3)u with u = et sin(x) sin(y) Procs m×n T-total T-comp T-comm Speedup Efficiency Max-err 1 2048×2048 1.31e+03 1.31e+03 0.00e+00 1.00 100% 5.10e–03 4 1024×1024 2.46e+02 2.37e+02 9.48e+00 5.33 133% 5.11e–03 16 512×512 6.86e+01 6.38e+01 4.74e+00 19.1 119% 5.12e–03 64 256×256 1.62e+01 1.51e+01 1.07e+00 80.9 126% 5.12e–03 256 128×128 5.25e+00 3.60e+00 1.65e+00 250 97.5% 5.12e–03 The domain is [0, 100]×[0, 100] with h = 100/2048, and the time interval is [0,1] with ∆t = 1/1000. √ √ The domain divided into p × p subdomains, each with m×n gridpoints, where p is processor number.

parabolic problems based on block-centered finite difference”, SIAM J. Numer. Anal. 31 (1994), no. 4, pp. 1045–1061. [11] C. Dawson, Q. Du, and T. Dupont, “A finite difference domain decomposition algorithm for numerical solution of the heat equation”, Math. Comp. 57 (1991), no. 195, pp. 63–71. [12] M. Dryja, “Substructuring methods for parabolic problems”. Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations (Moscow, 1990), pp. 264–271, SIAM, Philadelphia, PA, 1991. [13] M. Dryja and O. Widlund, “An additive variant of the Schwarz alternating method for the case of many subregions”, Tech. Rep. 339, Courant Inst., New York Univ., 1987. [14] W. D. Gropp and D. E. Keyes, “Domain decomposition on parallel computers”, Impact of Computing in Science and Engineering, 1 (1989), pp. 421–439. [15] C. Hagedorn, D. T. Hansen and G. H. Simonson, “Survival and movement of fecal indicator bacteria in soil under conditions of saturated flow”, J. Environmental Quality, Vol. 7, No. 1, pp. 55–59. [16] D.E. Keyes, “Domain decomposition: a bridge between nature and parallel computers”, NASA ICASE Technical Report No. 92-44, NASA Langley Research Center Hampton, VA 23681-0001, 1992. [17] D.E. Keyes and W.D. Gropp, “A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation”, SIAM J. Sci. Statis. Comput., 8 (1987), pp. 166–202. [18] C. Koch, Biophysics of Computation: Information Processing in Single Neurons, Oxford University Press, New York, New York, 1999. [19] Y.A. Kuznetsov, “New algorithms for approximate realization of implicit difference schemes”, Sov. J. Numer. Ana. Math. Modell. 3 (1988), pp. 99–114. [20] Y.M. Laevsky, “A domain decomposition algorithm without overlapping subdomains for the solution of parabolic equations”. (Russian) Zh. Vychisl. Mat. i Mat. Fiz. 32 (1992), no. 11, pp. 1744– 1755; translation in Comput. Math. Math. Phys. 32 (1992), no. 11, pp. 1569–1580.

[21] Y.M. Laevsky, “Explicit-implicit domain decomposition method for solving parabolic equations”, Computing methods and technology for solving problems in mathematical physics (Russian), pp. 30–46, Ross. Akad. Nauk Sibirsk. Otdel., Vychisl. Tsentr, Novosibirsk, 1993. [22] Y.M. Laevsky and S.V. Gololobov, Explicitimplicit domain decomposition methods for the solution of parabolic equations. (Russian) Sibirsk. Mat. Zh. 36 (1995), no. 3, pp. 590–601, ii; translation in Siberian Math. J. 36 (1995), no. 3, pp. 506– 516. [23] Y.M. Laevsky and O.V. Rudenko, “Splitting methods for parabolic problems in nonrectangular domains”. Appl. Math. Lett. 8 (1995), no. 6, pp. 9–14. [24] T. Mathew, P. Polyakov, G. Russo and J. Wang, “Domain decomposition operator splittings for the solution of parabolic equations”, SIAM J. Sci. Comput. 19 (1998), no. 3, pp. 912–932. [25] P.S. Myers, O.A. Uyehara, and G.L. Borman, Fundamentals of Heat Flow in Welding, Welding Research Council Bulletin, Vol. 123 (1967). [26] H. Qian and J. Zhu, “On an efficient parallel algorithm for solving time dependent partial differential equations”, Proceeding of the 11th International Conference on Parallel and Distributed Processing Techniques and Applications, July 1998, Las Vegas, CSREA Press, Athens, GA, pp. 394–401. [27] K.A. Rusch, “Development of a marsh-based system to treat domestic wastewater from coastal dwellings”, J. Louisiana Sect. Amer. Soc. Civil Engineers, Vol. 6, Mar. 98, pp. 4–21. [28] B.F. Smith, P.E. Bjorstad and W.D. Gropp, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential equations, Cambridge University Press, 1996. [29] B. Sportisse and A. Djouad, “Reduction of chemical kinetics in air pollution modeling”, J. Comp. Phys., vol. 164 (2000), pp. 354–376. [30] M. Stubblefield, C. Yang, R. Lea, and S. Pang, “Development of heat-activated joining technology for composite-to-composite pipe using prepreg fabric”, Polymer Engineering and Science, Vol. 38, No. 1, pp. 143-149, January (1998). [31] X.-H. Sun, H. Zhang, and L. Ni, “Efficient Tridiagonal Solvers on Multicomputer”, IEEE Trans.

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

on Computers, Vol. 41, No. 3, pp. 286-296, March 1992. [32] X.-H. Sun, “Application and accuracy of the parallel diagonal dominant algorithm, Parallel Computing (Aug. 1995), pp. 1241–1267. [33] Y. Zhuang, A Class of Stable, Globally Noniterative, Non-overlapping Domain Decomposition Algorithms for the Simulation of Parabolic Evolutionary Systems, Ph.D. dissertation, Department of Computer Science, Louisiana State University Baton Rouge, December 2000. [34] Y. Zhuang and X.-H. Sun, “A domain decomposition based parallel solver for time dependent differential equations”, Proc. 9th SIAM Conf. Parallel Processing for Scientific Computing, March 1999, San Antonio, Texas. CD-ROM SIAM, Philadelphia, 1999.

[35] Y. Zhuang and X.-H. Sun, “Stable, globally noniterative, non-overlapping domain decomposition parallel solvers for parabolic problems”, Proceeding. High Performance Networking and Computing 2001 (SC2001), Denver, Colorado, November 2001, CD-ROM IEEE Computer Sociaty and ACM. [36] Y. Zhuang and X.-H. Sun, “Stabilized explicit implicit domain decomposition methods for the numerical solution of parabolic equations”, SIAM J. Sci. Comput., Vol. 24, No. 1, July 2002, pp. 335– 358. [37] Y. Zhuang and X.-H. Sun, “A domain decomposition algorithm for solving heat transfer problems on massively parallel computers”, submitted.

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

A Highly Parallel Algorithm for the Numerical Simulation of ... - CiteSeerX

A Highly Parallel Algorithm for the Numerical Simulation of ... - CiteSeerX

Suggest Documents

A Parallel Algorithm for the Direct Numerical Simulation of ... - Polimi

A Block JRS Algorithm for Highly Parallel Computation of ... - CiteSeerX

A Highly Scalable Parallel Algorithm for Sparse Matrix ... - CiteSeerX

A New Algorithm for the Numerical Simulation of ...

Petascale turbulence simulation using a highly parallel ... - CiteSeerX

11B.5 A HIGHLY PARALLEL ALGORITHM FOR TURBULENCE ...

PARMA: A Parallel Randomized Algorithm for ... - CiteSeerX

A Parallel Algorithm for Change Detection - CiteSeerX

A Numerical Algorithm for the Solution of

A Model and Numerical Framework for the Simulation of ... - CiteSeerX

A low-cost parallel implementation of direct numerical simulation of ...

Synchronisation Primitives for Highly Parallel Discrete ... - CiteSeerX

NEW TECHNIQUES FOR PARALLEL SIMULATION OF ... - CiteSeerX

A Highly Parameterizable Parallel Processor Array ... - CiteSeerX

An OpenMP Parallel Implementation for Numerical Simulation of Gas

Numerical Simulation of Pulse Detonation Phenomena in a Parallel

A Fully Parallel Algorithm for the Symmetric Eigenvalue ... - CiteSeerX

a parallel algorithm for computing the extremal ... - CiteSeerX

A Sublinear-Time Randomized Parallel Algorithm for the ... - CiteSeerX

A parallel algorithm for estimating the secondary structure ... - CiteSeerX

A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

A NUMERICAL ALGORITHM FOR BLOCK

NUMERICAL SIMULATION OF HYDRODYNAMIC ... - CiteSeerX