Fast Elliptic Solver for Incompressible Navier Stokes Flow and Heat Transfer Problems on the Grid∗ M. Garbey † , B. Hadri
‡
Department of Computer Science,University of Houston, Houston, Tx 77204,USA
W. Shyy
§
Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, Fl 32611,USA
The objective of this work is to present a fast parallel elliptic solver that improves efficiently the run of incompressible Navier-Stokes Flow code or Heat Transfer code on a grid of parallel computers. We focus on the design of the elliptic solver because the pressure solver in an incompressible Navier-Stokes code , is the one that is most time consuming, and demanding on communications between processors. We describe a domain decomposition method that is numerically efficient, scales well on a parallel computer and is tolerant to the high latency and low bandwidth of a slow network. The main feature of our method is to keep the framework of additive Schwarz algorithm that is easy to code, to parallelize, and to make it numerically efficient with an acceleration procedure. We discuss also a model to handle the automatic performance tuning of the linear solver for each subdomain, each processor architecture and each applications with surface response modeling. Results are shown with performance on grid of heterogeneous parallel computers.
I.
Introduction and Motivation
The goal of this paper is to present a fast parallel solver for elliptic equations as follows − div(ρ∇p) = f, ρ ≡ ρ(x) ∈ R , p ≡ p(x) ∈ R+ x ∈ Ω ⊂ R2 ,
(1)
+,∗
complemented by appropriate boundary conditions. The specificity of our solver is that it is designed to combine numerical efficiency and parallel efficiency on a grid of parallel computers. We recall that a grid is a complex system with a large number of distributed hardware and software components and no centralized control1, 2, 3, 4, 5, 6 . Communication latency between some pairs of nodes might be high, variable, and unpredictable. The parallel computers of the grid are heterogeneous and have variable performances. One may also expect an unusually high unreliability of computing nodes. Typical applications of the solver presented in this paper could be the solution of the pressure equation in incompressible Navier Stokes flow, or the steady solution of a heat transfer problem. We consider that the discretization grid is topologically equivalent to a Cartesian grid, or can be embedded to such a spatial discretization grid. This practice holds for subdomains only, when we use an overset or chimera method. We look at second order finite volume or finite difference discretization of the equation in 2D that, provides a sparse linear operator with regular bandwidth structure. The structure of the matrix and its storage scheme is therefore fairly simple. After this introduction, we present in Section 2 the domain decomposition technique , Section 3 describes the subdomain solver tuning, Section 4 give some benchmark performances on Beowulf clusters with high latency network, Section 5 is our conclusion. ∗ This
work was supported in part by NSF Grant ACI-0305405. , Department of Computer Science,
[email protected] ‡ PhD candidate, Department of Computer Science,
[email protected] § Professor ,Department of Mechanical and Aerospace Engineering,
[email protected]
† Professor
1 of 11 American Institute of Aeronautics and Astronautics
II.
Domain Decomposition Technique
The domain is decomposed into overlapping Domain Decomposition (DD) and use the general framework of additive Schwarz algorithm.7 Recently, we have introduced a new family of domain decomposition solvers for elliptic problems8 that provides a framework to build high performance algorithms highly tolerant to low bandwidth and high latency. This technique can be combined with standard DD that scale well on uniform parallel architecture; where CPU power is well balanced by network performance. Our method post-processes standard block-wise relaxation scheme such as the Schwarz method,9, 10 with Aitken like acceleration of the sequences of interfaces produced by the Schwarz method. We briefly describe the numerical ideas behind the Aitken-Schwarz (AS) method. We refer to11 for a detailed description of this method. For simplicity, we illustrate the concept with the discretized Helmholtz operator L[u] = ∆u − λu, λ > 0, with a grid that is a tensorial product of one dimensional grids, and a square domain decomposed into strip subdomains. Let us consider the homogeneous Dirichlet problem L[U ] = f in Ω = (0, 1), U|∂Ω = 0, inSone space dimension. We restrict ourselves to a decomposition of Ω into two overlapping subdomains Ω 1 Ω2 and consider the Additive Schwarz algorithm (AdS) n+1 n n L[un+1 ] = f in Ω1 , un+1 ] = f in Ω2 , un+1 1 1|Γ1 = u2|Γ1 , L[u2 2|Γ2 = u1|Γ2 .
(2)
with given initial conditions u01|Γ2 , u02|Γ1 to start this iterative process. To simplify the presentation, we assume implicitly in our notations that the homogeneous Dirichlet boundary conditions are satisfied by all intermediate subproblems. This algorithm can be executed in parallel on two computers. At the end of each subdomain solve, the artificial interfaces un2|Γ1 and un1|Γ2 have to be exchanged between the two computers. In order to avoid redundancy as much as possible in the computation, we fix a size of one mesh once and for all the overlap between subdomains to be the minimum . This algorithm can be extended to an arbitrary number of subdomains and is nicely memory-scalable if one neglects the numerical efficiency. As a matter of fact, the communications link only neighboring subdomains. The ratio of communications per flops for one Schwarz iterate stays constant as the number of processors grows linearly with the size of the global problem. However the additive Schwarz algorithm is one of the worst numerical algorithms to solve the problem, because the convergence is extremely slow.12 The standard procedure is to combine multilevel methods with AdS,7 but this procedure does not scale very well13 and is sensitive to high latency and low bandwidth. The multilevel structure has an unbalanced computation per communication ratio as the grid gets coarse. Thereafter, we introduce a modified version of this Schwarz algorithm, the so-called AS algorithm, that transforms this slow iterative solver into a direct fast solver while the memory scalability of the Schwarz algorithm is retaining as much as possible. The idea is as follows. We observe that the interface operator T, n+1 t (un1|Γ1 − UΓ1 , un2|Γ2 − UΓ2 )t → (un+1 (3) 1|Γ1 − UΓ1 , u2|Γ2 − UΓ2 ) is linear. Therefore, the sequence (un1|Γ1 , un2|Γ2 ) has linear convergence, that is, it satisfies the identities: n+1 n n un+1 1|Γ2 − U|Γ2 = δ1 (u2|Γ1 − U|Γ1 ), u2|Γ1 − U|Γ1 = δ2 (u1|Γ2 − U|Γ2 ),
(4)
where δ1 (respt δ2 ) is the damping factor associated to the operator L in subdomain Ω1 (respt Ω2 ).14 Consequently (5) u21|Γ2 − u11|Γ2 = δ1 (u12|Γ1 − u02|Γ1 ), u22|Γ1 − u12|Γ1 = δ2 (u11|Γ2 − u01|Γ2 ). So, if the initial boundary conditions match the exact solution U at the interfaces, the damping factors can be computed from the linear system (5). Since δ1 δ2 6= 1 the limit U|Γi , i = 1, 2 is obtained as the solution of the linear system (4). Consequently, this generalized Aitken acceleration15, 16 procedure gives the exact limit of the sequence on the interface Γi based on two successive Schwarz iterates uni|Γi , n = 1, 2, and the initial condition u0i|Γi . An additional solve of each subproblem (2) with boundary conditions u∞ Γi gives the final solution of the ODE problem. To improve this first algorithm, one can compute δ1 and δ2 beforehand numerically or analytically as follows. Let (v1 , v2 ) be the solution of L[v1 ] = 0 in Ω1 , v|Γ1 = 1; L[v2 ] = 0 in Ω2 , v|Γ2 = 1;
(6)
We have then δ1 = v|Γ2 , δ2 = v|Γ1 . Once (δ1 , δ2 ) is known, we need only one Schwarz iteration to accelerate the interface and an additional solve for each subproblem. This is a total of two solves per subdomain. The 2 of 11 American Institute of Aeronautics and Astronautics
Aitken acceleration thus transforms the additive Schwarz procedure into an exact solver regardless of the speed of convergence of the original Schwarz method, and in particular with minimum overlap. This Aitken-Schwarz algorithm can be reproduced for multidimensional problems. As a matter of fact, it is shown17 that the coefficients of each wave number of the sine expansion of the trace of the solution generated by the Schwarz algorithm has its own rate of exact linear convergence. We can then generalize the one dimensional algorithm to two space dimensions with Ω = (0, π)2 , as follows: • step 1 : compute analytically or numerically in parallel each damping factor δjk for each wave number k from the two point one D boundary value problems analogues of (6). • step 2: apply one additive Schwarz iterate to the Helmholtz problem with subdomain solver of choice (multigrids,18 fast Fourier transform,19 PDC3D,20, 21 etc...) • step 3: - compute the sine expansion u ˆnk|Γi , n = 0, 1, k = 1..N of the traces on the artificial interface Γi , i = 1..2 for the initial boundary condition u0|Γi and the solution given by one Schwarz iterate u1|Γi , i = 1, 2. - apply generalized Aitken acceleration separately to each wave coefficients in order to get u ˆ∞ j|Γi . - recompose the trace u∞ j|Γi in physical space. • step 4: compute in parallel the solution in each subdomains Ωi , i = 1, 2 with new inner BCs and subdomain solver of choice. We have shown that a generalized Aitken acceleration technique can be applied to an arbitrary number q > 2 of subdomains with strip DD. Our main result is that no matter what the number of subdomains and the size of the positive overlap are, the total number of subdomain solver , required to produce the final solution, is still two. The arithmetic complexity of this algorithm can be found analytically, provided the knowledge of the arithmetic complexity of the linear solver used in each subdomain. We found that our method is two times slower than a Fast Poisson solver, but it does not scale on a Beowulf cluster and it is impractical for grid computing. Further, the generalized Aitken acceleration of the vectorial sequences of interface introduces a coupling between all interfaces. It is essential to note that this is not necessary for all wave components ! As a matter of fact, this generalized Aitken acceleration processes independently each wave coefficient of the sine expansion of the interfaces. Second, the highest is the frequency k, the smallest is the damping factors δjk , j = 1..2q. A careful stability analysis of the method shows that: • for low frequencies components, we should use the generalized Aitken acceleration coupling all the subdomains. • for intermediate frequencies components, we can neglect this global coupling and implement only the local interaction between subdomains that overlap. • for high frequencies components, we do not use Aitken acceleration because one iteration of the Schwarz algorithm damps the high frequency error enough. The AS method gives us this unique ability to compute adaptively what is the best decomposition above as a function of the architecture of the network of our distributed computation hardware! The algorithm has then the same structure as the two subdomains algorithm presented above. Step 1 and step 4 are fully parallel. Step 2 requires only local communication and scales well with the number of processors. Step 3 requires global communication of interfaces in Fourier space for low wave numbers, and local communications for intermediate frequencies. In addition, for moderate numbers of subdomains, the arithmetic complexity of step 3 that is the kernel of the method is negligible compared to step 2. In our recent work, we have achieved the following results: we have extended our method to 3 dimensional problems with multidimensional DD, grids that are tensorial products of one dimensional grids with arbitrary (irregular) space step, iterative DD methods such as the Dirichlet-Newman procedure with non-overlapping subdomains or red/black subdomains procedure.22, 23, 24 3 of 11 American Institute of Aeronautics and Astronautics
For non-separable elliptic problems and non-linear elliptic problems, the Aitken acceleration is no longer exact. The so-called Steffensen-Schwarz described in17 and11 uses approximate reconstruction of the matrix of the trace transfer operator and iteratively uses the AS method. We have shown that AS methods are versatile and robust for a large variety of (non-linear) problems, such as the Bratu 25 problem, which is a simplified model for solid combustion26 and the p-Laplacian that is of interest in CFD.17, 11 We have also generalized recently its application to the tensorial product of one dimensional grids that have arbitrary irregular space steps.22, 23, 24 We have developed a mathematical framework for the development of the Aitken like acceleration of the Schwarz method with general discretization. The key feature of the theory has been to show how one can compute a posteriori from a sequence of Schwarz iterates a limited set of main eigenvectors of the trace transfer operator to speed up dramatically the additive Schwarz algorithm. The application to finite volume discretization on general quadrangle has been presented in27 The use of the method in context of heterogeneous domain decomposition with overlapping subdomains has been presented in 23, 28 Let us illustrate the performance of our method with a large scale metacomputing architecture on the Poisson problem.29 This problem is simple but provides one of the most difficult situations for metacomputing because any perturbation at an artificial interface decreases slowly in space. We use a two-level DD algorithm with a first level of decomposition into macro-block processed by AS across the slow network, and a fast DD elliptic solver inside the macro block. This parallel solver, so-called PDC3D, developed by Rossi and Toivanen2021 following the ideas of Kuznetsov30 and Vassilevski31 is a parallel, fast, direct solution method for linear systems with separable block tridiagonal matrices. . The method under consideration has the arithmetic complexity O(N log2 N ), and it is closely related to the cyclic reduction method, but instead of using the matrix polynomial factorization, the so-called partial solution technique is employed. The PDC3D solver is an almost optimal solver with good parallel scalability in parallel computers having network performance balanced with processor performance. But the parallel efficiency of this solver degrades dramatically if any one of the links in the network of the parallel computer slows. The same phenomenon occures with Fast Fourier Transforms that are very sensitive to high latencies in communications We have been using several Crays in Finland, Germany and the US. Some of these machines have alpha processors with 450 MHz frequencies, some with 350 MHz. We did a priori load balancing on the heterogeneous network of Cray supercomputers. In all experiments, the numerical accuracy of the solution was checked against exact test functions. The key point is that we were running our metacomputing on two or three supercomputers with the existing ordinary Ethernet network. During all our experiments, the bandwidth fluctuated in the range (1.6M b/s − 5.M b/s) and the latency was about 30ms. We have been able to solve a 700 million unknowns for the Poisson problem with 3 Cray T3Es in HLRS (Stuttgart, Germany), the Von Neumann Institute (Julich, Germany), and PSC (Pittsburgh, USA) with 1280 processors in less than a minute. We used almost all memory available on our network of supercomputers and focused on the extensibility properties of our direct linear solver. In accordance with Gustafson’s law, a direct measurement of the speedup is not appropriate. Our two main observations in these experiments are then as follows: • we have in our experiments an incompressible overhead that varies from 17s to 24s that depends mostly on the speed of the network that interlinks our supercomputers. We recall that the bandwidth of the network was in the range of 2 to 5 Mb/s. This overhead is quasi-independent of the size of the problem considered here. • besides the overhead due to the network between distanced sites, we observe an excellent (memory) scalability of our Poisson solver. This result is a combination of two factors: first the arithmetic complexity of our solver grows, almost linearly, with the number of macro subdomains. Second, the ratio of computation time per communication time is large even with slow networks and fast supercomputers. We refer to29 for a detailed presentation of this experiment. Further, to our knowledge, there are no other known results of efficient, large scale metacomputing of PDE problems that work with tightly coupled computation as is the case in a Poisson solver. To adapt this fast Poisson solver on an heterogeneous network of computers, one need to optimize the subdomain solver. We present in the next section a methodology for this purpose.
4 of 11 American Institute of Aeronautics and Astronautics
III.
Discussion on Subdomain Solver Tuning
We have written an interface software32 to reuse a broad variety of existing linear algebra softwares for each subdomain such as LU factorization, a large number of Krylov methods with incomplete LU preconditioner, and Geometric or Algebraic multigrids solvers. Lapack from http://www.netlib.gov, Sparskit33 and Hypre34 were implemented in this interface . These libraries are representative of the state of the art for the resolution of a linear system. However many options are available for each iterative solver, and each of these software has a different language and/or style of coding. In our method, the complexity of the choice of the argument in the solver is hidden by the fact that the software can determine the best method case by case by exhaustive experiment on the grid. The optimum design of the solver uses two critical components. First the software runs automatically, on the grid, a large number of tests to compare methods and generate enough data to produce a simplified model of performance. Second, we formalize the optimum construction of the elliptic solver as an optimization problem and proceed with an optimization method based on the surrogate model of performance. We take therefore advantage of the fact that the grid offers more processing resources than necessary to run the code. The same approach applies to load balancing that is indeed necessary in grid computing. The grid has heterogeneous resources, and one should balance the various performances of each processor units. Keeping the domains to be topologically equivalent to a bloc of Cartesian mesh, is a tremendous help to the fine tuning of load balancing, because partitioning the data is easier. Further, we have a clear advantage on performance of the code on RISC processor, because the cache can be used efficiently. Beside automatic tuning of our subdomain solver, an alternative is to give the choice of the options to the user. Since programming is cumbersome for the user,we are currently developing a web interface to help the user fill all options fields in the subroutine call. This process is completely transparent and does not require coding by the user. We expect that this web interface will be used later on for pedagogic purpose as well. We envision that the grid can help us to make software more intelligent by learning from previous distributed runs of the linear libraries. A.
Curved pipe flow
Let us illustrate the performance of subdomain solvers with an incompressible flow in a curved pipe 28 . The model uses Navier Stokes equation with no slip boundary condition on the wall of the pipe and prescribed flow speed at the inlet and outlets. Heterogeneous mesh in a U−shape pipe
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
Figure 1. Composite Mesh for the curved pipe flow problem
For large Reynolds number we use an overlapping DD method (see Figure 1) with non matching grids. Two thin subdomains, denoted BL1 and BL2 fit the wall and have orthogonal meshes to approximate the boundary layer. The domain denoted RD for the central part of the pipe is polygonal and it is overlapping the boundary subdomains by few mesh cells. This is basically a Chimera approach that is convenient to compute fluid structure interaction.
5 of 11 American Institute of Aeronautics and Astronautics
On figure 2 and 3, the performance of different solver in RD and boundary layer’s subdomains are represented . The optimum choice of the solver for each subdomain, i.e the solver that processes the subdomain in the shortest elapsed time, depends on • the type of subdomain, • the fact that one reuse or not the same preconditioner or decomposition of the operator, • the architecture of the processor, • the size of the problem. The choice of the wrong solver for a specific domain can slow down the all computation.
BCGSTAB
0
HYPRE
BL 2 : 153 * 13
LU
0.2
BCGSTAB
0
HYPRE
RD : 57 * 80
LU
0.2
2 1
LU
BCGSTAB
HYPRE
0
LU
BCGSTAB
TIME LU
Figure 2. Comparison of the elapsed time for each subdomain with preconditioning for the curved pipe flow problem
0
BCGSTAB
HYPRE
0.1
0
LU
BCGSTAB
HYPRE
RD : 114 * 160
1.5
HYPRE
BCGSTAB
BL 2 : 307 * 25
0.05
HYPRE
0.1
LU
LU
0.15
RD : 57 * 80
0.05 0
HYPRE
BCGSTAB
0.1 0.05
HYPRE
0.01
0.15
TIME
0.4
BCGSTAB
BL 2 : 153 * 13
0.02
0
HYPRE
3
TIME
0.6
BCGSTAB
RD : 114 * 160
4
LU
0.03
0.1
0.8
0.01 0
HYPRE
TIME
0.02
TIME
BCGSTAB
BL 2 : 307 * 25
0.3
0.04
0
LU
0.4
0.02
BL 1 : 367 * 25
0.15
TIME
LU
TIME
TIME
0.2 0.1
0.06
0
TIME
TIME
TIME
0.02
BL1 :183 * 13
0.03
0.3
0.04
0
BL 1 : 367 * 25
0.4
TIME
BL1 :183 * 13
0.06
1 0.5 0
LU
BCGSTAB
HYPRE
Figure 3. Comparison of the elapsed time for each subdomain with a precomputed preconditioner for the curved pipe flow problem
We have32 generated ,for a variety of subdomain sizes, and processors architecture, some surface response to monitor the performance of each subdomain solver. One can then detect which solvers gives the most reliable prediction with a low order polynomial approximation of the surface response. One can show that LU are the most easy to model with surface response, while the performance of Hypre is very sensitive to the subdomain size. Generating these surface response is cumbersome and can be done automatically by distributing these parallel tasks on a grid of computers. We are going to present in the next section the limit of this approach to parallel computing model performance. B.
Limit of parallel subdomain solver performance
Let us consider ,now once for all, a simple Poisson problem and experiment with the scalability of our AS solver. We are going to check the impact of the choice of the subdomain solver on the overall DD performance. The goal is not here to obtain the best subdomain solver from all existing methods, but rather to use the best solver from Sparskit(Krylov solver tool developped by Saad33 ) or Lapack only. As a matter of fact from all existing computational algorithm a full multigrid algorithm should be, in theory, the optimum. The performance modeling on a single processor of the Poisson subdomain solver gives figure 4. We checked that the elapsed time with four successive runs stays consistent. Horizontal axis and vertical axis of the figure corresponds to the dimension of the subdomain in each space direction. The dark blue region indicates where Sparskit (BCGSTAB) is faster than Lapack (LU decomposition) while the light blue shows where Lapack is faster than Sparskit. Let us compare this prediction model on a single processor and the performance of the AS algorithm with a fixed size of the subdomain per processor that corresponds to the red points in figure 4. We test the scalability of the AS algorithm, that is: we fix the size of the subdomain per processor once for all, and let the number of processors grows. The global size of the problem grows linearly with the number of processors. We use a Beowulf cluster that is a 48 dual AMD Athlon 1800 equipped with a Gigabit Ethernet switch.
6 of 11 American Institute of Aeronautics and Astronautics
The following table gives the total elapsed time to solve the problem and the number in red gives the best elapsed from both options i.e Lapack or Sparskit.
points per processors
100 120 140 160 180 200
x x x x x x
160 160 160 160 160 160
2 processors LU Saad 1.44 2.26 2.08 3.19 3.28 4.17 4.67 5.21 6.46 6.39 8.24 8.18
4 processors LU Saad 1.58 1.87 2.45 2.54 3.61 3.36 4.81 4.36 6.43 5.63 8.89 6.69
8 processors LU Saad 1.43 1.43 2.29 1.98 3.54 2.75 4.95 3.49 6.52 4.31 8.27 5.25
16 processors LU Saad 1.79 1.82 2.75 2.50 3.89 3.45 5.37 4.37 7.72 5.45 9.69 6.54
Figure 4. Comparison between Krylov Solver (BICGSTAB) and LU decomposition (Lapack).The dark blue shows the region where the Krylov solver is faster than LU decomposition, and the light blue shows the region where LU is faster
We can notice that the prediction of the best subdomain solver according to figure 4, is that one should uses Sparskit (respectively Lapack) for the first space dimension above 160 (respectively below 160). This prediction is correct for the 2 processors computation. However as the number of processors grows, this prediction is incorrect, and one should favor the Krylov solver. Overall, we observe that • While the communication schedule in the parallel code is extremely regular, the performance of the parallel code is non monotonic with respect to the global size of the problem. The optimum choice of the solver is then depending on the number of processors used. • If one selects the best of the two subdomain’s solvers for each configuration, the scalability of the code is excellent: For the smallest problems, the total elapsed time does not increase significantly as the number of processors growths while the problem size per processor stays constant. For large problems, the elapsed time decreases! We conclude that the optimum choice of the subdomain solver should use a surface response model that includes as a third dimension the number of processors. In the next section we report in more details on the performance of AS with Beowulf clusters.
IV.
Beowulf cluster with a high latency network
In this section we concentrate on the speed up performance of our DD solver on a Beowulf cluster equipped with a Gigabit Ethernet network. Our measurement gives a bandwidth of the network in the range 80 − 120 Mb/s, and a latency of the order of 0.05 ms-0.06 ms. We restrict ourselves to the solution of small two dimensional Poisson problems, to emphasis the impact of high latency on modern classical method. 7 of 11 American Institute of Aeronautics and Astronautics
Let us show the speedup of AS with Lapack (with LU decomposition) and Sparskit (with GMRES) for subdomain solvers. . Aitken − LU
16
14
Aikten − GMRES
18
160−160 pts 240−240 pts 320−320 pts ideal
160−160 pts 240−240 pts 320−320 pts ideal
16
14
12
Speedup
Speedup
12 10
8
10
8 6
6
4
2
4
2
4
6
8 10 number of processors
12
14
2
16
Figure 5. Speedup AS solved with LU for Poisson problem
2
4
6
8 10 number of processors
12
14
16
Figure 6. Speedup AS solved with GMRES for Poisson problem
We can notice from figures 5 and 6 that AS performs very well on small problems. Further, the Krylov method seems to be more sensitive to the cache effect, since we have a superlinear speedup. For the next section, we use Sparskit for subdomain solver only, since this solver gives us the best performance for large number of processors. Let us compare the performance of our AS code with the Portable, Extensible Toolkit for Scientific Computation (PETSc) A.
Comparison with PET Sc
PETSc35 is an excellent general purpose software for comparison purpose. PETSc consists of a variety of libraries which include many linear solvers such as Lapack, Krylov solver and algebraic multigrid solver. 36 3.5
18 16
Aikten − gmres : 198*198 PETSc Multigrid : 198*198 Aikten − gmres : 398*198 PETSc Multigrid : 398*198
3
14
2.5
Elpased time
Speedup
12 10 Aikten − gmres : 198*198 Aikten − gmres : 398*198 PETSc Multigrid : 198*198 PETSc Multigrid : 398*198 ideal Speedup
8 6
2
1.5
1
4
0.5
2 0
2
3
5
8 10 number of processors
0
16
Figure 7. speedup between Aikten solved with GMRES and PETSc multigrid
2
3 number of processors
5
Figure 8. Elapsed time for Aikten solved with GMRES in each subdomain and PETSc multigrid
In Figure 7 , we report the speedup performance of PETSc and AS on the same graphic, while on Figure 8, we give the elpased time. We choose to run PETSc using V-cycle multigrid. The preconditioner is of Richardson type to get traditional (non-Krylov accelerated) multigrid. One has two pre and two post smoothing steps of SSOR (running independently on each process) and direct LU on the coarsest grid. To be accurate the option in the PETSc code is: -dmmg_nlevels 3 mg_levels_ksp_type richardson -mg_levels_pc_type sor -mg_levels_pc_sor_lits 2 -mg_levels_pc_sor_local_symmetric
8 of 11 American Institute of Aeronautics and Astronautics
This combination of options seems to give good performances for PETSc. Our implementation of AS does not neglect any wave components of the interface and uses blocking broadcast and gathers for the acceleration process in Step 3 of the algorithm. This implementation is then far from optimum. PETSc, as expected is faster than AS with two processors and also for 3 processors. However as the number of processors increases, one can observe that the multigrid solver does not speed up well, while AS performs better. Eventually for more than three processors, AS gives a better elapsed time than the multigrid solver. PETSc does not have a good speedup, this is explained by the fact that the results are obtained with a high latency network and PETSc requires a better network to get better performances. This is by no mean a general conclusion because this test case is particularly simple. But it is rather a demonstration than the AS algorithm is tolerant for high latency network, while traditional optimum server are not. We should have used PETSc as a subdomain solver, and AS for the domain decomposition method in this specific case. This is part of our ongoing software development. Let us conclude this section with a grid computing experiment with the same AS software. B.
Grid computing experiment
The interface for the grid computing experiment is PACX-MPI37, 38 Version 5-0.beta with MPICH-1.2.5. The library PACX-MPI (PArallel Computer eXtension) enables scientists and engineers to seamlessly run MPIconforming parallel application on a Computational Grid, such as a cluster of High-Performance Computers, connected through high-speed network. Our performance evaluation , is done with the AS code between two beowulf cluster, one so called Dragon is located at the University of Florida (Gainesville, Florida) and the other called Stokes situated at the University of Houston (Houston, Texas). The descriptions of the hardware on each machine follows : • Dragon : Beowulf cluster APPRO System (dragon.mae.ufl.edu) 8 dual nodes Intel(R) XEON(TM) CPU 1.80GHz Cache size 512 KB, Main Memory 2GB Network Fast Ethernet: Latency 0.06ms; Bandwidth 84Mb/s • Stokes :Beowulf cluster APPRO System (stokes.cs.uh.edu) 48 dual processor nodes AMD Athlon(tm) MP 1800+ CPUs Cache size 256 KB, Main Memory 1GB Network Gigabit Ethernet: Latency 0.05ms; Bandwidth 124Mb/s 5 4.5 4
Elapsed time
3.5 3 2.5 2 1.5 1 0.5 0
2
4
8
12 16 number of processors in the grid
24
Figure 9. Scalability Performance with 2 clusters : Dragon and Stokes
We do apriori a load balancing between the two machines, we run subdomain of size 300*51 in each processors of Stokes and respectively 300*49 on Dragon. We study the scalability performance on Figure9, and we use the same number of processors in each cluster. As the number of processors grows, the overall performance improves, and eventually slightly detoriate for 24 processors. Our AS code has consequently very good scalability performance. We have fairly consistent performance with three successive runs for each test case. The metacomputing environment is therefore quite stable. 9 of 11 American Institute of Aeronautics and Astronautics
V.
Conclusion
In this paper, we presente a new DD framework for elliptic solver that is interesting for distributed computing with high latency network. This method provides good scalability results. More work needs to be done to optimize the AS method for general linear problems for which the approximation of the main eigenvectors of the trace transfer operator are not easy to get. We demonstrate the impact of the choice of the subdomain solver and presented a methodology with surface response that can help to tune in an optimal way the solver. While generating such model is cumbersome and time consuming, there are advantage to generate them in an automatic manner using the resources of the grid. Grid computing is further complicated by the fact that large distributed systems and broad network are unreliable for long runs. We show alsothat for slow network and a simple Poisson problem, AS is efficient and eventually out performs the multigrid solver of PETSc for large number of processors. One more step to make grid computing more valuable for engineering application will be to make domain decomposition software fault tolerant. 39
Acknowledgments We would like to thank Barry Smith for his help and advising for the PETSc library.
References 1 I.Foster
and C.Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1998. I. and N.Karonis, “A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems,” IEEE/ACM SC98 Conference, 1998, p. 46. 3 Foster, I. and Kesselman, C., “The Globus Project: A Status Report,” Proceedings of the Heterogeneous Computing Workshop, IEEE-P, 1998, pp. 4–18. 4 Foster, I., “Advance Computing and AnalysisTechniques in Physics Research,” VII International Workshop; ACAT 2000 , 2000, pp. 51–56. 5 Foster, I., Kesselman, C., Nick, J. M., and Tuecke, S., “Grid Services for Distributed System Integration,” Computer , Vol. 35, 2002, pp. 37–46. 6 Foster, I., “The Grid: A New Infrastructure for 21st Century Science,” Physics Today, Vol. 55, 2002, pp. 42–47. 7 Smith, B., Bjorstad, P., and Gropp, W., Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, 1996. 8 Garbey, M. and Tromeur-Dervout, D., “Operator Splitting and Domain Decomposition for Multiclusters,” Proc. Int. Conf. Parallel CFD99 , edited by D. Keyes, A. Ecer, N. Satofuka, P. Fox, and J. Periaux, North-Holland, 1999, pp. 27–36. 9 Schwarz, “Gesammelte Mathematische Abhandlungen,” 2nd Ed Springer Berlin, Vol. 2, 1980, pp. 133–145, Vierteljahresschrift der Naturforschenden Gesellschaft, Zurich, Vol. 15, 1st Edition, 1870, pp. 272-286. 10 Lions, P. L., “On the Schwarz alternating method I,” Domain Decomposition Methods for Partial Differential Equations, edited by R. Glowinski, G. H. Golub, G. A. Meurant, and J. P´eriaux, Society for Industrial and Applied Mathematics, Philadelphia, 1988, pp. 1–42. 11 M.Garbey and Dervout, D., “On some Aitken like acceleration of the Schwarz Method,” Int. J. for Numerical Methods in Fluids, Vol. 40, 2002, pp. 1493–1513. 12 Cai, Z. and McCormick, S., “Computational Complexity of the Schwarz Alternating Procedure,” International Journal of High Speed Computing, Vol. 1, 1989, pp. 1–28. 13 Keyes, D. E., “How Scalable is Domain Decomposition in Practice?” Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods, 1998, pp. 286–297. 14 Garbey, M. and Kaper, H. G., “Heterogeneous domain decomposition for singularly perturbed boundary problems,” SIAM Journal on Numerical Analysis, Vol. 34, 1997, pp. 1513–1544. 15 Henrici, P., Elements of Numerical Analysis, John Wiley and Sons, New York, 1964. 16 Stoer, J. and Bulirsch, R., Introduction to Numerical Analysis, Springer-Verlag, New York, 1980. 17 Garbey, M. and Tromeur-Dervout, D., “Two level domain decomposition for Multiclusters,” 12th Int. Conf. on Domain Decomposition Methods DD12 , edited by T. C. . A. editors, ddm.org, 2001, pp. pp. 325–339. 18 McCormick, S. F., Multigrid Methods, Vol. 3 of Frontiers in Applied Mathematics, SIAM Books, Philadelphia, 1987. 19 Van Loan, C. F., Computational Frameworks for the Fourier Transform, Vol. 10 of Frontiers in Applied Mathematics, SIAM, 1992. 20 Rossi, T. and Toivanen, J., “A Nonstandard Cyclic Reduction Method, Its Variants and Stability,” SIAM Journal on Matrix Analysis and Applications, Vol. 20, 1999, pp. 628–645. 21 Rossi, T. and Toivanen, J., “A Parallel Fast Direct Solver for Block Tridiagonal Systems with Separable Matrices of Arbitrary Dimension,” SIAM Journal on Scientific Computing, Vol. 20, 1999, pp. 1778–1793. 22 J.Baranger, M.Garbey, and F.Oudin, “Recent Development on Aitken-Schwarz Method,” 13th Int. Conf. on Domain Decomposition Methods DD13 , edited by N. et Al edt, CIMNE, Bracelona, 2002, pp. 289–296. 2 Foster,
10 of 11 American Institute of Aeronautics and Astronautics
23 J.Baranger, M.Garbey, and F.Oudin, “On Aitken Like Acceleration of Schwarz Domain Decomposition Method Using Generalized Fourier,” Domain Decomposition Methods in Science and Engineering, DD14 , edited by O. I.Herrera, D.E.Keyes and R.Yates, UNAM, 2003, pp. 341–348. 24 J.Baranger, M.Garbey, and F.Oudin, “Acceleration of the Schwarz method : the cartesian grid with irregular space step case,” 2004. 25 Garbey, M., “Some Remarks on Multilevel Method, Extrapolation and Code Verification,” 12th Int. Conf. on Domain Decomposition Methods DD12 , 2001. 26 Wieners, C., “A Parallel Newton multigrid method for high order finite elements and its application to numerical existence proofs for elliptic boundary value problem,” Z. Angew. Math Mech. 76 , Vol. no. 3, 1996, pp. 175–180. 27 M.Garbey, “Acceleration of the Schwarz method for elliptic problem,” To appear in SIAM , 2005. 28 M.Garbey, C.Picard, and Tay, R. S., “Heterogeneous domain decomposition for multi-scale problems,” 43rd Aerospace Sciences Meeting and Exhibit Conference, Reno, 2005. 29 Barberou, N., M.Garbey, M.Hess, Resch, M., Rossi, T., J.Toivanen, and Dervout, D., “Efficient metacomputing of elliptic linear and non-linear problems,” Journal of Parallel and Distributed Computing, Vol. 63, 2003, pp. 564–577. 30 Kuznetsov, Y., “Numerical methods in subspaces,” Vychislitel’nye Processy i Sistemy II, G. I. Marchuk editor, Nauka, Moscow , 1985, pp. 265–350. 31 Vassilevski, P. S., “Fast algorithm for solving a linear algebraic problem with separable variables,” C. r. Acad. Bulg. Sci., Vol. no. 37, 1984, pp. 305–308. 32 M.Garbey, W.Shyy, B.Hadri, and E.Rougetet, “Efficient Solution Techniques for CFD and Heat Transfer,” ASME Heat Transfer/Fluids Engineering Summer Conference, 2004. 33 Saad, Y., Iterative Methods for Sparse Linear Systems, SIAM, 2nd ed., 2003. 34 Falgout, R. D. and Yang, U. M., “Hypre: A Library of High Performance Preconditioners,” Lecture Notes in Computer Science, Vol. 2331, 2002, pp. 632–641. 35 Balay, S., Gropp, W. D., McInnes, L. C., and Smith, B. F., “PETSc Users Manual,” Tech. Rep. ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2003. 36 Briggs, W. L., Henson, V. E., and McCormick, S. F., A Multigrid Tutorial , SIAM Books, 2000. 37 Keller, R., Krammer, B., Mueller, M. S., Resch, M., and Gabriel, E., “MPI Development Tools and Applications for the Grid,” 2003. 38 Keller, R., Krammer, B., Mueller, M. S., Resch, M., and Gabriel, E., “PAXC-MPI,” Tech. rep., HLRS, 2003. 39 M.Garbey and H.Ltaief, “Fault Tolerant Domain Decomposition for Parabolic Problems,” the 16th international conference in domain decomposition,New York , 2005.
11 of 11 American Institute of Aeronautics and Astronautics