Solving PDE Problems On Parallel and Distributed Computer Systems Using the NAG Parallel Library Arnold Krommer, Mishi Derakhshan and Sven Hammarling The Numerical Algorithms Group Ltd, Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, UK E-mail: farnoldk,mishi,
[email protected]
Abstract. The NAG Parallel Library enables users to take advantage of the increased computing power and memory capacity oered by multiple processors. It provides parallel subroutines in some of the areas covered by traditional numerical libraries, such as dense and sparse linear algebra, optimization, quadrature and random number generation, as well as utility routines for data distribution, input/output and process management. NAG has recently initiated and is currently participating in the HPCN Fourth Framework project on P arallel I ndustrial N umE rical Applications and P ortable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the NAG Parallel Library for dealing with computationally intensive industrial applications by appropriately extending the range of library routines. Additionally, several industrial applications are being ported onto parallel computers within the PINEAPL project by replacing sequential code sections with calls to appropriate parallel library routines. Most of the library material being developed is concerned with the solution of PDE problems. This paper outlines the design of the proposed library extensions, discusses implementation issues, provides examples of library use and illustrates performance results.
1 Introduction The NAG Parallel Library enables users to take advantage of the increased computing power and memory capacity oered by multiple processors. It provides parallel subroutines in some of the areas covered by traditional numerical libraries (in particular, the NAG Fortran 77, Fortran 90 and C libraries), such as dense and sparse linear algebra, optimization, quadrature and random number generation. Additionally, the NAG Parallel Library supplies utility routines for data distribution, input/output and process management purposes. These utility routines shield users from having to deal explicitly with the message-passing system { which may be MPI [12] or PVM [7] { on which the library is based. Targeted primarily at distributed memory computers and networks of workstations, the NAG Parallel Library also performs well on shared memory computers whenever ecient implementations of MPI or PVM are available.
NAG has recently initiated and is currently participating in the HPCN Fourth Framework project on P arallel I ndustrial N umE rical Applications and P ortable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the NAG Parallel Library for dealing with a wide range of computationally intensive industrial applications by appropriately extending the range of library routines. In order to demonstrate the power and the eciency of the resulting Parallel Library, several application codes from the industrial partners in the PINEAPL project (see Table 1) are being ported onto parallel and distributed computer systems by replacing sequential code sections with calls to appropriate parallel library routines. Project Coordinator : NAG Industrial Partners Related Partners British Aerospace Manchester University Piaggio IBM SEMEA, CPS Thomson LCR CERFACS Danish Hydraulic Institute Math-Tech ApS
Table 1. PINEAPL Consortium This paper focuses on the solution of PDE problems based on nite dierence, nite element or nite volume methods using library routines developed by NAG. Other library material being developed by PINEAPL partners includes optimization, Fast Fourier Transform, and Fast Poisson Solver routines. Section 2 reviews the fundamental design principles of the existing NAG Parallel Library. The scope of support for the computational steps required to solve PDE problems and the basic design of the relevant routines are outlined in Section 3. The application of library routines available in Release 2 of the NAG Parallel Library is illustrated in Section 4 using a parallel diusion simulation module. Timing results for the parallel diusion simulation module on a cluster of DEC AlphaServer 4100s demonstrate the performance of these routines.
2 Design of the NAG Parallel Library All computational routines incorporated into the NAG Parallel Library { irrespective of the particular numerical problems they deal with { have to adhere to a speci c set of design principles. These design principles have been established to ensure that the NAG Parallel Library achieves a high standard of quality with respect to the following criteria:
Ease of Use The intricacy of explicitly parallel programming is a major obsta-
cle to the widespread acceptance of parallel computing as a viable computational paradigm. A parallel numerical library, which provides basic building blocks needed for a wide range of applications, is in itself a fundamental step towards facilitating the use of parallel computing resources. Nevertheless, it is important for the library to be as easy to use as possible. To this end, interfaces to parallel library routines are made to resemble their counterparts in the sequential NAG Fortran library as far as possible. This enables the numerous users of the sequential NAG Fortran library to migrate to parallel computing with a minimum amount of porting eort. Furthermore, from the user's point of view, the dierences between sequential programs and their parallel counterparts often become minor when the latter are based on the Single Program Multiple Data (SPMD) programming paradigm. The NAG Parallel Library supports the SPMD programming paradigm in two ways: (i) It provides an SPMD interface to all library routines { even to those which internally employ a dierent parallelization paradigm (for instance, a master-slave paradigm). (ii) It provides process initialization and management routines for SPMD user programs. Portability The NAG Parallel Library achieves a high degree of portability by relying exclusively on components which themselves have a proven record of portability: Sequential computations are almost entirely coded in standard Fortran 77; the remaining code is written in ANSI C. Inter-process communication is mostly performed using the Basic Linear Algebra Communication Routines (BLACS) [6]; only a small number of routines use explicit MPI or PVM calls. As a result of using the BLACS as an intermediate communication layer, most library routines can easily be made to work based on either an MPI or a PVM message passing system by simply using the MPI or the PVM implementation of the BLACS. Performance Since the main goal of parallel computing is to increase program performance, ensuring the eciency of the NAG Parallel Library is of foremost importance. (Performance results for selected library routines on the IBM SP-2 can be found in [3].) Therefore, wherever possible, use is made of standardized software components { in particular, the Basic Linear Algebra Routines (BLAS) [8],[5],[4] { for which machine-optimized versions are available on a broad range of computer systems. Several vendors also supply optimized versions of the BLACS. Additionally, the consistent use of Fortran 77 enables the library to take advantage of the advanced optimization capabilities of many Fortran77 compilers. Unnecessary data redistribution overhead is generally avoided in the NAG Parallel Library because each numerical routine assumes that the required input data are properly distributed (i.e. in place) on entry to the routine. This is the case, in particular, when the input data of a routine are generated in properly distributed form by other library routines. Additionally, auxiliary data (re)distribution routines are provided to assist users in distributing data appropriately.
Flexibility A number of measures have been taken to ensure that the NAG
Parallel Library routines can be used as exibly as possible. To begin with, since Fortran 77 interfaces are provided for all routines, the NAG Parallel Library can easily be incorporated into Fortran90, C and C++ programs. And then, by consistently using the BLACS context identi er (which is akin to the MPI communicator) as an argument in every library routine, the NAG Parallel Library lends itself to advanced parallelization techniques such as multigridding1. Using the BLACS context identi er also allows for a seamless integration of the NAG Parallel Library with other parallel packages/libraries which are based on the BLACS, for instance, the Parallel BLAS (PBLAS) [2] and the SCAlable Linear Algebra PACKage (ScaLAPACK) [1] routines. Reliability All software engineering measures designed to ensure the reliability of sequential NAG library routines (in particular, the development of a stringent test program for each and every routine) are also applied to NAG Parallel Library routines. Standard NAG error checking procedures are performed to guard routines against incorrect input data. In addition, NAG Parallel Library routines perform rigorous checks on the consistency of global input arguments, i.e. arguments which must have the same value on entry to a routine on all participating processors. They also guarantee that all global output arguments have the same value on exit from a routine on all processors. This applies particularly to the global error argument: The processors participating in a computation are guaranteed to have a consistent view of the success or (reason for) failure of a NAG Parallel Library routine. Release 1 of the NAG Parallel Library is available on a number of parallel computer systems, including the IBM SP-2, the Cray T3D, the SG Power Challenge, and the Intel-Paragon. Release 2 runs on the IBM SP-2, the Cray T3E, the Hitachi SR2201, the Fujitsu AP3000, and the Fujitsu VPP. The NAG Parallel Library is also available for homogeneous workstation networks of all major vendors, including DEC, HP, IBM, SGI and Sun, and even for PCs running the Linux operating system.
3 NAG Parallel Library Support for PDE Solvers This section reviews the computational steps required to solve PDE problems on parallel computers, outlines the scope of support for these steps provided by the NAG Parallel Library, and describes the salient design features of relevant library routines. 1
Multigridding enables users to create multiple logical processor grids on a given set of processors and to execute parallel (sub)programs on dierent grids independently.
3.1 Computational Steps in PDE Solvers
Figure 1 provides a schematic representation of the sequence of computational steps performed usually when PDE problems are solved on parallel computers using nite dierence, nite element or nite volume discretization techniques:
Problem Specification Error Estimation
Mesh Update
Mesh Generation
Mesh Repartitioning
Mesh Partitioning
Update of PDE Discretization
PDE Discretization
Parallel Mesh Generation
Update of Numerical Values of Non-zero Entries
Sparse System Solution/ Computation of EVs
Visualization/ Other Applications
Fig.1. Computational Steps in PDE Solvers on Parallel Computers
{ The problem speci cation stage comprises formulating the (system of) PDEs
to be solved, describing the underlying domain, and entering the PDE coecients (or coecient functions) as well as the initial and/or boundary conditions. { Mesh generation is concerned with producing a computational mesh for the given domain and describing this mesh in terms of entities such as nodes, edges, faces, or cells as well as the geometric relationships between these entities. { PDE computations are generally parallelized by partitioning the computational mesh and mapping the resulting sub-meshes onto dierent processors. { An alternative to generating the computational mesh on a single processor and then partitioning the resulting mesh is to generate the computational
{ { { { { {
mesh in distributed form on several processors concurrently, a technique known as parallel mesh generation . PDE discretization involves (i) representing continuous functions de ned on the domain as discrete functions de ned on the computational grid and (ii) mapping the continuous dierential operators in the PDEs onto operators de ned in terms of the discrete grid functions. In order to obtain approximate PDE solutions, the sparse linear systems of equations resulting from discretizing (and, if necessary, linearizing) the PDEs have be solved. Additionally, eigenvalue analyses of the associated coecient matrices may be required. When solving non-linear or time-dependent PDEs, it is often necessary to repeatedly generate matrices with identical sparsity patterns which dier in the numerical values assigned to non-zero entries. This is achieved by updating the numerical values of non-zero entries . Furthermore, the accuracy of the computed solutions is assessed by performing appropriate error estimation procedures. Based on such error analyses, the computational mesh may have to be updated { re ned or coarsened { locally in dierent regions, and the resulting mesh may have to be repartitioned between the processors. Additionally, an update of the PDE discretization , corresponding to the updated computational mesh, is required. Frequently, the computed solutions serve as input data to other applications. Visualization tools, for instance, take the computed solutions and generate corresponding graphical representations. The quantitative and/or qualitative information generated by post-processing the computed solutions is often used as a basis for modifying the original problem speci cation (changing the computational domain, for instance).
3.2 Scope of Library Support A number of routines have been and are being developed and incorporated into the NAG Parallel Library in order to assist users in performing some of the computational steps described in the foregoing section. The extent of support provided for the dierent steps is determined by (i) whether or not the computational tasks involved in a step can be implemented in library form, and (ii) whether or not sucient gain can be expected from parallelizing a step. For instance, no support is currently provided or planned in the NAG Parallel Library for problem speci cation, mesh generation, and solution visualization because these computational steps can either not be put into library form or (at least initially) do not warrant any parallelization eort.2 The routines in the NAG Parallel Library belong to one of the following classes: 2
This does not in any way preclude other NAG products, for instance, the sequential Fortran 77 library, the AXIOM symbolic computation system, or the IRIS Explorer visualization tools from providing adequate support.
Mesh Partitioning Routines which decompose a given computational mesh
into a number of sub-meshes in such a way that certain objective functions (measuring, for instance, the number of edge cuts) are optimized. They also deal with re-partitioning a distributed computational mesh after mesh re nement or coarsening operations. Additional routines are designed to construct regions of overlap between sub-meshes, which can signi cantly improve the quality of domain decomposition -based preconditioners [11]. Sparse Linear Algebra Routines which are required for solving systems of linear equations and eigenvalue problems resulting from discretizing PDE problems. These routines can be classi ed as follows: Iterative Schemes which are the preferred method for solving large-scale linear problems, possibly comprising several millions degrees of freedom. NAG Parallel Library routines are based on Krylov subspace methods [10], including the Conjugate Gradient (Lanczos ) method for symmetric positive-de nite problems, the Symmetric LQ method for symmetric inde nite problems, as well as the Restarted Generalized Minimum Residual (Arnoldi ), the Conjugate Gradient Squared , the Biconjugate Gradient Stabilized , and the Transpose-Free Quasi-Minimal Residual methods for unsymmetric problems. Preconditioners which are used to accelerate the convergence of the basic iterative schemes. NAG Parallel Library routines employ a range of preconditioners suitable for parallel execution. These include domain decomposition -based, speci cally additive and multiplicative Schwarz preconditioners [11]. The subsystems of equations arising in these preconditioners are (approximately) solved on each processor sequentially either directly { based on incomplete LU or Cholesky factorization algorithms { or iteratively { based on one of the aforementioned iterative schemes.3 Additionally, preconditioners based on classical matrix splittings, specifically Jacobi and Gauss-Seidel splittings, are provided, and multicolor orderings of the unknowns are applied to achieve a satisfactory degree of parallelism. The quality of splitting-based preconditioners can be improved by applying polynomial acceleration methods. The same multicolor orderings employed in splitting-based preconditioners are also used to derive parallel preconditioners based on zero- ll incomplete LU and Cholesky factorizations. Basic Linear Algebra Routines which calculate matrix-vector products involving sparse matrices { an operation required by all iterative schemes { and also provide certain Level-1 BLAS operations for vectors distributed conformally to sparse matrices (see Section 3.3). Black-Box Routines which provide easy-to-use interfaces at the price of reduced exibility. Set-Up Routines which perform matrix transformations { including matrix partitioning, reordering of non-zero entries, and re-indexing of un3
In the latter case, incomplete LU or Cholesky factorizations can be used as preconditioners on each processor.
knowns { and generate auxiliary information required to perform parallel sparse matrix operations eciently. Parallel I/O Routines which read the non-zero entries of sparse matrices from sequential data les and store them in distributed form on the different purposes processors. They also write the distributed non-zero entries of sparse matrices to sequential data les. Similar functions are provided for vectors which are distributed conformally to sparse matrices (see Section 3.3). These parallel I/O routines come in handy when PDE solvers have to transmit data to external applications, such as visualization tools. In-Place Generation Routines which generate the (additional) non-zero entries of sparse matrices concurrently: Each processor computes those entries which { according to the given data distribution { have to be stored on it. Inplace generation routines are also provided for vectors which are distributed conformally to sparse matrices (see Section 3.3). Distribution Routines which (re)distribute the non-zero entries of sparse matrices or the elements of vectors distributed conformally to sparse matrices to the dierent processors according to given distribution schemes.
3.3 Design and Implementation of Library Routines An in-depth technical discussion of the design and implementation of the routines described in Section 3.2 is beyond the scope of this paper (see [9]), but the salient design and implementation features can be summarized as follows: Reverse-communication interfaces are provided for all iterative schemes, which ensures that these routines can be used as exibly as possible. For instance, since the reverse-communication interfaces are matrix-free , the iterative methods provided can be used in conjunction with any particular storage scheme suitable for a given application. Storage Formats supported in Release 2 of the library are distributed variants of the coordinate storage and compressed row storage formats. Users may provide matrices in the exible and easy-to-use coordinate storage format. The set-up routines provided in the NAG Parallel Library then partition the matrix, reorder the non-zero entries and perform a re-indexing of the unknowns in order for subsequent parallel sparse matrix operations to attain maximum performance. In particular, the non-zero entries in each resulting matrix block are stored in compressed row storage format. Storage requirements for symmetric matrices are signi cantly reduced by storing only the upper triangular parts of the local diagonal blocks (which usually contain most non-zero entries). Row block distributions are used for sparse matrices. Any such distribution corresponds to a domain decomposition of grid points in the physical domain. Conformal distributions are used for vectors, i.e. vectors are aligned with the rows of the corresponding sparse matrices. A single descriptor array { for each sparse matrix { contains all the auxiliary information required to eciently perform parallel sparse matrix operations,
which minimizes the number of arguments in routine interfaces. Using the descriptor array also makes it possible to design routine interfaces which do not explicitly contain any distribution parameters. Hence, additional data distributions (2-D block distributions, for instance), which may be supported in the future, can be accommodated without changing the user interface. The intricate internal structure of the descriptor array is mostly transparent to the user.
4 Example: Diusion Simulation For demonstration purposes, a simple PDE solver module which uses a number of sparse matrix routines available in Release 2 of the NAG Parallel Library has been developed. Only a very coarse description of the module can be given in this paper. More detailed information can be found in the source code of the PDE solver module, which is available from the authors upon request.
4.1 Module Description The PDE solver module solves the diusion equation
@u = r (cru) + f; t 2 [t ; t ]; (1) lo hi @t on a three-dimensional rectangular region = [xlo; xhi ] [ylo; yhi ] [zlo ; zhi ] IR3. The diusion coecient c : ! IR3 as well as the forcing term f : [tlo; thi ] ! IR may depend on the spatial coordinates x, y and z ; additionally, the forcing term may be time-dependent. The solution u is subjected to the initial condition
u(x; y; z; tlo) = g(x; y; z ); (x; y; z )> 2 ;
(2)
and the inhomogeneous Dirichlet boundary condition
u(x; y; z; t) = h(x; y; z; t); (x; y; z )> 2 ?; t 2 [tlo ; thi ];
(3)
where g and h are given functions and ? denotes the boundary of . The diusion equation is discretized on a Cartesian grid by using a standard seven-point nite dierence scheme in the spatial domain and by applying the trapezoidal rule to the resulting system of coupled ordinary dierential equations { a method known as Crank-Nicolson scheme . The user can specify the number of grid points nx , ny and nz in each spatial dimension, the number of time-steps nt, and the number of processors np to be used in the simulation. The resulting grid is then partitioned into blocks of Nb := d(nx ny nz )=np e (according to the natural ordering) consecutive grid points which are mapped onto the dierent processors.
4.2 Implementation Using the NAG Parallel Library Discretizing the diusion equation gives rise to a sparse matrix A which describes the interrelationship between the approximate solution values at dierent grid points in consecutive time-steps. The PDE solver module uses the NAG Parallel Library routine F01YAFP to generate the matrix A in parallel: Each processor independently generates those rows of A which correspond to the grid points mapped to the processor. The routine F11ZAFP is then called to transform the initial matrix representation in such a way that subsequent operations involving A can be performed eciently. The Crank-Nicolson scheme requires a symmetric and positive-de nite linear system of equations to be solved for each time-step. Matrix-vector products have to be computed both for setting up the right-hand side of the linear system and for solving the linear system using the iterative Conjugate Gradient (CG) method. The NAG Parallel Library routine F11XAFP, which is called in the PDE solver's initialization section, computes an optimized schedule for the communication operations needed to calculate these matrix-vector products. This optimized schedule takes full advantage of the particular sparsity structure of A in order to keep the communication overhead as low as possible. The convergence of the CG method can be accelerated by applying an additive Schwarz preconditioner based on approximately solving subsystems of equations (which correspond to the subdomains of the partitioning) using incomplete LU factorizations . This preconditioner is set up concurrently using the NAG Parallel Library routine F11DAFP: Each processor independently factorizes the subsystem associated with its subdomain. The user can determine whether or not a preconditioner should be applied and, if it should, specify the level of ll or the drop tolerance used in the incomplete LU factorizations. The values of the initial solution at the dierent grid points are generated in parallel using the routine F01YEFP at the beginning of the simulation run. F01YEFP is also used at each time-step to concurrently generate the contributions from the forcing term and the boundary condition. The CG method is implemented in a suite of three routines, F11GAFP, F11GBFP and F11GCFP, according to a reverse-communication mechanism. The required matrix-vector products are computed by calling F11XBFP, and the preconditioning equations are solved using F11DBFP. Data les containing the values of the approximate solutions at speci c timesteps can be generated (for visualization purposes, for instance) by calling the NAG Parallel Library routine X04YAFP.
4.3 Performance Results Figure 2 shows speed-up results of performance tests on a shared memory DEC AlphaServer 4100 5/400 with four CPUs.4 On this system, Digital MPI, which 4
The results in this section were obtained by Niall Couse, Digital Equipment International, Galway, Ireland.
4
c
3
c
Speed-up 2
c
1
c
1
2 3 Number of Processors
4
Fig.2. Speed-up of Diusion Simulator on Single DEC AlphaServer 4100 24 20
c c
16 Speed-up 12
c c
8
c
4 1
c c
1
4
8
12 16 Number of Processors
20
24
Fig. 3. Speed-up of Diusion Simulator on Cluster of Six DEC AlphaServer 4100s
was used as the underlying communication system in the tests, uses an optimized shared memory transport to exchange messages. The problem size in the experiments was chosen as nx = ny = nz = 32. The results clearly demonstrate that the NAG Parallel Library can achieve nearly linear speed-ups even for relatively small problems5 on shared memory computers. Figure 3 shows similar speed-up results for a cluster of six AlphaServer 4100 5/400 machines, with four processors each, connected by a Memory Channel network. On this system, Digital MPI uses shared memory transport for intra-host communication and the Memory Channel links for inter-cluster communication. The problem size in the experiments was chosen as nx = ny = nz = 70. The results demonstrate the NAG Parallel Library can sustain a high degree of eciency even for larger numbers of processors.
References 1. J. Choi, J. Demmel, I. Dhillon, J. J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. W. Walker, R. C. Whaley, ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance, in \Applied Parallel Computing" (J.J. Dongarra, K. Masden, J. Wasniewski, Eds.), Springer-Verlag, Berlin, 1995, pp. 95{106. 2. J. Choi, J. J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker, R. C. Whaley, A Proposal for a Set of Parallel Basic Linear Algebra Subprograms. LAPACK Working Note No.100, Technical Report CS-95-292, Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN, 1995. 3. M. Derakhshan, L. Waters, Speed-up Results for NAG Numerical PVM Library Routines on an IBM SP-2, Technical Report TR3/96, The Numerical Algorithms Group Ltd, Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, UK, 1996. 4. J. J. Dongarra, J. Du Croz, I. S. Du, S. Hammarling, A Proposal for a Set of Level 3 Basic Linear Algebra Subprograms, in \Parallel Processing for Scienti c Computing" (G. Rodrigue, Ed.), SIAM, Philadelphia, PA, 1989, pp. 40{44. 5. J. J. Dongarra, J. Du Croz, S. Hammarling, R. J. Hanson, An Extended Set of FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Software 14 (1988), pp. 1{32, 399. 6. J. J. Dongarra, R. C. Whaley, A Users' Guide to the BLACS v1.0. LAPACK Working Note No.94, Technical Report CS-95-281, Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN, 1995. 7. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, MA, 1994. 8. C. L. Lawson, R. J. Hanson, D. Kincaid, F. T. Krogh, Basic Linear Algebra Subprograms for FORTRAN Usage, ACM Trans. Math. Software 5 (1979), pp. 308{ 323. 9. NAG, NAG Parallel Library Manual, Release 2, The Numerical Algorithms Group Ltd, Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, UK, 1997. 5
The calculation of a single time-step took only 0.35s on a single processor and 0.09s on four processors!
10. Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing, Boston, 1996. 11. B. Smith, P. Bjorstad, W. Gropp, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Dierential Equations, Cambridge University Press, Cambridge, 1996. 12. M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, J. J. Dongarra, MPI: The Complete Reference, MIT Press, Cambridge, MA, 1996.