Oslo Scienti c Computing Archive Parallelization of ... - CiteSeerX

3 downloads 0 Views 429KB Size Report
Report. Parallelization of Explicit Finite. Di erence Schemes via Domain. Decomposition. Elizabeth Acklam. Hans Petter Langtangen. Preliminary draft ...
Oslo Scienti c Computing Archive Report

Parallelization of Explicit Finite Di erence Schemes via Domain Decomposition Elizabeth Acklam

Hans Petter Langtangen

Preliminary draft

Aims and scope: Traditionally, scienti c documentation of many of the activities in modern scienti c computing, like e.g. code design and development, software guides and results of extensive computer experiments, have received minor attention, at least in journals, books and preprint series, although the the results of such activites are of fundamental importance for further progress in the eld. The Oslo Scienti c Computing Archive is a forum for documenting advances in scienti c computing, with a particular emphasis on topics that are not yet covered in the established literature. These topics include design of computer codes, utilization of modern programming techniques, like object-oriented and object-based programming, user's guide to software packages, veri cation and reliability of computer codes, visualization techniques and examples, concurrent computing, technical discussions of computational eciency, problem solving environments, description of mathematical or numerical methods along with a guide to software implementing the methods, results of extensive computer experiments, and review, comparison and/or evaluation of software tools for scienti c computing. The archive may also contain the software along with its documentation. More traditional development and analysis of mathematical models and numerical methods are welcome, and the archive may then act as a preprint series. There is no copyright, and the authors are always free to publish the material elsewhere. All contributions are subject to a quality control.

Oslo Scienti c Computing Archive

Revised Preliminary draft

Title

Parallelization of Explicit Finite Di erence Schemes via Domain Decomposition Contributed by

Elizabeth Acklam Hans Petter Langtangen Communicated by

Not yet approved

Oslo Scienti c Computing Archive is available on the World Wide Web. The format of the contributions is chosen by the authors, but is restricted to PostScript les, generated from LaTeX, and HTML les for documents with movies and text, and compressed tar- les for software. There is a special LaTeX style le and instructions for the authors. There is also a standard for the use of HTML. All documents must easily be printed in their complete form. There is an associated ISSN number such that libraries can locate hard-copies of a document.

Contents

1 Introduction 2 The goals and the relation to other work 3 Problem formulation

3.1 Example 1: The wave equation in 1D . . . . . . . . . . . . . . 3.2 Example 2: The heat equation in 2D . . . . . . . . . . . . . . 3.3 Example 3: The shallow water equations in 2D . . . . . . . .

4 Software components 4.1 4.2 4.3 4.4

How to create parallel solvers . . . . . The sequential heat equation solver . . Di pack's interface to MPI . . . . . . The parallel solvers and the managers

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 2 4 6 7 9

12

13 14 15 18

5 Tools for parallel nite di erence methods

23

6 The parallel shallow water equation solvers 7 Increasing the eciency 8 Summary

27 33 34

5.1 The grid and process handler: ParallelFD . . . . . . . . . . . 23 5.2 Communication between processes: the Points classes . . . . 24

i

Page ii

Parallel Explicit FDM

Parallelization of Explicit Finite Di erence Schemes via Domain Decomposition Elizabeth Acklam1

Hans Petter Langtangen2 Abstract

This report describes software tools in Di pack that makes it easy to parallelize an existing sequential solver. The scope is limited to solvers that employ explicit nite di erence methods. This class of problems allows parallellization via exact domain decomposition procedures. The main emphasis of the report is devoted to user-friendly abstractions for communicating eld values between di erent processes. Both standard and staggered grids can be handled. The software setup is in principle general and not limited to explicit nite di erence schemes.

1 Introduction In this report we describe software tools that make it easy to take a standard sequential Di pack simulation code and develop a version of it that can run e ectively on parallel computers. The basic idea of this approach is to formulate the original mathematical problem as a set of subproblems over di erent parts of the domain. This strategy is usually referred to as domain decomposition. All the subproblems can then be solved concurrently. For our purpose it is essential that each subproblem has the same mathematical structure as the original undecomposed problem, such that one can solve a subproblem re-using a sequential simulator designed, developed and tested for the original problem. Normally, a domain decomposition approach needs some kind of iteration because all the subproblems are coupled. However, if we restrict the attention to partial di erential equations discretized in 1 Numerical Objects A.S. Email: [email protected]. 2 Mechanics Division, Dept. of Mathematics, University of Oslo, P.O. Box 1053 Blin-

dern, N-0316 Oslo, Norway. Email:

[email protected].

1

Page 2

Parallel Explicit FDM

time by explicit schemes, we can formulate a domain decomposition method for the spatial problems at each time level that is exact and that has fully decoupled subproblems. This class of schemes will be the framework for the present development. First we outline some mathematical details of the method and give some speci c examples. Thereafter we outline the basic software abstractions, and nally we apply the software to real problems.

2 The goals and the relation to other work The main goal of the project that led to this report was to push the quality of the software environment for parallel computing as far as possible, while restricting the complexity of the mathematical and numerical problem. As long as a user has an explicit nite di erence scheme for a system of partial di erential equations in one, two or three dimensions, with or without staggered grids, on hypercube-shaped domains, the software described here should be directly applicable. No detailed knowledge of or programming with MPI is required, but the user must typically have understood the basic mathematical principles of parallel, explicit nite di erence methods and the fundamentals of the message passing programming paradigm. First, a sequential simulator for the problem is developed, and then the parallel version is realized as some add-on routines in a separate le. This means that the user can verify the sequential part of the implementation at any time. We believe that this is an important part of any software development process for parallel computers. The add-on routines for concurrent computing contain high-level statements that are close to the mathematical description of the algorithm. With such tools at hand, utilization of parallel computing environments is made much simpler and the power of super-computing is made available to a larger audience. Explicit nite di erence methods appear in a number of elds in mechanics, e.g. heat transfer, non-dispersive wave phenomena, compressible

ow, multi-phase porous media ow (except for an implicit pressure equation), Navier-Stokes solvers (except for an implicit pressure equation), and high-speed penetration problems in elasto-plastic solids. However, in many of these applications it is necessary to use staggered grids, which actually means that each primary unknown scalar eld in a system of partial differential equations has its own grid. With low-level Fortran 77 and MPI programming this book-keeping of grid points often leads to subtle errors, but with the proper software abstractions, staggered grid implementation is no more complicated than the mathematics behind it (which is indeed simple). The goals of parallel scienti c computing, as outlined above, are obvious. It is, however, not obvious how to achieve the goals technically. Here we will comment on some attempts to build strong support in software for

Page 3 easy utilization of parallel computing. The present work is fundamentally di erent from other contributions that we know of, because it avoids working with distributed data structures (such as distributed vectors, grids etc). Instead, it relies on an overall domain decomposition of the problem, that allows the global solver to be implemented in terms of sequential solvers over subdomains. This simpli es the software development dramatically and makes it possible to create practical tools in much shorter time, e.g., within a Master's thesis project. One attractive attempt to make parallel programming simple is High Performance Fortran (HPF), which is an extension of Fortran 77 for concurrent programming. This approach relies on intelligent compiler technology. There is, e.g., no explicit message passing in HPF programs. The fundamental question is therefore the eciency of the approach. Hopefully, the future of parallel computing will be in the direction of HPF-technology, since this will be the easiest parallel programming tool from a user's point of view. The P++ software [4] provides distributed C++ arrays for parallel computing on a wide range of platforms. It should in principle be easy to build an environment on top of P++ for explicit nite di erence methods, but the integration with the rest of Di pack might face diculties. The Cogito project [6] is an on-going, long term project at the University of Uppsala. The focus is on explicit (and now also implicit) nite di erence methods on overlapping composite grids. The implementation is done in Fortran 90, with C++ emerging on the top level for administering the solution of partial di erential equations. The POOMA initiative [5] is a large and serious development of objectoriented techniques for parallel computing in general at Los Alamos National Labs. POOMA is not intended to span the entire eld of scienti c computing, the project is driven by the needs for a nite set of applications. Some of their current projects are the development of a simulation program for multimaterial, compressible hydrodynamics and particle-in-cell plasma simulations. Combined with their new giant super-computers and associated nancial support, aimed at eliminating the need for nuclear weapon tests in 10 years, the results of the project are likely to change the known parallel computing paradigms of today. PETSc [1], being developed at Argonne National Labs, is perhaps the most advanced software product for parallel computing at present. It is implemented in C, but exhibits a truely object-oriented design. The emphasis in PETSc is on libraries for linear algebra (linear solvers, preconditioners, eigenvalue solvers). PETSc is used in the U.S. industry, e.g., for implicit solution of gas dynamics problems. Unfortunately, PETSc lacks support for easy implementation of partial di erential equation problems. Integration of PETSc and Di pack seems dicult. The current idea of parallel programming in Di pack relies on domain decomposition and re-use of sequential solvers, which avoids the need for

Page 4

Parallel Explicit FDM

much of the software of the type that is being developed in the projects listed above. At present there are two prototype Di pack environments aimed at concurrent computing, one for explicit nite di erence schemes and one for elliptic problems [2] (where domain decomposition is used iteratively). While the latter project focuses on the principal possibility of re-using existing elliptic Di pack solvers, the present one aims at a more polished software environment for a simpler class of numerical problems. Merging the results of these two developments should provide a basis for a general handling of parallel computing in Di pack, that is based on re-using the sequential libraries as they are today. The preliminary results show that utilization of parallel computers is within reach in the very near future for existing Di pack simulators across a wide range of application areas.

3 Problem formulation Assume that we are to solve a system of partial di erential equations over some spatial domain . Domain decomposition methods consist in dividing the spatial domain into subdomains s , s = 1; : : : ; D, and formulate local initial-boundary value problems on the each subdomain. The subdomains can be overlapping or non-overlapping but we must of course require that

= [Ds=1 s . For concurrent computing the principal goal is to solve all the subproblems independently of each other, i.e. in parallel. In temporal evolution equations the goal is to solve all the spatial problems at each time level in parallel. However, the fundamental diculty with this approach is that the system of equations restricted to s needs boundary conditions at the non-physical internal boundaries of s . A simple solution to this problem is to use the solutions at neighbouring subdomains as boundary conditions, but then the subproblem over s is coupled to all other subproblems, and the subproblems cannot be solved independently. One can, however, introduce an iteration where the subproblem on s applies \old" values of the neighbouring subproblems. Whether the iteration converges, is an open question, but the strategy is simple and applicable to a wide range of systems of partial di erential equations. An attractive class of problems for which it is straightforward to formulate exact domain decomposition methods, with no need for iteration, is temporal evolution equations discretized by explicit numerical schemes in time. This class of problems can be written

u+i (x) = Li (u?1 (x); : : : ; u?m(x)); i = 1; : : : ; m;

(1)

where ui (x) is a function of the spatial coordinates x 2 , Li is some spatial di erential operator (possibly non-linear), and m is the number of equations and unknown functions in the underlying partial di erential equation system. The superscript + denotes an unknown function at the current time

Page 5 level, while the superscript ? refers to a function that is known numerically when solving equation no. i. The problems in (1) are discrete in time, but continuous in space. A domain decomposition method is easily formulated by restricting equation (1) to s :

us;i + (x) = Li (us;1 ? (x); : : : ; us;m? (x); g ?); i = 1; : : : ; m; s = 1; : : : ; D: (2) The new solution over s is ui (x)s;+ , while ui (x)s;? are previously computed solutions over s . Depending on the discretization of Li , previously computed functions ui (x)q;? from neighbouring subdomains q , with q in some suitable index set, are also needed when solving the subproblem on s . This information is represented as the g ? quantity in (2). In other words, all the coupling of the subproblems is through previously computed functions when we work with explicit temporal schemes. We can then solve the subproblems over s , s = 1; : : : ; D, at a time level in parallel, and the domain decomposition method is exact, with no need for iteration. Thereafter we need to exchange the g ? information between the subproblems before we proceed to the next time level. The mathematical and numerical details will become clearer when we study some speci c initial-boundary value problems below. The simple domain decomposition method in (2) involves two basic components: the numerical scheme for the original problem (restricted to a part

s of the global domain ), and some information about already computed elds in neighbouring subdomains. Therefore, to solve (2) we can re-use sequential software for the global problem and equip it with some additional functionality for exchanging data between the subproblems. When using C++, Di pack and object-based concepts, we can build the software for a parallel solver in a clean way such that the debugging of, e.g., communication on parallel computers is minimized. First, one develops a standard sequential solver class for the problem over and tests this solver. Thereafter, one derives a subclass with additional general data structures for boundary information exchange. This subclass contains little code and relies on the numerics of the hopefully well tested base class for the global problem as well as on a general high-level communication tool for exchange of boundary information. The subclass solver will be used to solve problems on a subdomain s , for a general s. In addition, we need a manager class that holds some information about the relations between the subdomains s . The principle is that the managers have a global overview of the subproblems, while the subproblem solvers have no explicit information about other subproblem solvers. At any time, one can pull out the original sequential solver to ensure that its numerics work as intended. With this set-up, it is fairly simple to take an existing sequential solver, equip it with two extra small classes and turn it into a simulation code that can take advantage of parallel computers. Hopefully, this strategy may have a substantial impact on the development time of parallel simulation software.

Page 6

Parallel Explicit FDM

3.1 Example 1: The wave equation in 1D

Consider the one-dimensional initial-boundary value problem for the wave equation:

@ 2u = 2 @ 2u ; x 2 (0; 1); t > 0; @t2 @x2 u(0; t) = UL ; u(1; t) = UR ; u(x; 0) = f (x); @ u(x; 0) = 0: @t

(3) (4) (5) (6) (7)

Let uki denote the numerical approximation to u(x; t) at the spatial grid point xi and the temporal grid point tk , where i and k are integers, i = i1; : : : ; in and k  ?1 (the k = ?1 value is an arti cial quantity that simpli es the coding of the scheme). Assume here that xi = (i ? 1)x and tk = kt. De ning C = t=x, a standard nite di erence scheme for this equation reads

u0i = f (xi); i = i1; : : : ; in;

u?i 1 = u0i + 21 C 2(u0i+1 ? 2u0i + u0i?1 ); i = i1 + 1; : : : ; in ? 1;

uki +1 = 2uki ? uik?1 + C 2 (uki+1 ? 2uki + uki?1 ); i = i1 + 1; : : : ; in ? 1; k  0; uk1+1 = UL; k  0; ukn+1 = UR; k  0: A domain decomposition for the numerical scheme is easily formulated by dividing the domain = [0; 1] into over-lapping subdomains s , s = 1; : : : ; D, and apply the scheme to the interior points in each subdomain. We de ne the subdomain grids more precisely as follows.

s = [xsL ; xsR]; xsi = xsL + (i ? i1s )x; i = i1s ; : : : ; ins ; xsi1 = xsL; xsuin = xsR ; x1L = 0; xDR = 1: s

s

The optimal overlap can be seen to be one grid cell. Hence, we have

xsi1 = xsi ?1?1?1 ; xsi1 +1 = xsi ?1?1 ; xsi ?1 = xsi1+1+1 ; xsi = xsi1+1+1 +1; for s = 2; : : : ; D ? 1. In multi-dimensional problems, and especially when s

n

s

s

n

s

n

s

s

n

s

s

using the nite element method for the spatial discretization, it is probably not feasible with a precise mathematical notation for the overlap. Instead,

Page 7 we only know that there is some kind of overlap, and the principal functionality we seek is evaluating u in a neighbouring subdomain at a speci ed point. Let Ns be an index set containing the neighbouring subdomains of

s (in 1D Ns = fs ? 1; s + 1g). We then need the interpolation operator I (x; t; Ns) that interpolates the solution u at the point (x; t), where x is known to lie in one of the subdomains q , q 2 Ns . De ning us;k i as the numerical solution over s , we can write the numerical scheme for subdomain s as follows.

us;i 0 = f (xi); i = i1s ; : : : ; ins ;

us;i ?1 = us;i 0 + 21 C 2(us;i+10 ? 2us;i 0 + us;i?01 ); i = i1s + 1; : : : ; ins ? 1;

s us;k i1 = I (xu1 ; tk ; Ns); s us;k i = I (xun ; tk ; Ns); s;k?1 + C 2(us;k ? 2us;k + us;k ); +1 = 2us;k us;k i?1 i i+1 i ? ui i i = i1s + 1; : : : ; ins ? 1; k  0; +1 = UL ; if xsi1 = 0; k  0; us;k i1 +1 = UR ; if xsi = 1; k  0: us;k i n

s

n

s

s

s

s

s

s

n

s

From this scheme it becomes clear that we can re-use the code for the original global problem if only the domain can be an arbitrary interval and if we have an interpolation operator I (x; Ns) at hand.

3.2 Example 2: The heat equation in 2D

To illustrate how the domain decomposition method easily can be extended to be applied on a two dimensional problem. We consider

@u = @ 2u + @ 2 u ; (x; y) 2 (0; 1)  (0; 1); t > 0; @t @x2 @y 2 u(0; y; t) = 0; t > 0; u(1; y; t) = 0; t > 0; u(x; 0; t) = 0; t > 0; u(y; 1; t) = 0; t > 0; u(x; y; 0) = f (x; y);

(8) (9) (10) (11) (12) (13)

which is the scaled two dimensional heat conduction problem. In similitude to example 1 we de ne uki;j as the approximation to u(x; y; t) in the spatial grid point (xi ; yj ) at time step tk , where the integers i, j and k are given by i = 1; : : : ; n, j = 1; : : : ; m and k  0, and assume that xi = (i ? 1)x, yj = (j ? 1)y and tk = kt.

Page 8

Parallel Explicit FDM

Discretizing the equations (8{13) using an explicit nite di erence scheme gives u0i;j = f (xi; yj ); i = i1; : : : ; in; j = j1 ; : : : ; jn; 



uki;j+1 = uki;j + t [x2 u]ki;j + [y2u]ki;j ; i =i1 + 1; : : : ; in ? 1; j = j1 + 1; : : : ; jn ? 1; k > 0; k+1 uki1+1 ;j = ui ;j = 0; k  0; k+1 = 0; k  0; uki;j+11 = ui;j n

n

where

  [x2 u]ki;j = 1x2 uki?1;j ? 2uki;j + uki+1;j ;   [y2 u]ki;j = 1 2 uki;j ?1 ? 2uki;j + uki;j +1 : y We wish to split the domain into D subdomains s , where s = 1; : : : ; D, and we de ne a two dimensional Cartesian partition, where the subdomain number s is given by a bijective map such that s = (p; q ) and (p; q ) = ?1(s), by    

p;q = xpL ; xpR  yBq ; yTq xpi = xpL + (i ? i1p)x; i = i1p; : : : ; inp yjq = yBq + (j ? j1q )y; j = j1q ; : : : ; jnq xpi1 = xpL ; xpi = xpR ; yjq1 = yBq ; yjq = yTq ; x1L = 0; xPR = 1; yB1 = 0; yTQ = 1: Here we have p = 1; : : : ; P , q = 1; : : : ; Q and D = P  Q. Our numerical scheme has a ve point computational molecule, i.e., for each point ui;j we need to update, we only need values in points that are one grid cell away, hence we will only need one grid cell overlap between the domains. For p = 2; : : : ; P ? 1 and q = 2; : : : ; Q ? 1 we then have xpi1 = xpi ?1?1 ; xpi1 +1 = xpi ?1?1 ; xpi ?1 = xpi1+1+1 ; xpi = xpi1+1+1 +1 ; yjq1 = yjq?1?1 ?1; yjq1 +1 = yjq?1?1 ; yjq ?1 = yjq1+1+1 ; yjq = yjq1+1+1 +1: p

p

q

n

n

n

p

p

q

q

q

p

n

n

p

q

n

n

n

p

p

q

q

q

n

n

p

p

q

q

If either P = 1 or Q = 1 we note that we have a one-dimensional domain division, similar to the one presented in example 1. The set of neighbouring subdomains Np;q = N?1 (s) is given by Np;q = f(p + 1; q); (p ? 1; q); (p; q + 1); (p; q ? 1)g : In conformity with the one-dimensional case we then denote the interpolation operator I (x; y; t; Np;q) which interpolates u at (x; y; t), where (x; y ) is

Page 9 in one of the neighbouring subdomains k;l , (k; l) 2 Np;q . The numerical equations for subdomain p;q then read 0 p p q q up;q; i;j = f (xi ; yj ); i = i1 ; : : : ; in ; j = j1 ; : : : ; jn ; p;q p;q up;q;k i1 ;j = I (xi1 ; yj ; tk ; Np;q); p;q p;q up;q;k i ;j = I (xi ; yj ; tk ; Np;q); p;q p;q up;q;k i;j1 = I (xi ; yj1 ; tk ; Np;q ); p;q p;q up;q;k i;j = I (xi ; yj ; tk ; Np;q ); p

p

n

p

n

q

n

p

q

q

n

q

j = j1 q ; : : : ; jnq ; k > 0; j = j1 q ; : : : ; jnq ; k > 0; i = i1 p; : : : ; inp ; k > 0; i = i1 p; : : : ; inp ; k > 0;

+1 p;q;k p;q;k p p = up;q;k up;q;k i;j + t(uxx + uyy ); i = i1 + 1; : : : ; in ? 1; i;j j = j1q + 1; : : : ; jnq ? 1; k > 0; +1 u1i1;q;k 1 ;j = 0; +1 uP;q;k i ;j = 0; p;1;k+1 = 0; ui;j 11 p;Q;k ui;j +1 = 0; n

P

n

Q

j = j1q ; : : : ; jnq ; k  0; j = j1 q ; : : : ; jnq ; k  0; i = i1p ; : : : ; inp ; k  0; i = i1p ; : : : ; inp ; k  0;

Again it is evident that a seqencial program solving the global problem can be reused in solving the problem over each subdomain, provided that we add functionality for communicating interpolated values between the subdomain solvers.

3.3 Example 3: The shallow water equations in 2D

We now concider the two dimensional linear equations for wave motions in shallow water which read

@ = ? @ (Hu) ? @ (Hv); (x; y) 2 (0; 1)  (0; 1); t > 0; @t @x @y @u = ?g @ ; (x; y) 2 (0; 1)  (0; 1); t > 0; @t @x @v = ?g @ ; (x; y) 2 (0; 1)  (0; 1); t > 0; @t @y u(0; y; t) = 0; u(1; y; t) = 0; v(x; 0; t) = 0; v(x; 1; t) = 0; (x; y; 0) =  0(x; y); u(x; y; 0) = 0; v(x; y; 0) = 0:

(14) (15) (16) (17) (18) (19) (20) (21) (22) (23)

Page 10

Parallel Explicit FDM

Here u(x; y; t) and v (x; y; t) are the velocity components in the x and y directions respectively,  (x; y; t) is the surface elevation, g is the accelleration of gravity and H = H (x; y ) is the stillwater depth. Let ki;j denote the numerical approximation to a function (x; y; t) at the space-time point (xi; yj ; tk ), where xi = (i ? 1)x, yj = (j ? 1)y , and tk = kt, with k  0. The primary unknowns u, v and  are usually de ned in di erent spacetime points, i.e., the grid is staggered. A popular scheme for solving the shallow water equations employs the so called Arakawa C grid. This staggered grid is depicted in gure 11. We denote the primary discrete unknowns k+ 2 k+ 12 k in the C grid as i+ 1 ;j + 1 , ui;j + 1 and vi+ 1 ;j . We can then see from gure 1 2 2 2 2 that integer i and j values correspond to the solid lines. Using a C grid and approximating rst order derivatives by centered nite di erences, results in the following discrete initial-boundary value problem: i0+ 1 ;j+ 1 = 0(xi; yj ); i = ie;1; : : : ; ie;n; j = je;1 ; : : : ; je;n ; 2 1 u2

2

i;j + 12 1

= 0; i = iu;1 ; : : : ; iu;n ; j = ju;1; : : : ; ju;n;

ui2+ 1 ;j = 0; i = iv;1 ; : : : ; iv;n ; j = jv;1; : : : ; jv;n; 2





+1 k+ 12 2 ? xt Hi+1;j+ 12 uik+1 = ? H u 1 1 i;j + ;j + 2 2 i;j + 12    t k+ 12 k+ 12 ? y Hi+ 12 ;j+1vi+ 12 ;j+1 ? Hi+ 12 ;j vi+ 12 ;j ; i = ie;1 ; : : : ; ie;n; j = je;1 ; : : : ; je;n ; k  0; 1 3 n+ 2 = uni;j++21 ? g t (in++11 ;j + 1 ? in?+11 ;j + 1 ); ui;j +1 x 2 2 2 2 2 2 i = iu;1 + 1; : : : ; iu;n ? 1; j = ju;1 ; : : : ; ju;n ; k  0; 3 1 t ( n+1 ?  n+1 ); vin++12;j = vin++12;j ? g  y i+ 1 ;j+ 1 i+ 1 ;j? 1

ik++11 ;j+ 1 2 2

ik+ 12 ;j+ 21

2

2

2

2

2

2

i = iv;1 ; : : : ; iv;n; j = jv;1 + 1; : : : ; jv;n ? 1; k  0; 3

3

uik+12;j+ 1 = uki + 2;j+ 1 = 0; j = ju;1; : : : ; ju;n; k  0; 2 k+ 32 vi+ 1 ;j 1 2 u;

v;

where

u;n

3 = vik++12;j 2

2

v;n

= 0; i = iv;n ; : : : ; iv;n; k  0;

Hi;j+ 21 = (Hi;j + Hi;j+1 )=2;

Hi+ 21 ;j = (Hi;j + Hi+1;j )=2: From gure 1 we observe that the number of u, v and  points di ers, and this is the reason for introducing the many limits (like iu;1; iu;n ; iv;1; : : : ) in the indices.

Page 11

M M M M



u

M

u

M

u

M

u

M

   



u

M

u

M

u

M

u

M

  



u

M

u

M

u

M

u

M

  



u

M

u

M

u

M

u

M

  

   u= , M= u;  = v

Figure 1: Staggered C-grid.

When dividing the staggered grid domain into subdomains we use the same method and de nitions as in example 2. The decomposed problem on a subdomain p;q reads

0 ip;q; =  p;q;0(xpi; yjq ); i = ie;1p ; : : : ; ie;np ; j = je;1q ; : : : ; je;nq ; 1 + ;j + 1

2

2

p;q; 21 ui;j +1 2 p;q; 21 ui+ 1 ;j 2

= 0; i = iu;1 p ; : : : ; iu;n p; j = ju;1 q ; : : : ; ju;n q ; = 0; i = iv;1 p ; : : : ; iv;np ; j = jv;1 q ; : : : ; jv;n q ;

1

+2 p;q p;q q q up;q;k i 1 ;j + 1 = I (xi 1 ; yj + 1 ; tk ; Np;q); j = ju;1 ; : : : ; ju;n ; k > 0; u;

2

p

+1 2 up;q;k i ;j + 12 +1 2 vip;q;k 1 + 2 ;j 1 1 + vip;q;k 1 2 + 2 ;j u;n

p

u;

u;n

q

q

u;

p

2

p;q = I (xp;q i ; yj + 21 ; tk ; Np;q); p;q = I (xp;q i+ 12 ; yj 1 ; tk ; Np;q); p;q = I (xp;q i+ 1 ; yj ; tk ; Np;q); u;n

p

u;

2

q

u;n

q

j = ju;1 q ; : : : ; ju;nq ; k > 0;

i = iv;1 p; : : : ; iv;np ; k > 0; i = iv;1 p; : : : ; iv;np; k > 0;

Page 12

Parallel Explicit FDM 



p;q;k+ 12 p;q up;q;k+ 21 = ? xt Hip;q u ? H 1 1 +1;j + 2 i+1;j + 2 i;j + 12 i;j + 21    t p;q;k+ 21 p;q;k+ 21 p;q p;q ? y Hi+ 1 ;j+1vi+ 1 ;j+1 ? Hi+ 1 ;j vi+ 1 ;j 2 2 2 2 i = ie;1 p; : : : ; ie;n p; j = je;1 q ; : : : ; je;n q ; k  0; t p;q;k+1 p;q;k+ 12 +3 p;q;k+1 2 up;q;k i;j + 21 = ui;j + 12 ? g x (i+ 21 ;j + 12 ? i? 12 ;j + 12 ); i = iu;1p + 1; : : : ; iu;np ? 1; j = ju;1 q ; : : : ; ju;nq ; k  0; 1 +3 2 = v p;q;k+ 2 ? g t ( p;q;k+1 ?  p;q;k+1 ); vip;q;k 1 ;j +1 ;j i + y i+ 12 ;j + 12 i+ 12 ;j ? 12 2 2 i = iv;1 p; : : : ; iv;np ; j = jv;1 q + 1; : : : ; jv;n q ? 1; k  0;

+1 ip;q;k 1 1 + 2 ;j + 2

ip;q;k 1 1 + 2 ;j + 2

3

+2 = 0; j = ju;1q ; : : : ; ju;nq ; k  0; u1i ;q;k 1 1 ;j + 1

2

u;

3 uiP;q;k+;j2+ 1 2 p;1;k+ 23 ui+ 1 ;j 1 1 2 p;Q;k+ 23 ui+ 1 ;j 2 u;n

P

u;

u;n

Q

= 0; j = ju;1q ; : : : ; ju;nq ; k  0; = 0; i = iv;1 p; : : : ; iv;np ; k  0; = 0; i = iv;1 p ; : : : ; iv;n p ; k  0;

The half indices and the unequal number of grid points in each eld in a staggered grid introduces some diculties as opposed to ordinary grids, especially in dividing the grids into subdomains. However, the tools presented here will together with the Di pack tools for staggered grids signi cantly simplify the index book-keeping. One should also note that the numerical scheme requires that we place boundary conditions on u in xpi 1 and xpi and on v in yjq 1 and yjq , hence the u-grid must be overlapping in the x-direction, the v-grid must be overlapping in the y-direction, whereas the  grid requires no overlap in any direction. u;

v;

q

v;n

p

u;n

p

q

4 Software components Developing a parallel program is always more complicated than developing a sequential program. The principal diculty is of course to derive algorithms that can be run concurrently. However, such algorithms are now known for many classes of partial di erencial equation problems, e.g., for explicit nite di erence schemes as we have described above. Transforming a concurrent algorithm into running code is unfortunately a time-consuming task, mainly because the parallel programming tools are primitive. The standard message passing protocol today is MPI [3], but MPI programming tend to be notoriously error-prone. There are many versions of the same basic MPI

Page 13 functionality, and it is dicult for novice users to use the correct version. The argument lists are long, and a complete knowledge of the details on the byte-level is required. Finally, debugging parallel programs is much more complicated than debugging sequential codes. Better debuggers will hopefully improve this situation. If we want to increase the human eciency in developing parallel computer codes we should develop a software environment where the programmer can concentrate on the principal steps of concurrent algorithms and not on MPI speci c details. It is near optimal if the programmer can start with developing a sequential solver and then in just a few steps transform this solver to a parallel version. Realization of such a software environment is indeed possible and requires a layered design of software abstractions, where all explicit MPI calls are hidden in the most primitive layer, and where the interface to message passing tools is simple and adapted to the programming standard of sequential solvers. We have developed this type of software environment in Di pack. The basic contents of the environment are described below. We use the heat conduction example to demonstrate how the software components are used to create parallel simulators in a fast and safe way.

4.1 How to create parallel solvers

It is instructive to outline how we build a parallel solver for explicit nite di erence schemes in Di pack. First, one develops a standard sequential solver, using grid and eld abstractions (GridLattice, FieldFD etc). The \heart" of such a solver is the updating loops implementing the explicit nite di erence scheme. These loops should be divided into loops over the inner points and loops over the boundary points. As we shall see later, this is essential for parallel programming. The solver must also be capable of dealing with an arbitrary rectangular grid. As in Di pack environments for solving systems of partial di erencial equations by assembling independent solvers for the di erent components in the system, it is useful to have a scan function in the solver that can either create the grid internally or use a ready-made external grid. We want the sequential solver to solve the partial di erential equations on its own, hence it must create a grid. On the other hand, when it is a part of a parallel altorithm, the grid is usually made by some tool that has an overview of the global grid and the global solution process. This exibility with respect to handling a grid is also desireable for other data members of the solver (e.g. time parameters). Having carefully tested a exible sequential solver on the global domain, the steps towards a parallel version of the solver are then as follows (see gure 2 for an overview of the class structure). 1. Derive a subclass of the sequential solver that implements details sp-

Page 14

Parallel Explicit FDM

e cic to concurrent computing. This subclass is supposed to be acting as subdomain solver. The code of the subclass is usually very short, since major parts of the sequential solver can be re-used also for parallel computing. Recall that this was our main goal of the mathematical formulation of the initial-boundary value problems over the subdomains. 2. Make a manager class for administering the subdomain solvers. The manager has the global view of the concurrent solution process. It divides the grid into subdomains, calls the initialization processes in the subdomain solver and administers the time loop. The subdomain solver has a pointer to the manager class, which enables insight into neighbouring subdomain solvers for communication etc. The code of the manager class is also short, mainly because it is derived from a general toolbox for parallel programming of explicit nite di erence schemes (ParallelFD). On each process we run a manager and a subdomain solver. The details of these classes and the C++ tools they utilize are explained below.

MyPDE

ManagerMyPDE

MyPDEs

Figure 2: Sketch of a sequential solver (MyPDE), its subclass (MyPDEs) for solving the problem on a subdomain as a part of a concurrent algorithm, and the manager (ManagerMyPDE) that adminsters the global parallel scheme. Solid arrow indicate class derivation (\is-a" relationship), whereas dashed arrows indicate pointers (\has-a" relationship).

4.2 The sequential heat equation solver

Following standard Di pack examples on creating nite di erence solvers, the heat equation simulator is typically realized as a C++ class with the following content.

Page 15 #ifndef Heat2D_h_IS_INCLUDED #define Heat2D_h_IS_INCLUDED #include #include #include class Heat2D { protected: Handle(GridLattice) Handle(FieldFD) Handle(FieldFD) Handle(TimePrm)

grid; u; u_prev; tip;

// // // //

lattice grid here 2D grid fields to store the u, v and eta values on this and previous timesteps hold time information, like tstop and dt

CPUclock clock; real compute_time;

// for time taking within the program

int ui0, uin, uj0, ujn; real mu, nu;

// start/stop indices: loops over grid points // help variables for efficiency

real initialField(real x, real y );

// initial disturbance

virtual void setIC (); virtual void timeLoop (); void computeInnerPoints(); void computeBoundaryPoints(); void updateDataStructures(); public: Heat2D (); ~Heat2D () {} void scan(GridLattice* grid_ = NULL, TimePrm* tip_ = NULL); void solveProblem (); // solve the problem virtual void resultReport();

}; #endif

The most important point is the two fucntions computeInnerPoints and computeBoundaryPoints for separating the updating of inner and boundary points.

4.3 Di pack's interface to MPI

The exchange of data structures between di erent processes is usually performed using the message passage principle and a special message passing protocol. As earlier mentioned, the current standard message passing protocol is MPI. MPI is a library of functions that supports Fortran or C programming, that is, the MPI functions look like ordinary Fortran or C functions for a programmer. As we have already mentioned, direct programming will in practice often give rise to subtle errors that may be dicult to track down. We have therefore used the strengths of C++ (overloaded functions, classes, dynamic binding) to build a generic, re-useable, simple interface to the most common MPI routines that are needed for solving partial di erential equations in parallel. This interface consists of three classes: DistrProcManager, Topology and CharPack. DistrProcManager supports MPI communication,

Page 16

Parallel Explicit FDM

whereas Topology is a class hierarchy that handles the process topology using advanced MPI functionality. The CharPack class simpli es the packing and unpacking of complicated data structures, byte by byte, in a long char array. This array can then easily be communicated between processes. For example, a class may pack and unpack itself for communication using the CharPack tool.

MPI initialization and clean-up. Before using MPI in a program, the

initializing function MPI Init must be called, followed by calls to the functions MPI Comm rank and MPI Comm size. An instance of the DistrProcManager class is constructed inside the initDiffpack function, and the MPI initializations are called during construction of the process manager object. The function nishing up MPI in the program (MPI Finalize) is correspondingly called when the process manager object goes out of scope at the end of the program. This means that during program execution there is one globally located process manager on each processor, created and destroyed without any explicit calls from the user, and all the functions in the DistrProcManager class are available through this object.

Simpli ed interfaces to MPI calls. One attractive feature in C++ is

its ability to distinguish between calls to functions with the same name but with di erent arguments. This has been used in the DistrProcManager class allowing the user to call for instance the broadcast function with the variable to be broadcast as parameter, and C++ will automatically choose the appropriate broadcast function depending on the variable type. Using MPI's MPI Bcast function would have required the datatype to be speci ed explicitly. The ability to use default arguments is another feature that is used frequently in DistrProcManager. Most calls to MPI functions require a communicator object as argument. However, the average user is likely never to use any other communicator than MPI COMM WORLD, hence using this as a default argument to all the functions in DistrProcManager allows the users to omit this argument unless they wish to use a di erent communicator. By taking advantage of overloaded functions and arguments with default values, the interface to common MPI calls is signi cantly simpli ed by our DistrProcManager class. In the same manner as the broadcast functions was built into DistrProcManager, other useful functions e.g. the functions for sending and receiving messages, both blocking (MPI Send and MPI Recv), and non-blocking (MPI Isend and MPI Irecv), has been included in this class, allowing communication of variables of any kind, for instance integers, reals, strings and vectors. For example, native Di pack vectors can be sent or received directly, without much MPI competence or need for explicitly extracting the

Page 17

Rank: 1 Coord: (0,1)

Rank: 3 Coord: (1,1)

Rank: 0 Coord: (0,0)

Rank: 2 Coord: (1,0)

Figure 3: Relation between coordinates and rank for four processes in a (2  2) process grid underlying C data structures of the Di pack vector classes.

Flexible assignment of process topology. In many applications, the

order in which the processes are arranged is not of vital importance. By default, a group of n processes are given numbers from 0 to n ? 1. However, this linear ranking of the processes does not always re ect the geometry of the numerical underlying problem. If the numerical problem consist of a two or three dimensional grid, a corresponding two or three dimensional process grid would re ect the communication pattern better. Setting up the process topology can be complicated, and the required source code follows the same pattern in all the programs. Hence, the topology functions has been built into a Topology class. Topology has two subclasses: GraphTopology, where the communication pattern is represented by a graph (suitable for unstructured grids), and CartTopology, where the processes are organized in a cartesian structure with a row-major process numbering beginning at 0 as indicated in gure 3. Through these classes, the user can set up the desired topology by supplying the topology information in a user friendly format such as CART d=2 [2,4] [ dpFALSE dpFALSE ]

for setting up a 2 dimensional cartesian process grid with (2  4) processes with a non-periodic grid in both directions. The Cart Topology class also provide functions that return the rank of a process given its process coordinates and vice versa. We refer to the MPI documentation for more information on the topology concept and functionality. It is important to notice the di erence between the logical topology (also called virtual topology) and the underlying physical hardware. The machine may exploit the logical process topology when assigning the processes to the physical processors, if it helps to improve the communication performance

Page 18

Parallel Explicit FDM

on the given machine. If the user does not specify any logical topology, the machine will make a random mapping, which on some machines may lead to unnecessary contention in the interconnection network, hence setting up the logical topology may therefore give possible performance bene ts as well as large bene ts for program readability. We believe that the Topology hierarchy is one of the most important contributions of the present work for simplifying the development of parallel simulators.

4.4 The parallel solvers and the managers

Subdomain partition and the parallel algorithm. In section 3.2 we learned that with a ve-point computational molecule the optimal overlap is one grid cell. Figure 4 illustrates the overlapping grid on some subdomains,

1

3

5

0

2

4

Figure 4: Illustration of overlapping grids on some subdomains and we observe that the di erent subdomain grids have di erent sizes, as the domains are only overlapping on boundaries that are not equal to the global boundary. If we introduce the terms overlapping grid points and genuine grid points, where overlapping grid points are all points where the value in that particular points is evaluated in a neighbouring subdomain, while genuine grid points are those in which this subdomain is in charge of updating the value, as illustrated in gure 5, we observe that global boundary conditions are set in genuine grid points, while the boundary values received from other subdomains are placed in the overlapping grid points. When reporting results after computations are nished, genuine points are used. When we consider communication and computations, however, we need a di erent classi cation of the points, hence we introduce the following. Grid points at

Page 19 1

3

5

0

2

4

Figure 5: The shaded points are overlapping points with other subdomains, while the clear points are genuine points on this subdomain the boundary of a subdomain are said to have o set +1. The points inside these are said to have o set 0, while the remaining inner points have o set -1 or less, as illustrated in gure 6. The points with o set +1 are either a genuine part of the grid (on global boundaries) or they correspond to a point with o set 0 on a neighbouring subdomain. When applying the nite di erence scheme on a subdomain, we need boundary values from the neighbouring subdomain at the points with o set +1. These values are used in the di erence equations for the points with o set 0. The main idea in the parallel algorithm is to communicate these boundary values while applying the nite di erence scheme to all points with o set -1 or less. When the boundary values are received, one can apply the scheme to the points with o set 0 and the solution over a subdomain is complete. MPI allows computations and communication simultaneously. Technically, this is achieved by non-blocking send and receive calls. We can therefore device the following algorithm for a subdomain solver: 1. Send requested points to all neighbouring subdomains (these points will be used as boundary conditions in the neighbouring subdomain solvers). 2. Update the solution at all pure inner points (those with o set -1 and less). 3. Receive boundary values at points with o set +1 from all neighbouring subdomains. 4. Update the solution at the points with o set 0.

Page 20

Parallel Explicit FDM 1

3

5

0

2

4

Figure 6: The dark shaded grid points have o set -1, the light shaded points have o set 0, the clear grid points have o set -1 and less. For small grids we may have to wait for the messages to arrive, because the time spent communication may exceed the time spent computing inner points. The length of the message depends on the circumference of the grid, while the number of inner values to be computed depend on the area of the grid. Therefore, as the size of the grid grows, the time spent on computing will grow faster than the time spent on communication, and at some point the communication time will be insigni cant. The speedup will then be (close to) optimal; having n processors reduces the CPU time by a factor of n.

The subclass solver. The basic nite di erence schemes for updating the solution at inner and boundary points are provided by the sequential solver, which is class Heat2D in our current example. Communication of boundary values between processors is the only additional functionality we need for turning the sequential solver into parallel code. We add the communication functionality in a subclass Heat2Ds. Its header is like this: #ifndef Heat2Ds_h_IS_INCLUDED #define Heat2Ds_h_IS_INCLUDED #include #include class ManagerHeat2D; class Heat2Ds: public HandleId, public Heat2D { friend class ManagerHeat2D; // Parallel stuff Ptv(int) coords_of_this_process; // my processor coordinates

Page 21 1

3

0

2

Figure 7: Updating the value in subdomain 0 (points with o set 0) requires values from subdomains 1 and 2. The indicted value in subdomain 3 (point with o set -1) depends only on local information and can be updated without any communicated any values from the neighbours. Ptv(int) num_solvers; int s; VecSimple(int) neighbours; ManagerHeat2D* boss;

// // // //

PointsFD bc_points;

// boundary points for communication

void void void void

number of processes in each space dir my subdomain no. is s identify my neighbours my manager keeps track of global info

initBC (); setBC1 (); setBC2 (); solveAtThisTimestep();

public: Heat2Ds (ManagerHeat2D* boss_); ~Heat2Ds () {} void scan(); void resultReport(); }; #endif

Class Heat2Ds inherits the code from class Heat2D, but extends it for parallel computing:  We have some data members for representing the processor coordinates of the current process, identi cation of neighbours etc.  A PointsFD structure holds the boundary points for communication (the points to send and receive and their values).  initBoundaryCommunication initializes the PointsFD structure.

Page 22 

sendBoundaryPoints



receiveBoundaryPoints

Parallel Explicit FDM

applies the PointsFD structure for sending the boundary values to the neighbours.

applies the PointsFD structure for receiving the boundary values from the neighbours.

We observe that there is a close relation between the program abstractions and the components of the parallel algorithm.

The manager class. The manager class performs no computations, it

merely scans and holds global information and issues the basic steps of the parallel algorithm, using functionality in the subdomain solvers. Much of the code in a manager class can be re-used from problem to problem, and it is therefore natural to collect such generic software components in a base class, here called ParallelFD. The particular ManagerHeat2D class looks as follows. #ifndef ManagerHeat2D_h_IS_INCLUDED #define ManagerHeat2D_h_IS_INCLUDED #include #include class ManagerHeat2D : public ParallelFD { friend class Heat2Ds; Handle(Heat2Ds) solver; // hold global grid information: Handle(GridLattice) global_grid; // global variables for the smaller grid used to report the results Handle(GridLattice) global_resgrid; Handle(FieldFD) global_res_u; Handle(TimePrm) tip; VecSimple(int) neighbours; String time_data; // hold data for time parameters String grid_data; // hold data for the global grid int resgrid_size; // the size of each direction in the result-rep grid void void void void void

scan (MenuSystem&); define (MenuSystem& menu, int level = MAIN); distributeData (); initSystem (); timeLoop ();

public: ManagerHeat2D (); ~ManagerHeat2D () {} virtual void adm (MenuSystem& menu); void gatherData (FieldFD& local_res);

Page 23 void solveProblem (); void resultReport (); }; #endif

Only one manager really scans the input data. This manager, often called the master and recognized by having process identi cation number 0, then distributes the information to the other managers.

5 Tools for parallel nite di erence methods

5.1 The grid and process handler:

ParallelFD

ParallelFD

MyPDE

Points

ManagerMyPDE

MyPDEs

PointsFD

Figure 8: Sketch of a sequential solver, its subclass, manager and the toolboxes that these classes utilize. Solid arrow indicate class derivation (\is-a" relationship), whereas dashed arrows indicate pointers (\has-a" relationship). Figure 2 displays the simulator from gure 2, but now with the relations to the Points, PointsFD and ParallelFD classes. The present section explains these tools in more detail. The parallel toolbox, ParallelFD, contains functionality generally needed for solving problems using nite di erence methods on parallel computers. The ParallelFD class is in touch with the basic parallel class DistrProcManager which supplies information such as rank of the process and total number of processes. It also holds the topology object from which we can obtain information related to the multidimensional topology. One of the main tasks of the manager class is to divide the global grid into subdomains. This general problem is supported by class ParallelFD. The global grid is here divided in a manner such that all the local grids are approximately the same size, ensuring that the load is evenly balanced between the processes. There are two versions of the function initLocalGrid,

Page 24

Parallel Explicit FDM

one for ordinary nite di erence grids, and one for staggered grids. The latter version is more advanced and allows a di erent size of the overlap in the di erent space directions. When dividing a problem into subproblems, we want each grid point in the local grid to have the exact same coordinates as the corresponding grid point on the global grid, i.e. the point on the global grid the local grid point represents. This is useful when setting the initial condition, as it is usually given by a function which takes the coordinates of each point in order to nd the correct initial value, hence the local coordinates must be correct in a global view. As we shall see later, this is also important in the initial set up of boundary points for communication. The index set on each subdomain may be chosen arbitrarily, either one can allow each grid point to have the same indices as the corresponding global grid point, or one can have a common base index for all the local grids. Either way, the index will only be used locally by each subdomain. A major problem in parallel computing is the assembly of the subdomain solutions for visualization. Most visualization systems require a global grid with corresponding global elds. We have therefore made some preliminary tools for collecting the subdomain solutions in a global grid. The purpose of going parallel is often to increase the size of the global grid, and in such cases no single processor has enough memory to store the global solution. To solve this problem, we let each subdomain solver prolongate its ne grid solution to a coarser grid. The solution of the coarse grid is sent to the master manager. The master manager simply applies tools in class ParallelFD for combining coarse grid subdomain solutions into a manageable global, coarse grid solution that is useful for visualization. The functionality related to constructing a coarse grid, global solution includes the functions initResultGrid and gatherFields in class ParallelFD. For the prolongation of a solution from the ne to the coarse grid, one can use standard approaches known from multigrid methods, or one can stick to simpler strategies, like using nested coarse and ne grids and picking out the point values, or perform an interpolation. The prolongation part is somewhat primitive in this respect at the time of this writing.

5.2 Communication between processes: the Points classes A central issue in parallel algorithms for solving partial di erential equations is to communicate values at the boundaries between subdomains. Some of the required functionality is common to a wide range of problems, and can hence be collected in a general class, here called Points, whereas other types of functionality depend on the particular solution method being used. The special functionality for explicit nite di erence methods is collected in class PointsFD, which is a subclass of Points. The Points classes o er functionality for

Page 25  setting up data structures for the boundary points and values to be

sent and received,  sending boundary values to neighbours,  receiving boundary values from neighbours. Class Points has general functions for sending and receiving boundary values. Subclasses, like PointsFD, are mainly concerned with setting up the proper data structures for the information on boundary points and values. Let us go somewhat more into the details of the general process of communicating boundary values. Assuming overlapping domains of arbitrary shape, we rst identify all points with o set -1 in which we need to receive values from other neighbours. When receiving a boundary value, this value should be associated with a grid point in the present subdomain. For fast storage of the value in the internal data structures (FieldFD object), we should have the corresponding grid indices at hand. In order to send values to the neighbours, we need to hold information about the grid points in which the values are needed. When sending a boundary value to a neighbour, we need in general to interpolate the eld value at a grid point speci ed by our neighbour. Interpolation on uniform nite di erence grids is fast, but in nite element methods one should avoid repeating the interpolation process at every time step; results on the element containing the point and the local element coordinates should be stored for later use. From this discussion we see that we need data structures for  the boundary values (a real array is sucient here),  indices for fast storage of received values in internal eld data structures,  coordinates for points in which updated values should be extracted from,  possibly auxiliary data for enhancing the eciency of the interpolation process made at a request from a neighbour. The boundary points in a subdomain overlap with several di erent subdomains. To handle a general case we propose to send the boundary points of a subdomain to all its neighbours and let each one of them determine which of the points that belong to that domain. The result of this operation is hence a distribution of the responsibility of providing boundary information. A subdomain solver can then just think of all its neighbours as one surrounding grid. This avoids detailed book-keeping of the \sides" of a subdomain and their neighbours in the subdomain solver. In the PointsFD class we nd a function initBcPointsList which creates a list of all the boundary points. For each point, it holds information of

Page 26

Parallel Explicit FDM

the coordinates of the point, and the local indices for the points on this subdomain. Once the boundary list is made, the initial communication can begin. The initBoundaryCommunication function converts the list to a format that MPI can handle, and the list is sent to all the neighbours of this subdomain. When the lists are received by the neighbours, they are unpacked and sorted. Each subdomain has then received one boundary list from each neighbour, and for each of these neighbours we create an array called send coords and one called send indices. The subdomain can then, for each neighbour, run through the received lists of boundary information, and extract the points that are inside this subdomain, i.e. points that this subdomain can provide updated values in. When a point is found in the list that in inside this subdomain's grid, the coordinates of that point are placed in the send coords array. The indices that follow each set of coordinates mean nothing to this subdomain, as the indexing is local on each subdomain, however, its use will become evident shortly. Every time we place a set of coordinates in the send coords array we place the corresponding indices in the send indices array. Once all the points have been extracted, we send the send indices back to the subdomain from which the point list came from originally. The subdomains receive the lists and place them in arrays called recv indices. Each subdomain now has a collection of send coords arrays, one for each neighbour, containing the coordinates of all the points in which these neighbours request values. They also have a collection of recv indices arrays, containing the indices of the grid points in which the boundary values received later belongs. This allows us, throughout the program, to send the values alone without any kind of identi cation, because they are packed using the send coords array, and unpacked at the receiving subdomain using the recv indices which is organized in the exact same way as the send coords array in the neighbouring subdomain. There are some things that can be done to increase the eciency of this communication set-up. When the initial boundary list is sent to the neighbours, and the neighbours extract the boundary points, it must, for each set of coordinates, check whether this point is inside this grid. In general, we do not necessarily know which subdomain should be responsible for updating each boundary point. In nite di erence methods however, it is easy to determine which subdomains should provide values in which boundary points. We could then, when we pack the initial list also pack the number of the receiving subdomain together with each boundary point. Then, once the neighbours should check each boundary point and extract the appropriate points, it could check for a matching subdomain number in order to determine if the point is meant for this subdomain instead if checking whether each point is inside the grid, which would be a more time consuming process. The send coords arrays are used by each subdomain to extract the values

Page 27 to send to each neighbour. In general, we can not expect that the coordinates, which identify grid points on other subdomains, exactly match any grid points on this subdomain, hence in order to nd the appropriate values the subdomain must perform an interpolation between the nearest points on this subdomain. This interpolation is somewhat expensive, hence in order to avoid performing this interpolation on each time step, it may be done once, and information that will ease the process of extracting values later may be stored. In nite di erence methods, the grids are uniform, hence the coordinates of a boundary point on one grid will always coincide with a grid point on the neighbouring grid. In this case, the interpolation will be fast. However, as the grids match exactly, there is no reason why we could not perform the interpolation once, nd the appropriate indices for each grid point, and use them in the process of gathering the values for communication later in the program, as using the indices directly will be faster than interpolating. The indices of the grid points could be stored in the send indices arrays, as these arrays are only in use temporarily during the initial set-up procedure. The send indices array would then take over send coords job, hence at every time step, one uses the send indices array to extract the values to be sent to each neighbour, which uses the recv indices array when placing the received values into the local data structure.

6 The parallel shallow water equation solvers In this section we shall see how the tools from sections 4.3{5.2 can be reused and extended to solve the shallow water wave equations from section 3.3. The wave problem is signi cantly more dicult to handle since it involves staggered grids, which in practice means that we must operate with three di erent grids in the parallel communication. The subdomain grids do not need to overlap with the neighbouring subdomains in all space directions. A low level C or Fortran program, using MPI, need to face all details of half indices, di erent grid sizes for di erent unknowns, more complicated communication of boundary values of three unknowns etc. In principle this is straightforward, but in practice small errors in index handling and MPI calls, combined with lack of suitable debugging tools, tend to make the development of parallel code very time consuming. However, with proper software tools it is easy and safe develop parallel programs, involving staggered grids, at the level of the algorithmic abstractions. The parallel wave simulator is based on the sequential solver class Wave2D: #ifndef Wave2D_h_IS_INCLUDED #define Wave2D_h_IS_INCLUDED #include #include #include class Wave2D

Page 28 { protected: Handle(GridLatticeC) grid; Handle(FieldFD) u; Handle(FieldFD) u_prev; Handle(FieldFD) v; Handle(FieldFD) v_prev; Handle(FieldFD) eta; Handle(FieldFD) eta_prev; Handle(TimePrm) tip; real real real real

ui0, uin, uj0, ujn; vi0, vin, vj0, vjn; ei0, ein, ej0, ejn; mu, nu;

Parallel Explicit FDM

// lattice grid here 2D grid // fields to store the u, v and eta values // on this and previous timesteps

// hold time information, like tstop and dt // // // //

start/stop indices: loops over grid points real is used instead of int to match the half indices in the numerical schemes help variables for increased efficiency

real initialField(real x, real y );

// initial surface elevation

virtual void setIC (); virtual void timeLoop (); void computeETAinnerPoints(); void computeUVinnerPoints(); void computeETAboundaryPoints(); void computeUVboundaryPoints(); void updateDataStructuresETA(); void updateDataStructuresUV(); public: Wave2D (); ~Wave2D () {} virtual void scan(GridLatticeC* grid_ = NULL, TimePrm* tip_ = NULL); void solveProblem (); // solve the problem virtual void resultReport(); }; #endif

The class bares some resemblance to Heat2D, although it is somewhat more extensive than the heat class, as the problem has three unknowns,  , u and v, and a more complicated numerical scheme. From the scheme in example 3 we can see that  is dependent on u and v , and u and v depend on  , but the velocity components does not depend on each other. The order in which the elds are computed is therefore of great importance, and as we wish to separate the calculations of the inner points and the boundary points, we must also separate calculations of  from those of u and v . Creating a subdomain solver from the sequential base class is more complicated for this problem than it was for the heat conduction problem. This is to a large extent due to the staggered grid we use in the calculations. In order to get the grid sizes right, we use the Di pack classes for staggered C-grids, FieldFD and GridLatticeC. The GridLatticeC class is a subclass of GridLattice, and is used much the same way. The main feature of this class is that it can dimension the array of a eld correctly, based on information on whether the eld is a scalar ( ) or a vector component (u or v ). When constructing the local grids on each subdomain we can reuse the initializing function from the manager in the heat conduction problem with

Page 29 minor modi cations. In the heat problem, the grids were divided such that the total number of grid points was almost the same on each subdomain, giving all the processes almost the same amount of work, which is called load-balance. This is the goal in the staggered grid computations as well, hence we must only modify the procedure to take the size di erence of the staggered grids into consideration on the last process in each direction in the process grid. As soon as the subdomain solver has received the local grid information for the three grids from the manager, each grid must be bound to the right eld. When a staggered grid is bound to a eld, the eld will change the dimensions of the grid making it the right size for the eld it represents. However, when working on a parallel computer with global and local grids, the size of the global staggered grid should be redimensioned to the right size. When these grids are divided into subdomain grids, the grids all have the right size, and we need to use a constructor in FieldFD which allows us to tell the eld that the supplied grid already has the right size, and that its size should not be recalculated as would have been appropriate for an ordinary staggered grid. There is more than one way to divide the staggered grids into subdomain grids. The most intuitive is to have one grid cell overlap in each direction, which turns the staggered grid in gure 9 into the grid in gure 10. When

= u

= v

= η

Figure 9: A staggered C-grid studying the algorithm for the subproblem in section 3.3 we see that we need to communicate values of u in the x direction and values of v in the y direction. No  values should be communicated. We see from gure 10 that all the local grids look like an ordinary staggered grid, however, we should

Page 30

Parallel Explicit FDM

= u

= v

= η

Figure 10: The staggered grid in gure 9 divided onto a 2  2 process grid with one grid cell overlap also note that the overlap of one grid cell means that we are updating some of the values in this grid more than once. The  values are duplicated on all boundaries, while u is duplicated on all horizontal boundaries, and v is duplicated on all vertical boundaries. To avoid all the unnecessary computations we could divide the grid in a more ecient way, still allowing the algorithm in section 3.3 to be used. In this grid, the staggered grid in gure 9 has rst been divided without overlap. The overlap has then been added on the appropriate boundaries, i.e. in the x direction for the u grid and in the y direction for the v grid. Now all the appropriate points will be updated, and they will only be updated once. However, the staggered grid makes things somewhat more complicated, because to calculate the leftmost points in the u grid and the lowest points in the v grid we need  values that are located on a di erent subdomain, i.e on a di erent processor. As the  grids should not be overlapping we introduce a ghost boundary, which is an extra boundary around the grid, without being a part of the actual grid. This is a feature that the FieldFD class in Di pack has been extended to cover. The ghost boundary does not a ect the grid object bound

Page 31

= u

= v

= η

= η in ghostboundary

Figure 11: An alternative division of the staggered grid in gure 11 to the eld, but the corresponding array is allocated larger than what the grid object implies (for example, we may have an extra grid point outside all boundaries). With this functionality, we can access values in ie;1 ? 1 even if the grid tells us that the base index is ie;1 . This makes it possible to continue using the computational procedures from the sequential problem, even though we communicate values in the  eld as well. Now, in the rst case we communicate u to the left and to the right, while v is communicated up and down, and quite an extensive amount of unnecessary computations are done. In the second case, we need to communicate u to the left, v downwards, and  upwards and to the right, however, no extra calculations are done. As the communications in these to cases require approximately the same amount of work, the program is based on the second version, as this is thought to be slightly more ecient than rst. In similarity with the heat conduction program we wish to take advantage of the ability to compute a large amount of new values while waiting for the boundary values to be received from other processes. Each point in the numerical computations is only dependent on values at neighbouring grid points. As  depends on the velocity components in neighbouring points, we calculate the inner points of  while communicating u and v , before the boundary points of  are calculated. Equivalently, the inner points of the

Page 32

Parallel Explicit FDM

velocity components are calculated while the  boundary points are communicated between the subdomains before the velocity boundary points are updated. Our software tools allow the the more complicated communication algorithm, needed for the shallow water equations, to be expressed directly by high-level statements in the code. In the heat program, all the subdomains communicated with all the neighbours, allowing each subdomain solver to have one neighbour-list, containing identi cation numbers of all the neighbours in which it should send and receive values to and from. In order to communicate only in the necessary directions, we must allow the di erent elds to have separate lists for the various elds, and each eld must have one list for the neighbours to send to and one containing the neighbours to receive values from. By indicating in which direction communication should take place, the user will receive the correct neighbour lists from the parallel toolbox without any further action, making communication in a limited amount of directions no more complicated than communicating with all the neighbours. The resulting subdomain solver class then becomes #ifndef Wave2Ds_h_IS_INCLUDED #define Wave2Ds_h_IS_INCLUDED #include #include class ManagerWave2D; class Wave2Ds: public Wave2D { friend class ManagerWave2D; Handle(GridLatticeC) Ugrid; Handle(GridLatticeC) Vgrid; Handle(GridLatticeC) Egrid; // Parallel stuff Ptv(int) my_coords; // Ptv(int) num_solvers; // int s; // ManagerWave2D* boss; // VecSimple(int) eta_send_neigh; VecSimple(int) eta_recv_neigh; VecSimple(int) u_send_neigh; VecSimple(int) u_recv_neigh; VecSimple(int) v_send_neigh; VecSimple(int) v_recv_neigh; PointsFD u_bc_points; PointsFD v_bc_points; PointsFD eta_bc_points; void void void void void void

initBC (); setBC1eta (); setBC1uv (); setBC2eta (); setBC2uv (); solveAtThisTimestep ();

// lattice grid here 2D grid // lattice grid here 2D grid // lattice grid here 2D grid my processor coordinates number of processes in each space direction my subdomain no. is s my manager keeps track of global info // identify my neighbours

// boundary points for communication

Page 33 public: Wave2Ds (ManagerWave2D* boss); ~Wave2Ds () {} void scan(); void resultReport(); }; #endif

7 Increasing the eciency In some simulations, a cartesian process topology does not re ect the numerical problem appropriately, and a more general topology would be more suitable. MPI allows you to de ne your own general topology, also referred to as a graph topology. The graph topology allows you to set up which processes should have which processes as its neighbours, opening channels between these neighbours. MPI allows you to send messages between any pair of processes in a group, even if they are not speci ed as neighbours, but the topology gives no convenient way of naming this communication pathway, as it would for neighbours. Any topology could be covered by a graph topology, and the cartesian topology is just a special case of the graph topology, where the graph structure is regular. As the cartesian topology is quite common in many applications, and setting up the cartesian topology using the graph tools would be unnecessary inconvenient for the user, these cases are addressed directly through the cartesian topology tools in MPI. The graph topology would for instance sometimes be useful in wave simulations on large ocean areas, because some of the areas in the grid may be representing land, hence calculations should not be done in these areas. Using a cartesian topology might result in a situation where some of the processes hardly do any work, as illustrated in gure 12, and is a very poor utilization of the machine capacity at hand. If we set up a graph topology instead, we might divide the workload equally among the processes, leaving out the areas in which no calculations should be done, as shown in gure 13. When using a cartesian topology in the programs earlier in this section we concluded that in nite di erence methods it was simple to determine which boundary points are sent to which neighbour. However, when using a graph topology, it is no longer evident which values go where, even though the numerical method at hand still is the nite di erence method. Applying the general methods presented in this report, allowing each subdomain to extract the appropriate grid points based on whether the coordinates are inside the grid, makes setting up the communication system in the program quite simple, and requires no extra e ort from the programmer.

Page 34

Parallel Explicit FDM

Figure 12: Cartesian process topology applied on a numerical grid

8 Summary In this report we have seen how one can utilize sequential programs in making a parallel program, allowing the numerical methods to be tested thoroughly on a sequential machine before it is used in parallel, hopefully reducing the number of possible errors in the parallel program. The tools presented give a way for other users without much experience from concurrent programming to follow a simple pattern showing how they can run their sequential programs on parallel machines without spending much time learning MPI rst.

References [1] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Ecient Management of Parallelism in Object-Oriented Numerical Software Libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools for Scienti c Computing, chapter 8. Birkhauser, 1997. [2] A. M. Bruaset, X. Cai, H. P. Langtangen, and A. Tveito. Numerical Solution of PDEs on Parallel Computers Utilizing Sequential Simulators. Paper to appear in the proceedings of the ISCOPE'97 Conference. [3] MPI Forum. MPI: Message Passing Interface standard. June 1995.

Page 35

Figure 13: Graph topology applied on a numerical grid [4] M. Lemke and D. Quinlan. P++, a Parallel C++ Array Class Library for Architecture-Independent Development of Structured Grid Applications. In ACM SIGPLAN Notes, pages 21{23, 1993. [5] A Framework for Scienti c Simulation on Parallel Architectures. http://www.acl.lanl.gov. [6] M. Thune, E. Mossberg, P. Olsson, J. Rantakokko, K.  Ahlander, and K. Otto. Object-Oriented Construction of Parallel PDE Solvers. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools for Scienti c Computing, chapter 9. Birkhauser, 1997.

Suggest Documents