Domain Decomposition Preconditioning for Parallel PDE ... - CiteSeerX

Domain Decomposition Preconditioning for Parallel PDE Software Peter K. Jimack School of Computing University of Leeds, LS2 9JT, UK.

Abstract Domain decomposition methods have been applied to the solution of engineering problems for many years. Over the past two decades however the growth in the use of parallel computing platforms has ensured that interest in these methods, which offer the possibility of parallelism in a very natural manner, has become greater than ever. This interest has led to research that has yielded significant advances in both the theoretical understanding of the underlying mathematical structure behind domain decomposition methods and in the variety of domain decomposition algorithms that are available for use by the engineering community. In this paper we provide a brief overview of some of the main categories of domain decomposition algorithm and then focus on a particular variant of the overlapping Schwarz algorithm that is based upon the use of a hierarchy of finite element grids. Throughout the paper we consider domain decomposition methods as preconditioners for standard, Krylov subspace, iterative solvers however they may also be used directly as iterative methods in their own right. All of the theoretical results that are described apply equally in both cases. Keywords: partial differential equations, domain decomposition, parallel computing

1 Introduction The majority of this paper considers the parallel finite element (FE) solution of the following two-dimensional variational problem, which is derived from a second order self-adjoint partial differential equation (PDE). All of the results and algorithms described can be generalized to three-dimensional problems however, and in section 4 practical extensions to a class of non-self-adjoint problems are also considered. 1

Problem 1.1 Find u 2 HE1 ( ) such that

A(u; v) = F (v); 8v 2 H01( ) ;

(1)

where 2 0 such that: for all v 2 V there are vi 2 Vi such that v = pi=0 vi and p X

=0

A (v ; v ) C A(v; v) ; i

i

(33)

i

i

then the spectral condition number of M 1 K is given by

(M 1 K ) n C ;

(34)

where n is the minimum number of colours required to colour the subdomains i in such a way that no neighbours are the same colour. Furthermore the following result, which is also proved in [47] and [53], demonstrates that the condition (33) is satisfied for a constant C that is independent of h, H and p provided that the overlap between the subdomains i is always proportional to the size of elements in the coarse triangulation. This is often referred to as a “generous” overlap, and the preconditioner is said to be “optimal” in this case. Theorem 2.2 Provided the overlap between the subdomains is of size O (H ), where H represents the mesh size of T 0 , then there exists C > 0, which P is independent of h, H and p, such that for any v 2 V there are v 2 V such that v = =0 v and i

p

i

p X

=0

i

A (v ; v ) C A(v; v) : i

i

i

i

i

(35)

i

The requirement for a generous overlap between subdomains in order to obtain an optimal preconditioner is extremely demanding. In two dimensions, as T is refined (assuming uniform global refinement for simplicity), the number of elements of T in the overlap regions is O (h 2 ), and in three dimensions it is O (h 3 ). In practice therefore this requirement is usually dropped in one of two different ways. The simplest option is to stick to a fixed number of layers of overlapping elements of the triangulation T as it is refined. This means that the overlap region decreases as h ! 0 and so the condition number of the preconditioned system grows. Fortunately however this growth is typically quite slow and so the additional iterations required by the 12

PCG solver are justified by the reduced cost of applying the preconditioner at each iteration (since there are less unknowns associated with each subdomain problem in comparison to the optimal preconditioner with an O (H ) overlap), [47]. Nevertheless, the preconditioner is no longer optimal. The other common approach taken to avoid the need for a generous overlap is to make use of a sequence of nested grids (as opposed to just a coarse grid and a fine grid). The coarse grid problem in (31) is then replaced by the solution of a problem on a grid which is only one level of refinement less than the fine grid problem. This problem is then solved using a two level approximation based upon the grid that is two levels of refinement less than the fine grid problem. Repetition of this two level approach at repeatedly coarser levels leads to a “multilevel” Schwarz algorithm that is rather more complex than the two level algorithm given by (31) but which yields an optimal preconditioner with only a minimal amount of overlap. Again see, for example, [47] for further details. The main drawback with the multilevel AS algorithm is the need for a series of projections from one grid to the next at each preconditioning step. On a regular sequence of grids obtained through uniform refinement this is relatively straightforward to manage in parallel, however on a sequence of grids generated with local, rather than global, mesh refinement this can become a complex programming task. Throughout the rest of this paper therefore we describe an alternative to the multilevel approach that is a two level AS algorithm based upon the use of a hierarchical sequence of grids. As such, this may be considered to be a cross between the classical two level AS algorithm with a generous overlap and the classical multilevel AS algorithm.

3 A Hierarchical Two Level Schwarz Algorithm This section introduces an alternative two level AS algorithm based upon the use of a “weakly overlapping” hierarchy of nested grids, as described in [5]. An overview of the parallel implementation is also included.

3.1 A Weakly Overlapping Additive Schwarz Preconditioner As before, let T 0 be a coarse triangulation of and suppose that T 0 consists of N0 (0) triangular elements, j , each of size O (H ). Now assume that T 0 is partitioned into p non-overlapping subdomains i such that:

=

i

\

p [

=1

;

(36)

i

i

= (i 6= j ) ; [ (0)

= where I j

i

j

2i

i

j

f1; :::; N0g (I 6= ) :

I

Here the overbar is used to denote the closure of a set of points. 13

i

(37) (38)

Now permit T 0 to be refined several times, to produce a family of triangulations, T 0 ; :::; T J , where each triangulation, T k , consists of Nk elements, j(k) , such that

=

Nk [ (k) j

=1

j

T

and

k

k : = f ( ) g =1 k

j

N

(39)

j

The successive mesh refinements that define this sequence of triangulations need not be global and may be non-conforming, however they must satisfy a number of conditions, as in [8] for example: 1.

2T

k

+1 implies that either

2 T , or (b) has been generated as a refinement of an element of T into four similar (a)

k

k

children,

2. the level of any triangles which share a common point can differ by at most one, 3. only triangles at level k may be refined in the transition from T k to T k+1 . (Here the level of a triangle is defined to be the least value of k for which that triangle is an element of T k .) In addition to the above it is also necessary that: 4. in the final mesh, T J , all pairs of triangles on either side of the boundary of each subdomain i have the same level as each other. Having defined a decomposition of into subdomains and a nested sequence of triangulations of , [5] next defines the restrictions of each of these triangulations onto each subdomain by

i;k = fj(k) : j(k) i g ; (40) and, in order to introduce a certain amount of overlap between neighbouring subdomains,

~ i;k = fj(k) : j(k) has a common point with i g : (41) Following this, finite element spaces associated with these local triangulations are introduced. Let G be some triangulation and denote by S (G) the space of continuous piecewise linear functions on G. Then the following definitions can now be made:

V = S (T ) (42) 0 V0 = S (T ) (43) V = S ( ) (44) V~ = S ( ~ ) (45) ~ ~ V = V 0 + ::: + V : (46) Note that the spaces V and V0 are the same as those defined in the previous section but that the spaces V , for i = 1; :::; p are different (since they now correspond to an J

i;k

i;k

i;k

i;k

i

i;

i

14

i;J

overlap of just one element at each level of the mesh hierarchy). Nevertheless, it is still evident that V = V0 + V1 + ::: + Vp : (47) P

This means that for all v 2 V there are vi 2 Vi such that v = pi=0 vi , and this is the decomposition of V that [5] proposes for the alternative two level AS preconditioner of the form (31). In [5] the following theorem is proved for problems in both two and three dimensions. When combined with Theorem 2.1, this shows that the preconditioner (31), with the rectangular matrices now representing projections from V to the new spaces Vi, is optimal. Theorem 3.1 Let V , V0 and Vi (for i = 1; :::; p) be defined by (42), (43) and (46) respectively. Then there exists C > 0, which isPindependent of h, H and p, such that for any v 2 V there are vi 2 Vi such that v = pi=0 vi and p X

=0

A (v ; v ) C A(v; v) : i

i

i

(48)

i

3.2 Implementation In order to implement the above AS preconditioner in parallel, [5] combines the coarse grid solve associated with K0 1 in (31) with each of the subdomain solves. This is achieved by assigning a copy of the entire coarse mesh, T 0 , to each processor but ~ i;k 1 at step k of the refinement only allowing processor i to refine this mesh inside

k 1 k process (i.e. from T to T ). The continuous piecewise linear FE spaces on the resulting meshes are then given by

V~

i

= V0 [ V

i

(49)

for i = 1; :::; p. Corresponding meshes (with p = 3) are illustrated for a simple 2-d example in Figure 3. Note that in this figure there are a number of “slave” nodes in each processor’s mesh which cause these meshes to be non-conforming. The solution values at these nodes are not free: they are determined by the nodal values at the ends of the edges on which the slave nodes lie. For a practical implementation it turns out to be simpler to allow the solution values at these nodes to be free by performing an interior refinement of those elements on the unrefined sides of the edges that have “hanging” nodes on them. In the 2-d example of Figure 3 this simply involves bisecting all triangular elements containing a hanging node and for 3-d problems a similar intermediate refinement strategy may be used (see [48] for example). Having obtained a mesh on each processor it is possible for processor i to assemble the stiffness matrix, Ki , for its own mesh independently of the other processors. These 15

Figure 3: An example of the meshes produced when mesh containing 24 elements.

J =2 J =3 J =4 J =5 J =6

p = 3 and J = 2 for a coarse

p = 2 p = 4 p = 8 p = 16 6 6 6 6 6

9 8 8 7 7

12 12 12 11 11

14 13 13 12 12

Table 1: The number of PCG iterations required to reduce the residual by a factor of 106 when solving Poisson’s equation on a coarse mesh of 256 elements (taken from [5]). are the matrices that are used in a modified version of the preconditioner (31) which takes into account the merging of V0 with Vi :

M

1=

p X

=1

R~ K 1 R~ : T

i

i

i

(50)

i

~ i is the rectangular matrix that represents the projection from V to V~i . In (50) each R Note that the application of these projections (and the corresponding prolongations R~ iT ) requires interprocessor communications but that the subdomain solves (corresponding to Ki 1 ) may be undertaken independently and concurrently. A number of results are presented in [5] however Table 1 shows a sample of these for solving Poisson’s equation on a square domain in two dimensions on between 2 and 16 subdomains, using between 2 and 6 levels of uniform refinement. As predicted by the theory, the number of iterations taken appears to be bounded independently of h and p for a fixed choice of H .

4 Extensions In this section two possible extensions of the work of [5], outlined in the previous section, are considered. The first of these is based upon an observation made by Cai 16

and Sarkis in [13] concerning traditional AS preconditioners of the form (31). The second extension is to a larger class of problem than those given by Problem 1.1. In addition to the straightforward generalization to three dimensional problems (which is included in [5]), it is possible to apply the preconditioning techniques reviewed in this paper to non-self-adjoint problems. This is described for convection-diffusion equations, based upon the work appearing in [4, 30, 31, 40, 41].

4.1 A Restricted Version of the Preconditioner In [13] an AS preconditioner of the form (31) is considered. It is noted that for each subdomain i, for i = 1; :::; p, the operations Ri and RiT each require a certain amount of computational work in order to restrict vectors between T and Ti , and prolongate vectors from Ti to T , respectively. Furthermore, in a parallel implementation on p processors, each of these operations requires interprocessor communication between neighbouring processors. In [13] therefore, a restricted AS preconditioner is proposed ^ iT say, for i = 1; :::; p. in which RiT is replaced by a different prolongation matrix, R This matrix represents a simpler prolongation to T from those nodes of Ti in the region of i that does not overlap with neighbours, with overlapping nodal values set to zero. Hence no interprocessor communication is required for the prolongation steps and the total communication cost of each preconditioning step is halved. Interestingly, [13] reports that, with this restricted AS preconditioner, iteration counts are actually slightly lower than those obtained using the full AS preconditioner (31). It should be noted however that this new preconditioner is not symmetric positive-definite (SPD) and so the PCG algorithm can no longer be used. In [4] and [41] results are presented (in two and three dimensions respectively) using a GMRES solver (see, for example, [3, 23, 42]) with a preconditioner based upon a restricted version of the new preconditioner (50): X R^~ K 1 R~ : M 1= p

T

=1

i

i

i

(51)

i

~ i and Ki are as in (50) and R^~ iT is a rectangular prolongation matrix that maps Here R 2