Implementing overlapping domain decomposition ... - CiteSeerX

1 downloads 0 Views 202KB Size Report
intranet/internet. Each worker must be .... Step 2: Set up the vector b, the RHS of Ax = b, from the finite difference methods by using the boundary data g, the ...
Implementing overlapping domain decomposition methods on a virtual parallel machine David Darjany1? , Burkhard Englert2?? and Eun Heui Kim1? ? ? 1

2

California State University Long Beach, Dept. of Mathematics and Statistics Long Beach, CA 90840 Email: [email protected], [email protected] California State University Long Beach, Dept. of Comp. Engr. & Comp. Science Long Beach, CA 90840 Email: [email protected].

Abstract. To solve many partial differential equations of different types domain decomposition techniques were developed. Such algorithms are generally very well suited for implementation on a virtual parallel machine, simulated on a distributed system. While such algorithms are readily available and well established in the literature, authors do usually not concern themselves with questions of the practical implementability of their algorithms. In particular issues such as finding the optimal size of overlap in domain decompositions, finding the most effective number of subdomains or deciding whether to use exact or inexact subdomain solvers are beyond the scope of these results. In this paper we will address these questions. We first develop suitable domain decomposition algorithms for our virtual parallel machine. Through numerical experiments using our algorithms we then show that smaller linear systems work well even without any overlap while larger systems require that at least 10% of the subdomain size overlap to have convergency. The data also indicates that between 20% to 35% of the subdomain is the optimal overlap size. We next increase the number of subdomains and analyze its effect on the parallel solver. Our data shows that for a sufficiently large linear system computational speed of convergence improves significantly as the number of subdomains increases. We finally compare the effectiveness of exact and inexact domain solvers and show that the appropriate choice of the number of iterations in the worker algorithm, is much more efficient in the inexact solver than in the exact solver.

1

Introduction

Domain decomposition techniques have been developed in recent years for solving partial differential equations of various types including elliptic, parabolic and ? ?? ???

Research supported by the Dept. of Energy under grant DE-FG02-03ER25571. Research supported by the Dept. of Transportation under METRANS USC-111699. Research supported by the Dept. of Energy under grant DE-FG02-03ER25571.

mixed type equations. In particular for elliptic problems, various domain decomposition algorithms are well developed, see for example the survey paper by Chan and Mathew [6] and the references therein for details. While these algorithms are readily available they are only little concerned with practical implementability on parallel clusters. In particular issues such as finding the optimal size of overlap in domain decompositions, finding the most effective number of subdomains or deciding whether to use exact or inexact subdomain solvers are not addressed in these papers. In this manuscript we will study these more practical aspects of these domain decomposition techniques. Evaluating the impact of the underlying hardware, however, is beyond the scope of this paper. 1.1

Domain Decomposition

In general, domain decomposition methods can be classified either as overlapping subdomain algorithms or nonoverlapping subdomain algorithms, based on a decomposition of the domain into a number of overlapping subregions or nonoverlapping subregions respectively. A comparison of some of the overlapping and nonoverlapping algorithms can be found in [4]. As it was discussed in [6] the overlapping method is generally easier to implement (the nonoverlapping method has so called interface problems) and it is easier to achieve an optimal convergence rate and often more robust. However extra work is needed in the overlapping regions. Furthermore, the overlapping method is not suitable for the discontinuous differential operator because of the discontinuities on the overlapped regions. The results in [2, 5] show that the overlapping and nonoverlapping algorithms are basically related under a specific interface preconditioner. In this manuscript we focus on the overlapping method and we briefly discuss the classical Schwarz alternating algorithms [4, 6–10, 13] based on overlapping subregions for a Poisson equation with a Dirichlet boundary condition: ½ Lu ≡ −∆u = f (x, y), in Ω, (1) u = g on ∂Ω. It also works for a general elliptic equation with a Neumann or mixed boundary condition. For simplicity we consider Ω is covered by Ωi , i = 1, 2 where Ω1 ∩ Ω2 6= ∅. With an initial guess u0 , the Schwarz alternating algorithm constructs a sequence of iterations uk , k = 1, 2, ..., where we first solve  k+1  k+1  Lu1 = f in Ω1  Lu2 = f in Ω2 k and uk+1 = u | uk+1 = uk+1 |Γ2 Γ 1 2 1  1k+1  k+1 u1 = 0 on ∂Ω1 \ Γ1 , u2 = 0 on ∂Ω2 \ Γ2 , where Γ1 = ∂Ω1 ∩ Ω and Γ2 = ∂Ω2 ∩ Ω. Next uk+1 is then defined by ½ k+1 u2 (x, y) if (x, y) ∈ Ω2 uk+1 (x, y) = uk+1 (x, y) if (x, y) ∈ Ω \ Ω1 . 1 Depending on the choices of Ωi , i = 1, 2, it can be shown that ku − uk k ≤ ρk ku − u0 k,

ρ < 1.

The constant ρ can be close to one if the overlapping region is thin. See for example [11]. Depending on the number of subdomains and the geometry of the domain, there are so called ”coloring methods” in implementing alternating methods, that is, to determine which subdomains get computed in what order, see [6] for details. It is very well known that the Poisson equation arises in the study of stationary diffusions and waves, electric potential, steady fluid flow, Brownian motion and so on. We discretize the equation (1) by finite differences (it is also true for finite elements) to get a large sparse symmetric positive definite linear system: Au = b.

(2)

There are many iterative methods to solve linear systems such as Jacobi, Gauss-Siedel, SOR and so on. In this manuscript we use the Gauss-Siedel iterative method since it is easy to implement and sufficiently efficient. We note that for the Poisson equation (even for certain nonlinear equations), the domain can be scaled to be small. However since our goal is to implement and test the efficiency of these algorithms for large linear systems (2) when implemented over a distributed systems, we focus in this paper on solving (2) for a domain without any scaling. 1.2

Motivation

We now discuss some practical issues in implementing the overlapping subdomain method using our new algorithms. One of the issues is how to choose the size of the overlapping region, that is how to choose γ = dist(Γ1 , Γ2 )/2 > 0, where dist is the standard Euclidean distance. Here for simplicity we assume that subdomains are evenly overlapped. It is pointed out in [3, 6] that there are convergence theories which show that the bigger γ gets the fewer number of iterations is needed. However the convergence theories do not answer the question of how to choose γ appropriately in practice. The next practical issue is to choose an optimal number of subdomains. This issue is often governed by the geometry of the domain and naturally by the number of clusters available for parallel computing. Also, in the early stages of the iterations, to save computational cost, instead of finding exact solutions for subdomain problems, inexact subdomain solvers can be used. Inexact subdomain solvers can be formulated in two ways. One way is that when the differential operator has variable coefficients, the subdomain problems can be approximated by simpler forms, for example with constants (average valued) coefficients. The other way is that in solving the subdomain linear system of (2), it can be used by a few of Gauss-Seidel (or Jacobi, SOR, etc) iterations or replaced by some other inexact methods (ILU, ILUT). However in applications, it may result in the loss of convergence in the iterations. In this manuscript, we focus on experiments of overlapping subdomain methods by varying the size of γ and by varying the number of subdomains. Furthermore we compare efficiencies of inexact subdomain solvers and exact solvers by

varying the number of Gauss-Siedel iterations in subdomain solvers. We note that instead of using alternating methods (alternating subdomains), we run worker machines simultaneously. That is, with uk , k = 0, 1, ..., in the domain Ω = {0 < x < a, 0 < y < b}, the master machine sends each corresponding boundary data for each and every subdomains, worker machines then perform computations simultaneously and send the new data uk+1 back to the master machine. We repeat this until uk converges. This simplifies the process of dividing the subdomains to alternate the computations, that is we do not need the so called coloring method to generate a sequential order of subdomains to work on them one after the other. We discuss the algorithms in more detail in Section 2. 1.3

Our contributions

Our contributions are based on our domain decomposition algorithms for a virtual parallel machine and are threefold: – Using our algorithms we investigate what happens if we increase the size of the overlap relative to the size of the domain. We show that the results depend on the size of the linear system (the domain). Our data shows that smaller linear systems work well even without any overlapping while the larger systems require that at least 10% of the subdomain size overlaps to have convergency. The data indicates that between 20% to 35% of the subdomain is the optimal overlap size. We provide some numerical results in Section 3. – We next increase the number of subdomains and analyze its effect on the parallel solver. Our data shows that for a sufficiently large linear system its computational speed improves significantly as the number of subdomains increases. – We finally compare the effectiveness of exact and inexact domain solvers and show that the inexact solver with the appropriate choice of M S, the maximum number of iterations in the worker algorithm, is much more efficient than the exact solver. For a smaller linear system, a relatively small number of M S is sufficient to improve the efficiency, whereas for a larger system, a bigger number of M S is needed. Our data shows that as the size of the matrix increases about 250%×250%, the number M S must increase by about 1, 000 %.

2

Algorithms

In this section we describe the algorithms that we designed and used for the computations. Our virtual parallel machine is implemented over a distributed system. We designate one machine as master and the others as workers. We present the algorithm for Master machine in subsection 2.1 and for worker machines in subsection 2.2. Since implementations of virtual parallel machines are

readily available over the internet, many users will not want to rewrite this code. Hence we also decided to perform our experiments using such available code packages. We implement Lucio Andrades parmatlab v1.77 Beta - parallel computing toolbox [1] for distributed computing on connected PCs. This toolbox using matlab distributes processes to workers available over the intranet/internet. Each worker must be running a matlab daemon to be accessed. The toolbox can operate in two possible modes: [MPMD mode] Multiple programMultiple Data parallel model; the user has the control to send different matlab tasks to remote machines simultaneously and retrieve results later. [SPMD mode] Single program-Multiple Data parallel model; parallelization and management of remote workers is done automatically. Input data must be regularly ordered in matlab hyperblocks. Our algorithm operates in MPMD mode. Using this readily available toolbox has several advantages [1]: 1. No common file system is needed since all communications between tasks are through TCP/IP connections. 2. The parallel virtual machine does not need to know which workers are available, it’ll will be listening until workers report ready. New workers can be added even after the process has been started. 3. Parallelization can be done over different dimensions (up to 5) at the same time and using contiguous, overlapping or constant hyper-blocks. Indexes can also vary for different input variables, the only restriction is that the total number of parallel elements should be the same. 4. The toolbox uses an improved version of the TCP/IP TOOLBOX 1.2.3 by Peter Rydesater [12]. In particular, serialization of data is achieved with a low-level MEX file, reducing the impact on computational efficiency. Serial data to Matlab variables is also done with a MEX file. 2.1

Master Algorithm

INPUT: size of main domain (0 ≤ x ≤ a, 0 ≤ y ≤ b), number of sub-domains N , size of overlap (in # of mesh points), tolerance ², maximum number of iterations for the master algorithm M , maximum number of iterations for the worker algorithm M S, function f (x, y) (the nonhomogeneous term in the Poisson equation (1)), boundary function g(x, y). In addition, exact solution for comparison purposes if known. OUTPUT: the approximate solution to (1), total number of master iterations, execution time, error from exact solution (if supplied). Or, message notifying that max number of iterations has been reached without convergence. Step 1: Determine sizes of vectors based on domain definition (number of mesh points for interior domains, boundary domains and the full domain). Step 2: Initialize g, w, the approximate boundary and function, respectively ( we used an initialization function w = 0). Step 3: Set up position vector, containing the bottom left corner of each domain (this is sent to the worker machine and then back so we know which domain we are receiving back). Step 4: Initialize parallel session. Use Lucio Andrades parmatlab v1.77 Beta - parallel computing toolbox [1]. Step 5: WHILE (the counter < M ) do Steps 6–9. Step 6: Send each domain to a parallel worker machine. This is split up into 3 steps. First send the leftmost domain, then the middle domains, and lastly, the rightmost domain. The worker machines run a Poisson algorithm, which is described in Subsection 2.2. Step 7: Receive each domain from parallel worker machines. They are received in random order, and reset to its proper location in the w−vector based on the returned start position of that particular domain. Step 8: Check one column of a domain with its matching value from another domain. (What is checked is the left most domain of the overlap of each region).

If the difference between the two columns values is less than ² (we take ² = 10− 2), for every overlap region, then mesh results together, using only one sub-domains data for each overlap. Graph the result. Close the parallel session. STOP. Step 9: Else, replace w by the newly computed solution, increment the counter and go to Step 6. Step 10: Close the parallel session. STOP.

2.2

Worker Algorithm (Poisson Algorithm)

We implement the Gauss-Siedel iterative method and test for both an exact solver which is identical to the Gauss-Siedel iterations and an inexact solver which must execute the algorithm after M S, relative small, number of iterations even without convergence. In the algorithm there is no differences between the exact and the inexact solvers except the obvious one, M S for the exact is large and M S for inexact is small. INPUT: the most current approximate solution of (1) W temp, the start position of where this data sits in the x − y plane (lower left boundary) startP os, the nonhomogeneous function f (x, y), the most current boundary data g corresponding to W temp, a vector holding (i) the mesh size ∆x = ∆y (ii) number of y steps the data encompasses, (iii) number of x steps the data encompasses (iv) maximum number of iterations for the procedure to be performed, M S. OUTPUT: the approximate solution W , startP os which is the same as the input value, a Boolean result of whether or not the algorithm converged within the max number of iterations it was allowed to perform sol. Step 1: Set up local variables from passed vectors Step 2: Set up the vector b, the RHS of Ax = b, from the finite difference methods by using the boundary data g, the nonhomogeneous function f and ∆x = ∆y. Step 3: WHILE ( (I) the counter < M S or (II) the error between the most recent and the previous approximation > ² ) do Gauss-Seidel iterations with the initial approximation W temp. Step 4: Send the approximate solution W , startP os, sol to the master machine. STOP.

3

Numerical Experiments

For implementations, we use f = 2 sin(πy/b) + (π/b)2 x(a − x) sin(πy/b) and g = 0 on the boundary so that the exact solution of the Possion equation (1) becomes uexact = x(a − x) sin(πy/b) on Ω = {0 < x < a, 0 < y < b}. We use PCs, Intel Pentium 4, 1.5 GHz, 256MB of RAM, 40GB hard drives. We are connecting PCs via a local area network. The operation system for PCs is Windows XP. The programming codes are written in Matlab version 6. In all of our experimentations, we use the mesh size ∆x = ∆y = 10−1 and thus the tolerance ² = 10−2 . 3.1

Varying Size of Overlap

By subdomain size we mean the following. For a given domain Ω we evenly subdivide Ω into N many nonoverlapping subdomains, say Ωi0 , such that Ω = SN 0 0 i=1 Ω i where Ω means the closure of Ω. We let the size of Ωi be the subdomain 0 size. We then extend Ωi to obtain an overlapping decomposition Ωi of Ω for worker machines to perform computations.

Case 1A: Domain Ω = {0 < x < 3, 0 < y < 1}, number of subdomains is 3. Exact solver method is used for subdomain solvers. Subdomain size is 1. In this case, we increase the size of the overlap from 0 to 0.5, that is the number of overlapping mesh points from 1 to 11. The number of iterations in the Master machine decreases as the size of overlap increases. However when the overlap size gets bigger than 0.2 (20%), the execution time jumps from 12.7s to 18.3s. In fact this case shows that the fastest execution time is obtained when the size of the overlap is 0.2, which, however is very close to the case when the size of overlap is 0, and it gets slower beyond 0.3. We think that it is because beyond the overlap 0.3 the workers are solving unnecessarily large linear systems and thus it becomes inefficient. This suggests that the optimal size of the overlap is within 20% of the subdomain size. Table 1. Case 1A # of mesh points # Iterations Execution Time max |u − uexact | (Size of overlap) in Master in seconds 1(0) 6 12.8 6.256 × 10−3 3(0.1) 5 13.8 1.332 × 10−3 5(0.2) 4 12.7 1.512 × 10−3 7(0.3) 4 18.3 1.656 × 10−3 9(0.4) 3 19.1 1.636 × 10−3 11(0.5) 3 19.0 1.668 × 10−3 Table 2. Case 1B # of mesh points # Iterations Execution Time max |u − uexact | (Size of overlap) in Master in seconds 1(0) 12 73.5 2.412 × 10−1 5(0.2) 8 60.1 1.237 × 10−3 9(0.4) 6 60.9 1.531 × 10−3 11(0.5) 5 59.1 1.532 × 10−3 13(0.6) 5 61.8 1.596 × 10−3 15(0.7) 4 56.1 1.537 × 10−3 17(0.8) 4 63.3 1.588 × 10−3

Case 1B: Domain Ω = {0 < x < 6, 0 < y < 2} and the number of subdomains are 3. Exact solver method is used for subdomain solvers. In this case the subdomain size is 4. This case shows that for the overlap sizes which are bigger than 0.7 there is no improvement on both the number of iterations for the Master machine and the execution time. Only the first case does not converge and the fastest execution time is obtained when the overlap size is 0.7 which is 35% of the subdomain size. 3.2

Varying Number of Domains

Case 2A: Domain Ω = {0 < x < 6, 0 < y < 2}, and 15 mesh points overlaps. Exact Sub-domain Solver is used. Note that Ave. # of worker iterations means that average number of iterations in worker machines on the first (largest) master iteration. Case 2A shows no improvement in the execution time as the number of subdomains increases. In fact the data shows that one subdomain, that is no

distributed computing, is the fastest. This happens because the machines are spending most of their time to communicate with each other and thus we conclude that the size of the linear system (the domain) is not sufficiently large enough to see any improvement from distributing the algorithm. Table 3. Case 2A # of subdomains # Iterations Execution Time max |u − uexact | Ave. # of in Master in seconds worker iter’s 1 1 43.9 1.618 × 10−3 690 2 4 54.6 1.526 × 10−3 598 3 4 55.9 1.537 × 10−3 530 4 5 47.9 1.559 × 10−3 484 5 5 46.7 1.487 × 10−3 456 Table 4. Case 2B # of subdomains # Iterations Execution Time max |u − uexact | Total in Master in seconds iterations 1 1 1572.2 1.397 × 10−3 2758 2 11 1007.7 8.719 × 10−4 5500 3 11 866.9 6.990 × 10−4 8250 4 12 413.1 9.750 × 10−4 12000 −4 5 12 359.8 7.622 × 10 15000

Case 2B: Domain Ω = {0 < x < 10, 0 < y < 4} and 21 mesh points overlaps. Inexact Sub-domain Solver (250 iterations/worker machine) is used. This case shows that the execution time decreases significantly as the number of sudomains (worker machines) increases. Unfortunately we only had 5 workers to test so we do not have any more data. It will be interesting to see how the execution time changes as the number of workers increases further. 3.3

Inexact vs. Exact Subdomain Solver

Case 3A: Domain Ω = {0 < x < 6, 0 < y < 2} and 15 mesh points overlap. Number of Domains is 3. Table 5. Case 3A: Inexact subdomain solver MS: # of iter. # iter. Exec. Time in worker in Master in sec.s 3 64 66.3 4 55 61.2 5 48 53.6 10 31 41.7 15 24 42.3

max |u − uex | Total iter.s 2.695 × 10−1 576 2.062 × 10−1 660 1.717 × 10−1 720 8.00 × 10−2 930 4.11 × 10−2 1080

In this case we obtained the results by varying the maximum numbers of iterations M S in the inexact solvers and the results of the exact solvers. The fastest execution time shown in the inexact subdomain solver is obtained when the maximum number of iterations in workers is set to M S = 10. So by comparing the execution time in the exact solver which is 52.4s, we see that the inexact solver is faster when setting 10 as the number of iterations in workers.

Table 6. Case 3A: Exact subdomain solver # of iter. # iter. # iter. # iter. max |u − uex | in Master in worker1 in w2 in w3 1 499 593 497 2.5425 2 441 484 428 1.975 × 10−1 3 303 358 290 1.100 × 10−2 4 195 200 183 1.540 × 10−2 Totals: 1438 1635 1398 Total iter. 4471 Exec. time 52.4s

Case 3B: Domain Ω = {0 < x < 10, 0 < y < 4} and 15 mesh points overlap. Four Domains. Without using the distributed algorithm, using 2552 iterations, we obtain an execution time of 907s and max |u − uexact | is 8.480 × 10−3 . In this case the fastest execution time for the inexact solver is when the number M S is 100 with 270.3s, while the exact solver takes 686.9s to execute. Notice that the domain is large enough to see the effect of the distributed algorithm (see the execution time for the single domain, that is no workers, 907.0s). Table 7. Case 3B: Inexact subdomain solver MS: # of iter. # Iterations Execution Time max |u − uexact | Total in worker in Master in seconds iterations 3 226 1117.2 1.0211 2712 5 170 835.4 6.782 × 10−1 3400 10 110 652.8 3.495 × 10−1 4400 −1 15 84 527.3 2.203 × 10 5040 25 58 395.3 1.270×10−1 5800 50 35 342.7 5.300 × 10−2 7000 100 21 270.3 2.100 × 10−2 8400 250 12 376 2.400 × 10−2 12, 000

For both Case 3A and Case 3B, in fact for all of our data, we use the stopping criteria, max |uk−1 − uk | < ² = 10−2 , to execute the algorithm. Thus the result as it turns out to be much more closer ( of order 10−3 in Table 8) to the exact solution whereas it is still of order 10−2 when it compares with its previous approximation.

4

Conclusions

Our data shows that the size of the linear system (the domain) is the key factor to determine the size of overlap, the number of workers, and the number of iterations, M S, in the inexact solver. In fact, for smaller linear systems, in all three cases relatively small numbers are sufficient to have efficient convergence, whereas for larger systems, relatively large numbers are needed. Our experiments guide how specifically the size of overlap, the number of workers and the number of iterations, M S, in the inexact solver can be chosen for different sizes of linear systems for efficient distributed computing.

Table 8. Case 3B: Exact subdomain solver # of iter. in Master 1 2 3 4 5 6 7 8 9 Totals: Total iter. Exec. time

# iter. in w1 1153 1107 944 830 682 560 422 302 189 6189 23064 686.9s

# iter. in w2 1532 1399 1243 1065 903 730 567 405 260 8104

# iter. in w3 1531 1393 1240 1059 899 725 563 401 256 8067

# iter. in w4 1151 1091 928 815 667 545 407 288 179 6071

max |u − uex | 1.370 × 10 4.899 1.7819 6.331 × 10−1 2.2237 × 10−1 7.460 × 10−2 2.130 × 10−2 2.400 × 10−3 5.900 × 10−3

References 1. Lucio Andrade, Paramatlab, http://www.mathworks.com/matlabcentral/ fileexchange/loadFile.do?objectType=file&objectId=217. ¨ rstad and O.B. Widlund, To overlap or not to overlap: A note on 2. P.E. Bjo domain decomposition method for elliptic problems, SIAM J. Numer. Anal. 23(6) (1989) 1093-1120. 3. X.-C. Cai, A family of overlapping Schwarz algorithms for nonsymmetric and indefinite elliptic problems. Domain-based parallelism and problem decomposition methods in computational science and engineering, SIAM, (1995) 1–19. 4. X.-C. Cai, W.D. Gropp and D.E. Keyes, A comparison of some domain decomposition and ILU preconditioned iterative methods for nonsymmetric elliptic problems, Lin. Alg. Applics, (1994) 5. T.F. Chan and D. Goovaerts, On the relationship between overlapping and nonoverlapping domain decomposition methods, SIAM J. Matrix Anal. Appl. 13(2)(1992) 663. 6. T.F. Chan and T.P. Mathew, Domain decomposition algorithms, Acta Numerica, (1994), 61–143. 7. M. Dryja and O. Widlund, An additive variant of the Schwarz alternating method for the case of many subregions, Tech. Rep. 339, Courant Inst., New York Univ., 1987. 8. W. D. Gropp and D. E. Keyes, Domain Decomposition on parallel computers, Impact of Computing in Science and Engineering, 1 (1989), pp. 421-439. 9. D. E. Keyes, Domain Decomposition: a bridge between nature and parallel computers, NASA ICASE Technical Report 92-44, NASA Langley Research Center Hampton, VA 23681-0001, 1992. 10. D. E. Keyes and W. D. Gropp, A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation, SIAM J. Sci. Statis. Comp. , 8 (1987), pp. 166-202. 11. P. L. Lions, On the Schwarz alternating method II: stochastic interpretation and order properties, Domain decomposition methods, SIAM (1989),47–70 12. Peter Rydesater, TCP/IP toolbox, http://petrydpc.itm.mh.se/tools. 13. B. F. Smith, P. E. Bjorstad and W. D. Gropp, Domain Decomposition: Parallel Multilevel Methods for Elliptic PDEs, Cambridge University Press, 1996.

Suggest Documents