Pipeline Implementations of Dirichlet--Neumann and Neumann ...

3 downloads 93491 Views 211KB Size Report
May 27, 2016 - of speeding up the computation is to parallelize the solution process ... iterates can be computed in a parallel pipeline fashion after an initial start-up cost. ...... linearly elliptic three-dimensional problems, J. of Comput. and App.
PIPELINE IMPLEMENTATIONS OF DIRICHLET–NEUMANN AND NEUMANN–NEUMANN WAVEFORM RELAXATION METHODS

arXiv:1605.08503v1 [math.NA] 27 May 2016

BANKIM C. MANDAL∗ AND BENJAMIN W. ONG† Abstract. The work of this article is motivated by the increasing demand of high resolution simulation of complex systems, as well as the increasing availability of multiprocessor supercomputers with thousands of cores that help to reduce the computational cost significantly. One attractive way of speeding up the computation is to parallelize the solution process using domain decomposition (DD) methods, or to reformulate the existing methods in a manner that allows for many concurrent operations. In this paper, we reformulate two newly found DD methods, namely Dirichlet-Neumann and Neumann-Neumann Waveform Relaxation (DNWR and NNWR) so that successive waveform iterates can be computed in a parallel pipeline fashion after an initial start-up cost. We illustrate the performance of the pipeline implementation for parabolic problems with numerical results, and study the communication costs for various implementations. Key words. Dirichlet–Neumann, Neumann–Neumann, Waveform Relaxation, Parallel Computing, Domain Decomposition, Space–time Parallel Method AMS subject classifications. 65M55, 65Y05, 65M20

1. Introduction. Dirichlet–Neumann waveform relaxation (DNWR) and Neumann– Neumann waveform relaxation (NNWR) methods [8, 9, 16, 12, 7] have been formulated and analyzed recently for parabolic and hyperbolic partial differential equations (PDEs). These iterative methods are based on non-overlapping domain decomposition in space, where the iterations require subdomain solves with Dirichlet or Neumann boundary conditions. The superlinear convergence behavior of both the DNWR and NNWR methods have previously been shown for the heat equation [8, 9]; finite step convergence to the exact solution for the wave equation has also been shown [7, 17]. Both the DNWR and NNWR methods belong to the family of Waveform Relaxation (WR) methods. In the classical formulation, one solves the space-time subproblem over the entire time horizon at each iteration, before communicating interface data across subdomains. These WR methods have their origin in the work of Picard– Lindel¨ of [21, 14] in the 19th century; Waveform relaxation methods were introduced as a parallel approach for solving systems of ODEs [13]. Since then, the WR framework has been combined with numerical approaches for solving elliptic problems for tackling time-dependent parabolic PDEs [10, 11]. New transmission conditions have also been developed, known as optimized Schwarz WR (OSWR) methods, in order to achieve convergence with non-overlapping domains, or in general, faster convergence for overlapping domains; OSWR approaches for parabolic problems [5] and OSWR approaches for hyperbolic problems [6] have both been explore. This paper discusses a pipeline implementation of the DNWR and NNWR algorithms [15], which are extensions of Dirichlet-Neumann [1, 3, 19, 18] and Neumann-Neumann [2, 22, 23] algorithms to solve space–time PDEs. The pipeline implementation of the DNWR and NNWR algorithm subdivides the entire time domain into smaller time blocks. The space-time subproblem is solved on this smaller time block, and updated transmission conditions transmitted before advancing to the next time block. This enables many concurrent subdomain space-time ∗ Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, 49931 [email protected] † Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, 49931 [email protected]

1

2

B. C. Mandal and B. W. Ong

t T Ω1

0

···

Ω2

x1

x2

ΩN −1

ΩN

xN −2 xN −1

L x

Fig. 2.1: Decomposition of a R1 domain

blocks, resulting in efficient parallelization on modern supercomputers. The pipeline implementation for classical Schwarz waveform relaxation was previously introduced for classical Schwarz waveform relaxation methods [20]. The theoretical presentation of the algorithms are for the time-dependent PDE, ∂u ∂t

− Lu = f (x, t), x ∈ Ω, t ∈ [0, T ], u(x, 0) = u0 (x), x ∈ Ω, u(x, t) = g(x, t), x ∈ ∂Ω, t ∈ [0, T ],

(1.1)

where L is a spatial operator, Ω ⊂ R× [0, T ] is a bounded domain, and ∂Ω is a smooth boundary. This paper is broken into two main sections. In Section 2, we review the NNWR method for solving equation (1.1) before presenting the pipeline approach and numerical studies. The DNWR algorithm is outlined in Section 3, along with various pipeline approaches and numerical results. 2. Neumann-Neumann Waveform Relaxation Method. The NNWR algorithm for the model problem (1.1) with multi–subdomains setting was previously proposed and analyzed in [8]. In this method, a spatial domain Ω is first partitioned into non-overlapping subdomains {Ωi , 1 ≤ i ≤ N }. Figure 2.1 illustrates a simple 1D decomposition, although more complicated 2D or 3D decompositions (with cross points) are also possible. For i = 1, . . . , N set Γi := ∂Ωi \ ∂Ω, Λi := {j ∈ {1, . . . , N } : Γi ∩ Γj has nonzero S measure} and Γij := ∂Ωi ∩ ∂Ωj , so that the interface of Ωi can be rewritten as Γi = j∈Λi Γij . We denote by nij the unit outward normal for Ωi on the [0]

interface Γij . The NNWR algorithm starts with an initial guess wij (x, t) along the interfaces Γij × (0, T ), j ∈ Λi , i = 1, . . . , N , and then performs a two-step iteration for each waveform iterate k. This two-step iteration consists of first solving a “Dirichlet subproblem” on each space-time subdomain, [k]

[k]

∂t ui − Lui [k] ui (x, 0) [k] ui [k] ui

= = = =

f, u0 (x), g, [k−1] wij ,

in Ωi , in Ωi , on ∂Ωi \ Γi , on Γij , j ∈ Λi .

(2.1)

Pipeline DNWR and NNWR Implementations

3

followed by solving an auxiliary “Neumann subproblem” [k]

[k]

∂t ψi − Lψi [k] ψi (x, 0) [k] ψi [k] ∂nij ψi

= = = =

0, 0, 0, [k] [k] ∂nij ui + ∂nji uj ,

in Ωi , in Ωi , on ∂Ωi \ Γi , on Γij , j ∈ Λi .

The iteration ends with an update of the Dirichlet traces,   [k] [k] [k−1] [k] wij (x, t) = wij (x, t) − θ ψi Γij ×(0,T ) + ψj Γij ×(0,T ) ,

(2.2)

(2.3)

where θ ∈ (0, 1] is a relaxation parameter. Although the NNWR algorithm can be solved on complicated subdomains in higher dimensions, we present algorithms and results for strip-like subdomains in R1 (Figure 2.1). Suppose Ω := (0, L) is decomposed into Ωi := (xi−1 , xi ), i = 1, . . . , N . Denote oN −1 n [0] the initial guess (iterate 0) as wi (t) on the interfaces xi . For i = 1, . . . , N , i=1

the equations (2.1) simplify to [k]

[k]

∂t ui − Lui = f, [k] ui (x, 0)

(x, t) ∈ Ωi × [0, T ],

= u0 (x), x ∈ Ωi , ( g(xi−1 , t), if i = 1, [k] ui (xi−1 , t) = [k−1] wi−1 (t), for i > 1, ( [k−1] (t), for i < N, wi [k] ui (xi , t) = g(xi , t), if i = N,

(2.4)

and the auxiliary equations (2.2) simplify to [k]

[k]

∂t ψi − Lψi = 0,

(x, t) ∈ Ωi × [0, T ],

[k] ψi (x, 0) = 0, x ∈ Ωi , ( [k] ψ1 (x0 ) = 0, [k] [k] −∂x ψi (xi−1 , t) = (∂x ui−1

(

[k]

[k]



[k] ∂x ui )(xi−1 , t),

if i = 1, for i > 1,

(2.5)

[k]

∂x ψi (xi , t) = (∂x ui − ∂x ui+1 )(xi , t), for i < N, [k] if i = N. ψN (xN ) = 0,

The update of the Dirichlet traces, equation (2.3), simplifies to   [k] [k] [k−1] [k] (t) − θ ψi (xi , t) + ψi+1 (xi , t) . wi (t) = wi

(2.6)

Remark 2.1. The auxiliary equations are only solved if the waveform iterates have not converged. The convergence of the NNWR algorithms has been shown [8]. Specifically, for θ = 1/4, the NNWR algorithm (2.4)–(2.5)–(2.6) converges superlinearly with the estimate !2k √ 2 2 6 [k] [0] max kwi kL∞ (0,T ) ≤ e−k hmin /T max kwi kL∞ (0,T ) , (2k+1)h2 1≤i≤N −1 1≤i≤N −1 min − T 1−e where hmin = min{xi − xi−1 : i = 1, . . . , N } is the minimum subdomain width.

4

B. C. Mandal and B. W. Ong

2.1. Classical NNWR implementation. A pseudo-code for a classical implementation of the NNWR in R1 is given in Algorithm 1. If the domain is broken into N non-overlapping subdomains, this implementation uses a total of N processing cores to approximate the solution. 2(N − 1)(2K − 1) or 4(N − 1)K messages are needed (depending on when the stopping criterion is satisfied) to transmit Nt words, where K is the number of waveform iterates computed, Nt is the number of time steps. In this simplified pseudo-code, it is assumed that identical time discretizations are taken in each subdomain. A generalization where each subdomain might take different discretizations is possible, provided an interpolation of the Neumann or Dirichlet traces are computed before lines 10 and 16 of Algorithm 1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

MPI Initialization; for i = 1 to N do in parallel for m = 1 to Nt do [0] Guess wi (tm ) on Γi for i = 1, . . . , (N − 1);

Set k = 1; while not converged do for m = 1 to Nt do [k] Solve equation (2.4) for ui (x, tm ); Compute jump in Neumann data along interfaces; Transmit/Receive Neumann data; Check for convergence; if not converged then for m = 1 to Nt do [k] Solve equation (2.5) for ψi (x, tm ); [k]

Update Dirichlet traces, wi (tm ), equation (2.6); Transmit/Receive Dirichlet data; Check for convergence; k ← k + 1;

Algorithm 1: The classical implementation of the NNWR algorithm is able to utilize N computing cores if the domain is broken into N non-overlapping subdomains. 2.2. Pipeline NNWR implementation. A pipeline implementation of the NNWR algorithm allows for higher concurrency at the expense of increased communication: multiple waveform iterates are simultaneously evaluated after an initial start-up cost. The main idea is to decompose the time window into non-overlapping blocks, and transmit messages to available processors after the evaluation of each time block, so that other processors can simultaneously compute solutions to the auxiliary equations or the next waveform iterate. We first illustrate this for a simple two subdomains, two time blocks example. Let 0 = T0 < T1 < T2 = T and let Ωij denotes the space-time domain, Ωi × [Tj−1 , Tj ]. The pipeline implementation starts with solving the Dirichlet subproblem, equations (2.4), for u[1] in Ω11 and Ω21 in parallel with two processors. As soon as this computation is completed, messages are sent to available processors. The algorithm then proceeds with solving the Dirichlet subproblems for u[1] in Ω12 and Ω22 , as well as solving auxiliary equation (2.5) for ψ [1] in Ω11 and Ω21

5

Pipeline DNWR and NNWR Implementations

simultaneously with four processors. If the waveform iterates have not converged, messages are sent to the appropriate processors, and the algorithm proceeds with solving the Dirichlet subproblems for u[2] in Ω11 and Ω21 , as well as solving auxiliary equation (2.5) for ψ [1] in Ω12 and Ω22 simultaneously with four processors. A graphical illustration of the pipeline framework for this example is shown in Figure 2.2.

u[1] ∈ Ω22

ψ [1] ∈ Ω22

u[2] ∈ Ω22

ψ [1] ∈ Ω21

u[2] ∈ Ω21

ψ [2] ∈ Ω21

Ω2 u[1] ∈ Ω21

··· u[1] ∈ Ω11

ψ [1] ∈ Ω11

u[2] ∈ Ω11

ψ [2] ∈ Ω11

u[1] ∈ Ω12

ψ [1] ∈ Ω12

u[2] ∈ Ω12

Ω1

Wall Time Fig. 2.2: A pipeline NNWR implementation for a two-domain, two time block example.

A pseudo-code for a pipeline implementation of the NNWR in R1 is given in Algorithm 2. If the domain is broken into N non-overlapping subdomains, this implementation is able to utilize up to min (N J, 2N K) processing cores, where K is the number of waveform iterates computed and J is the number of time blocks. The number of messages increases, with up to 4J(N − 1)K messages needed in the algorithm. The size of each message decreases to Nt /J words (transmission of Dirichlet data) or 2Nt /J words (transmission of both Dirichlet and Neumann data). Remark 2.2. Unlike the classical NNWR implementation where one iterates until convergence, Algorithm 2 requires a specification of K, the number of waveform iterates to be computed. One can use a priori error estimates to pick K intelligently, but it is likely that there will either be wasted work (if convergence is obtained for k < K) or an unconverged solution, if K is not large enough. All is not lost however if K is not large enough, since the computation can be restarted using the most accurate Dirichlet traces. Also, in support of dynamic resource allocation and fault resiliency, there is active development to allow MPI communicators to expand or shrink. However, there are no plans for adopting dynamic MPI communicators into MPI standards in the near future. Remark 2.3. The pipeline NNWR implementation has a start-up and shut-down phase before multiple waveform iterates can be computed in parallel, i.e., at the start and end of the computation, some processing cores sit idle. For K full waveform

6

B. C. Mandal and B. W. Ong

1 2 3 4 5 6 7

8 9 10

MPI Initialization; for i = 1 to N do in parallel for m = 1 to Nt do [0] Guess wi (tm ) on Γi for i = 1, . . . , N − 1;

for k = 1 to K do in parallel for q = 1 to 2 do in parallel if q = 1 then /* Dirichlet Update for j = 1 to J do if k > 1 then receive updated Dirichlet data;

13

for m = 1 to Nt /J do [k] Solve equation (2.4) for ui (x, tm ); Compute jump in Neumann data along interfaces;

14

Send Neumann data;

11 12

15 16

17 18 19 20 21 22 23 24

*/

Check for convergence. If converged, break; else /* Auxillary update for j = 1 to J do Receive Neumann data; for m = 1 to Nt /J do [k] Solve equation (2.5) for ψi (x, tm ); [k] Update Dirichlet traces, wi (tm ), equation (2.6);

*/

if k < K then Send Dirichlet data; Check for convergence. If converged, break;

Algorithm 2: The pipeline implementation of the NNWR algorithm is able to utilize min (N J, 2N K) computing cores if the domain is broken into N nonoverlapping subdomains. iterates and J time blocks, the peak theoretical parallel efficiency is ( 2K 2K > J, 2K+J−1 , if J , if 2K < J. 2K+J−1

(2.7)

2.3. Numerical Experiments. As mentioned in the introduction, the discretization and convergence of NNWR algorithms have already been discussed and established for hyperbolic and parabolic PDEs [7, 8, 15]. In this section, we are interested in the efficacy of the pipeline implementation, where many waveform iterates can be concurrently computed. We compare the computational time for approximating solutions to ut = uxx subject to the initial conditions u(x, 0) = (x − 0.5)2 − 0.25 and homogeneous Dirichlet boundary data. The computational domain, Ω × [0, T ], is [0, 1]×[0, 0.1]. A centered finite difference is used to approximate the spatial operator,

7

Pipeline DNWR and NNWR Implementations

and a backward Euler integration is used. Each subdomain solves the linear system for each time advance by performing a backwards and forwards substitution using a pre-factored LU decomposition of the corresponding matrix. The first numerical experiment validates the qualitative behavior of the theoretical peak efficiency, equation (2.7), as the number of time blocks, J, is varied. The number of full waveform iterates is fixed at K = 4, the number of subdomains is fixed at N = 6, the spatial and temporal discretizations are fixed at Nx = 30000, Nt = 8000. A total of 48 processing cores are used in this first experiment. Figure 2.3 displays the expected behavior – for a large number of time blocks, J, the algorithm is able to utilize all processors in a pipeline fashion for a larger percentage of the computation (at the expense of increased communication), in this case leading to higher efficiency. Here, an efficiency close to 1 means that the pipeline NNWR implementation is able to compute K full waveform iterates in almost the same wall time as the classical NNWR implementation using N processors to compute one partial (Dirichlet/Neumann) waveform iterate. The data used to generate Figure 2.3 is summarized in Table 2.1.

Efficiency

0.95

0.9

0.85

0.8

8

32 50 80 100 160 J: Number of time blocks

200

Fig. 2.3: Efficiency of the pipeline NNWR implementation as a function of the number of time blocks, J. For a large number of time blocks, J, the algorithm is able to utilize all processors in a pipeline fashion for a larger percentage of the computation, leading to higher efficiency.

# Time blocks Wall time in sec.

8 470.17

16 419.32

32 397.51

50 388.72

80 383.25

100 381.50

160 380.32

200 378.32

Table 2.1: Raw data used to compute the efficiency of the pipeline NNWR implementation as a function of the number of time blocks, J.

A weak scaling study is performed for the second numerical experiment. For a fixed spatial, temporal and time-block discretization, the number of processors is varied to compute a differing number of waveform iterates. Figure 2.4 shows that the

8

B. C. Mandal and B. W. Ong

Table 2.2: Time block and wall time chart for Figure 2.3 # Waveform Iteration 1 8 16 32 Wall time in sec. 35 42 40 48.5

pipeline NNWR implementation is able to scale weakly with reasonable efficiency. Again, an efficiency of 1 means that the pipeline NNWR implementation (using min (N J, N (2K − 1)) cores) is able to compute K waveform iterates in the same wall time as the classical NNWR implementation using N processors to compute one waveform iterate. The data used to generate Figure 2.4 is summarized in Table 2.2. Numerical data Theoretical data

Efficiency

1

0.9

0.8

0.7

1

8

16 Waveform iterates K

32

Fig. 2.4: Weak Scaling for Pipeline NNWR: Efficiency vs Waveform iterations. N = 16 subdomains, J = 500, (2K − 1)N processor cores. N x = 32000, N t = 4000

3. Dirichlet-Neumann Waveform Relaxation Method. Our second method of interest is the DNWR, which is first introduced and analyzed in [9]. Although there are several possible arrangements (in terms of placing Dirichlet and Neumann boundary conditions along artificial boundaries), we focus on the particular arrangement proposed in [4]. The algorithm goes as the following: given initial guesses along the interfaces, we start an iteration with a Dirichlet subproblem solve (Dirichlet boundary conditions on both boundaries) in the middle subdomain, followed by several mixed Neumann-Dirichlet subproblem solves in the adjacent subdomains. Suppose a spatial domain Ω is partitioned into N non-overlapping subdomains Ωi , i = 1, . . . , N , without any cross-point. Figure 3.1 illustrates a simple decomposition in R1 , although one can consider this algorithm in more complicated 2D domains. In Figure 3.1, the letters ‘D’ and ‘N’ in red denote the Dirichlet and Neumann boundary conditions imposed [k] along the artificial interfaces. Let ui denotes the restriction of the solution u of (1.1) to Ωi in the k-th waveform iterate. For i = 1, . . . , N − 1, set Γi := ∂Ωi ∩ ∂Ωi+1 . We further define Γ0 = ΓN = ∅. ni,j is defined to be the unit outward normal for Ωi on the interface Γj , j = i − 1, i (for Ω1 , ΩN we have only n1,2 and nN,N −1 respectively).

Pipeline DNWR and NNWR Implementations

9

t T ···

Ω1

0

···

Ωm

ND

ND

x1

xm−1

DN

ΩN DN

xm xN −1

L x

Fig. 3.1: Decomposition of a R1 domain into many non-overlapping strip-like subdomains

[0]

Given initial guesses gi (x, t) along the interfaces Γi × (0, T ), i = 1, . . . , N − 1, the DNWR algorithm consists of a “Dirichlet subproblem” solve in the middle subdomain Ωm with m = ⌈N/2⌉, for waveform iterate k, [k]

[k]

∂t um − Lum [k] um (x, 0) [k] um [k] um

= = = =

f, u0 (x), g, [k−1] , gi

in Ωm , in Ωm , on ∂Ω ∩ ∂Ωm , on Γi , i = m − 1, m,

(3.1)

followed by mixed “Dirichlet-Neumann subproblem” solves for m > i ≥ 1 and m+1 ≤ j≤N [k]

[k]

[k] [k] in Ωj , ∂t uj − Luj = f, in Ωi , ∂t ui − Lui = f, [k] [k] in Ωj , uj (x, 0) = u0 (x), in Ωi , ui (x, 0) = u0 (x), [k] [k] [k−1] [k] ∂nj,j−1 uj = −∂nj−1,j uj−1 , on Γj−1 , , on ∂Ωi \ Γi , gi ui = e [k] [k] [k−1] [k] ∂ni,i+1 ui = −∂ni+1,i ui+1 , on Γi , , on ∂Ωj \ Γj−1 . uj = gˇj (3.2) The method ends with the update conditions along the interfaces, [k−1] [k] [k] (x, t), 1 ≤ i < m, gi (x, t) = θui Γi ×(0,T ) + (1 − θ)gi (3.3) [k−1] [k] [k] (x, t), m ≤ j ≤ N, gj (x, t) = θuj+1 Γj ×(0,T ) + (1 − θ)gj

where θ ∈ (0, 1], and for i = 1, . . . , m − 1 and j = m + 1, . . . , N ( ( g, on ∂Ω ∩ ∂Ωi , g, on ∂Ω ∩ ∂Ωj , [k−1] [k−1] = = gi e , gˇj . [k−1] [k−1] , on Γj , gi−1 , on Γi−1 , gj

Remark 3.1. Note that, in our setting, for an odd N , the middle subdomain Ωm is eventually the positional middle subdomain, which is not the case for even N . However, the algorithm is still valid if we assign the ‘middle subdomain’ to be anyone between 1 and N . The efficiency will not be the same in that case, see [9]. Similar to the NNWR in Section 2, we present algorithms and results for subdomains in R1 , where a spatial domain Ω := (0, L) is decomposed into non-overlapping subdomains Ωi := (xi−1 , xi ), i = 1, . . . , N (Figure 3.1). Remark 3.2. The convergence results for the DNWR algorithm (3.1)-(3.2)-(3.3) have been presented in [9] for the heat equation in R1 . For θ = 1/2, the DNWR

10

B. C. Mandal and B. W. Ong

Table 3.1: Steps for the DNWR iteration for five subdomains, two waveform iterations Steps

Ω1

Ω2

DN[1]

s = 3 DN[1] s=4

Ω4

Ω5

DD[1]

s=1 s=2

Ω3

ND[1] DD[2]

DN[2]

s = 5 DN[2]

ND[1] ND[2] ND[2]

algorithm (3.1)-(3.2)-(3.3) in Ω = (0, L) for N (> 2) subdomains of unequal sizes hi , 1 ≤ i ≤ N with m = ⌈N/2⌉, hmax := max1≤i≤N hi and hmin := min1≤i≤N hi converges superlinearly with the estimate k    2hmax khmin [k] [0] √ max k gi kL∞ (0,T ) ≤ N − 4 + erfc max k gi kL∞ (0,T ) . 1≤i≤N −1 hm 2 T 1≤i≤N −1 The case N = 2 is analyzed separately in [8]. 3.1. Classical DNWR implementation. The original algorithm for general N subdomains can be parallelized in an effective way. As soon as the subproblems in the immediate neighboring subdomains are solved, a new waveform iteration can be started in the middle subdomains. A step-by-step computation table for five subdomains is included in Table 3.1. The superscript inside the third bracket indicates the iteration number. Optimally this implementation can use up to min{2K, ⌈N/2⌉} concurrent processing cores for a setting of N subdomains and K waveform iterations. The total number of steps to compute K waveform iterations is 1 + 2(K − 1) + ⌊N/2⌋ and (N −1)(2K −1) messages are needed to transmit of size Nt words, where Nt is the number of time steps. A pseudo-code for this method in R1 is given in Algorithm 3. Here we considered a uniform time discretization for each of the subdomains. However, a generalization with more heterogeneity is possible, provided an interpolation of the Neumann or Dirichlet traces are computed in lines 13, 22 and 25 of Algorithm 3. 3.2. Pipeline DNWR implementation. In this section, we rewrite the DNWR algorithm similar to the pipeline NNWR algorithm, described in Section 2.2. This enables computation of multiple waveform iterations in pipeline fashion. We decompose the given time window into smaller time blocks, and communicate information to other processors after computation in each time block, so that other processors can simultaneously work on different time blocks and/or different waveform iterations. We illustrate the pipeline evaluation for a simple three subdomains, two time blocks example. Similar to Section 2.2, suppose [0, T ] is decomposed in 0 = T0 < T1 < T2 = T and Ωij denotes the space-time domain, Ωi × [Tj−1 , Tj ]. The pipeline implementation starts with solving the Dirichlet subproblem, equation (3.1), for u[1] in the middle subdomain Ω21 with one processor. As soon as this computation is completed, Neumann data are computed and sent to the processors associated to neighboring subdomains Ω11 and Ω31 . The algorithm then proceeds with solving mixed DirichletNeumann subproblems (3.2) for u[1] in Ω11 and Ω31 , as well as solving a Dirichlet

11

Pipeline DNWR and NNWR Implementations

1 2 3 4

5 6 7 8 9

MPI Initialization; for i = 1 to N do in parallel for n = 1 to Nt do [0] Guess gi (tn ) on Γi for i = 1, . . . , (N − 1); /* Loop of pre-determined steps for s = 1 to 1 + 2(K − 1) + ⌊N/2⌋ do for i = 1 to N do in parallel if i = ⌈N/2⌉ is permissible then if required in step s then Receive Dirichlet data from left and right;

12

for n = 1 to Nt do [k] Solve equation (3.1) for ui (x, tn ); Compute Neumann data along interfaces;

13

Transmit Neumann data;

10 11

14 15 16 17 18 19 20 21 22 23 24 25 26

*/

else if i is permissible in step s then Receive Neumann data from right (left); if required in step s then Receive Dirichlet data from left (right); for n = 1 to Nt do [k] Solve equation (3.2) for ui (x, tn ); [k]

[k]

Update Dirichlet trace, gi (tn ) (or gi−1 (tn )), equation (3.3); if i > 1 or i < N then Compute Neumann data along left (right) interface; Transmit Dirichlet data to right (left); if i > 1 or i < N then Transmit Neumann data to left (right);

Algorithm 3: The classical implementation of the DNWR algorithm is able to utilize min{2K, ⌈N/2⌉} computing cores if K waveform iterations are computed for N non-overlapping subdomains. subproblem (3.1) for u[1] in Ω22 simultaneously with three processors. Then messages are sent to the appropriate processors, and the algorithm proceeds with solving the Dirichlet-Neumann subproblems for u[1] in Ω12 and Ω32 , as well as solving a Dirichlet subproblem in the next waveform iteration for u[2] in Ω21 simultaneously with three processors. A graphical description of the pipeline framework for this example is shown in Figure 3.2. In general, if we evaluate K waveform iterations in a domain, that is decomposed into N spatial subdomains and large number of time blocks J, then one can use up J to N K concurrent processors with expected parallel efficiency J+⌊(N/2)⌋+2(K−1) (due to initial start-up cost). A pseudo code for the pipeline DNWR in R1 is given in Algorithm 4. It is evident that the pipeline DNWR implementation has a start-up and shut-down phase before multiple waveform iterates can be computed in parallel.

12

B. C. Mandal and B. W. Ong

Ω3

Ω2

u[1] ∈ Ω21

Ω1

u[1] ∈ Ω31

u[1] ∈ Ω32

u[2] ∈ Ω31

u[1] ∈ Ω22

u[2] ∈ Ω21

u[2] ∈ Ω22 · · ·

u[1] ∈ Ω11

u[1] ∈ Ω12

u[2] ∈ Ω11

Wall Time Fig. 3.2: A pipeline DNWR implementation for a three-subdomain, two time blocks example.

eve

1 st

el tim

l

l eve

d 2n

el tim

l eve

Wa ll

3rd

el tim

tim e

Fig. 3.3: Three dimensional structure for pipeline DNWR algorithm, each plane representing one time block.

Figure 3.3 gives a snapshot of the first four computation steps of the pipeline DNWR algorithm for five subdomains. Each vertical plane corresponds to a new time block, that is added to the existing computation structure. On each of such planes, a complete DNWR method is computed on a particular time block. We indicate a light red dot as the first waveform iteration and dark red as the second iteration. Note that, the same 2D pyramidal structure of Table 3.1 for classical DNWR method gets repeated on each plane with a shift of one horizontal step.

13

Pipeline DNWR and NNWR Implementations

1 2 3 4 5 6

7 8 9

MPI Initialization; for i = 1 to N do in parallel for n = 1 to Nt do [0] Guess gi (tn ) on Γi for i = 1, . . . , N − 1;

for k = 1 to K do in parallel if i = ⌈N/2⌉ then /* Middle subdomain for j = 1 to J do if k > 1 then receive updated Dirichlet data from left and right;

12

for n = 1 to Nt /J do [k] Solve equation (3.1) for ui (x, tn ); Compute Neumann data along interfaces;

13

Send Neumann data to immediate left and right subdomains;

10 11

14

15 16 17 18 19

20 21 22 23 24 25

else /* Other subdomains for j = 1 to J do Receive Neumann data from right(left); for n = 1 to Nt /J do [k] Solve equation (3.2) for ui (x, tn ); [k]

*/

*/

[k]

Update Dirichlet traces, gi (tn ) (or gi−1 (tn )), equation (3.3); if i > 1 or i < N then Compute Neumann data along left (right) interface; if k < K then Send Dirichlet data to right (left); if i > 1 or i < N then Transmit Neumann data to left (right);

Algorithm 4: The pipeline implementation of the DNWR algorithm is able to utilize N K computing cores if the domain is broken into N non-overlapping subdomains.

3.3. Numerical Experiments. In this section, we run few experiments for pipeline DNWR algorithm, where many waveform iterates can be concurrently computed. We compare the computational time for the same model problem as in Section 2.3. The first numerical experiment is a weak scaling study for small number of processor cores. Figure 3.4 verifies the expected behavior: for a large number of time blocks, J = 400, the pipeline DNWR uses N K processing cores to compute K waveform iterates for N subdomains. We see that the efficiency is close to the expected theoretical efficiency for different number of subdomains. A large scale experiment for weak scaling study is performed and showed in Figure 3.5. We consider large number of time blocks, J = 500, and the implementation is for 16 subdomains. The pipeline DNWR implementation using N K cores also gives

14

B. C. Mandal and B. W. Ong

expected efficiency gain. 8 subdomains Theoretical value 15 subdomains Theoretical value 24 subdomains Theoretical value

Wall time in Sec.

120 100 80 60 40 20 24

42

54 66 78 Number of Cores

90

Fig. 3.4: [Superior] Weak Scaling for Pipeline DNWR: Wall time vs Cores, for N subdomains, for fixed number of time levels J = 400. We vary number of cores N K with waveform level K. Nx = 30000, Nt = 4000.

Numerical data Theoretical data

Wall time in Sec.

40

38

36

34 1

8

16

32 Waveform iterates K

64

Fig. 3.5: [MSU stampede] Weak Scaling for Pipeline DNWR: Wall time vs Waveform iterates. N = 16 subdomains, J = 500, N K processor cores. Nx = 32000, Nt = 5000

4. Conclusions. In this paper, we have reformulated the DNWR and NNWR methods to allow for pipeline-parallel computation of the waveform iterates, after an initial start-up cost. Theoretical estimates for the parallel speedup and communication overhead are presented, along with strong scaling studies to show the effectiveness of the pipeline DNWR and NNWR algorithms.

Pipeline DNWR and NNWR Implementations

15

REFERENCES [1] P. E. Bjørstad and O. B. Widlund, Iterative Methods for the Solution of Elliptic Problems on Regions Partitioned into Substructures, SIAM J. Numer. Anal., (1986). [2] J. F. Bourgat, R. Glowinski, P. L. Tallec, and M. Vidrascu, Variational Formulation and Algorithm for Trace Operator in Domain Decomposition Calculations, in Domain Decomposition Methods, T. F. Chan, R. Glowinski, J. Priaux, and O. B. Widlund, eds., SIAM, 1989, pp. 3–16. [3] J. H. Bramble, J. E. Pasciak, and A. H. Schatz, An Iterative Method for Elliptic Problems on Regions Partitioned into Substructures, Mathematics of Computation, (1986). [4] D. Funaro, A. Quarteroni, and P. Zanolli, An Iterative Procedure with Interface Relaxation for Domain Decomposition Methods, SIAM J. Numer. Anal., 25 (1988), pp. 1213–1236. [5] M. J. Gander and L. Halpern, Optimized Schwarz Waveform Relaxation for Advection Reaction Diffusion Problems, SIAM J. Num. Anal., 45 (2007), pp. 666–697. [6] M. J. Gander, L. Halpern, and F. Nataf, Optimal Schwarz Waveform Relaxation for the One Dimensional Wave Equation, SIAM J. Num. Anal., 41 (2003), pp. 1643–1681. [7] M. J. Gander, F. Kwok, and B. C. Mandal, Dirichlet-Neumann and Neumann-Neumann Waveform Relaxation for the Wave Equation, in Domain Decomposition Methods in Science and Engineering XXII, T. Dickopf, M. J. Gander, L. Halpern, R. Krause, and L. Pavarino, eds., Lecture Notes in Computational Science and Engineering, Springer International Publishing, 2016, pp. 501–509. [8] , Dirichlet-Neumann and Neumann-Neumann Waveform Relaxation Algorithms for Parabolic Problems, to appear in Electronic Transactions on Numerical Analysis, (arXiv:1311.2709). , Dirichlet-Neumann Waveform Relaxation Method for the 1D and 2D Heat and Wave [9] Equations in Multiple subdomains, submitted, (arXiv:1507.04011). [10] M. J. Gander and A. M. Stuart, Space-time continuous analysis of waveform relaxation for the heat equation, SIAM J. for Sci. Comput., 19 (1998), pp. 2014–2031. [11] E. Giladi and H. Keller, Space time domain decomposition for parabolic problems, Tech. Report 97-4, Center for research on parallel computation CRPC, Caltech, 1997. [12] F. Kwok, Neumann-Neumann Waveform Relaxation for the Time-Dependent Heat Equation, in Domain Decomposition in Science and Engineering XXI, J. Erhel, M. J. Gander, L. Halpern, G. Pichot, T. Sassi, and O. B. Widlund, eds., vol. 98, Springer-Verlag, 2014, pp. 189–198. [13] E. Lelarasmee, A. Ruehli, and A. Sangiovanni-Vincentelli, The waveform relaxation method for time-domain analysis of large scale integrated circuits, IEEE Trans. Compt.Aided Design Integr. Circuits Syst., 1 (1982), pp. 131–145. ¨ f, Sur l’application des m´ [14] E. Lindelo ethodes d’approximations successives ` a l’´ etude des int´ egrales r´ eelles des ´ equations diff´ erentielles ordinaires, Journal de Math´ ematiques Pures et Appliqu´ ees, (1894). [15] B. C. Mandal, Convergence Analysis of Substructuring Waveform Relaxation Methods for Space-time Problems and Their Application to Optimal Control Problems, 2014. Thesis (Ph.D.)–University of Geneva. [16] , A Time-Dependent Dirichlet-Neumann Method for the Heat Equation, in Domain Decomposition in Science and Engineering XXI, J. Erhel, M. J. Gander, L. Halpern, G. Pichot, T. Sassi, and O. B. Widlund, eds., vol. 98, Springer-Verlag, 2014, pp. 467–475. , Neumann-Neumann Waveform Relaxation Algorithm in Multiple Subdomains for Hy[17] perbolic Problems in 1d and 2d, to appear in Numerical Methods for Partial Differential Equations, (arXiv:1507.04008). [18] L. Martini and A. Quarteroni, An Iterative Procedure for Domain Decomposition Methods: a Finite Element Approach, SIAM, in Domain Decomposition Methods for PDEs, I, (1988), pp. 129–143. [19] , A Relaxation Procedure for Domain Decomposition Method using Finite Elements, Numer. Math., (1989). [20] B. W. Ong, S. High, and F. Kwok, Pipeline schwarz waveform relaxation, in Domain Decomposition Methods in Science and Engineering XXII, T. Dickopf, M. J. Gander, L. Halpern, R. Krause, and L. Pavarino, eds., Lecture Notes in Computational Science and Engineering, Springer International Publishing, 2016, pp. 179–187. [21] E. Picard, Sur l’application des m´ ethodes d’approximations successives ` a l’´ etude de certaines ´ equations diff´ erentielles ordinaires, Journal de Math´ ematiques Pures et Appliqu´ ees, (1893). [22] Y. D. Roeck and P. L. Tallec, Analysis and Test of a local domain decomposition preconditioner, in Domain Decomposition Methods for PDEs, I, R. Glowinski et al., ed.,

16

B. C. Mandal and B. W. Ong

Philadelphia, 1991, SIAM, pp. 112–128. [23] P. L. Tallec, Y. D. Roeck, and M. Vidrascu, Domain decomposition methods for large linearly elliptic three-dimensional problems, J. of Comput. and App. Math., (1991).