A High-Performance Parallel Implementation of the ...

A High-Performance Parallel Implementation of the Certified Reduced Basis Method David J. Knezevica,∗, John W. Petersonb a Department

of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139 USA b Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX 78758-4497

Abstract The Certified Reduced Basis method (herein RB method) is a popular approach for model reduction of parametrized partial differential equations. The RB method involves: (i ) an expensive Offline stage in which a reduced-order model is constructed and (ii ) a comparatively inexpensive Online stage in which the reduced model is evaluated for given parameter values. In the present work, and in concert with the RB method itself, we emphasize a two-level hierarchy of computational effort: a high-performance implementation on parallel supercomputers for the Offline stage, and a lightweight implementation for the Online stage suitable for laptop, desktop, and other “thin” computing resources. Our particular Offline implementation is of key importance and is the focus of this paper. We discuss two opportunities for parallelization in the Offline context: standard (domain-decomposition) parallelization of the underlying “truth” finite element solves, as well as a novel parallel partition of the discrete parameter set employed to “train” the reduced basis. We show numerical results for two examples: a many-parameter heat transfer problem and a natural convection fluid flow problem, neither of which would be computationally feasible without the aforementioned parallelization of the Offline implementation. In each case we also demonstrate the significant speedup in the Online stage, compared to the truth, which is a hallmark of the RB method. Keywords: Reduced basis method, model reduction, a posteriori error bounds, finite element methods, high-performance computing, parametrized partial differential equations, Boussinesq equation, heat transfer

1. Introduction High-fidelity computational simulation of parametrized partial differential equations (PDEs) is fundamentally important in many scientific and engineering contexts. However, classical high-fidelity numerical discretizations — finite ∗ Corresponding

author.

Preprint submitted to CMAME

May 26, 2010

element methods, finite difference methods, spectral methods — are often computationally expensive, which limits their applicability in situations in which the PDE needs to be solved for many different parameters (optimization, inverse problems, multiscale analysis), or in which real-time evaluation is desirable (control applications, “in the field” design). To address this limitation a large body of model reduction literature has emerged [1, 2, 3, 4]. The basic goal in model reduction is to develop a low-dimensional representation that accurately captures the dynamics of a high-fidelity numerical discretization over a specific parameter regime of interest. One well-established approach to model reduction of parametrized PDEs is the certified Reduced Basis (RB) method [5, 6, 7, 8, 9]. In the RB method, a very low-dimensional approximation to the “solution manifold” of a parametrized PDE is constructed, typically resulting in a reduced order model that achieves orders of magnitude savings in computation time and memory. Throughout this paper we shall assume there exists a sufficiently accurate (for our purposes) finite element discretization of a parametrized PDE, and that we seek to construct a reduced-order model based on this discretization via the RB method. Employing the standard RB nomenclature, we shall henceforth refer to the high-fidelity discretization as the “truth” model. It has been wellestablished [9] that, for a class of PDEs that satisfy the “affine hypothesis,” the RB method allows rapid and highly-accurate approximation of the highfidelity (truth) model and, moreover, incorporates inexpensive a posteriori error bounds that provide a rigorous measure of the error incurred in the model reduction process. The focus of this paper, however, is to demonstrate that by exploiting opportunities for parallelism in the RB framework, we can realize the full potential of the RB methodology for large-scale PDE simulations. The “payoff” of model reduction clearly increases in direct proportion to the complexity and expense of the full-fidelity simulation. As mentioned previously, the RB framework is naturally separated into an “inexpensive” Online stage and an “expensive” Offline stage. In the inexpensive Online stage the reduced-order model and corresponding error bounds can be evaluated for user-specified input parameters — the cost of the Online calculations are independent of the truth discretization and hence can be performed very rapidly and with limited computational resources [10]. The Offline stage, on the other hand, corresponds to the construction of the reduced order model. In this stage we appeal to the truth discretization to develop basis functions that are highly tailored to the solution manifold, as well as to perform other truth model dependent preprocessing for subsequent Online evaluation steps. Our high-performance implementation of the Offline stage of the RB dichotomy enables robust treatment of problems with many parameters as well as problems with extremely expensive truth solves — both of these tasks are out of reach without the ability to exploit parallel computing resources. 1.1. RB Method Preliminaries We suppose that the system input µ ∈ D ⊂ RP (where D is a bounded set) enters as a parameter in a PDE which describes the relevant physical phenomena 2

over an appropriate spatial domain Ω ⊂ Rd , d = 1, 2 or 3. We may consider either steady-state or time-dependent problems; in the time-dependent case we suppose 0 ≤ t ≤ tf is the time interval of interest. This PDE yields (i) a field variable over Ω, u(µ) ∈ X(Ω) (where X(Ω) is an appropriate function space), and (ii) a scalar output of interest, s(µ) ∈ R, which can be expressed as a (say) linear functional of the field variable, s(µ) = `(u(µ)). In this simple discussion we consider only a single scalar output, but in actual practice we often consider multiple outputs. Note that the parameter dependence proceeds from the PDE through the field variable and finally to the engineering output. We suppose that the parametrized PDE is discretized by finite elements in space, and in the time-dependent case, by a (say) Backward Euler scheme in time. We then obtain a high-fidelity finite element approximation by computing the Galerkin projection over a finite element approximation subspace X N (⊂ X) of large dimension N . For any µ ∈ D, our truth approximation is given by uN (µ) ∈ X N and sN (µ) = `(uN (µ)) ∈ R. We note that, given our restriction to µ ∈ D, all solutions of interest necessarily reside on the parametrically-induced manifold MN ≡ {uN (µ) | µ ∈ D}. We observe that this manifold is relatively low-dimensional and in many cases of practical interest it can be expected to be smooth. We assume that the truth approximation provides, for sufficiently large N , the desired accuracy. In addition, and especially for the example problems considered here, we assume that the solution of the truth system is of sufficient computational expense (in the context of our intended use of said solution) that we also require an accelerator for the mapping from µ ∈ D to uN (µ) and sN (µ). Hence we pursue the RB approach as a low-dimensional approximation to the truth approximation. Assuming an RB approximation space (which we shall call XN (⊂ X N ) where dim(XN ) = N ) has been constructed1 we then consider Galerkin projection over XN to obtain for any µ ∈ D, uN (µ) and sN (µ) = `(uN (µ)). As part of the RB method, we also compute rigorous a posteriori bounds, ∆N (µ) and ∆sN (µ), for the error in the RB field approximation and the RB output approximation, respectively. Thus for any µ ∈ D, we have kuN (µ) − uN (µ)kX ≤ ∆N (µ) and |sN (µ)−sN (µ)| ≤ ∆sN (µ). The existence of these bounds allows us to claim that our RB approximation is certified. The RB approximation space XN is specifically designed to accurately approximate functions which reside on the parametrically-induced manifold of interest, MN : indeed, XN is developed as the span of well-chosen snapshots on MN . (In contrast, even in a mesh-adaptive context, the truth finite element approximation space X N can represent a large class of functions very distant from MN .) The reduction in dimension from N to N in conjunction with an Offline-Online computational approach provides the RB advantage in the real1 In practice we employ the Greedy algorithm to construct X , see Section 2 for details. N Additionally, in the time-dependent case, the reduced basis approximation inherits the same temporal discretization as the truth.

3

time or many-query contexts. We now discuss the Offline-Online decomposition in more detail. 1.2. Offline-Online Decomposition In the Offline stage we develop the RB space: we identify optimal (combinations of) snapshots of MN and we “precompute” various parameter-independent functionals of these snapshots implicated in subsequent RB approximations and associated RB error bounds. This Offline stage is expensive, O(N γ ) FLOPs, where γ is a problem-dependent factor related to the computational cost of the truth solves. The Offline stage yields a (problem-dependent) Online Dataset; this dataset is relatively small — it scales only with Q and N , where Q measures the parametric complexity of our PDE. In the Online stage we invoke the Online Dataset to perform rapid certified output evaluation: given any µ ∈ D we calculate the RB output approximation sN (tk ; µ) and the associated RB output error bound, ∆sN (tk ; µ), 0 ≤ k ≤ K. This Online stage is very inexpensive — the FLOPs count also scales only with Q and N , with N N . Since the computational complexity of the Online stage is independent of N it is evidently appealing to perform model reduction of expensive truth models, e.g. high-resolution discretizations of nonlinear PDEs posed in three spatial dimensions. Our high-performance parallel implementation of the Offline stage makes this feasible; in essence bringing the power of a supercomputer to a desktop machine or an even more limited platform such as a smartphone [10]. Of course, the Offline stage is computationally intensive (much more so than a “single” traditional truth solve) and therefore the RB paradigm is not relevant in all circumstances. It is, however, highly advantageous in two situations of practical interest. The first is when rapid, reliable real-time response is required, such as for informing real-time engineering design or optimization decisions “in the field,” or for embedded control systems. In this context, the Offline computation time is essentially irrelevant. The second situation is when the parametrized PDE needs to be evaluated for many different parameters — the so-called many-query context — which is relevant for inverse problems, optimization, uncertainty quantification and multiscale analysis. In the many-query context we typically require m 1 evaluations of an RB model, hence the Offline computation time is amortized over the m queries and becomes negligible as m → ∞. The remainder of this paper is arranged in the following way: in Section 2, we provide a brief overview of the reduced basis method. We employ a simple one-parameter model problem in order to illustrate the ideas behind the methodology, and briefly discuss some possible generalizations. In Section 3, we discuss the new features of our high-performance parallel implementation of the Offline stage of the RB method, emphasizing the key areas in which parallel algorithms have been employed. In Section 4 we present some RB results for two computationally intensive problems: a “Swiss cheese” thermal conduction problem with parametrized thermal conductivities and a natural convection problem in an enclosed cavity. Both problems are based on high-resolution three-dimensional finite element truth discretizations. The former demonstrates the efficacy of our 4

implementation for a problem with “many” parameters while the latter demonstrates robust model reduction for a nontrivial nonlinear problem. In each case the Offline stage was performed on the TeraGrid supercomputer Ranger at the Texas Advanced Computing Center (TACC). 2. Certified Reduced Basis Formulation In this section we describe the main ideas of the RB method. We focus on a simple one-parameter model problem in order to simplify our discussion, and we indicate generalizations during the discussion. For more details on the material in this section see, for example, the review article [9]. 2.1. Model Problem To fix ideas we consider steady heat conduction in a (say, polygonal) domain Ω = Ω1 ∪ Ω2 : the normalized thermal conductivity in Ω1 (respectively, Ω2 ) is unity (respectively, κ). We apply a uniform unit heat source over the entire domain Ω. We require that the temperature field, u, vanish — zero Dirichlet conditions — on the domain boundary ∂Ω. We take for our output of interest, s, the integral of the temperature over Ω1 . We consider a single (P = 1) parameter: µ ≡ κ, the conductivity in Ω2 ; D, the parameter domain, is given by the interval [1, 10]. In mathematical terms, u(µ) ∈ X, where X = H01 (Ω); here H01 (Ω) = {v ∈ 1 H | v|∂Ω = 0}, H 1 (Ω) = {v ∈ L2 (Ω) | |∇v| ∈ L2 (Ω)}, and L2 (Ω) is the space of square integrable Rfunctions over Ω. We associate to the space p X the inner product (w, v)X ≡ Ω ∇w · ∇v and induced norm kwkX ≡ (w, w)X R and to the space L2 (Ω) the inner product (w, v) ≡ Ω wv and induced norm p kwk ≡ (w, w). We then R R define the continuous and coercive bilinear forms a1 ≡ Ω1 ∇w · ∇v, a2 ≡ Ω2 ∇w · ∇v, and a(w, v; µ) ≡ a1 (w, v) + µa2 (w, v), ∀w, v ∈ X, (1) R R and the bounded linear forms f (v) = Ω v and `(v) = Ω2 v. We can now provide the weak statement of our PDE: given µ ∈ D, find u(µ) ∈ X such that a(u(µ), v; µ) = f (v), ∀v ∈ X; evaluate s(µ) = `(u(µ)). We note that (1) is a special case of a more general class of bilinear forms. We say that a bilinear form a is “affine in parameter” (or more precisely, “affine in functions of the parameter”) if we can write a(w, v; µ) =

Q X

Θq (µ)aq (w, v),

(2)

q=1

where the Θq : D → R are easily evaluated (O(1) FLOPs) parameter-dependent coefficient functions and the aq are parameter-independent continuous bilinear forms. Similar expansions may be developed for f and `. The assumption (2) is a prerequisite for the crucial Offline-Online decomposition alluded to in the Introduction. 5

In fact many problems admit a representation of the form (2) which can either be identified by inspection or, in the case of certain geometric variations, be constructed in an automated fashion [9]. More generally, if the PDE is not amenable to an affine decomposition, we can develop an affine representation which approximates the PDE [11]; in certain cases rigorous bounds can then be developed that quantify the additional “consistency” errors introduced by this approximation [12]. We next define the truth finite element approximation. We introduce a triangulation of Ω, T N , to which we associate a standard conforming polynomial finite element approximation space X N . We can then provide the truth approximation: given µ ∈ D, find uN (µ) ∈ X N such that a(uN (µ), v; µ) = f (v), ∀v ∈ X N . The truth output is then evaluated as sN (µ) = `(uN (µ)). 2.2. Reduced Basis Approximation We shall assume that we are given Nmax nested RB approximation spaces X1 ⊂ X2 ⊂ · · · XN · · · ⊂ XNmax ; here XN is of dimension N . The RB approximation uN (µ) ∈ XN is expressed in terms of (·, ·)X -orthonormal basis functions {ζi }i=1,...,Nmax : N X uN (x; µ) = ωN j (µ)ζj (x), (3) j=1

where x is a point in Ω. The ζi ∈ X N — orthonormalized snapshots from MN — are in practice characterized by the nodal values ζj (xi ), 1 ≤ i ≤ N , 1 ≤ j ≤ N , where xi denotes a node of the triangulation T N . At the conclusion of this section we shall be in a position to succinctly summarize the algorithm by which these RB spaces, and in particular the underlying snapshots, are selected. We can now provide the RB Galerkin approximation: given µ ∈ D, find uN (µ) ∈ XN such that a(uN (µ), v; µ) = f (v), ∀v ∈ XN ; evaluate sN (µ) = `(uN (µ)). We know by Céa’s Lemma that this approximation is optimal (in the energy norm): the Galerkin projection chooses the best linear combination of snapshots. We can further develop the corresponding matrix equations (inserting (3) into our weak form and testing on v = ζi , 1 ≤ i ≤ N ): given µ ∈ D, find ωN (µ) ∈ RN , the solution to AN (µ)ωN (µ) = FN , and evaluate sN (µ) = LT ωN (µ). Here AN i,j (µ) = a(ζj , ζi ; µ), 1 ≤ i, j ≤ N, is our stiffness matrix, FN i = f (ζi ), 1 ≤ i ≤ N, is our load vector, and LN i = `(ζi ), 1 ≤ i ≤ N, is our output vector. Note that AN , FN , and LN are principal submatrices/subvectors of ANmax , FNmax , and LNmax , respectively, thanks to the hierarchical nature of the RB spaces. We could simply calculate, for any given query instance µ, the matrix elements AN i,j (µ) = a(ζj , ζi ; µ), 1 ≤ i, j ≤ N, as N 2 integrations over Ω. However this approach is very expensive — O(N N 2 ) — and indeed perhaps even more expensive than direct calculation of the truth finite element approximation uN (µ), sN (µ). Instead, and in anticipation of our Offline-Online strategy, we must partition the computation of ωN (µ) and sN (µ) into two complementary components: a “construction” part which will be expensive — operation count

6

O(N ) — but which will not depend on the query instance µ and hence may be performed in the pre-deployment period; and an “evaluation” part which will depend on the query instance but which will be inexpensive: operation count O(N ), not O(N ). The key enabler is the “affine in parameter” structure of the bilinear form: (1) implies that a(ζj , ζi ; µ) = a1 (ζj , ζi ) + µa2 (ζj , ζi ), 1 ≤ i, j ≤ N , and hence that AN (µ) = A1N + µA2N , where A1N i,j = a1 (ζj , ζi ) and A2N i,j = a2 (ζj , ζi ), 1 ≤ i, j ≤ N , are parameter-independent “proto-stiffness” matrices. (In the operation count and storage estimates developed below we shall extend this argument to the general affine expansions described by (2). In that case we will of course have Q proto-stiffness matrices.) The Offline-Online partition is now evident. In the construction part we compute (and store) the elements of A1N and A2N in O(N N 2 ) FLOPs. In the evaluation part, for a given query instance µ, we first form AN (µ) = A1N + µA2N , from (the stored) A1N and A2N , in O(N 2 ) FLOPs; we then solve the small system AN (µ)ωN (µ) = FN in O(N 3 ) FLOPs. Finally, we calculate sN (µ) = LTN ωN (µ) in O(N ) FLOPs. (Note that AN (µ) is small but, unfortunately, dense.) The operation count for the evaluation part is therefore independent of the truth resolution, N . 2.3. RB a posteriori error bounds We can also readily develop rigorous a posteriori error bounds for this model problem which will apply to the more general “affine” case. As in the finite element context, the a posteriori error bounds here are intimately tied to the residual r(v; µ) ≡ f (v) − a(uN (µ), v; µ), ∀v ∈ X N . We denote the Riesz representation of the residual by R(µ) ∈ X N , which satisfies the Poisson problem (R(µ), v)X = r(v; µ),

∀v ∈ X N .

(4)

We may then introduce our X-norm error bounds for the field variable and output: ∆N (µ) ≡ ∆sN (µ)

kR(µ)kX

≡

C` kR(µ)kX ,

(5) (6)

where C` = supv∈X N `(v)/kvkX . Note that C` is a calculable constant by the Riesz representation theorem: let L ∈ X N satisfy (L, v)X = `(v),

∀v ∈ X N ,

(7)

then C` = kLkX . The X-norm error bound ∆N (µ) has a familiar form: the dual norm of the residual divided by a lower bound for the relevant stability (here coercivity, more generally inf-sup) constant; for our simple model problem a lower bound for the stability constant is unity, but in general this will not be the case. It can be readily demonstrated that for any µ in D (and for any N , 1 ≤ N ≤ Nmax ), kuN (µ) − uN (µ)kX ≤ ∆N (µ) and |sN (µ) − sN (µ)| ≤ ∆sN (µ). 7

In anticipation of our Offline-Online strategy we must now find a partition of the computation of kR(µ)kX into construction and evaluation components. Towards that end, we insert the residual expression into (4) and invoke the affine structure of our bilinear form, (1), and our representation (3), to obtain (R(µ), v)X = f (v) −

N X

ωN i (µ)a1 (ζi , v)

i=1

−

N X

µωN i (µ)a2 (ζi , v),

∀v ∈ X N .

i=1

We then apply linear superposition to express R(µ) as R(µ) =

2N +1 X

Φi (µ)Gi ,

(8)

i=1

for Φi : D → R, Gi ∈ X N , 1 ≤ i ≤ 2N + 1.2 Note that the Gi , 1 ≤ i ≤ N , are independent of µ. Finally, from (8) it follows that q (9) ∆N (µ) = ΦT (µ)ΛΦ(µ), ∆sN (µ) = C` ∆N (µ) , where Λi,j ≡ (Gi , Gj )X , 1 ≤ i, j ≤ 2N + 1. The construction-evaluation partition is then clear: in the construction part we form and store the µ-independent quantity Λ (and C` ) in O(N N 2 ) FLOPs; in the evaluation part we calculate the error bound from (9) in O(4N 2 ) FLOPs. We can encapsulate the computational tasks associated with RB approximation and RB error estimation in the following procedures. Construction is represented as [S eval ] = Construction({ζn }n=1,...,Nmax ), (10) where S eval ≡ {A1Nmax , A2Nmax , FNmax , LNmax , Λ, C` }. Evaluation is represented as [ωN (Ξ), ∆N (Ξ), sN (Ξ), ∆sN (Ξ), µ∗ ] = Evaluation(Ξ; N ; S eval ),

(11)

where Ξ ⊂ D is either a single value or a set of values of the parameter, and µ∗ = arg maxµ∈Ξ ∆N (µ) (which will arise in the Greedy Procedure discussed in Section 2.4). We now summarize the operation counts under the general affine hypothesis (2). The operation count for Construction is O(N QN 2 + N Q2 N 2 ) and the storage requirement for S eval is O(QN 2 + Q2 N 2 ). The operation count for Evaluation is O(QN 2 + N 3 + Q2 N 2 ); there is no explicit dependence on the number of parameters P , though in general larger P will require larger N for any given error tolerance. 2 We provide the explicit form: Φ = 1, Φ (µ) = ω 1 2 N 1 (µ), Φ3 (µ) = ωN 2 (µ), . . . , ΦN +2 = µωN 1 (µ), . . . , Φ2N +1 = µωN N (µ); ∀v ∈ X N , (G1 , v)X = f (v), (G2 , v)X = −a1 (ζ1 , v), (G3 , v)X = −a1 (ζ2 , v), . . ., (GN +2 , v)X = −a2 (ζ1 , v), . . ., (G2N +1 , v)X = −a2 (ζN , v).

8

2.4. The Greedy Procedure To identify our snapshots and hence our RB spaces XN , 1 ≤ N ≤ Nmax , we apply the Greedy Procedure [13] detailed in Algorithm 1. The spaces are constructed sequentially: at each iteration we append the snapshot from N train MN } which is least well-represented by the current Ξtrain ≡ {u (µ) | µ ∈ Ξ RB approximation space (as measured by the RB field approximation error bound). Note that in Line 8 of Algorithm 1, I is the identity operator and ΠXN is the orthogonal projection onto XN with respect to the (·, ·)X inner product. Algorithm 1 Greedy Algorithm. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

specify Ξtrain ⊂ D of size ntrain and tolerance . set ζ1 = uN (µ1 ) (normalized: kζ1 kX = 1, µ1 arbitrary in D). set N = 1, X1 = span(ζ1 ). set S eval = Construction(ζ1 ). set [.,err,.,.,µ∗ ] = Evaluation(Ξtrain ; 1; S eval ). while err > do N = err (for “reporting” purposes); ζN +1 = (I − ΠXN )uN (µ∗ ) (normalized); N ← N + 1, XN ← XN ⊕ span(ζN +1 ), ; S eval = Construction({ζi }i=1,...,N ) (update); set [.,err,.,.,µ∗ ] = Evaluation (Ξtrain , N, S eval ); end while set Nmax ← N . The operation count for the Greedy Algorithm (for elliptic PDEs) is 2 2 3 O(N γ Nmax ) + O(N QNmax ) + O(N Q2 Nmax ) + O(ntrain QNmax ) 3 4 + O(ntrain Q2 Nmax ) + O(ntrain Nmax ).

(12)

The first term relates to the calculation of the snapshots — Line 8 of Algorithm 1. The exponent γ ≥ 1 relates to the efficiency of the truth solution procedure. The remainder of the terms relate to Lines 10 and 11 of Algorithm 1: the operation counts associated with “integration” of Construction and Evaluation from N = 1 to N = Nmax . We make two crucial observations: the update snapshot is determined by the error bound and not the true error; the truth solution is calculated, in Line 8, only for the winning “arg max” candidate, deduced in Line 11. As ntrain → ∞, the cost of the “many query” calculation ∆N (Ξtrain ) is determined solely by the inexpensive Evaluation — the expensive Construction is asymptotically negligible. This strategy ensures O(ntrain )+O(N ) rather than O(ntrain N ) complexity, and thus permits consideration of very large training sets Ξtrain for which MN Ξtrain will indeed be a good surrogate for the actual manifold of interest MN ≡ {uN (µ) | µ ∈ D}. In actual practice, the Greedy Procedure can identify very effective RB spaces. Large training samples, enabled by our error bound “importance sampling” strategy and Construction–Evaluation many-query procedure, are of course imperative in particular for larger P . 9

2.5. Extensions of the methodology The above discussion was restricted to the case of linear elliptic PDEs, but the RB method has been applied to a wide range of problem types. We now provide an overview of generalizations that have been developed, including construction of stability factor lower bounds, linear parabolic and quadraticallynonlinear PDEs, and primal-dual formulations. In Section 2.1, the stability factor (the coercivity constant for the “stiffness” operator) was deduced by inspection. In general, however, we must calculate a lower bound for parameter-dependent stability factors in an Offline-Online “Successive Constraint Method” (SCM) [14], which recasts the problem of finding a stability factor lower bound as a linear program. The SCM Offline stage is rather onerous and requires the solution of a series of truth eigenvalue problems in order to obtain the constraints for the linear program. The Online cost, however, depends only on Q and is typically negligible compared to the RB Online cost. The RB formulation in the linear parabolic case follows along much the same lines as in Section 2.1 [15, 16, 17]. For the sake of illustration let us now consider a time-dependent version of the linear elliptic model problem discussed previously. Let the time interval t ∈ [0, tf ] be divided into equal time steps of size ∆t = tf /K and suppose our truth model employs an unconditionally stable Backward Euler temporal discretization. Then, given uN 0 ∈ X N , for each tk , k = 1, . . . , K, the truth problem takes the form: find uN k (µ) ∈ X N satisfying 1 m(uN k (µ) − uN k−1 (µ), v; µ) + a(uN k (µ), v; µ) = f (v; µ), ∀v ∈ X N . (13) ∆t Supposing the mass matrix satisfies an affine decomposition analogous to (2), the RB framework can be applied modulo some relatively simple modifications. For example, in the Construction operation we now require extra terms related to the RB mass matrix MN i,j = m(ζj , ζi ; µ), 1 ≤ i, j ≤ N , and in Evaluation we solve an RB version of (13) and compute associated error bounds for 1 ≤ k ≤ K. Algorithm 1 also requires some modification. First, the truth solve at the winning “arg max” candidate µ∗ (implied in Line 8 of Algorithm 1) becomes the time-dependent truth calculation (13). Also, a pure Greedy strategy is no longer optimal in the time-dependent case since it does not account for temporal causality. A better approach (proposed in [15]) is to employ the Proper Orthogonal Decomposition (POD) method in time and Greedy selection in parameter. For specified n and {wk ∈ X N , 1 ≤ k ≤ K}, we denote the POD “method of snapshots” [18] operation as {ζ i ∈ X, 1 ≤ i ≤ n} = POD({wk ∈ X N , 1 ≤ k ≤ K}, n). This operation requires the solution of a dense K × K eigenvalue problem and returns n ≤ K X-orthonormal functions {ζ i ∈ X, 1 ≤ i ≤ n} such that Pn = span{ζ i , 1 ≤ i ≤ n} satisfies the optimality property Pn = arg inf

Yn ⊂Y

X K k=1

inf kwk − wk2X

w∈Yn

10

1/2 ,

(14)

2: 3: .. .

set {ζ1 , . . . , ζδN } = POD({(I − ΠXN )uN k (µ1 ), k = 1, . . . , K}, δN ). set N = δN , XN = span({ζ1 , . . . , ζδN }). .. .

8: 9:

{ζN +1 , . . . , ζN +δN } = POD({(I − ΠXN )uN k (µ∗ ), k = 1, . . . , K}, δN ); N ← N + δN , XN ← XN ⊕ span({ζN +1 , . . . , ζN +δN });

Table 1: Modifications to Algorithm 1 to obtain the POD(t)-Greedy(µ) algorithm. where Y ⊂ span{wk , 1 ≤ k ≤ K} and dim(Yn ) = n. We finally arrive at a description of the POD(t)-Greedy(µ) Algorithm by replacing Lines 2, 3, 8 and 9 of Algorithm 1 as shown in Table 1. Note that the Construction operation does not explicitly depend on K, but that in the general parabolic case Evaluation incurs an extra O(KN 3 + KQ2 N 2 ) cost due to the need to solve the linear system and evaluate the error bound at each time step. In the case of linear time invariant (LTI) parabolic problems, such as (13), this extra cost reduces to only O(KN 2 ) due to a backforward solve at each time step; the LU-decomposition and O(Q2 ) assembly of parameter-dependent error bound quantities need only be done once. As a further simplification in the LTI case, we may “train” an RB space based on an impulse function in order to treat, with a single RB approximation, any temporal control function specified in the Online stage [16]. It is well known that primal-dual formulations can be exploited to improve the numerical approximation of output quantities. The extension to a primaldual formulation to obtain “superconvergent” output approximations (and error bounds) in the spirit of [19] is well-established in the RB literature. This requires the construction of a second RB space for the dual system associated with a given output functional and the augmentation of the RB output prediction with a dual correction term. The RB primal-dual formulation has been be developed for elliptic [9] and LTI parabolic [16] problems; the primal-dual output error bound is the product of the primal and the dual error bounds, and hence it typically yields improved output approximation and corresponding error bounds for a given amount of computational effort. From the point of view of implementation one simply constructs independent RB spaces for the primal and dual systems, hence the preceding discussion of the Greedy algorithm and the Offline-Online decomposition applies directly in the primal-dual context as well. We have emphasized linear PDEs so far, but we note that the RB method can also be extended to nonlinear PDEs. PDEs with quadratic nonlinearities (e.g. Burgers’ equation, Navier-Stokes equations) can be treated efficiently with relatively few modifications to the RB formulation [20, 21, 22], with two important caveats: (i ) the error bound evaluation requires O(N 4 ) operations and consequently the Online cost is greater than in the linear case, and (ii ) for the time-dependent quadratically nonlinear case the RB error bounds grow exponentially as a function of time. The latter effectively limits the parameter range

11

and/or the final time that can be considered before the error bounds become impractically large. Note that exponential error bound growth is inescapable in the sense that it reflects inherent instability. Such instability is well-known in the Navier-Stokes equations, for example, and also plagues a posteriori error analyses in the finite element context. Higher-order polynomial nonlinearities also satisfy an affinedecomposition but the computational cost of the RB error bound evaluation scales as O(N 2r ) where r is the order of the nonlinearity. Therefore, it is generally impractical to consider r > 2 nonlinearities within the conventional RB framework. The methodological extensions of the RB method to time-dependent, primaldual, and polynomially-nonlinear problems, as well as problems with nontrivial stability factors (requiring the SCM) can all be encapsulated within the generic elliptic framework of Sections 2.1 to 2.4 with minor modifications to Algorithm 1 and appropriate “overloading” of the Construction and Evaluation operations. In fact, this notion naturally leads us to pursue an object-oriented implementation of the RB framework which provides a high-level description of the RB method that is user-friendly, flexible and extensible. Remark 1. The preceding discussion focuses on parametrized parabolic PDEs; model reduction for hyperbolic PDEs is also of interest but is much more difficult due to limited smoothness in parameter. The development of RB methods for hyperbolic equations is currently an active research topic. Remark 2. PDEs that do not satisfy the affine expansion hypothesis (1) can be treated via the empirical interpolation method (EIM) [11], which is an important extension of the RB methodology that constructs an approximate affine decomposition. This approach can also be employed to treat more general (higher-order polynomial and non-polynomial) nonlinearities efficiently [23, 24]. Remark 3. Another beneficial extension that has been examined in the context of model reduction is to adaptively subdivide the parameter domain into a set of disjoint parameter space “elements” to yield a more efficient (i.e. lower dimensional) reduced order model on each element [25, 26, 27, 28]. In the context of the RB method, this strategy is referred to as an hp-RB approach [27, 28] by analogy with hp-refinement in the finite element context. The hp-RB method can be accommodated by wrapping the algorithms discussed above in an outer loop that controls the adaptive splitting of D. 3. High-Performance Implementation of the RB Framework In Section 2 we introduced the key concepts in the RB framework through a simple one-parameter elliptic problem, and indicated a range of possible generalizations. We now discuss our implementation of the RB framework as an extension of the open source C++ parallel finite element library libMesh [29]. Initially developed as a modular “plug-in” which could be easily added to an 12

existing libMesh installation, the RB implementation is now available for download within the libMesh library itself. We employ existing libMesh library functionality for all the truth finite element operations including (i ) PDE solves at “greedily” selected parameter values, (ii ) truth eigensolves required by the SCM, (iii ) Poisson solves to compute the Riesz representations for the residual terms and outputs, and (iv ) “integrations” over Ω to obtain the data set S eval . We also utilize libMesh’s interfaces to PETSc [30] and SLEPc [31] for sparse linear algebra and eigenvalue solver functionality, respectively. The object-oriented organization of the RB code is illustrated in Figure 1. The base class RBBase implements functionality common across all problem types: it stores the “corners” in RP of the parameter domain D, the training set Ξtrain ⊂ D, and the set of parameter-independent operators and parameterdependent functions from the affine expansion of the PDE. RBBase is extended (via the C++ inheritance mechanism) in two directions which encapsulate, in a broad sense, the two major classes of systems (steady, time-dependent, and nonlinear PDE systems and SCM eigensystems) which must be solved in the RB method. We note that RBBase is a template class, hence the two specific instantiations of RBBase shown in Figure 1, RBBase and RBBase, represent independent inheritance branches and not e.g. a “multiple inheritance” object design. The RBSystem class (left side of the inheritance tree in Figure 1) represents the RB implementation for elliptic problems. Through its descendants it also provides support for linear parabolic (TransientRBSystem) and quadraticallynonlinear parabolic (QNTransientRBSystem) PDEs. These classes inherit from libMesh’s LinearImplicitSystem in order to perform truth finite element solves, and each class provides an appropriate implementation of the Greedy algorithm and the Construction and Evaluation operations. The second direction in which RBBase is extended is for the SCM: RBSCMSystem inherits from libMesh’s EigenSystem class in order to access the library’s SLEPc eigensolver interface, and is responsible for implementing the standard SCM algorithm from [32]. QNTransientSCMSystem extends RBSCMSystem to implement the modified SCM required for quadratically nonlinear time-dependent problems [22]. 3.1. Opportunities for parallelization in the Offline stage We exploit two key opportunities for parallelization in the Offline stage of the certified RB framework. First, we employ libMesh to parallelize the N dependent operations of the RB method mentioned at the beginning of this section. In addition, libMesh also handles the more “mundane” yet critical components required in the truth solve, such as the input and parallel partitioning of the computational mesh, and the output of truth solutions in formats recognized by popular visualization software. libMesh enables us to treat largescale truth discretizations because it employs domain decomposition with message passing and thread-based parallelism in order to accelerate matrix assembly and distribute memory consumption among multiple processors. libMesh

13

Figure 1: Class diagram for rbOOmit. Solid lines with open arrowheads represent C++ inheritance, with arrows pointing from child to parent class. Dashed lines represent specific instantiations of template classes. For example, the RBBase class is a template class, and we make use of two specific instantiations of it which are based on the LinearImplicitSystem and EigenSystem classes. Class names in italics are part of the LibMesh [29] finite element library. also leverages the parallel preconditioners and iterative solvers from PETSc and SLEPc to solve large-scale finite element systems efficiently. While Krylov subspace iterative solvers are sufficiently general and capable of handling the systems which must be solved in the RB method, in some cases it can be advantageous to employ a direct solver for the Riesz representation Poisson problems required to compute the Gi from (8). These Poisson solves involve a single system matrix and multiple right-hand sides: a prototypical scenario in which direct solvers can outperform their iterative cousins. We employ PETSc’s interface to the parallel sparse direct solver MUMPS [33] for this purpose, though for sufficiently large problems direct solvers may still become infeasible due to their large memory footprint. The other opportunity for parallelization is perhaps less obvious but equally lucrative in certain circumstances. We noted earlier that the Greedy strategy in which RB error bounds are evaluated on the set Ξtrain enables very large training sets because the evaluation of the RB error bounds is very “cheap.” Nevertheless, there are two situations in which the error bound sampling on Ξtrain tends to dominate the computational cost in the Offline stage. The first is problems with many parameters, in which — due to the “curse of dimensionality” — we need to choose very large training sets in order to obtain a reasonable coverage of D. The other situation is quadratically nonlinear problems in which error bound evaluation requires O(N 4 ) operations. In this scenario, when N 1, the RB solves required by the “arg max” operation may require a significant proportion of the overall runtime of the Offline stage. The key point to note, however, is that the “arg max” operation in the Greedy (and POD(t)–Greedy(µ)) algorithm is “embarrassingly parallel.” We 14

initialize our parallel implementation of the Greedy algorithm by partitioning the training set Ξtrain into disjoint subsets {Pi , i = 1, . . . , nproc } — where nproc denotes the number of processors. The parallel algorithm then proceeds with a local “arg max” operation performed simultaneously on all processors. Finally, an inexpensive parallel “arg max” is performed for the nproc local “winning” candidate parameters to find the global maximizer from Ξtrain . This simple strategy reduces the computation time for the ntrain RB solves by a factor of nproc . 4. Numerical Results We now consider two example problems that illustrate the RB techniques discussed above. Both problems are computationally very expensive in the Offline stage and would not be feasible without parallel computing resources. Nevertheless, the RB method allows us to generate reduced order models with rigorous error bounds that can be evaluated on a standard desktop or laptop computer in real-time or near real-time. All Offline computations were performed on the Ranger supercomputer (located at the Texas Advanced Computing Center), which has in total 62,976 cores, 123TB of memory, and a theoretical peak performance of 579 TFLOPs. 4.1. Transient Thermal Conduction We consider the heat equation — transient thermal conduction — in a three– dimensional “Swiss cheese” configuration. This is a generalization of the steadystate “thermal block” problem posed on the unit square in two spatial dimensions in [34]; here we consider a time-dependent formulation, a more complicated domain Ω and a much more expensive truth discretization. The domain Ω is given by the unit cube [0, 1]3 \ S, where S is a lattice of 27 spheres of radius 1/5 and respective centers [0, 1/2, 1] × [0, 1/2, 1]. (Note we directly consider the nondimensional version of the problem based on the standard conduction scalings.) We subdivide Ω into 27 subdomains, Ω = ∪27 j=1 Ωj , the Ωj are shown in Figure 2. The governing equations are most easily described in weak form: find u(t; κ) ∈ X for 0 ≤ t ≤ tf = 1 such that Z Ω

Z Z Z 26 X ∂u v+ κj ∇u · ∇v + ∇u · ∇v = v, ∂t Ωj Ω1 Γb j=1

∀v ∈ X,

(15)

where κj is the thermal conductivity of region Ωj relative to the thermal conductivity of region Ω1 (note we assume uniform specific heats and hence the κj may also be interpreted as scaled thermal diffusivities), X ≡ {v ∈ H 1 (Ω) | vΓt = 0}, Γb is the bottom boundary “z = 0” (on which we impose inhomogeneous flux boundary conditions), and Γt is the top boundary “z = 1” (on which we impose

15

Figure 2: Subdomains for the “Swiss cheese” problem. Each subdomain is associated with a distinct parametrized thermal conductivity.

zero Dirichet conditions). We impose homogeneous natural (zero) flux conditions on all other boundaries. We choose P = 26 parameters, µj = κj , 1 ≤ j ≤ 26, and specify the parameter domain D = [0.5, 2]26 . We consider one output, Z 1 u(t; µ), s(t; µ) = `(u(t; µ)) = |Ω1 | Ω1 the average temperature over Ω1 , where Ω1 is the intersection of Ω with the cube [0, 1/4]3 . We pursue a primal-dual approach for this LTI parabolic problem, where the dual problem runs “backwards” in time [16]. The weak formulation for the dual problem is: find z(t; κ) ∈ X for 0 ≤ t ≤ tf such that Z

26

∂z X − v + κj Ω ∂t j=1

Z

Z ∇v · ∇z +

Ωj

∇v · ∇z = 0,

∀v ∈ X,

(16)

Ω1

R with Ω v z(tf ) = `(v). We select a backward Euler temporal discretization with K = 100 time levels for both the primal and dual problems and employ a truth finite element approximation space with N = 241, 520 degrees of freedom. Both the primal and dual systems are implemented via the TransientRBSystem class discussed in Section 3, and the SCM is not required here due to the simple “parametrically coercive” [9] structure of (15) and (16). In Figure 4 we show the primal and dual truth solutions at t = 1 and t = 0, respectively, for the parameter value µ = (2, 2, 2, 2, 0.5, 2, 0.5, 0.5, 2, 2, 2, 2, 2, 2, 2, 2, 0.5, 2, 2, 2, 2, 2, 2, 2, 2, 2). 16

(17)

This parameter specifies the minimum thermal conductivity in Ω9 and the adjacent subdomains Ω6 , Ω8 , Ω18 , and hence we observe a “hot corner” in Figure 4a.

(a) Exterior view

(b) Interior cutaway view

Figure 3: Exterior (3a) and interior (3b) views of the “Swiss cheese” mesh used for the many-parameter heat transfer application. The mesh as shown in this figure is composed of linear tetrahedra and has 32,307 vertices and 168,926 elements. A once-uniformly-refined version of this grid which was used in the calculations has 241,520 vertices and 1,351,408 elements. All grids used in this work were created with the Gmsh [35] mesh generator. We employ the POD(t)-Greedy(µ) for both the primal and dual problems and set Nmax = 100 in both cases. To investigate the behavior of the Greedy in this “many-parameter” problem, we perform the Offline stage three times with log-randomly generated training sets of size ntrain = 102 , 104 , and 106 . The Offline burden due to the expensive truth and (in the ntrain = 106 case) large ntrain is considerable: both forms of parallelization from Section 3.1 are crucial. Here we used 512 processors of Ranger for the Offline computations, and the total Offline wall-clock time in the ntrain = 106 case was approximately 31 hours. The time spent evaluating RB error bounds for the ntrain = 106 calculation was roughly an hour (per processor). The embarrassingly parallel “arg max” operation discussed in Section 3.1 is therefore clearly essential, and in this case saved over 500 hours of computation time vs. an equivalent serial algorithm. In Fig. 5 we show the (normalized) POD(t)-Greedy(µ) convergence, εN ≡

∗ ∆K N (µ ) , kuN (tK ; µ∗ )kL2 (Ω)

∗ 2 where ∆K N (µ ) denotes the L (Ω)-norm RB error bound at time level K, for the primal RB space (the dual convergence is analogous and hence is omitted) with the three choices of ntrain . We observe that the POD(t)-Greedy(µ) convergence

17

(a)

(b)

Figure 4: Primal (4a) and dual (4b) truth solutions at t = 1 and t = 0, respectively, for the parameter in (17). reported in the ntrain = 102 case is overly optimistic due to the coarseness of the training set — here Ξtrain is a poor surrogate for D. However, the ntrain = 104 and 106 cases agree quite well which suggests that these larger training sets provide a reasonable surrogate for D. Of course, even the ntrain = 106 training set is extremely sparse in the 26-dimensional parameter domain (e.g. a tensor product grid with only two points in each parameter direction would have 226 ≈ 6.7 × 107 points) but the POD(t)-Greedy(µ) nevertheless performs well due to the highly smooth “parametrically coercive” [9] nature of this problem. Many-parameter model reduction is a challenging problem and therefore, in order to gain more insight into the nature of our RB approximation spaces, it is interesting to examine the distribution of the parameter values selected by the greedy algorithm. We consider the primal POD(t)-Greedy(µ) calculation with ntrain = 106 (recall that Ξtrain is log-uniform randomly distributed) and we generate individual histograms for each of the 26 parameters, µj , 1 ≤ j ≤ 26, based on the 100 greedily-selected µ values. We specify nbins equally-sized bins within the range [µmin , µmax ] = [0.5, 2] and let µj (i), 1 ≤ i ≤ nbins denote the number of occurrences of µj within bin i. To simplify the presentation we then generate a histogram for the “composite” variable hµi, such that 26

hµi(i) ≡

1 X µj (i) 26 j=1

1 ≤ i ≤ nbins .

(18)

This composite histogram for nbins = 10 is given in Fig. 6, where it is compared to a histogram for a log-uniform random sample of data points over the same parameter range. The greedily-selected parameter values are biased towards the endpoints of the µ domain relative to the log-uniform random sample. The tendency of the Greedy algorithm to select parameter values near the endpoints may also be related to the endpoint clustering seen in classical interpolation theory, for example the optimal approximation properties provided by Cheby18

ntrain=1e+02 1e+04 1e+06

log10 (εN)

0

-0.5

-1

-1.5

10

20

30

40

50

60

70

80

90

100

N

Figure 5: Greedy convergence for the primal basis for three different choices of ntrain . shev points in one dimension [36]. In contrast to the Chebyshev-like bias of the Greedy in the ntrain = 106 case, we note that in the ntrain = 102 case each parameter in the training set is selected exactly once (recall that Nmax = 100 also) so that the Greedy reduces to log-random parameter selection. In addition to the statistical distribution of the µ values, it is also instructive to look at the different spatial configurations of the parameters selected by the greedy algorithm, in particular the jumps in thermal diffusivity between adjacent subregions. We define the average jump for a given µ vector as 1 X hδµi ≡ |µi − µj | (19) |I| (i,j)∈I

where I is the set of all ordered pairs (i, j) for which i < j and regions Ωi and Ωj share a common surface. For the 3 × 3 × 3 configuration considered in the “Swiss cheese” problem (see Fig. 2) there are 54 unique ordered pairs in I. A histogram of hδµi values computed for the ntrain = 106 case is given in Fig. 7, where it is compared to the jumps computed for a large log-uniform random sample of data. The jumps in the greedily-selected dataset are, on average, larger than those from the log-uniform random sample. The greedy sample size is rather small, which makes it difficult to draw firm conclusions, but it seems likely that the greedy algorithm detects higher error in µ configurations with larger diffusivity jumps, thereby preferentially selecting them for inclusion in the reduced basis. We now proceed to the Online stage. In order to test the accuracy of the primal-dual RB approximation, we perform RB solves on a 5000 point lograndom test set, Ξtest , for the RB spaces generated by the ntrain = 102 and 19

20 Number of occurrences

Number of occurrences

20

15

10

5

15

10

5

0

0 0.6

0.8

1

1.2

1.4

1.6

1.8

2

0.6

(a) Greedy-selected

0.8

1

1.2

1.4

1.6

1.8

2

(b) Log-uniform random

Figure 6: Histograms of averaged greedy-selected (6a) and log-uniform randomly selected (6b) parameter values for the “Swiss cheese” problem with ntrain = 106 . The log-uniform histogram was created from a large set of sample data points and scaled to match the averaged greedy-selected hµi values. The bias of the greedy-selected values towards the endpoints 0.5 and 2 of the parameter domain relative to the log-uniform sample is evident. ntrain = 106 training sets. We compute the maximum relative output error test bound over Ξtest , ∆s,rel ), defined as max (Ξ test ∆s,rel ) ≡ max max (Ξ test µ∈Ξ

∆sN (tf ; µ) , sN (tf ; µ) − ∆sN (tf ; µ)

(20)

where ∆sN (tf ; µ) (resp. sN (tf ; µ)) denotes the primal-dual output bound (resp. dual-corrected RB output) at the final time, tf . Note that the denominator in (20) is a “worst-case” surrogate for the truth output sN (tf ; µ). In Table 2 we present results for N = 25, 50, 75, and 100 along with corresponding Online computation times for a single primal-dual RB solve.3 The data clearly indicate the trade-off between RB accuracy and Online cost. It is perhaps surprising that the RB accuracy in Table 2 is only mildly affected by the size of the training set. On reflection, however, this is not unexpected. Due to the sparsity of the ntrain = 102 training set in the 26dimensional parameter domain D, we expect very little “redundancy” among the ntrain = 102 parameter values. On the other hand, we note that the role of the Greedy algorithm is to compress redundancy within Ξtrain , but if the points in the ntrain = 102 training set are already maximally independent there is little benefit in employing the Greedy algorithm. This argument suggests that a preferable approach for this many-parameter problem is to employ the training set refinement procedure from [34] (an adaptive training set refinement strategy is also considered in [37]), since there is evidently little benefit in employing a richer training set before the RB approximation has reached some threshold level of accuracy. However, we temper the latter point by emphasizing that 3 Computation

times are averaged over Ξtest and are from an AMD Opteron 2382 processor.

20

25

20

20 Number of occurrences

Number of occurrences

25

15

10

5

15

10

5

0

0 0.35

0.45

0.55

0.65

0.75

0.35

(a) Greedy-selected

0.45

0.55

0.65

0.75

(b) Log-uniform random

Figure 7: Histograms of average jumps hδµi from (19) in spatially-adjacent greedy-selected (7a) and log-uniform randomly selected (7b) parameter values for the “Swiss cheese” problem with ntrain = 106 . The log-uniform histogram was created from a large set of sample data points and scaled to match the averaged greedy-selected hδµi values. N ntrain 25 50 75 100

test ∆s,rel ) max (Ξ 2 = 10 ntrain = 106

1.13 × 10−1 2.73 × 10−2 1.03 × 10−2 5.22 × 10−3

1.16 × 10−1 2.44 × 10−2 1.05 × 10−2 4.34 × 10−3

Online time (seconds) 3.48 × 10−2 1.01 × 10−1 2.10 × 10−1 2.74 × 10−1

Table 2: Relative output error bounds over the test set Ξtest containing 5000 log-randomly selected points and corresponding Online evaluation timings. these conclusions are specific to highly smooth PDEs such as our heat equation example; PDEs that are less smooth in parameter generally require a more thorough exploration of D for effective model reduction. Nevertheless, our primary motivation for considering training sets of fixed size here is to shed further light on the behavior of the standard Greedy algorithm in many-parameter problems, e.g. through Figures 6 and 7; though we also note that a training set enrichment scheme is easily accommodated within our RB implementation. Finally, we note from Table 2 that for the ntrain = 106 primal-dual RB spaces with N = 100 we obtain an Online output approximation with relative error bound less than 0.5% in 0.274 seconds — compared to approximately 200 seconds for a truth solve on 512 processors of Ranger. 4.2. Unsteady Boussinesq Equations We next consider a three-dimensional version of the unsteady natural convection problem from [38]. The nondimensional, coupled conservation of mo-

21

mentum, energy, and mass equations for this problem are: √ ∂V 1 V · ∇V + ∇ · V V + Gr Pr∇P + √ ∂t 2 Gr Pr √ − ∇2 V + Gr PrT g ˆ=0

(21)

1 1 2 ∂T + √ V · ∇T + ∇ · T V − ∇ T =0 ∂t Pr 2 Gr Pr

(22)

∇·V =0

(23)

where V ≡ (V1 , V2 , V3 ) is the nondimensional velocity vector, P is the nondimensional pressure, T is the nondimensional temperature, and g ˆ ≡ (− sin φ, 0, − cos φ) is a unit vector in the direction of gravity. Gr is the Grashof number, a measure of the strength of buoyant effects relative to thermal diffusion, and Pr = 0.71 is the Prandtl number (we assume the fluid is air). Additional details of the nondimensionalization can be found in [38]. The nondimensional domain is Ω = (0, 5)3 \P where P is the pillar (2.4, 2.6)× (1.5, 3.5) × (0, 1). We solve for the field variables V , P , and T throughout Ω. The “roof” of the cavity is maintained at temperature T = 0, the sides and base of the cavity are perfectly thermally insulated, and the top and sides of the pillar are subject to a uniform heat flux of magnitude Gr. We impose no-slip velocity conditions on all boundaries. The truth weak formulation for (21)–(23) requires only minor modifications from the two-dimensional case in [38], hence we omit it here for the sake of brevity.4 We consider the time interval t ∈ [0, 0.16], and employ a Crank-Nicolson temporal discretization with K = 100 timesteps. The threedimensional finite element mesh is composed of quadratic tetrahedral elements and has 71,279 nodes and 50,574 elements. Cutaway views showing the internal details of the tetrahedralization are given in Fig. 10; the total number of degrees of freedom is N = 294, 502. The RB and SCM implementations in this case were based on the classes QNTransientRBSystem and QNTransientSCMSystem, respectively. We consider a two-tuple parameter µ ≡ (µ1 , µ2 ) ≡ (Gr, φ) ∈ D ≡ [4000, 6000]× [0, 0.2]. Our goal is to study the parametric dependence of the temperature in regions near the top of the heated pillar in the presence of natural convection. 4 Note that we employ skew-symmetric convection operators in (21) and (22) to recover discrete stability properties and simplify RB error bound derivation.

22

Therefore, we use the following L2 (Ω)-bounded functionals of T Z 1 sn (t; µ) = `n (T (t; µ), µ) = T (t; µ) ; µ1 |Dn | Dn

(24)

as outputs, where D1 = [2.2, 2.4] × [1.5, 3.5] × [1, 1.1], D2 = [2.4, 2.6] × [1.5, 3.5] × [1, 1.1], and D3 = [2.6, 2.8]×[1.5, 3.5]×[1, 1.1] are three small rectangular prisms above the pillar. We denote the RB approximations to the outputs from (24) as sN,n (t; µ), n = 1, 2, 3. A cross-section of the domain geometry and output regions is given in Figure 8, cross-sections of the truth solution at t = 0.16 for (Gr, φ) = (6000, 0.2) are shown in Figure 11.

φ g 5 D1

1/5 1

D2

D3 Ω

5

Figure 8: Cross-section at y = 2.5 of the computational domain; note that Ω does not include the pillar which is shaded in red. Cross-sections of the output regions D1 , D2 and D3 are also indicated. We construct a (primal-only) RB space via the POD(t)-Greedy(µ) algorithm.5 We specify = 9 × 10−4 for the target tolerance and δN = 3 for the number of POD modes to include at each greedy iteration; we further specify a 32 × 32 tensor product train sample Ξtrain of uniformly spaced points in D. The tolerance is satisfied for Nmax = 90 and the Greedy convergence is shown in Fig. 9. Note that in the quadratically nonlinear case we report a nominal error bound during the POD(t)-Greedy(µ) since the SCM cannot be performed until after the RB space has been generated [22]. Nevertheless, this does not compromise the rigor of the RB error bounds in the Online stage. All Offline calculations are performed on 128 processors on Ranger. The total time required for the entire Offline stage, including the SCM, is significant: about 42 hours. The “embarrassingly parallel” solves over Ξtrain during the greedy algorithm require approximately 24 total minutes in the calculation — a small percentage 5 A primal-dual formulation is not appealing in this case since (i) the beneficial effect on output error bounds would be limited due to exponential error bound growth, and (ii) we would need a dual system for each output.

23

of the overall Offline runtime. It is important to note, however, that had these solves not been parallelized in the manner discussed in this paper, the total wall-clock time for this stage of the calculation would have been approximately 128 × 24 minutes, or just over 51 hours.

1

0.5

log10 (εN)

0

-0.5

-1

-1.5

-2

-2.5

-3 10

20

30

40

50

60

70

80

90

N

Figure 9: Greedy convergence for the 3D Boussinesq pillar. In Figure 12, we show RB output approximations and corresponding RB error bounds obtained with N = 90 to the sn (t; µ) for parameter values (Gr, φ) = (4000, 0), (5000, 0.1), and (6000, 0.2). We note that, as expected, the “left” and “right” outputs (s1 and s3 , respectively) coincide when φ = 0 and their separation increases as φ increases. The Online computation time with N = 90 is 20.6 seconds on an AMD Opteron 2382 processor, which is significantly longer than in the “Swiss cheese” example due to the O(N 4 ) complexity of the error bounds in the quadratically nonlinear case. Nevertheless, this still represents a speedup factor of approximately 63 with respect to a single truth solve on Ranger, which requires 21.7 minutes on 128 processors. Acknowledgements We would like to thank Prof. Anthony Patera and Dr. Phuong Huynh of MIT, and Jens Eftang of NTNU, for numerous helpful discussions related to this work. This research was supported by AFOSR Grant No. FA9550-07-1-0425, OSD/AFOSR Grant No. FA9550-09-1-0613, and also by the National Science Foundation through TeraGrid resources provided by TACC under Grant No. TG-ASC100016.

24

(a) x = 2.5

(b) y = 2.5

Figure 10: x (10a) and y (10b) interior cutaway views of the mesh used in the three-dimensional unsteady Boussinesq calculations. The output regions, which sit just above the heated “pillar,” are meshed exactly. The pillar itself is a rectangular cavity at the base of the domain and is not meshed. The mesh is composed of quadratic tetrahedral elements and has 71,279 nodes and 50,574 elements. References [1] Z. J. Bai, Krylov subspace techniques for reduced-order modeling of largescale dynamical systems, Applied Numerical Mathematics 43 (1-2) (2002) 9–44. [2] J. Chen, S.-M. Kang, Model-order reduction of nonlinear MEMS devices through arclength-based Karhunen-Loéve decomposition, in: Proceeding of the IEEE international Symposium on Circuits and Systems, Vol. 2, 2001, pp. 457–460. [3] M. D. Gunzburger, J. Peterson, J. N. Shadid, Reduced-order modeling of time-dependent PDEs with multiple parameters in the boundary data, Comp. Meth. Appl. Mech. and Eng. 196 (2007) 1030–1047. [4] M. Meyer, H. G. Matthies, Efficient model reduction in non-linear dynamics using the Karhunen-Loève expansion and dual-weighted-residual methods, Computational Mechanics 31 (1-2) (2003) 179–191. [5] T. A. Porsching, M. Y. L. Lee, The reduced-basis method for initial value problems, SIAM Journal of Numerical Analysis 24 (1987) 1277–1287. [6] J. P. Fink, W. C. Rheinboldt, On the error behavior of the reduced basis technique for nonlinear finite element approximations, Z. Angew. Math. Mech. 63 (1) (1983) 21–28. [7] B. O. Almroth, P. Stern, F. A. Brogan, Automatic choice of global shape functions in structural analysis, AIAA Journal 16 (1978) 525–528. 25

(a) x = 2.5

(b) y = 2.5

Figure 11: x (11a) and y (11b) interior cutplane views of the truth solution for (Gr, φ) = (6000, 0.2). Color contours show the temperature field and the overlaid velocity vectors indicate the circulatory nature of the flow. [8] A. K. Noor, Recent advances in reduction methods for nonlinear problems, Comput. Struct. 13 (1981) 31–44. [9] G. Rozza, D. B. P. Huynh, A. T. Patera, Reduced basis approximation and a posteriori error estimation for affinely parametrized elliptic coercive partial differential equations, Archives Computational Methods in Engineering 15 (3) (2007) 229–275, http://dx.doi.org/10.1007/ s11831-008-9019-9. [10] D. B. P. Huynh, D. J. Knezevic, J. W. Peterson, A. T. Patera, High-fidelity real-time simulation on deployed platforms, submitted to Computers and Fluids. [11] M. Barrault, Y.Maday, N. C. Nguyen, A. T. Patera, An “Empirical Interpolation” Method: Application to Efficient Reduced-Basis Discretization of Partial Differential Equations, C. R. Acad. Sci. Paris, Série I. 339 (2004) 667–672. [12] J. L. Eftang, M. A. Grepl, A. T. Patera, A posteriori error bounds for the empirical interpolation method, C.R. Acad. Sci. Paris, Série I. Submitted. [13] C. Prud’homme, D. V. Rovas, K. Veroy, L. Machiels, Y. Maday, A. T. Patera, G. Turinici, Reliable real-time solution of parametrized partial differential equations: Reduced-basis output bound methods, J Fluids Engineering 124 (2002) 70–80. [14] D. B. P. Huynh, G. Rozza, S. Sen, A. T. Patera, A successive constraint linear optimization method for lower bounds of parametric coercivity and inf-sup stability constants, CR Acad Sci Paris Series I 345 (2007) 473–478.

26

[15] B. Haasdonk, M. Ohlberger, Reduced basis method for finite volume approximations of parametrized evolution equations, M2AN (Math. Model. Numer. Anal.) 42 (2) (2008) 277–302. [16] M. A. Grepl, A. T. Patera, A Posteriori error bounds for reduced-basis approximations of parametrized parabolic partial differential equations, M2AN (Math. Model. Numer. Anal.) 39 (1) (2005) 157–181. [17] D. J. Knezevic, A. T. Patera, A certified reduced basis method for the Fokker-Planck equation of dilute polymeric fluids: FENE dumbbells in extensional flow, SIAM Journal on Scientific Computing 32 (2) (2010) 793– 817, http://dx.doi.org/10.1137/090759239. [18] L. Sirovich, Turbulence and the dynamics of coherent structures. I. Coherent structures, Quart. Appl. Math. 45 (3) (1987) 561–571. [19] N. Pierce, M. B. Giles, Adjoint recovery of superconvergent functionals from PDE approximations, SIAM Review 42 (2) (2000) 247–264. [20] K. Veroy, C. Prud’homme, A. T. Patera, Reduced-basis approximation of the viscous burgers equation: Rigorous a posteriori error bounds, CR Acad Sci Paris Series I 337 (2003) 619–624. [21] K. Veroy, A. T. Patera, Certified real-time solution of the parametrized steady incompressible Navier-Stokes equations; Rigorous reduced-basis a posteriori error bounds, International Journal for Numerical Methods in Fluids 47 (2005) 773–788. [22] N. C. Nguyen, G. Rozza, A. T. Patera, Reduced basis approximation and a posteriori error estimation for the time-dependent viscous Burgers’ equation, Calcolo 46 (3) (2009) 157–185, http://dx.doi.org/10.1007/ s10092-009-0005-x. [23] M. A. Grepl, Y. Maday, N. C. Nguyen, A. T. Patera, Efficient reduced-basis treatment of nonaffine and nonlinear partial differential equations, M2AN (Math. Model. Numer. Anal.) 41 (2007) 575–605. [24] N. C. Nguyen, J. Peraire, An effcient reduced-order modeling approach for nonlinear parametrized partial differential equations, International Journal of Numerical Methods in Engineering 76 (2008) 27–55. [25] D. Amsallem, C. Farhat, Interpolation method for adapting reduced-order models and application to aeroelasticity, AIAA Journal 46 (7) (2008) 1803– 1813. [26] D. Amsallem, J. Cortial, C. Farhat, On-demand CFD-based aeroelastic predictions using a database of reduced-order bases and models, in: 47th AIAA Aerospace Sciences Meeting, 2009.

27

[27] J. L. Eftang, A. T. Patera, E. M. Rønquist, An hp Certified Reduced Basis Method for Parametrized Elliptic Partial Differential Equations, Submitted to SISC (2009). [28] J. L. Eftang, D. J. Knezevic, A. T. Patera, E. M. Rønquist, An hp certified reduced basis method for parametrized time-dependent partial differential equations, Submitted to MCMDS (2010). [29] B. S. Kirk, J. W. Peterson, R. M. Stogner, G. F. Carey, libMesh: A C++ library for parallel adaptive mesh refinement/coarsening simulations, Engineering with Computers 23 (3–4) (2006) 237–254. [30] S. Balay, V. Eijkhout, W. D. Gropp, L. C. McInnes, B. F. Smith, Efficient management of parallelism in object oriented numerical software libraries, in: E. Arge, A. M. Bruaset, H. P. Langtangen (Eds.), Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997, pp. 163–202. [31] V. Hernandez, J. E. Roman, V. Vidal, SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems, ACM Trans. Math. Softw. 31 (3) (2005) 351–362, http://doi.acm.org/10.1145/1089014.1089019. [32] D. B. P. Huynh, G. Rozza, S. Sen, A. T. Patera, A successive constraint linear optimization method for lower bounds of parametric coercivity and infsup stability constants, C. R. Acad. Sci. Paris, Analyse Numérique 345 (8) (2007) 473–478. [33] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, Multifrontal parallel distributed symmetric and unsymmetric solvers, Computer Methods in Applied Mechanics and Engineering 184 (2000) 501–520. [34] S. Sen, Reduced basis approximation and a posteriori error estimation for many-parameter heat conduction problems, Numerical Heat Transfer, Part B: Fundamentals 54 (5) (2008) 369–389. [35] C. Geuzaine, J.-F. Remacle, Gmsh: A 3-D finite element mesh generator with built-in pre- and post-processing facilities, International Journal for Numerical Methods in Engineering 79 (11) (2009) 1309–1331, http://dx. doi.org/10.1002/nme.2579. [36] G. W. Stewart, Afternotes on Numerical Analysis, SIAM, Philadelphia, PA, 1996. [37] B. Haasdonk, M. Dihlmann, M. Ohlberger, A training set and multiple bases generation approach for parametrized model reduction based on adaptive grids in parameter space, Simtech preprint 2010-28, University of Stuttgart, submitted to MCMDS (April 2010). [38] D. J. Knezevic, N. C. Nguyen, A. T. Patera, Reduced basis approximation and a posteriori error estimation for the parametrized unsteady Boussinesq equations, PREPRINT, Submitted to Mathematical 28

Models and Methods in Applied Sciences. http://augustine.mit.edu/ methodology/papers/atp_M3AS_preprint_Nov09.pdf.

29

0.3

0.32

0.25

0.3

0.2

0.28

0.15

0.26

0.1

0.24

0.05

0.22

0 0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.2 0.1

0.16

(a) (Gr, φ) = (4000, 0)

0.11

0.12

0.13

0.14

0.15

0.16

(b) (4000, 0), Zoom near tf

0.3

0.32

0.25

0.3

0.2

0.28

0.15

0.26

0.1

0.24

0.05

0.22

0 0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.2 0.1

0.16

(c) (Gr, φ) = (5000, 0.1)

0.11

0.12

0.13

0.14

0.15

0.16

(d) (5000, 0.1), Zoom near tf 0.3

0.25

0.28

0.2

0.26

0.15

0.24

0.1

0.22

0.05

0.2

0 0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.18 0.1

0.16

(e) (Gr, φ) = (6000, 0.2)

0.11

0.12

0.13

0.14

0.15

0.16

(f) (6000, 0.2), Zoom near tf

Figure 12: The RB outputs sN,1 (tk ; µ) (blue N), sN,2 (tk ; µ) (red ), sN,3 (tk ; µ) (black ×), and associated error bounds (solid lines) as functions of time for three values of µ.

30

A High-Performance Parallel Implementation of the ...

A High-Performance Parallel Implementation of the ...

Suggest Documents

Synthesis of a unique highperformance

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

Highperformance work system implementation in small and medium ...

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

Implementation of a Massively Parallel Dynamic ...

Implementation of parallel tridiagonal solvers for a

Parallel implementation of a central decomposition ... - CiteSeerX

Parallel Implementation of a Data-Transpose ...

A Parallel Implementation of blockMesh for Quick

Parallel and GRID Implementation of a Large

A Parallel Implementation of Molecular Packing using

A Parallel Implementation of Singular Value

A Parallel implementation of Gram-Schmidt Algorithm

A parallel implementation of the LTSn method for a ... - CiteSeerX

HighPerformance Polybenzoxazine Nanocomposites Containing

Performance Evaluation of Parallel Implementation

Parallel Implementation of the Unified Flow Solver

Development of parallel implementation for the dendritic

Parallel Implementation of the Gauss-Seidel

Parallel Implementation of the Discontinuous Galerkin ... - CiteSeerX

High-speed Parallel Software Implementation of the

Parallel Implementation of the PHOENIX Generalized Stellar ...