Calibrating a large computer experiment simulating ... - CiteSeerX

Calibrating a large computer experiment simulating radiative shock hydrodynamics

arXiv:1410.3293v1 [stat.AP] 13 Oct 2014

Robert B. Gramacy Booth School of Business The University of Chicago [email protected]

Derek Bingham Dept. of Statistics & Actuarial Science Simon Frasier University [email protected]

James Paul Holloway, Michael J. Grosskopf, Carolyn C. Kuranz, Erica Rutter, Matt Trantham and Paul R. Drake Center for Radiative Shock Hydrodynamics University of Michigan {hagar,mikegros,ckuranz,ruttere,mtrantha,rpdrake}@umich.edu

Abstract We consider adapting a canonical computer model calibration apparatus, involving coupled Gaussian process (GP) emulators, to a computer experiment simulating radiative shock hydrodynamics that is orders of magnitude larger than what can typically be accommodated. The conventional approach calls for thousands of large matrix inverses to evaluate the likelihood in an MCMC scheme. Our approach replaces that tedious idealism with a thrifty take on essential ingredients, synergizing three modern ideas in emulation, calibration, and optimization: local approximate GP regression, modularization, and mesh adaptive direct search. The new methodology is motivated both by necessity—considering our particular application—and by recent trends in the supercomputer simulation literature. A synthetic data application allows us to explore the merits of several variations in a controlled environment and, together with results on our motivating real data experiment, lead to noteworthy insights into the dynamics of radiative shocks as well as the limitations of the calibration enterprise generally. Key words: emulator, tuning, nonparametric regression, big data, local Gaussian process, mesh adaptive direct search (MADS), modularization

1

Introduction

The rapid increase in computational power in recent decades has made computer models (or simulators) commonplace as a way explore complex physical systems. In many settings, the aim for scientists is to perform model calibration: based on a limited set of field observations, estimate parameters of the computer model to best match simulation with reality and, consequently, build an accurate predictor for the field. 1

In our work, we are interested in model calibration for a specific type of shock wave. Radiative shocks occur when shock waves become fast and hot enough that radiation from the shocked matter dominates the energy transport. This leads to shocks that are challenging to simulate because both hydrodynamics and radiation transport are required to describe the physics. The main aim of the University of Michigan’s Center for Radiative Shock Hydrodynamics (CRASH) is to develop and assess predictive capability for a radiative shock system using computer simulations of, and field experiments on, high-energy laser systems. A deterministic computer simulator for a radiative shock experiment was constructed— the CRASH code. The output of the computational model consists of a large space-time field that describes the evolution of the shock given specified initial conditions (the inputs to the code). From this output, features of the shock are extracted (e.g., the location of the shock in 0.5 nano-second intervals). A more detailed discussion of the application is given in Section 6. For the study conducted here, we have roughly 27K input-output combinations from simulations. In addition, a series of experiments was performed at the Omega laser facility at the University of Rochester (Boehly et al., 1997). Complicating matters somewhat is that some of inputs to the computational model have values that are not known in the physical system. Thus, their values must be inferred. From a statistical standpoint the primary goal is to make predictions of features of radiative shocks in an unsampled regime with associated uncertainty and to identify the “best” values for the unknown inputs for effective prediction. In other words, model calibration is what is needed. Kennedy and O’Hagan (2001) were the first to propose a statistical framework for combining simulation output and field observations for model calibration. They propose a hierarchical model linking noisy field measurements from the physical system to the potentially biased output of the computer model at a “true” but unknown value of the calibration parameter(s). They specify an inferential framework for estimating calibration parameter(s), bias, and noise level using data from both processes. The backbone of the framework is a pair of interlocking Gaussian process (GP) priors: for (a) the simulator output and (b) bias. The hierarchical nature of the model, paired with Bayesian posterior inference, allows both data sources (simulated and field) to contribute to joint inference for all unknowns. The GP is a popular prior for deterministic computer model output (Sacks et al., 1989). In that context they are referred to as surrogate models or emulators for the simulator where the conditional distribution, derived from (posterior) inference given training data, leads to accurate predictions of computer model output at new input locations. However, computational limitations arise as GP inference and prediction can require many expensive matrix decompositions. Consequently, new methodology for model calibration is required when there are moderate to large numbers of computer model trials, as is increasingly common in the simulation literature (e.g., Kaufman et al., 2012; Paciorek et al., 2013) With our motivating radiative shock experiment in mind, our proposed method is designed for the following setting: (1) The number of computer model runs is too large for full GP emulation, so we deploy a sparse method; and (2) the number of computer model evaluations dwarfs the number of field observations. To deal with the computational challenges in this setting we propose three significant differences from typical model calibration. First,

2

we modularize the model fitting (Liu et al., 2009) and construct the emulator using only the simulator outputs, i.e., ignoring the information from field data at that stage. Unlike Liu et al., who argued for modularization on philosophical grounds, we do this for purely computational reasons. Our application involves so much computer simulation data that the relative potential contribution of field data to the emulation is negligible. Second, we insert a local approximate GP (Gramacy and Apley, 2014) in place of the traditional GP emulator. We argue that the locality of the approximation is particularly handy in the calibration context which only requires predictions at a (small handful) of field data sites. Finally, we illustrate how mesh adaptive direct search (Audet and Dennis, Jr., 2006)—acting as glue between the computer model, bias, and noisy field data observations—can quickly find good settings of the tuning/calibration parameters and, as a byproduct, provide enough useful distributional information to replace an expensive sampling. The remainder of the paper is outlined as follows. Section 2 outlines the ingredients for large scale computer model calibration, while the actual recipe (and secret sauce) is given in Section 4, together with illustrations on synthetic data. We return to the motivating example in Section 6 for an analysis in several variations. Section 7 concludes with a discussion.

2

Background

We consider data comprised of runs of a computer model M at a large set of inputs designed by a space-filling criterion (e.g., Johnson et al., 1990), and a much smaller number observations from a physical or field experiment F which may follow some statistical design while respecting limitations imposed by the experimental apparatus. We assume that the runs of M are deterministic, and its input space fully contains that of F . There are two types of input to the simulator: (i) design variables, that can be adjusted, or at leased measured, in the physical system; and (ii) calibration or tuning parameters, whose values are required to simulate the system, but are unknown in the field. Denote design variables by the vector x, and let u be the vector of calibration parameters required to run the computer model (i.e., which takes inputs (x, u)). Our primary goal is predicting the result of new field data experiments, via M , which in turn means finding a good u.

2.1

Fully Bayesian calibration and a modularized approach

Kennedy and O’Hagan (2001, hereafter KOH) proposed a Bayesian framework for coupling M and F . Let yjF (x) denote the j th replication of the field run at x, and y M (x, u) denote the (deterministic) output of a computer model run. KOH represent the real mean process R as the computer model output at the best setting of the tuning parameters, u∗ , plus a bias term acknowledging that there can be systematic discrepancies between the computer model and the underlying mean of the physical process. In symbols, the mean of the physical process

3

is y R (x) = y M (x, u∗ ) + b(x). The field observations connect reality with data: iid

yjF (x) = y R (x) + εxj , giving

yjF (x)

M

εxj ∼ N (0, σε2 ),

∗

= y (x, u ) + b(x) + εxj ,

j = 1, . . . , nx (1)

The unknown parameters are u∗ , σε2 , and the bias b(·). KOH propose a Gaussian process (GP) prior for b(·), which we review in detail in the following sub-section. If evaluating the computer model is fast, then inference is made rather straightforward using residuals between computer model outputs and field observations, yjF (x) − y M (x, u), which can be computed at will for any (x, u) (Higdon et al., 2004). However, running the computer model is usually time consuming. In that case it is useful to use an emulator or surrogate model to stand in for y M (·, ·), which is a type of regression model yˆM (·, ·) fit to code outputs obtained on a computer experiment design of NM locations (x, u). KOH recommend a GP prior for y M , however rather than performing inference for y M separately, using just the NM runs, as is typical of a computer experiment in isolation (e.g., Morris et al., 1993), they recommend inference joint with b(·), u, and σε2 using both field observations and runs of the computer model. From a Bayesian perspective this is the coherent thing to do: infer all unknowns jointly given all data. It is also practical when the computer model is very slow, giving small NM . In that setup, the field data can be informative for emulation of y M (·, ·), especially when the bias b(·) is very small/zero and/or easy to estimate. However, the computation required for inference in this setup is fraught with challenges, especially in the fully Bayesian formulation recommended by KOH. The coupled b(·) and y M (·, ·) lead to parameter identification and MCMC mixing issues. Moreover, GP regression demands substantial computational effort when applied in isolation [Section 3.2], which can be compounded when coupled.

3

Elements of approximation

In this section, we review the main components of our contribution, motivated by difficulties associated with using GPs in coupled GP hierarchical model of KOH.

3.1

Modularization

Some computer modelers prefer not to allow the field data to be “contaminated” by the imperfect computer model or emulator, especially when the computer model is under ongoing development (as typical), and when one of the goals of the calibration exercise is to use estimates of the bias b(·) to guide development. Joseph (2006) and Santner et al. (2003) provide a pedagogical example where the fully Bayesian approach unproductively confounds emulator uncertainty with bias discrepancy. The example is reproduced by Liu et al. (2009) who argue for a more modular analysis in many Bayesian contexts generally, although in particular for calibration. They propose fitting the emulator yˆM (·, ·) independently, using only the outputs of runs at a design of NM inputs (x, u). In this sense they “modularized” the 4

calibration endeavor, separating emulator-building from estimating bias. Whether you view modularization as an approximation to the full KOH framework, or as a different modeling philosophy, is simply a matter of perspective. In calibration pursuits, due to the relative costs, the number of computer model runs increasingly dwarfs the data available from the field, i.e., NM NF , making it unlikely that field data would substantively enhance the quality of the emulator leaving only risk that joint inference with the bias will obfuscate traditional computer model diagnostics, and possibly stunt their re-development or refinement. In many applications, such as in our motivating example, the data available for building an emulator (NM ) is too large to accommodate traditional GP inferential methodology, (fully) Bayesian or otherwise. In that case, modularization is a practical necessity, as are other shortcuts discussed momentarily.

3.2

Gaussian process emulation and sparse/local approximation

Gaussian process (GP) regression is canonical for emulating computer experiments (Santner et al., 2003). The reasons are many, including nonlinear nonparametric fit, easily differentiated log likelihood, closed-form conditional distributions for prediction, which can interpolate if desired, as well as excellent prediction interval coverage. Technically, the GP is a prior over functions between x ∈ Rp and Y ∈ R such that any finite collection of Y -values is multivariate normal (MVN). Therefore, it is defined by a mean vector µ and covariance matrix Σ, whose values may be specified in terms of auxiliary parameters and x-values. Homoskedasticity and stationarity are common simplifying assumptions in the computer experiments literature, whereby Σ = σ 2 K for scalar σ 2 , constant for all x, and correlations in K defined only in terms of distances x − x0 . Generating realizations (Y ’s) given inputs (x’s) involves evaluating µ, and evaluating then decomposing Σ, and sampling from an MVN. Performing GP regression requires applying the same logic, conditionally on data DN = > > > (XN , YN ) = ([x> 1 , . . . , xN ] , [y1 , . . . , yn ] ). Given values of the parameters, the predictive distribution for Y (x) at new x’s is directly available from MVN conditionals. Integrating out σ 2 under a reference prior (see, e.g., Gramacy and Polson, 2011), yields a Student-t with mean and scale

µ(x|DN , KN ) = k > (x)KN−1 YN , σ 2 (x|DN , KN ) =

ψ[K(x, x) − k > (x)KN−1 k(x)] , N

(2) (3)

and N degrees of freedom, where k > (x) is the N -vector whose ith component is Kθ (x, xi ), defining the correlation function given parameters θ; KN is an N × N matrix whose entries are Kθ (xi , xj ), and ψ = YN> KN−1 YN . Inference for parameters θ to the correlation function K(·, ·) can proceed via (e.g., Newton-schemes based on derivatives of) the likelihood, Γ[N/2] p(Y |K) = × (2π)N/2 |KN |1/2

− N2 ψ . 2

(4)

Observe that prediction and inference (even sampling from the GP prior), require decomposing an N × N matrix to obtain KN−1 and |KN |. Thus for most choices Kθ (·, ·) and 5

point-inferene schemes, data sizes N are limited to the low thousands. Bayesian approaches are even further limited, as orders of magnitude more likelihood evaluations (and matrix inversions) are typically required, e.g., for MCMC. Assuming stationary can also sometimes be too restrictive, and unfortunately relaxation usually requires even more computation (e.g., Paciorek and Schervish, 2006; Bah and Joseph, 2012; Schmidt and O’Hagan, 2003). A key demand on the emulator in almost any computer modeling context, but especially for calibration, is that inference and prediction (at any/many x) be fast relative to running new simulations (at x). Otherwise, why bother emulating? As computers have become faster, computer experiments have become bigger, limiting the viability of standard GP emulation. A recurring theme of late in the search for emulators with larger capability is sparsity (e.g., Haaland and Qian, 2011; Sang and Huang, 2012; Kaufman et al., 2012; Eidsvik et al., 2013). Sparsity allows decompositions of large covariance matrices to be either avoided entirely, be built up sequentially, or performed using fast sparse-matrix libraries. In this paper we use a recent sparse GP methodology developed by Gramacy and Apley (2014). They provide a localized approach to GP inference/prediction that is perfect for calibration, where the full inferential scheme (either KOH or modular) only requires yˆM (x, u) for (x, u)-values coinciding with field-data x-values, and u-values along the search path for u∗ , as we describe in Section 4. The idea is to focus expressly on the prediction problem at an input x. In what follows we use x generically, rather than (x, u) as inputs to yˆM . The scheme acknowledges that data input locations in XN which are far from x have vanishingly small impact on the predictive equations (4). This is used as the basis of a search for locations Xn (x) ⊂ XN , which minimize Bayesian mean squared prediction error (MSPE). The search is performed in a greedy fashion, giving an approximate solution to the local design problem, using a scheme which efficiently updates the local GP approximation as new data points are added into the local design. Building a predictor in this way, ultimately using Eq. (4) with a data subset Dn (x), can be performed in O(n3 ) which represents a significant savings if n N . Pragmatically, one can choose n as large as computational constraints allow. Gramacy and Apley (2014) show empirically that these MSPE based local designs lead to predictors which are more accurate than nearest neighbor—using the nearest XN values to x—which are known to be sub-optimal (Vecchia, 1988; Stein et al., 2004). They also extend the scheme to provide local inference of the correlation structure, and thereby fit a globally nonstationary model. All calculations are independent for each x, so local inference and prediction on a dense set of x ∈ X can be trivially parallelized, accomodating emulation for N = 1M sized designs in under an hour (Gramacy et al., 2014). An implementation is provided in an R package laGP (Gramacy, 2013). However, independent calculations for each x—while providing for nonstationarity and parallelization—yields a dis-continuous global predictive surface, which can present challenges in the calibration context.

4

Proposed method

What we propose is thriftier than KOH in four ways, and thriftier than the modularized version in three ways: It (a) modularizes the KOH hierarchical model; (b) deploys local 6

approximate GP modeling via the laGP package for the emulator yˆM (x, u); with (c) maximum a posteriori (point) inference for ˆb(x) under a GP prior; and (d) maximum a posteriori (point) inference for u. This last step is the “glue” that holds the whole framework together. Given a value for the calibration parameter, u, the rest of the scheme involves a cascade of straightforward Newton-style maximizing calculations. Below we describe an objective function which, when optimized, performs the desired calibration, giving an estimated value. uˆ, for u∗ . The other quantities required to obtain, say, predictions of new y F (x)-values, can be quickly re-computed as needed given uˆ and the data.

4.1

Calibration as optimization

F Let the field data be denoted as DN = (XNF F , YNFF ) where XNF F is the design matrix of F NF field data inputs, paired with a NF vector of y F observations YNFF . Similarly, let M DN = ([XNMM , UNM ], YNMM ) be the NM computer model input-output combinations with M column-combined x- and u-design(s) and y M -outputs. Then, with an emulator yˆM (·, u) M |u M , let YˆNF = yˆM (XNF F , u) denote a vector of NF emulated output y-values at trained on DN M the XF locations obtained under a setting, u, of the calibration parameter. With local apM |u proximate GP modeling, each yˆj -value therein, for j = 1, . . . , NF , can be obtained independently (and in parallel) with the others via local sub-design XnM (xFj , u) ⊂ [XNMM , UNM ] and local inference for the correlation structure. The size of the local sub-design, nM , is a fidelity parameter, meaning that larger values provide more accurate emulation at greater computaM |u tional expense. Finally, denote the NF -vector of fitted discrepancies as ˆbNF = YNFF − YˆNF , which can be obtained with approximation since NF is small. Given these quantities, the objective function for calibration of u is the (log) joint probability density of observing YNFF at inputs XNF F . This is obtained from the best fitting GP ˆb regression model trained on data DN = (ˆbNF , XNF F ), representing the bias estimate ˆb. F M , Require: Calibration parameter u, fidelity parameter nM , computer data DN M F . and field data DN F 1: for j = 1, . . . , NF do M 2: I ← laGP(xFj , u | nM , DN ) {get indicies of local design} M M 3: θˆ ← mleGP(DN [I]) {local MLE of correlation parameter(s)} M M |u F M ˆ 4: yj ← muGP(xj | DNM [I], θ) {predictive mean emulation following Eq. (3)} 5: end for 6: ˆ bNF ← YNFF − Y M |u {vectorized bias calculation} ˆb F 7: DNF ← (ˆ bNF , XNF ) {create data for estimating ˆb(·)|µ} ˆ b 8: θˆ ← mleGP(DN ) {full GP estimate of ˆb|u} F ˆ 9: return predGP(YNFM |XNM , θ) {multivariate Student-t density generalizing (3)}

Algorithm 1: Objective function evaluation for modularized local GP calibration. Algorithm 1 represents this objective function in pseudocode for a more detailed second 7

look. In our implementation, steps 1–5 in the code are automated by applying a wrapper routine in the laGP package, called aGP, which loops over each element of the predictive grid, performing local design, inference, and prediction stages, in parallel via OpenMP. With NF and nM small relative to NM , the execution of the “for”-loop is extremely fast. In our examples to follow [Sections 5–6] we use a local neighborhood size of nM = 50. The final two steps, 8–9, are facilitated functions of the same names in the laGP package. The GP model model for b(·), fit in line 8, estimates a nugget parameter to capture the noise term σε2 in (1), whereas the local approximate ones used for emulation, on line 3, do not. For situations where bias is known to be very small/zero, it is sensible to entertain a degenerate GP prior for b(·) with an identity correlation matrix. In that case, step 8 in Algorithm 1 is skipped and step 9 reduces evaluating a predictive density under a (RaoBlackwellized) i.i.d. normal likelihood with µ = 0, i.e., only averaging over σε2 .

4.2

Derivative-free optimization of the calibration objective

Objective function in hand, we turn to optimizing. The discrete nature of independent local design searches for yˆM (xFj , u) ensures that the objective is not continuous in u. In fact, as we illustrate in Section 6.3 it can look ‘noisy’, although it is in fact deterministic. This means that optimization with derivatives—even numerically approximated ones—is fraught with challenges. We opt for derivative-free approach (see, e.g., Conn et al., 2009). Specifically, we use an implementation of the mesh adaptive direct search (MADS) algorithm (Audet and Dennis, Jr., 2006) called NOMAD (Le Digabel, 2011), via an interface for R provided by the crs package (Racine and Nie, 2012). MADS proceeds by successive pairs of search and poll steps, trying inputs to the objective function on a sequence of meshes which are refined in such a way as to guarantee convergence to a local optima under very weak regularity conditions; for more details see Audet and Dennis, Jr. (2006). Direct, or so-called pattern search, methods such as these have become popular for many challenging optimization problems where derivative information is either not available, or where numerical approximations to derivatives may lead to unstable numerical behavior. We are not the first to use MADS/NOMAD in the context of computer modeling. MacDonald et al. (2012) used it to search for the smallest nugget, leading to numerically stable matrix decompositions for near-interpolating GP emulation. Our use is novel in the calibration context. As MADS is a local solver, NOMAD requires initialization. We recommend choosing a starting u-value from evaluations on small random space-filling design, however in our experiments [e.g., Section 5] starting at the center of the space performs almost as well.

4.3

Predictions for field data

x Given u∗ , estimates of ˆb(x) and yˆM (x, u∗ ) can be obtained by, first, running back through steps 2–4 of Algorithm 1 to obtain a local design and correlation parameter, parallelized for ˆb many x; then performing steps 7–9 using saved DN and θˆ from the optimization. However, F rather than evaluate a predictive probability, instead save the moments of the predictive 8

density (step 9) at the new x locations. These can then be combined with the computer model emulation(s) obtained in step 4, thus adding bias to the computer model output to get a distribution for the y F (x)|u∗ , i.e., undoing step 6. Ideally, the full Student-t predictive density would be used here, in step 4, leading to a sum of Student-t random variables for yˆM (x, u) and ˆb(x) comprising y F (x)|u∗ . However, if NF , nM ≥ 30 summing normals suffices.

5

Illustrations

In this section we entertain variations on a synthetic data-generating mechanism akin to one described most recently by Goh et al. (2013), who adapted an example from Bastos and O’Hagan (2009). It uses two-dimensional field data inputs x, and two-dimensional calibration parameters u, both residing in the unit cube. The computer model is specified as follows. 1000t x3 + 1900x2 + 2092x + 60 1 1 1 − 1 1 . y M (x, u) = 1 − e 2x2 3 2 100t2 x1 + 500x1 + 4x1 + 20

(5)

The field data is generated as y F (x) = y M (x, u∗ ) + b(x) + ε,

where b(x) =

10x21 + 4x22 iid and ε ∼ N (0, 0.52 ), 50x1 x2 + 10

(6)

using u∗ = (0.2, 0.1). We keep this setup, however we diverge from previous uses in the size and generation of the input designs, and the number of field data replicates. Our simulation study is broken into two regimes, considering biased and unbiased variations, and is designed (i) to explore the efficacy of the proposed approach; (ii) to investigate performance in different scenarios (with/without bias, unreplicated and replicated experiments, etc.); and (iii) to motivate alternatives in our real data analysis in Section 6. Both simulation regimes involve 100 Monte Carlo (MC) repetitions, and proceed as follows. Each repetition uses a two-dimensional random Latin Hypercube sample (LHS, McKay et al., 1979) of size 50 (on the unit cube) for the field data design, and considers variations on the number of replicates for each unique design variable setting, x. Denote the number of replicates as nx ∈ {1, 2, 10} for all x, leading to NF ∈ {50, 100, 500}. The computer model design begins with a four-dimensional LHS of size 10000. It is then augmented with simulation trials that are aligned with the field data design. We take 10 points per input in the field data, differing only in the u-values; each paired with a unique two-dimensional LHS (of size 500) of u-values. This gives NM = 10500. Random Y F and deterministic simulations of Y M are generated at these NF and NM locations. In each MC repetition, a NOMAD search for uˆ is initialized with the best value found on a maxmin design of size 20, which is obtained by searching stochastically over a two-dimensional LHS of size 200. Vague independent Beta(2, 2) priors on each component of u discourages the solver from finding solutions that lie on the boundary of the search space. Finally, a two-dimensional LHS of size 1000 is used to generate an out-of-sample validation set of y F values without noise, i.e., εxj = 0. Root mean-squared errors (RMSEs) and estimates uˆ of u∗ are our main metrics of comparison. 9

In addition varying the number of replicates, nx , our comparators include variations on the calibration apparatus and emulation of y M . For example, we compare our local approximate modular approach [Section 4] to versions using the true calibration value, u∗ , a random value in the two-dimensional unit cube, ur , and combinations of those where y M is used directly—i.e., assuming free computer simulation, and thus bypassing the emulator yˆM . Together, these alternatives allow us to explore how the error in our estimates decompose at each level of the approximate modularized calibration. Ideally, we would compare to a full GP modularized calibration, and a non-modularized (KOH) version. However, at the scale of problem we are considering—NF as large as 500, and NM > 10K—full and coupled GP inference, whether Bayesian/modular or otherwise, is not practical.

5.1

Unbiased calibration

Figure 1 summarizes results from our first regime: without bias, i.e., setting b(·) = 0 in Eq. (6). Consider the top-left panel first, which shows boxplots of RMSEs arranged by numbers of replicates (three groups of six separated by solid gray vertical lines), and then by the use of an emulator yˆM or not (gray dotted vertical lines). Observe that a random calibration parameter, ur (labeled as “urand”, the middle boxplot in each group of three), gives poor predictions of y F . By contrast, using the correct u∗ with y M directly (labeled “u*-M”, fourth boxplot in each group of six), i.e., not estimating yˆM leads to nearly perfect prediction. Contrasting with the corresponding “u*-Mhat” boxplots (first in each group of six) reveals the relative “cost” of emulating via yˆM with u∗ . Taken together, these 12 boxplots provide a reference for how good and bad things can get. The rest involve more realistic setups, and they show differentiating performance depending on the number of replicates nx . The third and sixth boxplots (from the left) show RMSEs obtained with uˆ and one field data replicate, nx = 1. RMSEs obtained under y M or yˆM are very similar, with the former (M ) being slightly better. This indicates that the local approximate GP emulator is doing a good job as a surrogate for y M . The story is similar when nx = 2, giving slightly lower RMSEs (boxplots 9 and 12), as expected. Ten replicates (15 and 18) lead to greater differentiation between y M and yˆM results, implying more replicates allow a more accurate and lower variance estimate uˆ. Considering how bad things can get (“urand”), all of the other estimates are quite good relative to the best possible (“u*-M” and “u*-Mhat”). The bottom-left panel shows estimated predictive standard deviations (SDs) for each variation, whose corresponding RMSEs are directly above. SDs are calculated by factoring in the predictive variances from both stages: emulation uncertainty (if any), plus bias/noise components. The random calibration parameter, ur , gives the greatest uncertainty, which is reassuring given its poor RMSEs. Uncertainties coming from yˆM and y M are very similar. A numerical summary of RMSEs and SDs is provided the bottom-right panel for convenience. The top-right panel shows estimated uˆ-values for three representative the cases, over each of the 100 MC repetitions. The others follow these trends and are omitted to reduce clutter. In all three cases the uˆ-values found are along a straight line going through the true value u∗ = (0.2, 0.1). This is the case whether emulating with yˆM or using y M directly, although we observe that when there are more replicates (nx = 10 v.s. nx = 1) or when y M is used 10

1.0

● ●

0.8

●

uhat−Mhat uhat−Mhat−10 uhat−M−10

0.3

● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ●●●● ●●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ●● ●●●●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●●●● ●● ●● ● ●● ● ●●● ● ● ● ●●● ● ●●● ●● ● ● ● ● ●● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ●● ●● ● ● ● ●● ● ●●●● ● ● ● ●● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ● ● ●

0.6

●

● ● ●

●

● ● ●

● ● ●

0.1 ●

● ●

● ● ●

●

0.0

●

0.2

● ●

0.4

u2

rmse

●

0.2

u*−Mhat urand−Mhat uhat−Mhat u*−M urand−M uhat−M u*−Mhat−2 urand−Mhat−2 uhat−Mhat−2 u*−M−2 urand−M−2 uhat−M−2 u*−Mhat−10 urand−Mhat−10 uhat−Mhat−10 u*−M−10 urand−M−10 uhat−M−10

0.0

● ● ●

● ● ● ●

●

0.8

● ●

● ● ●

● ● ●

0.7 ● ●

sd

●

● ●

●

● ●

0.6

0.5 ●

● ● ●

●

0.4 ● ●


● ●

0.0

0.2

0.4

0.6

0.8

1.0

u1

u*-Mhat urand-Mhat uhat-Mhat u*-M urand-M uhat-M u*-Mhat-2 urand-Mhat-2 uhat-Mhat-2 u*-M-2 urand-M-2 uhat-M-2 u*-Mhat-10 urand-Mhat-10 uhat-Mhat-10 u*-M-10 urand-M-10 uhat-M-10

rmse 0.051 0.235 0.077 0.000 0.209 0.061 0.051 0.211 0.059 0.000 0.223 0.054 0.051 0.250 0.049 0.000 0.240 0.027

sd 0.524 0.586 0.512 0.506 0.559 0.502 0.522 0.578 0.514 0.504 0.569 0.501 0.523 0.593 0.519 0.502 0.574 0.501

Figure 1: Comparison on unbiased data, 100 MC replicates. The top-left panel shows RMSE to the true response on hold-out sets, and the bottom-left panel summarizes the standard deviation of the corresponding predictor. The top-right panel shows three examples of the chosen calibration parameter(s) u∗ . The true calibration parameter shown at the intersection of the dashed-blue lines. The axis and legend entries indicate if u is estimated (“uhat”), or if the true value is used (“u*”); Field data sets with 1, 2, and 10 replicates at each design location are shown, separated by solid gray vertical lines in the left panels. Dotted gray lines separate estimators using yˆM (“Mhat”) and y M (“M”). The whiskers of the “urand” comparators, which extend out to 0.6, are cut off to improve visualization. Finally, the table in the bottom-right gives numerical summaries of the boxplots on the left.

11

directly, the points cluster more tightly to the line and more densely near u∗ . We conclude that there is a ridge in the integrated likelihood for u. Beyond that ridge information is weak as the pull of our prior, towards the center of the space, is clearly visible. Further simulation (not shown) reveals that an even weaker/uniform prior u does indeed bring the cloud closer to the true u∗ , however priors which are very close to uniform give uˆ-values on the boundary, particularly near u2 = 0.2. RMSE results in either case are nearly identical. To wrap up with timing, we report that the most expensive comparator (“uhat-Mhat10”) took between 159 and 388 seconds, averaging 232 seconds, over all 100 repetitions on a 16-core Intel Sandy Bridge 2.6 GHz Xeon machine. That large range is due to variation in the number of NOMAD optimization steps required, spanning 11 to 33, averaging 18.

5.2

Biased calibration

Figure 2 shows a similar suite of results for the full, biased, setup described in Eq. (5–6). At a quick glance one notices: (1) the uˆ estimates (top-right) are far from the true u∗ for all calibration alternatives considered; (2) the random setting ur isn’t much worse than the other options (top-left). Looking more closely, however, we can see that the uˆ versions are performing the best in each section of the chart(s). These are giving the lowest RMSEs (top-left) and the lowest SDs (top-right). They are doing even better than with the true u∗ setting. So while we are not able to recover the true u∗ , we are able to predict the field data better with the values we do find. Our modularized approximate calibration method is excelling at one task, prediction of y F , possibly at the expense of another, estimating u∗ . The explanation is nuanced. The bias (6) is not well approximated by a stationary process, and neither is (5) for that matter. But our GP for ˆb assumes stationarity, so there is clearly a mismatch with (6). The local approximate GP emulator does allow for adaptivity of correlation structure over the input space, and thus can accommodate a degree of nonstationary in (5). That explains why our emulations were very good, but not perfect, in the unbiased case [Figure 1]. In this biased case, the full posterior distribution, inferring both full and local GPs, is using the flexibility of the joint modeling apparatus to trade-off responsibility, in effect exploiting a lack of identifiability in the model, which is a popular tactic in nonstationary modeling [further discussion in Section 7]. It is tuning uˆ to obtain an emulator that better copes with a stationary discrepancy, resulting in a less parsimonious and larger magnitude estimate of b, but one for which ˆb(·) + yˆM (·, uˆ) gives good predictions of y F (·). Meanwhile, the local GP is faced with a more demanding emulation task. Time wise, the most expensive comparator (“uhat-Mhat-10”) took between 538 and 1700 seconds, averaging 1049 seconds, over all 100 repetitions. The number of NOMAD optimization steps was similar to the unbiased case, ranging from 11 to 32, averaging 18. The main difference in computational cost compared to the unbiased case as in estimating the GP correlation structure for ˆb, requiring O(NF3 ) computations for NF = 500.

12

1.0

0.35

●

● ● ●

●

● ●

●

●

uhat−Mhat uhat−Mhat−10 uhat−M−10

0.8

0.30

● ●

●

●

0.25 0.6 ● ●

● ● ● ● ●

0.15

●

● ● ●

● ● ●

● ● ● ● ●

0.0


0.05

●

● ● ●

● ● ●

0.65

sd

0.60 0.55 0.50 ●

●

● ●

0.45

●

●

0.40 ●

● ● ●


0.35

●

●

0.10

0.70

● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ●●●●●● ●●● ● ●●● ● ● ●●●●● ●●● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●●●●●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●● ●

● ● ● ● ●

●

● ●

0.4

●

0.2

0.20

● ● ●

u2

rmse

●

0.0

0.2

0.4

0.6

0.8

1.0

u1

u*-Mhat urand-Mhat uhat-Mhat u*-M urand-M uhat-M u*-Mhat-2 urand-Mhat-2 uhat-Mhat-2 u*-M-2 urand-M-2 uhat-M-2 u*-Mhat-10 urand-Mhat-10 uhat-Mhat-10 u*-M-10 urand-M-10 uhat-M-10

rmse 0.206 0.199 0.182 0.189 0.186 0.167 0.165 0.159 0.152 0.144 0.139 0.130 0.098 0.097 0.094 0.086 0.084 0.085

sd 0.546 0.547 0.523 0.528 0.528 0.509 0.536 0.537 0.523 0.521 0.520 0.512 0.528 0.528 0.524 0.508 0.508 0.507

Figure 2: Comparison on biased data, 100 MC replicates. The explanation of the panels is the same as for Figure 1.

6

Predictions for radiative shock hydrodynamics

An important goal for the Center for Radiative Shock Hydrodynamics (CRASH) at the University of Michigan is to use computer model output and experimental data to build a predictive model for a radiative shock system, with appropriate estimates of uncertainty. The shocks under study are ones where radiation from shocked matter dominates the energy transport and results in a complex evolutionary structure. They arise in practice from astrophysical phenomena (e.g., super-novae) and other high-temperature systems (e.g., see 13

McClarren et al., 2011; Drake et al., 2011). Our study here involves a suite of simulation output and a small set of NF = 20 observations from radiative shock experiments to calibrate the CRASH simulator and to make predictions of features of radiative shocks. A series of experiments was conducted at the Omega laser facility at the University of Rochester (Boehly et al., 1997) where a high-energy laser is used to irradiate a beryllium disk located at front end of a xenon (Xe) filled tube (Figure 3a). A high-speed shock wave is launched into the tube and it is said to be a radiative shock if the energy flux emitted by the hot shocked material is equal to or larger than the flux of kinetic energy into the shock. Such shocks can have very different structures than non-radiative shocks. For example, they can lead to much higher compression of the shocked xenon than non-radiative shocks. Each

(a)

(b)

Figure 3: (a) Sketch the apparatus used in the radiative shock experiments. A high energy laser is used to ignite the beryllium disk on the right, creating a shock wave that travels through the xenon filled tube. (b) Radiograph image of a radiative shock experiment. physical observation is a radiograph image (Figure 3b), and the quantity of interest that is extracted from the image is the distance the wave has travelled at a pre-determined time. This response variable is called the shock location. The system design and calibration variables are listed in the first column of Table 1. The first three variables specify the thickness of the beryllium disk, the Xe fill pressure in the tube and the observation time for the radiograph image. The next four variables are related to the geometry of the tube and the shape of the apparatus at its front end. Most of the physical experiments were performed on circular shock tubes with a small diameter (in the area of 575 microns), and the remaining experiments were conducted on circular tubes with a diameter of 1150 microns or with different nozzle configurations. The aspect ration describes the shape of the tube (circular or oval). In our field experiments, the aspect ratios 14

are all 1, indicating a circular tube. The experiments to be predicted involve extrapolating to oval shaped tubes with an aspect ratio of 2. There are two calibration parameters: the laser energy scale-factor which is related to the laser energy input (more on both shortly), and the electron flux limiter. The CRASH code uses transport equations that describe the amount of heat transferred between cells of a mesh at a given time, as predicted by diffusion theory, but requiring a down-scaling constant—the electron flux limiter—whose value for this application is not known. Design variables Input CE1 CE2 Field Design Be thickness (microns) [18,22] 21 21 Xe fill pressure (atm) [1.100,1.2032] [0.852,1.46] [1.032,1.311] Time (nano-seconds) [5,27] [5.5,27] 6-values in [13, 28] Tube diameter (microns) 575 [575,1150] {575, 1150} Taper length (microns) 500 [460,540] 500 Nozzle length (microns) 500 [400,600] 500 Aspect ratio (microns) 1 [1,2] 1 Laser energy (J) [3600,3990] [3750.0 3889.6] Effective laser energy (J)∗ [2156.4,4060] Calibration parameters Input CE1 CE2 Electron flux limiter [0.04, 0.10] 0.06 Energy scale-factor [0.40,1.10] [0.60,1.00] Table 1: Design and calibration variables and input ranges for computer experiment 1 (CE1), and 2 (CE2) and field experiments. A single value means that the variable was constant for all simulation runs. ∗ The effective laser energy is the laser energy times energy scale-factor. Radiative shocks involve many subprocesses, making them computationally challenging to simulate. Two simulation campaigns were performed using super-computers at Lawrence Livermore and Los Alamos National Laboratories. These two computer experiments were run for different reasons, but can be combined for the calibration exercise. The inputs for each code and the ranges explored are shown in the second and third columns of Table 1. Notice that the input ranges are different in the two computer experiments (denoted CE1 and CE2, respectively). Briefly, CE1 explores the input region for small, circular tubes, whereas CE2 investigates a similar input region, but also varies the tube diameter and nozzle geometry. Both computer experiment plans were derived from LHS designs. The energy flux limiter was held constant in CE2 because the code outputs were found to be relatively insensitive to this input during CE1. In addition, the thickness of the beryllium disk is also held constant in CE2 because improvements in manufacturing in the time in between the simulation campaigns meant that the disk thickness had very little variability in the experiments. In total, we have 5002 simulator trials for the calibration task. 15

1.1

An input variable that requires some discussion is that of laser energy. In the physical system, the laser energy used to induce a shock is recorded by a technician. Things are a little more complicated for the simulations. Before CE1 it was felt that, due to some physical instabilities not modelled by the CRASH code, the shock speed would be too high, and thus the shock would be driven too far down the tube for any specified laser energy. So for CE1 an effective laser energy was constructed from two input variables: laser energy and the laser energy scale-factor, varied over the ranges specified in the second column of Table 1. CE2 used effective laser energy directly. The analysis is performed using both laser energy and the laser energy scale-factor. To make the second experimental design coincide with the first in terms of inputs, the design is expanded to produce instances where pairs of laser energy and the laser energy scalefactor are obtained by gridding values of laser energy scale-factor and matching them with values of laser energy deduced from effective laser energy values from the original design. When constructing the expanded design, our CRASH collaborators suggested constraining the scale-factors to be smaller than one and no smaller than value(s) which, when multiplied by the effective laser energy (in reciprocal) resulted in a laser energy no more than 5000 joules. With those restrictions, we took a uniform grid of 100 settings of energy scale-factor, giving a total of 26458 input-output combinations to use in the calibration exercise. Figure 4

● ●

●●

●

0.9 0.8 0.7 0.6 0.5

Energy Scale Factor

1.0

●● ●

●

●● ●

●● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●●● ●● ●● ● ●●●●●●● ● ● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ●●●●●●●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

0.4

● ●

●

●

●●

CE2 CE1 2500

● ●

3000

3500

● ● ● ● ● ● ● ● ● ● ●

4000

4500

Laser Energy

Figure 4: Marginal design for laser energy and energy scale-factor from both experimets. shows the design over laser energy and energy scale factor, combining CE1 with the expanded CE2 values. Observe that the design over derived laser energies from CE2 spans a larger range than those from CE1. From a data analysis perspective, the laser energy scale-factor is treated as a calibration parameter, and if it is found to be 1 then there was no need to down-scale the laser energy in the first place. 16

6.1

Modeling setup

Motivated by our synthetic data analysis above, and acknowledging the larger (and disparate unit) input space of the CRASH data [Table 1], our analysis to follow involves the following four variations derived from pairing biased versus unbiased setups (i.e., estimating a non-linear discrepancy with noise or estimating only the noise only between field data and computer model runs) with two types of pre-processing of the input data. Our first type of preprocessing simply scales all inputs to lie in the unit 10-cube. We call this the ‘isotropic’ case, since all input directions share a common lengthscale. In the local GP emulator, yˆM , local isotropy does not preclude global anisotropy or even non-stationarity. However, the discrepancy ˆb has global reach, so isotropy can be restrictive—however with only twenty field data observations isotropy has the virtue of parsimony. In a second version we re-scale those inputs by a crude estimate of the global lengthscale obtained from small random subsets of the computer model run data. Specifically, we randomly sample 1000 elements of the full 26458 design, in 100 replications, and save the maximum a-posteriori estimate of a separable lengthscale parameter from a Gaussian correlation function. The distribution of those lengthscales is summarized in Table 2. Observe

25% 50% 75%

Be Thick 0.17 0.64 1.05

Laser Energy 1.94 2.11 2.33

Xe Press 3.26 3.65 4.07

Aspect Ratio 2.68 2.94 3.25

Nozzle Length 3.54 3.85 4.20

Taper Length 3.15 3.57 3.95

Tube Diam 3.26 3.55 3.77

Time 0.51 0.69 0.91

Elect Flux Limit 0.51 0.88 1.35

Energy Scale Factor 2.48 2.73 2.98

Table 2: Summary of estimated lengthscales from a separable power correlation function applied 100 times to a random subsample of size 1000 from the full 26458 design. that while some inputs (the middle ones: Xe Pressure, Aspect Ratio, Nozzle Length, Taper Length, Tube Diameter) might cope well with a common lengthscale, the analysis suggests others require faster decay. Be Thickness, Time, and Electron Flux limiter benefit from lengthscales roughly 4x shorter than those listed above; laser energy and energy scale-factor almost 2x. To challenge isotropy in both yˆM and ˆb, our second pre-preprocessing variation involves dividing the cube-scaled inputs by square roots of median lengthscales in Table 2. Besides pre-processing of the inputs, the experimental setup for calibration and prediction is identical to the one described in Section 5 except the following. We initialize the search for uˆ, a two-vector comprising of electron flux limiter and energy scale-factor, with a larger space-filling design (of size 200 compared to 20). Since we are not performing a Monte Carlo experiment with hundreds of repetitions we can afford a more conservative, computationally costly, search. When estimating the discrepancy ˆb, we apply a GP model to the subset of inputs which actually vary on more than two values in the field data (Laser Energy, Xe Pressure, Time). See the final column in Table 1. We drop Tube Diameter, which has only two unique settings, however the results aren’t much changed when it is included.

17

6.2

Exploratory analysis

21

22

3500

500

550

500

520

540

3000 1.3

1.4

1.0

1.2

1.4

1.6

1.8

2.0

main effect

3000

AspectRatio

1000 600 700 800 900

1100

5.0e−09

TubeDiameter

1.5e−08

2.5e−08

Time

1000

1000

2000

main effect

3000

TaperLength 3000

NozzleLength

480

2000

main effect 1.2

1000

600 460

main effect

1.1

3000 main effect

3000 1000 450

1.0

GasPressure

2000

main effect

3000 2000 400

0.9

LaserEnergy

1000

main effect

4500

1000

1000 2500

2000

20 BeThickness

2000

19

2000

18

2000

main effect

3000 2000

main effect

1000

2000 1000

main effect

3000

isotropic separable

3000

The first step in implementing the proposed calibration methodology is emulating the computational model. A small sensitivity study is performed in advance of model calibration. This is done easily with the proposed emulator and should be a routine step to see which inputs have a substantial impact on the response and to head off any strange behavior in the computer model. Average main effect functions are computed for each input (Sobol, 1993) and displayed in Figure 5. Each panel of the plot gives the emulator response curve for an input, averaged over the the remaining inputs for both isotropic and separable specifications. Looking at Figure 5, we see that both covariance specifications give essentially the same results. Furthermore, the most influential inputs, marginally, appear to be Laser Energy, Time and Laser Energy Scale-factor. The code is relatively less sensitive to the remaining inputs, on average. Keeping in mind that the aim of the prediction endeavor is to make an extrapolation, since the effect of the variable where there is the greatest degree of extrapolation (aspect ratio) is very small, there is some hope that the proposed model can be used to provide useful predictions in the unsampled regime.

0.04

0.06

0.08

0.10 0.4

ElectronFluxLimiter

0.6

0.8

1.0

EnergyScaleFactor

Figure 5: Main effects plots for the emulated CRASH code. Before moving to the calibration exercise—combining emulation and inference for cali18

bration parameters, noise and/or bias, and prediction—we report on a leave-one-out study to asses the predictive ability of the four variations on our methodology and gain confidence that it is capturing variability in the input space and between simulation and field data. In turn each of the twenty field observations is deleted, models are fit to the remaining observations and (all) simulations, and the deleted observation is predicted. Figure 6 indicates LOO Cross Validation ●

3500

● ● ●

isotropic unbiased isotropic biased separable unbiased separable biased

● ● ● ●

● ●

●

● ● ● ●

●

● ●● ● ● ● ● ●

● ●

● ●

●

●

3000

●

● ●

●

2500

predicted

● ●

● ●

● ●

●

●

2000

●●

●● ● ●● ●

●

● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●

2000

● ● ● ●

●

● ●

●

●

● ●

●●

2500

3000

3500

true

Figure 6: Leave-one-out predictions for the radiative shock location field data. that all four methods are performing well, with none obviously dominating the other. Paired t-tests fail to detect differences in mean predictive ability amongst all pairs of comparators.

6.3

Model calibration

We turn now to a full analysis of the calibration exercise in four variations. The image plots in Figure 7 show the log likelihood surface interpolated from all evaluations of the objective [Algorithm 1], combining the initial design and NOMAD searches. The black dots indicate uˆ’s thus found. The unbiased experiments took about 20 minutes to run on a 4-core hyperthreaded machine, whereas the biased ones took fifteen. That ordering would seem paradoxical, since the biased models have more quantities to estimate, however the NOMAD convergence was faster for the biased version, presumably because they provided a better fit. Several observations are noteworthy. All four variations reveal that the likelihood surface is much flatter for electron flux limiter than for energy scale-factor. There is consensus on a value of scale-factor between 0.75 and 0.8. The separable models, biased or unbiased, largely agree on a setting of energy scale-factor, however the isotopic versions disagree with that setting and disagree amongst themselves. We attribute this divergence to the scales estimated in pre-processing from Table 2. Electron flux limiter benefits from a shorter lengthscale, which the isotropic version struggles to accommodate. Estimating a bias adds 19

0.9 0.8 0.6 0.5 0.06

0.07

0.08

0.09

0.04

0.05

0.06

0.07

Electron Flux Limiter


separable unbiased

separable biased

0.08

0.09

0.08

0.09

0.9 0.8 0.7 0.6 0.5

0.5

0.6

0.7

Energy Scale Factor

0.8

0.9

1.0

0.05

1.0

0.04

Energy Scale Factor

0.7

Energy Scale Factor

0.8 0.7 0.5

0.6

Energy Scale Factor

0.9

1.0

isotropic biased

1.0

isotropic unbiased

0.04

0.05

0.06

0.07

0.08

0.09

0.04


0.05

0.06

0.07


Figure 7: Profile log likelihood surfaces for the calibration parameters, electron flux limiter and energy scale-factor, in four setups. Clockwise from top-left (maximum likelihood setting/filled circle): isotropic unbiased (0.091, 0.790); isotropic biased (0.076,0.738); separable unbiased (0.057, 0.789); separable biased (0.059, 0.782). fidelity to the model, bringing estimates closer to those obtained in the separable version(s), providing further illustration [augmenting the discussion in Section 5.2] of the dual role of the discrepancy estimates in the calibration framework. As alluded to earlier, one disadvantage of local approximate GP inference for yˆM is that the predictive surface is technically discontinuous, which can result in a ‘noisy’ profile likelihood in a search for uˆ. When the data are highly informative about good uˆ, leading to a peaky likelihood surface, the noise is negligible. However, when it is flatter the noise is

20

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

unbiased separable biased separable

0.4

Energy Scale Factor

0.5

0.6

0.7

0.8

0.9

1.0

48 47 46 45 44

biased model log likelihood

40 30 20

unbiased model log likelihood

49 48 47 46

biased model log likelihood

45

10

unbiased isotropic biased isotropic

43

●

● ●

● ●●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●

44

40 35 30 25 20 15

unbiased model log likelihood

10

●●● ● ●●● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ●●● ● ● ● ●● ●● ● ●● ● ● ●●● ●

1.1

Energy Scale Factor

Figure 8: Slice(s) of the profile log likelihood surface over energy scale factor with electron flux limiter fixed to its midway value in the range: isotropic on left; separable on right. In both plots, the left axes show scale for the unbiased model, and the right for the biased one. evident. Figure 8 shows both cases via a slice through the likelihood surface(s) fixing electron flux limiter at its midway value. Being a more flexible model, with weaker identification, the biased setup yields a much shallower log likelihood surface. In the figure this is revealed by the right-hand y axes in both plots, compared to the left-hand ones. Correspondingly, the red dots for biased likelihood values are noisier. This could be a problem for overly simplistic optimization tools, like a numerical Newton-like method. However, NOMAD copes well.

6.4

Prediction

Next we aim to make predictions. The nominal settings of interest provided by the CRASH team are listed in Table 3. From past experiments, it was found that settings for certain inputs were not always achieved in practice when evaluated on the experimental apparatus (i.e., in the field). Table 3 also includes the distribution of those inputs, which shall propagate through the predictive model(s) to account for uncertainty in the design variables. Our algorithmic approach to prediction was outlined in Section 4.3, which we augment here to account for uncertainties in the nominal system settings, leading to the following multi-stage scheme: (i) sample the input settings, x, from the distributions specified in Table 3; (ii) independently for each x, locally emulate the simulator output via yˆM (x, uˆ); and (iii) augment with an estimate of the discrepancy ˆb(x) and combine with an estimate of the noise σ ˆε ; (iv) sample a value from (ii) and add it to one from (iii). These steps are repeated many times to obtain an estimate of the predicted response and prediction intervals. Figure 9, focusing first on the left panel, shows the predictive distributions for our four variations. We first observe that, on the scale of the response marginalized over all inputs (roughly from 1000 to 4500), the predictive distributions are remarkably similar for 21

Design variables Input Nominal value Distribution Be thickness (microns) 21 Uniform (20.5, 21.5) Laser energy (J) 3800 Normal (3800, 81.64) Xe fill pressure (atm) 1.15 Normal(1.15, 0.10) Tube diameter (microns) 1150 Taper length (microns) 500 Nozzle length (microns) 500 Aspect ratio (microns) 2 Time (ns) 26 Table 3: Settings and distributions for the design variables in the 2012 experiments. The Be thickness is uniform over the specified range and the Laser energy and Xe fill pressure are both normal with the specific mean and standard deviation. Predictive (Kernel) Density

0.0015

4000

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

●

3000

Shock Location

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ●

2000

0.0010

● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

1000

0.0000

●

● ● ● ●

0.0005

Density

0.0020

isotropic unbiased isotropic biased separable unbiased separable biased

3000

3500

4000

M

F

IU

IB

SU

SB true

Shock Location

Figure 9: Predictive densities for the 2012 experiments. all methods, despite choosing different u∗ for Electron Flux Limiter. However, observe that estimating bias leads to predictions (red densities) exhibiting a greater degree of uncertainty. Those models involve extra estimating steps and the random values of the nominal settings from Table 3 filter through to mean and variance values for the estimated bias. That the mode of the final distribution under the biased model (dashed-red) is distinctly larger than the others, while at the same time providing substantial spread for smaller values (but not larger ones—i.e., it is skewed towards the modes of the others), suggests that these predictions are the most conservative. This squares well with an a priori preference for estimating 22

bias and allowing separate lenghscales for each input. The right panel shows a boxplot version of the same distributions along side the output for a field experiment subsequently performed at the nominal input settings in Table 1. From the plot we can see that all four distributions were quite accurate, showing greatest agreement with the separable biased variation.

7

Discussion

Motivated by an experiment from radiative shock hydrodynamics, we presented a new approach to model calibration that can accommodate large computer experiments, which are increasingly common in simulation-based applied work. The cost of computation continues downward, with more and more processor cores being packed onto motherboards, and more nodes into computing clusters, whereas the costs of field work remain constant (or possibly increasing). Although the established, fully Bayesian KOH approach to calibration has many desirable features, we believe that it is too computation heavy to thrive in this environment. Something thriftier, retaining many of the salient features required for uncertainty quantification and experiment re-design, is essential going forward. Our method pairs local approximate Gaussian process (GP) emulation with a modularized approach to calibration, where the glue is a flexible derivative-free optimization method. The ingredients have been carefully chosen to work well from an engineering standpoint. All software deployed is open source and available in R. The extra subroutines we developed have been included in the laGP package for CRAN. Our method’s biggest drawback is that it doesn’t average over uncertainty in the estimated calibration parameter uˆ. As demonstrated in Figure 7, output from the scheme can provide insight into the posterior for u, giving an indication of how robust a particular choice of uˆ might be. However, we do not provide a method for sampling from that distribution, as we believe that would require too much computation to be practical. We observed, as many have previously, that the calibration apparatus can yield excellent predictions even when the estimated uˆ is far from the true value. This can be attributed to the extreme flexibility afforded by coupled nonparametric regression models, of which GPs are just one example, which further leverage an augmented design space: the calibration parameters, u. Authors have recently leveraged similar ideas towards tractable non-stationary modeling. In the first case Bah and Joseph (2012) proposed coupling GPs, and in the second Bornn et al. (2012) proposed auxiliary input variables. We were surprised to discover that the calibration model, preceding these methods by nearly a decade, would seem to nest them both as a special case: in the first without auxiliary inputs, and in the second without bias.

References Audet, C. and Dennis, Jr., J. (2006). “Mesh Adaptive Direct Search Algorithms for Constrained Optimization.” SIAM Journal on Optimization, 17, 1, 188–217.

23

Bah, S. and Joseph, V. (2012). “Composite Gaussian process models for emulating expensive functions.” Annals of Applied Statistics, 6, 4, 1838–1860. Bastos, L. and O’Hagan, A. (2009). “Diagnostics for Gaussian Process Emulators.” Technometrics, 51, 4, 425–438. Boehly, T. R., Brown, D. L., Craxton, R. S., Keck, R. L., Knauer, J. P., Kelly, J. H., Kessler, T. J., Kumpan, S. A., Loucks, S. J., Letzring, S. A., Marshall, F. J., McCrory, R. L., Morse, S. F. B., Seka, W., Soures, J. M., and Verdon, C. P. (1997). “Initial performance results of the OMEGA laser system.” Optics Communications, 133, 2, 495–506. Bornn, L., Shaddick, G., and Zidek, J. (2012). “Modelling Nonstationary Processes Through Dimension Expansion.” J. of the American Statistical Association, 107, 497, 281–289. Conn, A., Scheinberg, K., and Vicente., L. (2009). Introduction to Derivative-Free Optimization. Philadelphia: SIAM. Drake, R. P., Doss, F. W., McClarren, R. G., Adams, M. L., Amato, N., Bingham, D., Chou, C. C., DiStefano, C., Fidkowski, K., Fryxell, B., Gombosi, T. I., Grosskopf, M. J., Holloway, J. P., van der Holst, B., Huntington, C. M., Karni, S., Krauland, C. M., Kuranz, C. C., Larsen, E., van Leer, B., Mallick, B., Marion, D., Martin, W., Morel, J. E., Myra, E. S., Nair, V., Powell, K. G., Rauchwerger, L., Roe, P., Rutter, E., Sokolov, I. V., Stout, Q., Torralva, B. R., Toth, G., Thornton, K., and Visco, A. J. (2011). “Radiative effects in radiative shocks in shock tubes.” High Energy Density Physics, 7, 4, 130–140. Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2013). “Estimation and prediction in spatial models with block composite likelihoods.” Journal of Computational and Graphical Statistics, 0, Accepted for publication, –. Goh, J., Bingham, D., Holloway, J. P., Grosskopf, M. J., Kuranz, C. C., and Rutter, E. (2013). “Prediction and Computer Model Calibration Using Outputs From Multi-fidelity Simulators.” Technometrics. to appear. Gramacy, R., Niemi, J., and Weiss, R. (2014). “Massively Parallel Approximate Gaussian Process Regression.” Tech. rep., The University of Chicago. ArXiv:1310.5182. Gramacy, R. and Polson, N. (2011). “Particle Learning of Gaussian Process Models for Sequential Design and Optimization.” Journal of Computational and Graphical Statistics, 20, 1, 102–118. Gramacy, R. B. (2013). laGP: Local approximate Gaussian process regression. R package version 1.0. Gramacy, R. B. and Apley, D. W. (2014). “Local Gaussian process approximation for large computer experiments.” Journal of Computational and Graphical Statistics. to appear; see arXiv:1303.0383. 24

Haaland, B. and Qian, P. (2011). “Accurate Emulators for Large-Scale Computer Experiments.” Annals of Statistics, 39, 6, 2974–3002. Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2004). “Combining field data and computer simulations for calibration and prediction.” SIAM Journal on Scientific Computing, 26, 2, 448–466. Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990). “Minimax and maximin distance designs.” Journal of statistical planning and inference, 26, 2, 131–148. Joseph, V. (2006). “Limit Kriging.” Technometrics, 48, 4, 548–466. Kaufman, C., Bingham, D., Habib, S., Heitmann, K., and Frieman, J. (2012). “Efficient Emulators of Computer Experiments Using Compactly Supported Correlation Functions, With An Application to Cosmology.” Annals of Applied Statistics, 5, 4, 2470–2492. Kennedy, M. and O’Hagan, A. (2001). “Bayesian Calibration of Computer Models (with discussion).” Journal of the Royal Statistical Society, Series B , 63, 425–464. Le Digabel, S. (2011). “Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm.” ACM Transactions on Mathematical Software, 37, 4, 44:1–44:15. Liu, F., Bayarri, M., and Berger, J. (2009). “Modularization in Bayesian analysis, with emphasis on analysis of computer models.” Bayesian Analysis, 4, 1, 119–150. MacDonald, N., Ranjan, P., and Chipman (2012). “GPfit : An R package for Gaussian Process Model Fitting using a New Optimization Algorithm.” Tech. rep., Acadia University. ArXiv:1305.0759. McClarren, R., Ryub, D., Drake, P., Grosskopf, M., Bingham, D., Chou, C.-C., Fryxell, B., van der Holst, B., Holloway, J., Kuranz, C., Mallick, B., Rutter, E., and Torralva, B. (2011). “A Physics Informed Emulator for Laser-Driven Radiating Shock Simulations.” Reliability Engineering and System Safety, 96, 1, 1194–1207. McKay, M., Conover, W., and Beckman, R. (1979). “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code.” Technometrics, 21, 2, 239–245. Morris, D., Mitchell, T., and Ylvisaker, D. (1993). “Bayesian Design and Analysis of Computer Experiments: Use of Derivatives in Surface Prediction.” Technometrics, 35, 243–255. Paciorek, C., Lipshitz, B., Zhuo, W., Prabhat, Kaufman, C., and Thomas, R. (2013). “Parallelizing Gaussian Process Calculations in R.” Tech. rep., University of California, Berkeley. ArXiv:1303.0383. Paciorek, C. J. and Schervish, M. J. (2006). “Spatial Modelling Using a New Class of Nonstationary Covariance Functions.” Environmetrics, 17, 5, 483–506. 25

Racine, J. S. and Nie, Z. (2012). crs: Categorical regression splines. R package version 0.15-18. Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). “Design and Analysis of Computer Experiments.” Statistical Science, 4, 409–435. Sang, H. and Huang, J. Z. (2012). “A Full Scale Approximation of Covariance Functions for Large Spatial Data Sets.” Journal of the Royal Statistical Society: Series B , 74, 1, 111–132. Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of Computer Experiments. New York, NY: Springer-Verlag. Schmidt, A. M. and O’Hagan, A. (2003). “Bayesian Inference for Nonstationary Spatial Covariance Structure via Spatial Deformations.” Journal of the Royal Statistical Society, Series B , 65, 745–758. Sobol, W. (1993). “Analysis of variance of “Component stripping” decomposition of multi exponential curves.” Computer Methods and Programs in Biomedicine, 39, 3, 243–257. Stein, M. L., Chi, Z., and Welty, L. J. (2004). “Approximating Likelihoods for Large Spatial Data Sets.” Journal of the Royal Statistical Society, Series B , 66, 2, 275–296. Vecchia, A. (1988). “Estimation and model identification for continuous spatial processes.” Journal of the Royal Statistical Soceity, Series B , 50, 297–312.

26

Calibrating a large computer experiment simulating ... - CiteSeerX

Calibrating a large computer experiment simulating ... - CiteSeerX

Suggest Documents

Calibrating a large computer experiment simulating ... - Project Euclid

Simulating and analyzing a psychological experiment - CiteSeerX

Calibrating a large-scale traffic flow simulator

COMPUTER MAGNETO-OPTICAL EXPERIMENT ... - CiteSeerX

A Pair Programming Experiment in a Large Computing ... - CiteSeerX

Scalable Self-Calibrating Technology for Large-scale ... - CiteSeerX

COSIMAâa computer program simulating the dynamics of ... - CiteSeerX

Results from a Large Scale Quasi-Experiment on ... - CiteSeerX

The Development of a Methodology for Calibrating a Large ... - MDPI

Simulating cosmic reionization: How large a volume is large enough?

A 'Natural Experiment' - CiteSeerX

Laboratory Critical Experiment Simulating Long-Term ...

COLUMN2 - A Computer program for simulating

calibrating the bkw-eos with a large product species ...

Calibrating the CERN ATLAS Experiment with E/p - Core

Designing a computer experiment that involves switches

Simulating Computer Networks with Opnet

Boids that See: Using Self-Occlusion for Simulating Large ... - CiteSeerX

FLAME: Simulating Large Populations of Agents on ... - CiteSeerX

Calibrating a Micropipette

Calibrating a Robot Camera

A Contingent Ranking Experiment - CiteSeerX

a traveling salesman experiment - CiteSeerX

Simulating the Feel of Brain-Computer Interfaces for ... - CiteSeerX

Calibrating a large computer experiment simulating ... - CiteSeerX