Nonstationary Gaussian Process Emulators with ...

Nonstationary Gaussian Process Emulators with Kernel Mixtures Victoria Volodina and Daniel B. Williamson Department of Mathematical Sciences College of Engineering, Mathematics and Physical Sciences, University of Exeter. Abstract Weakly stationary Gaussian processes are the principal tool in the statistical approaches to the design and analysis of computer experiments (or Uncertainty Quantification). Such processes are fitted to computer model output using a set of training runs to learn the parameters of the process covariance kernel. The stationarity assumption is often adequate, yet can lead to poor predictive performance when the model response exhibits nonstationarity, for example, if its smoothness varies across the input space. In this paper, we introduce a diagnostic-led approach to fitting nonstationary Gaussian process emulators by specifying finite mixtures of region-specific covariance kernels. Our method first fits a stationary Gaussian process and, if traditional diagnostics exhibit nonstationarity, those diagnostics are used to fit appropriate mixing functions for a covariance kernel mixture designed to capture the nonstationarity, ensuring an emulator that is continuous in parameter space and readily interpretable. We compare our approach to the principal nonstationary models in the literature and illustrate its performance on a number of idealised test cases and in an application to modelling the cloud parameterization of the French climate model.

Keywords: Emulation, Nonstationary model response, Bayesian Methods, Uncertainty Quantification, Diagnostics

1. INTRODUCTION Computer simulators (models) are widely used to model physical processes (Frierson et al. 2006, Farneti and Vallis 2009). These models are treated as mathematical functions, denoted f , of a large number of parameters and are often computationally expensive to run. Gaussian Process emulators are often used as ‘surrogates’ to complex computer simulators in such cases, in order to make inference that requires embedding the model as part of a Monte Carlo procedure, tractable. The usual approach to emulation is to fit a stationary Gaussian 1

process (Sacks et al. 1989, Kennedy and O’Hagan 2001, Santner et al. 2003) trained on a set of runs of the simulator. Gaussian Process emulators are then used in a number of inferential engines, such as uncertainty analysis (Oakley and O’Hagan 2002), sensitivity analysis (Oakley and O’Hagan 2004) and calibration (Kennedy and O’Hagan 2001). The usual stationarity assumption means that a common variance term and a correlation function that depends only on the distance between the two points of interest are used to fit a Gaussian process (Bastos and O’Hagan 2009). However, in practice simulators may behave differently across the input space. This can sometimes be captured in modelling by using complex regression functions in the prior mean of the Gaussian process so that the residual is approximately stationary (Rougier et al. 2009; Williamson et al. 2013). When the nonstationarity is not adequately captured in this way, prediction intervals produced by the emulators can be too narrow in the region of high residual variability of f (the emulator is over-confident). On the contrary, the emulator is under-confident in the input space where f is ‘well-behaved’ (Bastos and O’Hagan 2009; Montagna and Tokdar 2016; Oughton and Craig 2016). There is a large literature on nonstationary GP models in spatial statistics. For example, Gelfand et al. (2003) suggested modelling spatial processes through the mean function using spatially varying regression coefficients, with the coefficients themselves given a multivariate process model. Paciorek and Schervish (2006) proposed to use nonstationary Gaussian Processes for spatial modelling, by convolving spatially-varying kernel functions. These methods have not yet been implemented for computer experiments as it would take many more model runs than are usually possible to obtain designs that are sufficiently dense in the typically > 2 input dimensions that are worked with. Recently, new approaches to constructing nonstationary GP models have appeared in the uncertainty quantification field. Gramacy and Lee (2008) developed a Bayesian treed GP (TGP) model to deal with simulators that exhibit nonstationary behaviour along one of the axis. A Bayesian treed model is used to partition the input space by making splits on the value of a single variable, and in each partition (leaf of the tree) a stationary GP model is fitted. The advantage of the method is its ability to accurately model the possible discontinuities (‘jumps’ and ‘spikes’) in simulator output. However, this approach could run into difficulties when the number of simulator runs is limited and the variability in the 2

simulator behaviour is continuous. Ba and Joseph (2012) introduced composite Gaussian processes (CGP) to deal with nonstationarity. The idea is to use two independent and stationary GPs, where the first GP captures the global trend throughout the input space, while the second one is used for local adjustments around this global trend. A flexible variance model is introduced to capture the varying volatility across the input space. This is similar to the approach described above, capturing as much variability with a complex mean function and allowing local behaviour to be captured by a residual process. Montagna and Tokdar (2016) presented a new nonstationary GP model via latent input augmentation. The model consists of two stationary GPs, the first GP is expressed in terms of original inputs and the second GP depends on the latent input. The latent input is used to capture regions in the input space that correspond to abrupt changes in the simulator behaviour. In this paper we introduce a natural and interpretable approach to constructing Gaussian process emulators for computer experiments that aims to mimic an applied statistician’s approach to using diagnostics to fix initial stationary fits. Firstly, we perform standardized Leave One Out (LOO) diagnostics for a stationary GP emulator fitted to our training data. We specify a finite mixture model for the standardized LOO errors to identify L distinct regions of behaviour in the input space and to derive mixing functions over each of these regions. We then assign a Gaussian process prior for f whose covariance kernel is a mixture of L stationary covariance kernels, each belonging to the L previously identified regions. The article has the following structure. Section 2 gives an overview of the stationary GP emulator and describes the existing work in nonstationary GP modelling. Section 3 introduces our nonstationary GP emulator. Section 4 examines the performance of our nonstationary GP emulator for a number of idealised numerical examples. Section 5 presents a real-data application to the CNRM-CM5 Earth system model. In particular, the climate modellers we collaborate with are interested in calibrating cloud parameterizations within their global models so that the sub-grid scale processes they represent have similar performance to models that explicitly resolve these processes. During our work with this project we observed that some of the model outputs we needed to capture failed traditional diagnostics checks, motivating this work. Section 6 contains a discussion. 3

2. GAUSSIAN PROCESS EMULATORS 2.1

Stationary GPs

We define a computer simulator to be a function, f , that takes inputs x ∈ X and outputs a scalar, we discuss a generalisation to multiple outputs later in this section. We write the GP emulator as f (x)|β, σ 2 , δ, τ 2 ∼ GP h(x)T β, k(·, ·; σ 2 , δ) , Cov[f (x), f (x0 )] = k(x, x0 ; σ 2 , δ) = σ 2 r(x, x0 ; δ) + τ 2 1 x = x0 .

(1)

There are a number of popular forms of correlation function leading to a positive definite kernel, k, used routinely in the literature on computer experiments, such as the power exponential 0

r(x, x ) = exp

n

p X xj − x0j φj o , − δ j j=1

with 0 < φj ≤ 2, and φj = 2 corresponding to the popular squared exponential. The Matern covariance function is also popular (Santner et al. 2003, Rasmussen and Williams 2004). We can therefore consider f (x) to be the sum of 3 processes f (x) = h(x)T β + (x) + ν(x)

(2)

where h(x)T β represents a global response surface that captures dominant features of the model output, (x) is a correlated residual process capturing input dependent deviation from the surface and ν(x) is a nugget process (Andrianakis and Challenor 2012; Gramacy and Lee 2012) either representing noise in the simulator (such as the internal variability of a climate model) (Williamson et al. 2017) or for capturing variability due to inactive inputs not used in (x) (Andrianakis and Challenor 2012), or to ensure non-singularity of covariance matrices in the Bayesian update (Gramacy and Lee 2012). Let X = (x1 , . . . , xn )T at which we obtain training runs F = (f (x1 ), . . . , f (xn )). Conditioned on the hyperparameters f (x)|X, F , β, σ 2 , τ 2 , δ ∼ GP m∗ (x), K ∗ (·, ·)

with m∗ (x) = h(x)T β + k(x, X)K −1 (F − h(x)T β) 4

(3)

k ∗ (x, x0 ) = k(x, x0 ) − k(x, X)K −1 k(X, x0 ) and K = k(X, X). Multivariate emulation often uses the scalar output emulator as a building block. For example, basis methods that project high dimensional output onto a basis such as that found by principal components, do this in order to fit scalar, stationary emulators to the coefficients (Higdon et al. 2008, Bayarri et al. 2009, Salter et al. 2018). Rougier (2008) introduced outerproduct emulators that included space, time and output type as inputs to a scalar, stationary emulator for high dimensional output. The multivariate Gaussian Processes introduced by Conti and O’Hagan (2010) used separable covariance functions between inputs and outputs, with the input covariance, the same stationary form used for scalar emulators. There are a number of popular methods for fitting GPs. Currin et al. (1991) fit the hyperparameters by maximum likelihood. Haylock and O’Hagan (1996) showed that fixing δ first led to a t-process posterior prediction for f (x) under π(β, σ 2 ) ∝ 1/σ 2 . This was extended to a Normal-inverse gamma update by Oakley and O’Hagan (2004), given δ. Full Bayes MCMC methods with specific priors on δ have been used e.g. Higdon et al. (2008), Kaufman and Sain (2010) and many others. In practice even with very flexible or well-chosen prior distributions, there can be difficulties with stationary emulators for highly nonstationary simulators. Figure 1 demonstrates the performance of a stationary GP emulator for a 2D function considered by Ba and Joseph in the R package CGP’s reference manual (Ba and Joseph 2014). From the top left panel, the mean surface is too rough in the highly smooth area for x1 and x2 large. We also see overconfidence in the region where f fluctuates rapidly, for small values of x1 and x2 , and underconfidence in the region where the function is relatively smooth, for large values of x1 and x2 from the top right panel. The diagnostic plots recommended in Bastos and O’Hagan (2009) shown in the lower left and right panels of Figure 1 indicate of problems we commonly encounter when building emulators in practice. The leave one out cross validation plot (bottom left) shows that a calibration using this emulator would be inefficient in identifying optimal regions of parameter space (Kennedy and O’Hagan 2001; Salter et al. 2018), and the heteroskedasticity in the standardized errors plot (bottom right) is the typical issue with nonstationarity. How can these errors be mitigated in practice?

5

Figure 1: Failure of stationary GP emulator. Top left: Posterior mean predictive surface produced by stationary GP emulator with 24 design points in red. Top right: Emulator performance for the cross section x1 = x2 . The dashed line is the true function values, the solid black line is the posterior mean predictive curve, and the grey areas denote two standard deviation prediction intervals. Bottom left: Leave One Out diagnostic plot against x1 . The predictions and two standard deviation prediction intervals are in black. The true function values are in green if they lie within two standard deviation prediction intervals, or red otherwise. Bottom right: Individual standardized errors of emulator predictions against x1 .

6

2.2

Existing work in nonstationary GP modelling

A common approach in mitigating nonstationarity is to fit a more complicated response surface h(x)T β that captures global nonstationarity, leaving a stationary process residual (Rougier et al. 2009; Vernon et al. 2010; Williamson et al. 2013). If the right functions can be found and added to h(x), they should be used, but in practice this can be extremely difficult and requires too much manual fitting. In applications where emulators are needed quickly or automatically (say 100’s or 1000’s must be built to describe a complex computer code), this may not be feasible in such a way as to leave a stationary residual. Ba and Joseph (2012) introduce the Composite Gaussian Process (CGP), which consists of two processes, f (x) = Zglobal (x) + σ(x)Zlocal (x). The first process replaces h(x)T β with a stationary GP and captures the global behaviour. They specify that Zglobal (x) is GP distributed with mean µ and covariance τ 2 g(x, x0 ). The second process captures the local variability by including a variance model σ 2 (x) = σ 2 v(x), where σ 2 is a variance term and v(x) is a standardized volatility function which fluctuates around 1. Zlocal (x) is a zero-mean GP with covariance l(x, x0 ). Both g(x, x0 ) and l(x, x0 ) are the Gaussian correlation functions: 0

g(x, x ) = exp −

p X

θk (xk −

x0k )2

0

l(x, x ) = exp −

,

k=1

p X

αk (xk −

x0k )2

k=1

with their own correlation parameters that satisfy 0 ≤ θ ≤ αl and αl ≤ α. To ensure that Zglobal (x) is smoother than Zlocal (x), αl is set to be moderately large. The parameters in the CGP model are derived using maximum-likelihood estimation, which means that the full account of parameter uncertainty is not taken in the predictions contrary to the fully Bayesian framework used for other nonstationary methods. Gramacy and Lee (2008) developed a treed partition model to divide the input space by making binary splits on the value of single variables recursively. The Bayesian approach is applied to the partitioning process by specifying a prior through a tree-generating process and the tree is averaged out by integrating over possible trees using Reversible Jump MCMC (RJ-MCMC). In each leaf of the tree a stationary Gaussian Process is fitted. Partitioning does not guarantee continuity of the fitted function, because the posterior predictive surface 7

conditional on a particular tree is discontinuous across the partition boundaries, which translates into higher posterior predictive uncertainty near region boundaries. However, Bayesian model averaging provides mean fitted functions that are relatively smooth in practice. Pope et al. (2018) provided an extension to the classification trees for modelling discontinuities in the model response. Voronoi tessellation is applied to divide the input space into disjoint regions and an independent Gaussian Process is fitted to each region. RJ-MCMC is used for implementation, which requires a larger number of MCMC chains and iterations per chain to ensure the convergence. Another approach to modelling nonstationary simulators is developed by Montagna and Tokdar (2016). They introduce a latent variable Z into the model, which does not have any physical meaning and is used to stretch the distance between the inputs with potentially abrupt changes in function values. The idea is that we model simulator f (x) as f (x) = µ(x) + [x, Z]; θ where the first term has the same meaning as in subsection 2.1 and [x, Z]; θ is a zeromean GP with a covariance function that depends on the original vector of inputs x and a latent input Z. The correlation function has the following form: 0

0

K ((x, Z), (x , Z )) = exp

n

−

p X

o φk (xk − x0k )2 − φp+1 (Z − Z 0 )2 .

k=1

The latent input Z is considered as a continuous function of original inputs, Z = g(x), and a stationary GP is defined: ` g|θ ∼ GP (0, K), p n X o 0 0 2 ` ` K(x, x ) = exp − φk (xk − xk ) k=1

p+1 p where θ = β, σ 2 , φl l=1 , φ`l l=1 . Hence the method relies on two stationary GPs, one

for the function of interest and another for the latent input: f |θ ∼ GP (µ, σ 2 K),

` g|θ ∼ GP (0, K).

The extended process f (x, Z) is stationary, while the marginal f (x) is not. Posterior pre˜ is derived by averaging over the set of hyperparameters θ and the posterior diction for x distributions of the latent inputs. Montagna and Tokdar (2016) demonstrated the approach 8

to be effective over a number of numerical examples, though it may be computationally demanding if the goal is fast prediction over the whole parameter space, as in a calibration exercise. It may also be difficult to interpret Z and so to use priors for the latent process that lead to good fits.

3. NONSTATIONARITY VIA DIAGNOSTIC-DRIVEN COVARIANCE MIXTURES When fitting an emulator in practice, we would normally begin by fitting a stationary Gaussian Process and examining diagnostics to assess whether the emulator was sufficient. Possible failure of the stationarity assumption can then be checked from the plots of standardized errors against the model inputs (Bastos and O’Hagan 2009). We may notice, as we do in Figure 1, that the model is ‘well-behaved’ in some regions of the input space but not in others. For example, the standardized errors are close to zero for x1 and x2 close to 1 in Figure 1, yet the model changes rapidly and the standardized errors are large for small values of x1 and x2 . Approaches such as TGP explicitly model these as regions of distinct behaviour by axis-aligned partitioning of the input space and fitting distinct Gaussian Processes to each region, imposing discontinuities in the covariance function. Our approach captures the distinct regional behaviours we see in stationary diagnostics, yet uses input-dependent mixing functions to ensure a continuous covariance kernel. We develop this approach below.

3.1

Nonstationary GP through mixtures of stationary processes

Suppose, upon examining the diagnostics of a stationary GP emulator, as above, we identify L input regions of distinct model behaviour, Xl , l = 1, . . . , L (see below for our method for identifying these regions). We define f (x) as: T

f (x) = h(x) β +

L X

λl (x)l (x) +

l=1

L X

zl (x)νl (x),

(4)

l=1

where λ1 (x), . . . , λL (x) are input-dependent mixture components on the unit simplex. Here l (x) are independent, mean zero, Gaussian processes with covariance kernel kl (·, ·), variance

9

σl2 and correlation length parameters δ l for each input region l, so that l (x)|σl2 , δ l

∼ GP

0, kl (·, ·; σl2 , δ l )

.

The final term of equation (4) is a nugget process term. We specify a region specific nugget process term for each input region, νl (x) ∼ N (0, τl2 ),

zl (x) = 1 λl (x) = max λk (x) , k

allowing the nugget to vary between the regions. Given λ(x) our nonstationary GP is therefore f (x)|β, λ(x), σ 2 , τ 2 , δ ∼ GP h(x)T β, k(·, ·; σ 2 , δ)

(5)

with 0

2

k(x, x ; δ, σ ) =

L X

0

λl (x)λl (x )kl (x, x

0

; σl2 , δ l )

+1 x=x

l=1

0

L X

zl (x)zl (x0 )τl2 ,

l=1

so that the covariance kernel for our nonstationary GP is a mixture of stationary covariance kernels. This formulation allows us to specify a different type of process behaviour in L regions similar to TGP. However by mixing GPs in this way we can have non-zero covariance between the points from different regions. Our approach is related to the covariance kernel specification introduced by Banerjee et al. (2004) in spatial statistics. There, L knots, xl , l = 1, . . . , L, are specified in 2D space by applying a rectangle-partitioning approach and finding the centroids of the resulting rectangles. The optimal choice of L is determined by considering BIC (Schwarz 1978). Banerjee et al. (2004) employ a mixture λ(x) = α(x, xl ) that depends on the centroids via a distance function. Our approach avoids the requirement to find centroids, which will be more difficult in the p > 2 dimensions we typically encounter in computer experiments, allowing a more general partitioning based on smooth surfaces in p dimensions.

3.2

Modelling the mixture components

We consider the λ(x) as probabilities indicating the dominant local behaviour around x as described by l (x) and νl (x). Define f˜ (x) to be an ‘out of the box’ stationary emulator 10

(described in the beginning of section 4), fitted using design X and computer model runs F . Let ei denote the leave-one-out standardized cross validation residuals: ei =

f (xi ) − E[f˜ (x)|F −i ] SD[f˜ (x)|F −i ]

where F −i = (f (x1 ), . . . , f (xi−1 ), f (xi+1 ), . . . , f (xn )). We define a latent indicator process s(x) s(x) ∼ Multinomial(λ1 (x), . . . , λL (x)), and model the ei via ei |s(xi ) = l ∼ Normal(0, ζl ), where ζl is the standard deviation for the distribution of standardized errors in region l. Then, we can fit the λl (x) via, for example, categorical regression exp(g(x)T αl ) , λl (x) = PL T 0 l0 =1 exp(g(x) αl ) with a suitable prior π(α, ζ). In the examples and attached code we specify priors for parameters, ζl ∼ log N (−1, 1) and αl ∼ Normal(0, 5), representing weak prior information(Almond 2014; Stan Development Team 2017b). We constrain the prior standard deviation parameter to follow the ordering, ζ1 ≤ ζ2 ≤ · · · ≤ ζL , solving the problem of having multiple modes in the posterior distribution for mixture models, and ensuring good mixing of our Markov chains (Almond 2014). We use Stan (Stan Development Team 2017a), based on Hamiltonian Monte Carlo (HMC), for our inference, to enable users of our code to be very flexible with their prior choices. HMC does not provide sampling for discrete parameters (Stan Development Team 2017b), therefore, the posterior of the discrete group allocation indices, s = (s(x1 ), . . . , s(xn )), cannot be sampled directly and so we integrate s out in the likelihood. Mixture components are computed a posteriori. The regression coefficient parameters are sampled from the joint posterior, integrating over s, Z p(α|e, ζ) ∝

p(e|s, ζ)p(s|α)p(α)p(ζ)ds.

Sinse s is discrete, this is equivalent to p(α|e, ζ) ∝

n Y

L X

i=1

l=1

! P r(s(xi ) = l|α)p(ei |ζl ) p(α)p(ζ) 11

The mixture components for x are computed for all M posterior samples (α)M and the mean ˆ over all posterior samples is found λ(x): M 1 X ˆ λ(x; αm ). λ(x) = M m=1

ˆ is to operate with joint prior distribution An alternative to this approach for fitting λ π(β, σ 2L , τ 2L , δ L , λ(x)L , L) using reversible jump MCMC (Green 1995; Kim et al. 2005). Reverseble jump MCMC could be time-consuming and operationally expensive, requiring a longer warm-up period for the mixing of Markov chains and introducing the convergence problems (Green 1995; Pope et al. 2018). Also this model specification could raise identifiability issues for x, in particular to which input region x should be allocated. We discuss this in section 6. ˆ Our approach is diagnostic-led, automatable and fast. Fixing λ(x) in this way resembles the common and effective Cross Validation (CV) approach for estimating covariance hyperparameters for stationary GPs (Rasmussen and Williams 2004, Bachoc 2013).

4. CASE STUDIES We consider the performance of stationary and our proposed nonstationary GP emulators on a set of test functions. To avoid ‘cherry picking’ we use our in-house code for constructing stationary GP emulators based in Stan. The code is available in supplementary material. The default options of our code select h(x) = (1, x1 , . . . , xp )T and weakly informative prior β 2:p ∼ N (0, 10) with uniform intercept. We use a truncated normal prior for σ with mean ασ as the residual standard error from an OLS fit with h(x), and βσ the squared difference of the standard deviation of the training runs and ασ . The weakly informative δk ∼ Gamma(4, 4) is used for the correlation length parameters (Stan Development Team 2017b). We assign τ 2 to 0.0001 to facilitate stable matrix inversion as suggested by Andrianakis and Challenor (2012). To ensure comparability, we specify the same forms of prior distribution for model parameters β 2:p and σ, δ, τ 2 L for our nonstationary GP emulator. We use the same default prior settings in all examples that follow. We also demonstrate the performance of Bayesian TGP (Gramacy and Lee 2008) and composite GP (CGP) (Ba and Joseph 2012). To implement TGP and CGP, we used R 12

packages TGP with default settings and fixed nugget parameter at 0.01, note the nugget has a different interpretation to τ 2 in TGP (Gramacy 2007), and CGP with default settings (Ba and Joseph 2014). Both R packages are available from CRAN. We do not consider the performance of the nonstationary GP emulator suggested by Montagna and Tokdar (2016), analysed in subsection 2.2, because there is no available package for this method.

4.1

The 2D ‘wavy’ function

In this subsection, we examine the performance of the emulators on the nonstationary function considered by Ba and Joseph (2012). The ‘wavy’ function has the equation f (x1 , x2 ) = sin(1/(0.7x1 + 0.3)(0.7x2 + 0.3)) (x1 , x2 ∈ [0, 1]). It fluctuates rapidly when x1 and x2 is small, but gradually becomes smooth as x1 and x2 increases toward 1. See Figure 2.

Figure 2: True function for the two-dimensional numerical example and 24-run maximin distance LHD red points used as design.

We use a 24-run maximin distance Latin Hypercube (LHC) (Morris and Mitchell 1995) to train our emulators. Firstly, we construct a stationary GP emulator and plot the standardized errors to see if there is a particular subspace in the input space where the behaviour is different. From Figure 3 we notice that the variability of the errors depends on both inputs and there are two distinct identifiable regions of standardized error behaviour. We 13

Figure 3: Top row : ei against x1 and x2 . Bottom row : coloured ei : the deep blue colour corresponds to the higher probability of a point being allocated to region 1, while the deep red colour corresponds to the higher probability of a point being allocated to region 2.

therefore specify the number of distinct regions L = 2. For x1 , x2 < −0.5, the standardized errors exhibit large variability, i.e. standardized errors ranging from -1.5 to 2, while for input values greater than -0.5 the standardized errors variability significantly decreases. We use the model described in section 3.2 to derive mixing functions. Figure 4 demonstrates the performance of all four emulators for the wavy function. The predictive mean surface, fˆ, (top row) produced by our nonstationary emulator and by CGP resemble the image of the true function surface, while the stationary GP emulator and TGP emulator produce a number of ridges across the input space. From the second row of Figure 4 we observe that the stationary GP emulator produces the lowest values of predictive standard deviation around the design points, but due to the small values of the correlation length parameters, the information is quickly ‘dying’ away from the design points. Our nonstationary GP emulator and CGP manage to learn and explore the function behaviour for high values of x1 and x2 based on a few design points due to the stronger correlation structure in this region. 14

Figure 4: Comparison between stationary GP (st-GP ), our nonstationary GP (nst-GP ), TGP, and CGP. First row : posterior predictive mean surface and root mean squared error (rmse). Second row : posterior predictive standard deviation and maximum posterior predictive standard deviation (max psd). Third row : cross-section performance at x1 = x2 . The dashed line is the true function, the solid black line is the posterior mean predictive curve, and the grey areas denote two standard deviation prediction intervals.

The behaviour of the wavy function changes along the line where x1 = x2 . Figure 4 demonstrates the cross-section plot for the stationary model and shows the limitations of the stationary emulator, i.e. it is overconfident in the region where the function behaviour changes rapidly. Both the TGP and CGP models are under-confident across the cross-section, while our nonstationary GP emulator produces the lowest prediction intervals among all of the methods, through is perhaps under-confident during the transition to smoother behaviour x1 , x2 ≥ −0.5.

15

4.2

Nonstationarity in 5 dimensions

We now examine the performance of our method in higher dimensions, using the 5D function f (x) = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 sin(β6 x5 ), with β5 , β6 changing in each of 5 different intervals in x5 , as shown in Figure 5. Intervals in white correspond to five separate function behaviours, while the intervals in blue are a mixture of functions from two neighbouring regions to ensure continuity. Our choices of β5 , β6 impose significant variability in smoothness with changes in x5 . We choose to vary the stationarity properties along one axis in order to favour TGP, which partitions the input

−10

−5

Y

0

5

space along one input.

−1.0

−0.5

0.0

0.5

1.0

x5

Figure 5: Our 5D function f (x) plotted against x5

We generate a 4-extended LHC of size 100 following the methodology presented by Williamson (2015). The 100 member LHC is composed of four, 25 member LHCs, each added sequentially, ensuring that the composite design is orthogonal and space-filling at each stage of extension. We will compare the performance of all methods using Leave One Latin Hypercube Out (LOLHO) diagnostics, i.e. each row represents the predictions generated for a left out from the design LHC by refitted emulators. LOLHO diagnostics offer a sterner test for an emulator than LOO diagnostics and allow us to assess which areas of the input space do not validate well (Williamson 2015). To construct our nonstationary GP emulator we consider plots of standardized errors for each left out LHC, shown in Figure 6, to see if there is a particular subspace in the input space where the behaviour is different. We observe heteroskedasticity against x5 and see 16

that the standardized errors become highly correlated as x5 increases. We specify L = 2, allowing our model to be as simple as possible. Through the true function has L = 5, in practice, we will always try to specify as few regions as possible initially, and L = 2 works well in this example.

Figure 6: ei against the x5 for four sub-designs. The deep blue colour corresponds to the higher probability of a point being allocated to region 1, while the deep red colour corresponds to the higher probability of a point being allocated to region 2.

Figure 7 shows the LOLHO diagnostic plots for each of our 4 emulators (columns) for each of the sub-designs considered (rows). We observe that the main issue with our stationary GP emulator manifests in the wide prediction intervals produced in the regions where the test function is ‘well-behaved’ for x5 close to 1. Despite TGP outperforming other methods in the region where the function is ‘well-behaved’, it performs poorly for small values of x5 across all ensembles, perhaps due to a small number of design points in this region. Across all designs our proposed nonstationary GP emulator and CGP produce similar results. To perform a fair comparison between our stationary and proposed nonstationary GP emulators to TGP and CGP, we used the ‘out of the box’ specification defined in the beginning of the section and our proposed nonstationary GP emulator performed well across all of the sub-designs. However, we could implement further steps to demonstrate the potential for superior performance of our proposed method. Figure 8 shows the standardized error plots generated by our stationary GP emulator for the first sub-design, demonstrating the nonstationarity against x5 . In practice, noticing such behaviour would lead us to give stronger prior information for δ5 (Higdon et al. 2008; Williamson and Blaker 2014). We achieve this by keeping the same Gamma(4, 4) prior for δ5 but specifying a smoother prior 17

1.0

0.0

0.5

1.0

0.0

0.5

10 0 -20 10 0 -20

0.0

0.5

1.0

10 0

0.0

0.5

1.0

-1.0

0.0

0.5

1.0

10 0

0

10

CGP: Ens 4

1.0

-20

-20

0.5

1.0

-20

-1.0

Y

0.0

-1.0

TGP: Ens 4

10

-1.0

0.5

CGP: Ens 3

10

1.0

-20

1.0

1.0

-20

0.5

Y

0.5

0.5

Y

0.0

0

10 0

0.0

0.0

0

10

-1.0

nst-GP: Ens 4

-20

-1.0

-1.0

TGP: Ens 3

-20

Y

1.0

0.0

Y

1.0

0

10 0

0.5

st-GP: Ens 4

-1.0

CGP: Ens 2

10

-1.0

nst-GP: Ens 3

-20

0.0

1.0

-20

Y

1.0

st-GP: Ens 3

-1.0

0.5

0

10 -20

Y

0.5

0.0

TGP: Ens 2

0

10 0

0.0

-1.0

nst-GP: Ens 2

-20

-1.0

Y

0

-1.0

Y

0.5

st-GP: Ens 2

Y

0.0

-20

Y

0 -20

Y

0 -20

-1.0

CGP: Ens 1

10

TGP: Ens 1

10

nst-GP: Ens 1

10

st-GP: Ens 1

-1.0

0.0

0.5

1.0

-1.0

0.0

0.5

1.0

Figure 7: Leave One Latin Hypercube Out (LOLHO) plots for stationary GP (st-GP ), our nonstationary GP (nst-GP ), TGP and CGP. Each row is constructed by leaving one LHC out. The posterior mean and two standard deviation prediction intervals produced by emulators are in black. The true function values are in green if they lie within two standard deviation prediction intervals, or red otherwise.

in the other 4 dimensions via δ1 , . . . , δ4 ∼ Gamma(42, 9) (this distribution was chosen by using the MATCH elicitation tool (Morris et al. 2014) to capture a reasonable distribution giving more weight to longer correlation lengths). Figure 9 demonstrates the superior performance of our non-stationary GP emulator with this weakly informed choice of priors for the correlation length parameters across all sub-designs.

18

-1.0

-0.5

0.0

0.5

1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

x2

0.0

0.5

1.0

2 -2

-1

0

std.err

1

2 1 -1 -2

-1 -2

-1.0

x1

Ens 1

0

std.err

1

2

Ens 1

0

std.err

1 -2

-1

0

std.err

1 0 -2

-1

std.err

Ens 1

2

Ens 1

2

Ens 1

-1.0

-0.5

x3

0.0

0.5

1.0

-1.0

-0.5

x4

0.0

0.5

1.0

x5

Figure 8: ei for first sub-design against the model inputs.

0.0 0.5 1.0

-1.0

0.0 0.5 1.0

0 10 -20 10 -20

Y -1.0

10 -20

Y 0.0 0.5 1.0

-1.0

0.0 0.5 1.0

0.0 0.5 1.0

0 10 -20

Y

0 10

CGP: Ens 4

-20

Y -20

0.0 0.5 1.0

0

10 -20

Y

-1.0

TGP: Ens 4

0 10

0 10 -20

Y

-1.0

-1.0

CGP: Ens 3

0

10 -20

0.0 0.5 1.0 nst-GP: Ens 4

0.0 0.5 1.0

0.0 0.5 1.0 TGP: Ens 3

0

10 0 -20

Y

-1.0

0.0 0.5 1.0

0

10 -20

Y 0.0 0.5 1.0 nst-GP: Ens 3

0.0 0.5 1.0

-1.0

CGP: Ens 2

0

10 0 -1.0

st-GP: Ens 4

-1.0

0.0 0.5 1.0 TGP: Ens 2

-20

Y

0 -20

0.0 0.5 1.0 st-GP: Ens 3

-1.0

-1.0

nst-GP: Ens 2

10

st-GP: Ens 2

-1.0

Y -20

Y -20

Y -20 -1.0

CGP: Ens 1

0 10

TGP: Ens 1

0 10

nst-GP: Ens 1

0 10

st-GP: Ens 1

-1.0

0.0 0.5 1.0

-1.0

0.0 0.5 1.0

Figure 9: Leave One Latin Hypercube Out plots similar to figure 7 with a modified prior for correlation length parameters for our nonstationary GP emulator (nst-GP ).

19

4.3

Stationarity check example

Following Ba and Joseph (2012) we check the performance of our emulator on stationary data. We perform LOO diagnostics for our stationary and nonstationary GP models on sample paths generated by three different zero-mean bivariate Gaussian processes with varying design X and correlation length parameters δ. We start by producing plots of standardized errors and we observe from Figure 10 there is no sign of unusual behaviour of standardized errors against x1 . We specify L = 2 as the default setting for the number of distinct regions. The mixture model identifies two distinct regions for the last ensemble.

Figure 10: ei plotted against x1 for three randomly sampled stationary paths. The deep blue colour corresponds to the higher probability of a point being allocated to region 1, while the deep red colour corresponds to the higher probability of a point being allocated to region 2.

Figure 11 demonstrates the posterior samples of region-specific parameters generated by our nonstationary GP model for the last ensemble and we notice that the histograms of posterior distributions for region specific parameters look similar across the regions. We observe that the performance of our stationary GP emulator (top row) and nonstationary GP emulator (bottom row) are almost identical in Figure 12.

20

Figure 11: Histograms of posterior samples of nonstationary GP model parameters for ensemble 3.

Figure 12: Top row : LOO diagnostic against x1 for stationary GP (st-GP ) for three randomly sampled stationary paths. Bottom row : LOO diagnostic against x1 for nonstationary GP (nstGP ). The posterior mean predictions and two standard deviation prediction intervals are in black. The true values are in ether green, if they are within the prediction intervals, or red otherwise.

21

5. CNRM-CM5 EXPERIMENTS In this section we present work that has been done as part of the ANR (The French National Research Agency) funded HIGH-TUNE project. The primary objective of the project is to improve and tune the boundary-layer cloud parameterization of the French General Circulation Model (Voldoire et al. 2013). Boundary-layer clouds are crucial component climate model because of their effect on water and energy cycles of atmosphere and surface temperature (Bony and Dufresne 2005). These clouds are not resolved by solvers at the resolution used for running GCM’s and so are parameterized inside the models through a set of approximating equations dependent on a number of ‘free’ (internal) parameters. In the climate community the estimation of these parameters is called tuning (Hourdin et al. 2017). The HIGH-TUNE project aims to improve the representation of boundary-layer clouds by comparing single-column versions of the GCM’s (SCM’s) in different regions of the globe with explicit 3D high-resolution Large-Eddy simulations (LES) over the same regions. In particular, history matching (Williamson et al. 2015, 2017), a statistical approach to tuning, will be performed in order to rule out the free parameter choices for the SCM’s that are not effectively mimicking the LES results. It is common practice to treat LES as surrogate observations for tuning, because it is difficult to observe real clouds at the required temporal and spatial scales. In climate modelling this concept is termed ‘process based tuning’. Emulators for the SCM’s will be required whenever new parameterizations are tried. In our work with the HIGH-TUNE team we have found using routine stationary GP’s insufficient for this task and so we present the performance of our nonstationary method within this application. The tuning, what we would term calibration, of these models is beyond the scope of this paper. We will consider the average potential temperature generated by the SCM by varying nine input parameters of interest. Firstly we identified the physically plausible ranges of the input model parameters with the HIGH-TUNE team and standardized these to the range [-1, 1], which is the usual practice in constructing emulators for computer experiments. We generated a 160 member LHC composed of four, 40 member LHCs (Williamson 2015), similar to the design for the 5D example in Section 4.2. 22

Figure 13 plots the average potential temperature against each input. We see that the average potential temperature varies most with the input ALFX, i.e. the variability in the response of the average potential temperature increases as ALFX increases. For our stationary and nonstationary GP emulators we specified a linear mean function in the nine inputs. To validate the performance of our emulators we used LOLHO diagnostics (Williamson 2015).

Figure 13: Average potential temperature against nine standardized inputs to SCM (Single Column Model).

23

Ensemble 2

2

0

-2

0

-2

-1.0

-0.5

0.0

ALFX

0.5

1.0

Ensemble 4

2

2

std.err

std.err

std.err

2

Ensemble 3

std.err

Ensemble 1

0

-2

-2

-1.0

-0.5

0.0

0.5

1.0

ALFX

0

-1.0

-0.5

0.0

ALFX

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

ALFX

Figure 14: Coloured ei against ALFX for four sub-designs. The deep blue colour corresponds to the higher probability of a point being allocated to region 1, while the deep red colour corresponds to the higher probability of a point being allocated to region 2.

We specified L = 2 allowing our model to be as simple as possible and derive the mixture functions. We also checked the performance of the model with L = 3, but haven’t observed any improvement in the emulator performance. We compare the performance of our nonstationary GP emulator to the stationary GP emulator, TGP and CGP. From Figure 15 we observe that the stationary GP emulator fails to recognise the variability of the model response in relation to ALFX, i.e. the length of two standard deviation prediction intervals is the same across the whole range of ALFX. TGP demonstrates satisfactory performance for all four validation ensembles, however due to the hard partitioning mentioned in subsection 2.2 the two standard deviation prediction intervals increase significantly for the ALFX>0. CGP performs well in the input region where the model is well-behaved, however is over-confident in the region where the model response varies the most, especially for sub-designs 2 and 4. Our nonstationary GP emulator demonstrates a gradual increase in the length of prediction intervals with increasing ALFX as we observe in the data.

24

0.0 0.5 1.0

305.8 305.2

Y -1.0

0.0 0.5 1.0

-1.0

0.0 0.5 1.0

-1.0

0.0 0.5 1.0

305.8 305.0

Y

305.0

Y

305.0

Y 0.0 0.5 1.0

305.8

ALFX CGP: Ens 2

305.8

ALFX TGP: Ens 2

305.8

ALFX nst-GP: Ens 2

-1.0

0.0 0.5 1.0

-1.0

0.0 0.5 1.0 ALFX

st-GP: Ens 3

nst-GP: Ens 3

TGP: Ens 3

CGP: Ens 3

-1.0

0.0 0.5 1.0

305.0

Y

305.0

Y

305.0

Y

305.0

0.0 0.5 1.0

305.8

ALFX

305.8

ALFX

305.8

ALFX

-1.0

0.0 0.5 1.0

-1.0

0.0 0.5 1.0 ALFX CGP: Ens 4

ALFX

-1.0

0.0 0.5 1.0

305.2

Y

305.2

Y

305.2

Y 0.0 0.5 1.0

-1.0

ALFX

305.8

ALFX TGP: Ens 4

305.8

ALFX nst-GP: Ens 4

305.8

ALFX st-GP: Ens 4

305.2 -1.0

305.8 305.2

Y -1.0

ALFX

305.8

-1.0

CGP: Ens 1

st-GP: Ens 2

305.8

-1.0

Y

305.8

0.0 0.5 1.0

305.0

Y

-1.0

Y

TGP: Ens 1

305.2

Y

305.8

nst-GP: Ens 1

305.2

Y

st-GP: Ens 1

0.0 0.5 1.0 ALFX

-1.0

0.0 0.5 1.0 ALFX

Figure 15: Comparison between stationary GP (st-GP ), nonstationary GP (nst-GP ), TGP, and CGP on modelling average potential temperature on four validation designs. Each row is constructed by leaving one LHC out. The posterior mean and two standard deviation prediction intervals produced by emulators are in black. The green and red points are the model values, coloured ‘green’ if they lie within two standard deviation prediction intervals and ‘red’ if they lie outside.

25

6. DISCUSSION In this paper we have introduced a non-stationary, diagnostic-driven, GP emulation of computer models via covariance kernel mixtures. We construct our nonstationary GP emulator in two steps. Firstly, we check if the stationary GP emulator performs well by considering plots of individual standardized errors against inputs in order to identify any signs of nonstationarity/heteroskedasticity (Bastos and O’Hagan 2009). We then fit a mixture model to the individual standardized errors to produce a mixture function prescribing the covariance kernel mixture for a nonstationary GP. We specify region-specific stationary covariance kernels, then establish the covariance kernel for our nonstationary GP as a mixture of these. The numerical examples together with the real-data application demonstrated the competitive performance of our method at its default settings compared to the main nonstationary methods implemented in software, TGP and CGP. A motivation for our approach is to mimic the approach we take to building emulators for computer models in practice. We may fit many such emulators to different model output over a session with our collaborators and use diagnostics from stationary fits to decide how to proceed. Our method treats these diagnostics as data that we then use to fit nonstationary models, making the fitting of many emulators more automatic, so that ultimately it can be done in house by the modellers themselves using our software. There are a number of possible extensions to the methodology presented. Firstly, we specify the number of regions L by considering plots of standardized errors against the model inputs, and the choice of L is based on subjective expert judgement. The alternative is to construct an automated process by considering WAIC and/or BIC measures and choose the number of regions L which provides the best WAIC and/or BIC measures (Almond 2014). Secondly, we may remove the 2 stage approach altogether by operating with joint prior distribution π(β, σ 2L , τ 2L , δ L , λ(x)L , L) using reversible jump MCMC (Green 1995; Kim et al. 2005) to attempt full Bayesian inference. Intelligent choice of prior distribution in order to avoid confounding will be important here. Finally, we would be interested to further develop diagnostics for types of nonstationarity and tailor methods/priors to these. Acknowledgements Daniel Williamson was funded by EPSRC fellowship (EP/K019112/1) and by the

26

NERC EuroClim project (NE/M006123/1). The authors have benefited from the discussions and data provision with members of the HIGH-TUNE project and would like to thank Frederic Hourdin, Fleur Couvreux and the climate modelling teams at LMDZ and CNRM.

References Almond, R. G. (2014), “A comparison of two MCMC algorithms for hierarchical mixture models,” CEUR Workshop Proceedings, 1218, 1–19. Andrianakis, I. and Challenor, P. G. (2012), “The effect of the nugget on Gaussian process emulators of computer models,” Computational Statistics and Data Analysis, 56, 4215– 4228. Ba, S. and Joseph, V. R. (2012), “Composite Gaussian process models for emulating expensive functions,” Annals of Applied Statistics, 6, 1838–1860. — (2014), CGP: Composite Gaussian process models, r package version 2.0-2. Bachoc, F. (2013), “Cross Validation and Maximum Likelihood estimations of hyperparameters of Gaussian processes with model misspecification,” Computational Statistics & Data Analysis, 66, 55–69. Banerjee, S., Gelfand, A. E., Knight, J. R., and Sirmans, C. F. (2004), “Spatial Modeling of House Prices Using Normalized Distance-Weighted Sums of Stationary Processes,” Journal of Business and Economic Statistics, 22, 206–213. Bastos, L. S. and O’Hagan, A. (2009), “Diagnostics for Gaussian Process Emulators,” Technometrics, 51, 425–438. Bayarri, M. J., Berger, J. O., Cafeo, J., Garcia-Donato, G., Liu, F., Palomo, J., Parthasarathy, R. J., Paulo, R., Sacks, J., and Walsh, D. (2007), “Computer model validation with functional output,” The Annals of Statistics, 35, 1874–1906.

27

Bayarri, M. J., Berger, J. O., Calder, E. S., Dalbey, K., Lunagomez, S., Patra, A. K., Pitman, E. B., Spiller, E. T., and Wolpert, R. L. (2009), “Using Statistical and Computer Models to Quantify Volcanic Hazards,” Technometrics, 51, 402–413. Bony, S. and Dufresne, J. (2005), “Marine boundary layer clouds at the heart of tropical cloud feedback uncertainties in climate models,” Geophysical Research Letters, 32, L20806. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998), “Bayesian CART Model Search,” Journal of the American Statistical Association, 93, 935–948. Conti, S. and O’Hagan, A. (2010), “Bayesian emulation of complex multi-output and dynamic computer models,” Journal of Statistical Planning and Inference, 140, 640–651. Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991), “Bayesian Prediction of Deterministic Functions, with Applications to the Design and Analysis of Computer Experiments,” Journal of the American Statistical Association, 86, 953–963. Farneti, R. and Vallis, G. K. (2009), “An Intermediate Complexity Climate Model (ICCMp1) based on the GFDL flexible modelling system,” Geoscientific Model Development, 2, 73– 88. Frierson, D. M. W., Held, I. M., and Zurita-Gotor, P. (2006), “A Gray-Radiation Aquaplanet Moist GCM. Part I: Static Stability and Eddy Scale,” Journal of the Atmospheric Sciences, 63, 2548–2566. Gelfand, A. E., Kim, H. J., Sirmans, C. F., and Banerjee, S. (2003), “Spatial modeling with spatially varying coefficient processes,” Journal of the American Statistical Association, 98, 387–396. Gettelman, A. and Rood, R. B. (2016), Demystifying Climate Models, vol. 2. Gramacy, R. B. (2007), “An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models,” Journal Of Statistical Software, 19, 1–46. Gramacy, R. B. and Lee, H. K. (2012), “Cases for the nugget in modeling computer experiments,” Statistics and Computing, 22, 713–722. 28

Gramacy, R. B. and Lee, H. K. H. (2008), “Bayesian Treed Gaussian Process Models With an Application to Computer Modeling,” Journal of the American Statistical Association, 103, 1119–1130. Green, P. (1995), “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, 82, 711–732. Haylock, R. and O’Hagan, A. (1996), “On inference for outputs of computationally expensive algorithms with uncertainty on the inputs,” Bayesian Statistics, 5, 629–637. Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008), “Computer Model Calibration Using High-Dimensional Output,” Journal of the American Statistical Association, 103, 570–583. Hoffman, M. D. and Gelman, A. (2011), “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” 15, 1593–1623. Hourdin, F., Mauritsen, T., Gettelman, A., Golaz, J.-C., Balaji, V., Duan, Q., Folini, D., Ji, D., Klocke, D., Qian, Y., Rauser, F., Rio, C., Tomassini, L., Watanabe, M., and Williamson, D. (2017), “The Art and Science of Climate Model Tuning,” Bulletin of the American Meteorological Society, 98, 589–602. Joseph, V. R. and Hung Ying (2008), “Orthogonal-Maximin Latin Hypercube Designs,” 18, 171–186. Kaufman, C. G. and Sain, S. R. (2010), “Bayesian functional {ANOVA} modeling using Gaussian process prior distributions,” Bayesian Analysis, 5, 123–149. Kennedy, M. and O’Hagan, A. (2000), “Predicting the output from a complex computer code when fast approximations are available,” Biometrika, 87, 1–13. Kennedy, M. C. and O’Hagan, A. (2001), “Bayesian calibration of computer models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 425–464. Kim, H.-M., Mallick, B. K., and Holmes, C. C. (2005), “Analyzing Nonstationary Spatial Data Using Piecewise Gaussian Processes,” Journal of the American Statistical Association, 100, 653–668. 29

Montagna, S. and Tokdar, S. T. (2016), “Computer Emulation with Nonstationary Gaussian Processes,” SIAM/ASA Journal on Uncertainty Quantification, 4, 26–47. Morris, D. E., Oakley, J. E., and Crowe, J. A. (2014), “A web-based tool for eliciting probability distributions from experts,” Environmental Modelling & Software, 52, 1–4. Morris, M. D. and Mitchell, T. J. (1995), “Exploratory Designs for Computer Experiments,” J. Statist. Plann. Inference, 43, 381–402. Oakley, J. and O’Hagan, A. (2002), “Bayesian inference for the uncertainty distribution of computer model outputs,” Biometrika, 89, 769–784. Oakley, J. E. and O’Hagan, A. (2004), “Probabilistic sensitivity analysis of complex models: a Bayesian approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 751–769. Oughton, R. H. and Craig, P. S. (2016), “Hierarchical Emulation: A Method for Modeling and Comparing Nested Simulators,” Journal on Uncertainty Quantification, 4, 495–519. Paciorek, C. J. and Schervish, M. J. (2006), “Spatial modelling using a new class of nonstationary covariance functions,” Environmetrics, 17, 483–506. Pope, C. A., Gosling, J. P., Barber, S., Johnson, J., Yamaguchi, T., Feingold, G., and Blackwell, P. (2018), “Modelling spatial heterogeneity and discontinuities using Voronoi tessellations,” . Rasmussen, C. E. and Williams, C. K. I. (2004), Gaussian processes for machine learning., vol. 14. Rougier, J. (2008), “Efficient Emulators for Multivariate Deterministic Functions,” Journal of Computational and Graphical Statistics, 17, 827–843. Rougier, J., Guillas, S., Maute, A., and Richmond, A. D. (2009), “Expert Knowledge and Multivariate Emulation: The ThermosphereIonosphere Electrodynamics General Circulation Model (TIE-GCM),” Technometrics, 51, 414–424.

30

Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), “Design and Analysis of Computer Experiments,” Statistical Science, 4, 409–423. Salter, J. M., Williamson, D. B., Scinocca, J., and Kharin, V. (2018), “Uncertainty quantification for spatio-temporal computer models with calibration-optimal bases,” . Santner, T. J., Williams, B. J., and Notz, W. I. (2003), The Design and Analysis of Computer Experiments, Springer Series in Statistics, New York, NY: Springer New York. Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 6, 461–464. Severijns, C. A. and Hazeleger, W. (2005), “Optimizing Parameters in an Atmospheric General Circulation Model,” Journal of Climate, 18, 3527–3535. Stan Development Team (2017a), “RStan: the R interface to Stan,” R package version 2.17.2. — (2017b), “Stan Modeling Language User’s Guide and Reference Manual,” . Vernon, I., Goldsteiny, M., and Bowerz, R. G. (2010), “Galaxy formation: A Bayesian uncertainty analysis,” Bayesian Analysis, 5, 619–670. Voldoire, A., Sanchez-Gomez, E., Salas y Mélia, D., Decharme, B., Cassou, C., Sénési, S., Valcke, S., Beau, I., Alias, A., Chevallier, M., Déqué, M., Deshayes, J., Douville, H., Fernandez, E., Madec, G., Maisonnave, E., Moine, M.-P., Planton, S., Saint-Martin, D., Szopa, S., Tyteca, S., Alkama, R., Belamari, S., Braun, A., Coquart, L., and Chauvin, F. (2013), “The CNRM-CM5.1 global climate model: description and basic evaluation,” Climate Dynamics, 40, 2091–2121. Williamson, D. (2015), “Exploratory ensemble designs for environmental models using kextended Latin Hypercubes,” Environmetrics, 26, 268–283. Williamson, D. and Blaker, A. T. (2014), “Evolving Bayesian Emulators for Structured Chaotic Time Series, with Application to Large Climate Models,” SIAM/ASA Journal on Uncertainty Quantification, 2, 1–28.

31

Williamson, D., Blaker, A. T., Hampton, C., and Salter, J. (2015), “Identifying and removing structural biases in climate models with history matching,” Climate Dynamics, 45, 1299– 1324. Williamson, D., Goldstein, M., Allison, L., Blaker, A., Challenor, P., Jackson, L., and Yamazaki, K. (2013), “History matching for exploring and reducing climate model parameter space using observations and a large perturbed physics ensemble,” Climate Dynamics, 41, 1703–1729. Williamson, D., Goldstein, M., and Blaker, A. (2012), “Fast linked analyses for scenariobased hierarchies,” Journal of the Royal Statistical Society. Series C: Applied Statistics, 61, 665–691. Williamson, D. B., Blaker, A. T., and Sinha, B. (2017), “Tuning without over-tuning: Parametric uncertainty quantification for the NEMO ocean model,” Geoscientific Model Development, 10, 1789–1816.

32