Principles of Experimental Design for Gaussian Process Emulators of ...

Submitted to the Annals of Statistics

PRINCIPLES OF EXPERIMENTAL DESIGN FOR GAUSSIAN PROCESS EMULATORS OF DETERMINISTIC COMPUTER EXPERIMENTS By Benjamin Haaland1,2 and Vaibhav Maheshwari2,3

arXiv:1411.7049v1 [math.ST] 25 Nov 2014

1 Georgia

Institute of Technology, 2 Duke-NUS Graduate Medical School, and 3 Renal Research Institute E-mail: [email protected]; [email protected] Abstract Computer experiments have become ubiquitous in science and engineering. Commonly, runs of these simulations demand considerable time and computing, making experimental design extremely important in gaining high quality information with limited time and resources. Principles of experimental design are proposed and justified which ensure high nominal, numeric, and parameter estimation accuracy for Gaussian process emulation of deterministic simulations. The space-filling properties “small fill distance” and “large separation distance” are only weakly conflicting and ensure well-controlled nominal, numeric, and parameter estimation error, while non-stationarity requires a greater density of experimental inputs in regions of the input space with more quickly decaying correlation. This work will provide scientists and engineers with robust and practically useful overarching principles for selecting combinations of simulation inputs with high information content.

1. Introduction. Computer experiments use complex mathematical models implemented in large computer codes to study real systems. In many situations, an actual experiment could be infeasible. Scientists and engineers use complex computer simulations, or computer experiments, to study real systems. For example, a computational fluid dynamics simulation could be used to compare outflow rates of various sclera flap geometries in trabeculectomy [21] or mosquito population dynamics could be coupled with dengue transmission models to study urban dengue control [6]. Often, a thorough exploration of the unknown simulation function or mean simulation function is wanted. However, the simulation is typically expensive enough that this exploration must be conducted very wisely. A seemingly high-quality solution is to evaluate the expensive simulation at several well-distributed data sites and then build an inexpensive approximation, or emulator, for the simulation. The accuracy of an emulator depends very strongly on the manner in which data is collected from the expensive function. However, even the several experimental designs of apparently high-quality often Keywords and phrases: Computer Experiment, Emulation, Experimental Design, Interpolation, Gaussian Process, Reproducing Kernel Hilbert Space

1

proposed by the relatively small cadre of experts in experimental design for simulations are largely based on non-rigorous analogy to results from numerical integration or assumptions which are quite dissimilar from those used in practice. For example, consider the maxi-min Latin hypercube designs which seem to have high information content in deterministic computer experiments. A Latin hypercube is an experimental design for several continuous inputs which has the property that each one-dimensional projection of the design achieves maximal one-dimensional uniformity. A maxi-min Latin hypercube is a Latin hypercube design whose nearest two points are farthest apart. Latin hypercube designs were originally proposed for numerical integration [14] and can be shown to achieve a faster rate of convergence to the true value of the integral due to their one-dimensional uniformity [20]. It is believed that this one-dimensional uniformity will lead to a more accurate emulator, but there are few results in place. Presently, there is a pressing need for robust and practically useful principles of experimental design to guide scientists and engineers in collecting data from their simulation experiments [25]. Traditional experimental design principles such as replication, blocking, and randomization are not applicable in the context of deterministic computer experiments. Here, we develop principles of data collection for Gaussian process emulation of deterministic computer experiments which are broadly applicable and rigorously justified. Three sources of inaccuracy will be considered, nominal error, numeric error, and emulation error due to parameter estimation. Error in parameter estimation refers to the difference between the true and estimated parameters of the emulator, numeric error refers to the difference between a computed quantity and the quantity that one wants to compute which results from floating point arithmetic, and nominal error refers to the difference between the quantity that one wants to compute and the target quantity. Stationary and non-stationary situations, as well as regression functions, will be considered. The overall approach will be to provide bounds for each of nominal, numeric, and parameter estimation error in terms of properties of the experimental design X. It will be shown that the nominal, numeric, and parameter estimation criteria are only weakly conflicting, and lead to broadly similar experimental designs. The remainder of this article is organized as follows. In Sections 3, 4, and 5, bounds on the nominal, numeric, and parameter estimation error, respectively, are developed. In each section, experimental design characteristics which lead to small error bounds are discussed and a few examples are given. In Section 6, the implications of these principles are briefly discussed. 2. Preliminaries. Let f : Ω → R, denote the function linking a computer experiment’s input to its output for Ω ⊂ Rd . Further, let fˆϑ denote the nominal emulator at a particular value of the parameters ϑ, and f˜ϑ denote the numeric emulator at a particular value of the parameters ϑ. The numeric emulator represents the emulator which is calculated using floating point arithmetic, while the nominal emulator represents the idealized, exact arithmetic, version thereof. Then, for ˆ the any norm k · k, particular value of the parameters ϑ∗ , and corresponding parameter estimate ϑ, normed deviation of the emulator from the computer experiment can be decomposed into nominal, 2

numeric, and parameter estimation components using the triangle inequality as shown below.

(1)

kf − f˜ϑˆ k = kf − fˆϑ∗ + fˆϑ∗ − fˆϑˆ + fˆϑˆ − f˜ϑˆ k ≤ kf − fˆϑ k + kfˆˆ − f˜ˆ k + kfˆϑ − fˆˆ k ∗

|

{z

∗

| ϑ {z ϑ }

}

nominal

|

numeric

{z ϑ }

parameter

Note that inequality (1) does not make any assumption about the norm or type of emulator used. It is also noteworthy that this error decomposition considers numeric error in evaluation of the interpolator and does not consider numeric error in the parameter estimation process. Explicit consideration of numeric error in parameter estimation would result in a fourth term in the error decomposition. Hereafter, the L2 norm on the domain of interest Ω will be considered (2)

kgk = kgkL2 (Ω) =

Z

g(x)2 dx.

Ω

For the L2 norm (2) and any expectation E we have Z

Ekgk = E (3) ≤

g(x)2 dx

Ω

Z Ω

Eg(x)2 dx,

by Jensen’s inequality and Tonelli’s theorem [4]. Applying relation (3) to the error decomposition (1), gives Ekf − f˜ϑˆ k ≤ Ekf − fˆϑ∗ k + Ekfˆϑˆ − f˜ϑˆ k + Ekfˆϑ∗ − fˆϑˆ k ≤

Z

Ω

Z

(4)

©2

¶

E f (x) − fˆϑ∗ (x)

+ Ω

dx

¶

E fˆϑˆ (x) − f˜ϑˆ (x)

Z

+ Ω

¶

©2

dx

E fˆϑ∗ (x) − fˆϑˆ (x)

©2

dx.

Throughout, consider a Gaussian Process (GP) model for interpolation, f ∼ GP (h(·)0 β, Ψθ (·, ·)) Ä ä0 for some fixed, known regression functions h(·), and let ϑ = β 0 θ 0 . It is assumed that Ψθ (·, ·) is a positive definite function [24]. For a particular dataset X = {x1 , . . . , xn } and input of interest x, the best linear unbiased predictor (BLUP) is (5)

Ä

ä

ˆ + Ψθ (x, X)Ψθ (X, X)−1 f (X) − H(X)β ˆ , fˆϑ (x) = h(x)0 β

where Ψθ (A, B) = {Ψθ (ai , bj )} and f (A) = {f (ai )} for A = {ai } and B = {bj }, H(X) has rows ˆ = H(X)0 Ψθ (X, X)−1 H(X)−1 H(X)0 Ψθ (X, X)−1 f (X), and θ equals the vector of true h(xi )0 , β correlation parameters [17]. Note that the BLUP, as shown in (5), is the nominal emulator. 3

3. Nominal Error. Focusing on the first term in the error decomposition (4), the nominal or mean squared prediction error (MSPE) is given by [17], ¶

©2

E f (x) − fˆϑ (x) (6) = Ψθ (x, x) −

Ä

h(x)0 Ψθ (x, X)

ä

0 H(X)0 H(X) Ψθ (X, X)

!−1

h(x) Ψθ (X, x)

!

.

Note that throughout this section, the unknown parameters are taken at their true values. In line with intuition, the following proposition shows that increasing the number of data points always reduces the nominal error. Proposition 3.1. If f ∼ GP (h(·)0 β, Ψθ (·, ·)), for fixed, known regression functions h(·) and X1 ⊆ X2 , then MSPE2 ≤ MSPE1 , where MSPE1 and MSPE2 denote the MSPE of the BLUPs based on X1 and X2 , respectively. Proof. Express MSPE2 from equation (6) in terms of partitioned matrices, (7)

MSPE2 = Ψθ (x, x) −

Ä

a01

a02

ä B11

B21

B12 B22

!−1

!

a1 , a2

where !

h(x) , Ψθ (X1 , x)

a1 =

!

a2 =

Ψθ (X∗2 , x),

B11 =

0 H(X1 )0 , H(X1 ) Ψθ (X1 , X1 )

!

B12 =

H(X∗2 )0 , Ψθ (X1 , X∗2 )

B21 = B012 ,

B22 = Ψθ (X∗2 , X∗2 ),

and X∗2 = X2 \ X1 .

Applying partitioned matrix inverse results [9] and simplifying (7) gives, −1 −1 0 MSPE2 = MSPE1 − (a2 − B21 B−1 11 a1 ) B22·1 (a2 − B21 B11 a1 ),

where B22·1 = B22 −B21 B−1 11 B12 . Now, the proof is completed by showing that B22·1 is non-negative definite (B22·1 0). Once again applying partitioned matrix inverse results gives, B22·1 =

Ψθ (X∗2 , X∗2 )

Ä

ä

− H(X∗2 ) Ψθ (X∗2 , X1 )

!−1

0 H(X1 )0 H(X1 ) Ψθ (X1 , X1 )

!

H(X∗2 )0 Ψθ (X1 , X∗2 )

= Ψθ (X∗2 , X∗2 ) − Ψθ (X∗2 , X1 )Ψθ (X1 , X1 )−1 Ψθ (X1 , X∗2 ) Ä

+ H(X∗2 ) − Ψθ (X∗2 , X1 )Ψθ (X1 , X1 )−1 H(X1 ) Ä

äÄ

H(X1 )0 Ψθ (X1 , X1 )−1 H(X1 )

ä−1

ä0

× H(X∗2 ) − Ψθ (X∗2 , X1 )Ψθ (X1 , X1 )−1 H(X1 )

The first two terms are non-negative definite because they represent a conditional variance. The third term is non-negative definite since H(X1 )0 Ψθ (X1 , X1 )−1 H(X1 ) 0 . 4

Consider controlling the inner part of the bound on the nominal error given by the first term of (4). The MSPE in the inner part of the nominal error is given by (6). Applying partitioned matrix inverse results, (6) can be rewritten as ¶

©2

E f (x) − fˆϑ (x) (8)

= Ψθ (x, x) − Ψθ (x, X)Ψθ (X, X)−1 Ψθ (X, x) Ä

ä0 Ä

+ h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) Ä

ä−1

H(X)0 Ψθ (X, X)−1 H(X)

ä

× h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) . Initially, the uppermost terms in (8), which provide the MSPE for a model with mean zero or no regression functions, are bounded. We examine two cases. The first covers many situations where stationarity is assumed, while the second considers a model of non-stationarity in correlation, adapted from [3]. In the definitions below, ϕ(·) is a decreasing function of its non-negative argument. Case 1: Ψθ (u, v) = σ 2 ϕ(kΘ(u − v)k2 ). Write the overall bound on the uppermost terms of the MSPE (8) in terms of the maximum of local bounds, ¶

(9)

sup E f (x) − fˆϑ (x)

x∈Ω

©2

¶

©2

= max sup E f (x) − fˆϑ (x) i

,

x∈Vi (Θ)

where Vi (Θ) = {x ∈ Ω : dΘ (x, xi ) ≤ dΘ (x, xj )∀j 6= i}, is a Voronoi covering [2] of Ω with respect to a Mahalanobis-like distance [13] dΘ (u, v) = kΘ(u − v)k2 . Throughout, we will consider the vector and matrix norm kAk2 =

»

λmax (A0 A)

along with the notation λmax (·) and λmin (·) for the maximum and minimum eigenvalues of their (diagonalizable) arguments. The maximum over i in (9) can be controlled by imposing a uniform bound on each of its components. For an arbitrary set A ⊂ Ω and covariance function Ψθ , the uppermost terms in (8) can be locally bounded as sup Ψθ (x, x) − Ψθ (x, X)Ψθ (X, X)−1 Ψθ (X, x) x∈A

kΨθ (X, x)k22 λmax (Ψθ (X, X)) x∈A kΨθ (X, x)k22 ≤ sup Ψθ (x, x) − n supu,v∈Ω Ψθ (u, v) x∈A ≤ sup Ψθ (x, x) −

(10)

≤ sup Ψθ (x, x) − x∈A

inf x∈A Ψθ (xi , x)2 , n supu,v∈Ω Ψθ (u, v) 5

where the first inequality is true because a0 B−1 a ≥ λmin (B−1 )kak22 and λmin (B−1 ) = 1/λmax (B), the second inequality is true since Gershgorin’s theorem [22] implies λmax (Ψθ (X, X)) ≤ max

(11)

j

n X

Ψθ (xi , xj ) ≤ n sup Ψθ (u, v), u,v∈Ω

i=1

and the final inequality is true because the supremum of a sum is bounded above by the sum of the supremums and the sum of squares k · k22 is larger than any one of its elements squared. Note that the term n supu,v∈Ω Ψθ (u, v) in bound (10) is constant over experimental designs. For A = Vi (Θ) and Ψθ (u, v) = σ 2 ϕ(kΘ(u − v)k2 ), the bound in (10) is Ñ

σ2

(12)

ä2 é

Ä

ϕ(0) −

ϕ supx∈Vi (Θ) dΘ (xi , x)

.

nϕ(0)

The maximum over i, corresponding to the bound in (9), is Ñ

σ2

ä2 é

Ä

ϕ(0) −

ϕ maxi supx∈Vi (Θ) dΘ (xi , x) nϕ(0)

.

Note that (13)

max sup dΘ (xi , x) = sup min dΘ (xi , x), i

x∈Ω

x∈Vi (Θ)

i

is the fill distance with respect to the distance dΘ . So, the supremum of the MSPE over possible inputs, for a GP model with mean zero, can be controlled by demanding that the non-spherical fill distance (13) is small. More importantly, a uniform bound on the terms (12) is achieved by an experimental design X for which all the supx∈Vi (Θ) dΘ (xi , x) are the same. That is, all the Voronoi cells have the same maximum distance with respect to dΘ from their data point to their edge. Case 2: Ψθ (u, v) = σ 2 (ω1 (u)ω1 (v)ϕ(kΘ1 (u − v)k2 ) + ω2 (u)ω2 (v)ϕ(kΘ2 (u − v)k2 )). For Case 2, assume ω1 (·), ω2 (·) ≥ 0 have Lipschitz continuous derivatives on Ω, ω1 (·)2 + ω2 (·)2 = 1, Θ1 , Θ2 are non-singular, and λmax (Θ01 Ξ02 Ξ2 Θ1 ) < 1, where Ξ2 = Θ−1 2 . The final assumption can be interpreted as ϕ(kΘ2 (· − ·)k2 ) is narrower than ϕ(kΘ1 (· − ·)k2 ). For covariance function σ 2 (ω1 (u)ω1 (v)ϕ(kΘ1 (u − v)k2 ) + ω2 (u)ω2 (v)ϕ(kΘ2 (u − v)k2 )), the uppermost terms in (8) can be locally bounded on a set Ai ⊂ Ω with x ∈ Ai following the development in (10) as Ç

(14)

σ

2

(inf x∈Ai {ω1 (xi )ω1 (x)ϕ(kΘ1 (xi − x)k2 ) + ω2 (xi )ω2 (x)ϕ(kΘ2 (xi − x)k2 )})2 ϕ(0) − nϕ(0)

å

.

Similar to Case 1, write the overall bound on the uppermost terms of the MSPE (8) in terms of the maximum of local bounds, ¶

©2

sup E f (x) − fˆϑ (x)

x∈Ω

¶

≤ max sup E f (x) − fˆϑ (x) i

6

x∈Vi∗

©2

,

where Vi∗ = Vi (Θ1 )∪Vi (Θ2 ). Note that Vi (Θ1 ) and Vi (Θ2 ) often do not differ strongly. For example, if Θ2 = aΘ1 , then Vi (Θ1 ) = Vi (Θ2 ). Taking Ai = Vi∗ , the infimum in (14) can, in turn, be bounded below, thereby bounding (14) above as inf {ω1 (xi )ω1 (x)ϕ(kΘ1 (xi − x)k2 ) + ω2 (xi )ω2 (x)ϕ(kΘ2 (xi − x)k2 )}

x∈Vi∗

(15)

(

!

≥ inf ∗ ω1 (xi )ω1 (x)ϕ x∈Vi

!)

sup dΘ1 (xi , x) + ω2 (xi )ω2 (x)ϕ

x∈Vi∗

sup dΘ2 (xi , x)

x∈Vi∗

.

The Lipschitz derivatives of ω1 (·), ω2 (·) and Taylor’s theorem [16] imply ω1 (x) = ω1 (xi ) + R1 (x, xi )

and ω2 (x) = ω2 (xi ) + R2 (x, xi ),

where |R1 (x, xi )| ≤ k1 kxi − xk2 and |R2 (x, xi )| ≤ k2 kxi − xk2 . The bound (15) can, in turn, be bounded below with (16) ! 2

ω1 (xi ) ϕ

! 2

sup dΘ1 (xi , x) + ω2 (xi ) ϕ

x∈Vi∗

sup dΘ2 (xi , x) − ϕ(0)(k1 + k2 ) sup {kxi − xk2 } .

x∈Vi∗

x∈Vi∗

For tractability, the second term in (16) is bounded uniformly across the design space as − sup {ϕ(0)(k1 + k2 )kxi − xk2 } ≥ −ϕ(0)(k1 + k2 ) max sup kxi − xk2 . i

x∈Vi∗

x∈Vi∗

Next, consider an experimental design for which the bounds ! 2

ω1 (xi ) ϕ (17)

! 2

sup dΘ1 (xi , x) + ω2 (xi ) ϕ

x∈Vi∗

sup dΘ2 (xi , x)

x∈Vi∗

− ϕ(0)(k1 + k2 ) max sup kxi − xk2 i

x∈Vi∗

are uniform over i. One might expect that regions of the design space with less weight on the global, long range, correlation ϕ(kΘ1 (·−·)k2 ) and more weight on the local, short range, correlation ϕ(kΘ2 (· − ·)k2 ) would require more closely spaced design points, and vice versa. Roughly speaking this expectation holds true. Consider two design points xi and xj along with corresponding (union of ) Voronoi cell sizes supx∈Vi∗ dΘ1 (xi , x), supx∈Vi∗ dΘ2 (xi , x), supx∈Vj∗ dΘ1 (xj , x), and supx∈Vj∗ dΘ2 (xj , x). Suppose that the points in the input space near xi have more weight on the global, long range, correlation than the points in the input space near xj and the points in the input space near xj have more weight

7

on the local, short range, correlation than the points in the input space near xi , in the sense that ! 2

ω1 (xi )

!!

sup dΘ1 (xi , x) − ϕ

ϕ

x∈Vi∗

é

Ñ Ñ

≥ ω1 (xj )

2

sup dΘ2 (xi , x)

x∈Vi∗

−ϕ

sup dΘ1 (xj , x)

ϕ

x∈Vj∗

(18)

éé

Ñ

sup dΘ2 (xj , x)

ω2 (xi )


ϕ

x∈Vi∗

Ñ Ñ

≤ ω2 (xj )2

ϕ

and

!!

! 2

,

x∈Vj∗

sup dΘ2 (xi , x)

x∈Vi∗

é

Ñ

−ϕ

sup dΘ1 (xj , x)

x∈Vj∗

éé

sup dΘ2 (xj , x)

.

x∈Vj∗

Uniformity of the bounds (17) along with ω1 (·)2 + ω2 (·)2 = 1 implies ! 2

ω1 (xi )

!!


ϕ

x∈Vi∗

Ñ Ñ

− ω1 (xj )2

ϕ

é

ω2 (xj )2

!

−ϕ

sup dΘ2 (xj , x)

x∈Vj∗

é

sup dΘ1 (xj , x)

ϕ

sup dΘ2 (xj , x)

x∈Vj∗

é

Ñ Ñ

(19)

x∈Vj∗

éé

Ñ

−ϕ

sup dΘ1 (xj , x)

x∈Vj∗

Ñ

=ϕ

sup dΘ2 (xi , x)

x∈Vi∗

sup dΘ2 (xi , x)

x∈Vi∗

Ñ

−ϕ

éé

sup dΘ2 (xj , x)

x∈Vj∗

! 2

− ω2 (xi )

ϕ

Ñ

=ϕ

!!


x∈Vi∗

é

sup dΘ1 (xj , x)

x∈Vj∗

and

sup dΘ2 (xi , x)

x∈Vi∗

!

−ϕ

sup dΘ1 (xi , x) .

x∈Vi∗

Combining (18) with (19) gives (20)

sup dΘ1 (xj , x) ≤ sup dΘ1 (xi , x)

x∈Vj∗

x∈Vi∗

and

sup dΘ2 (xj , x) ≤ sup dΘ2 (xi , x),

x∈Vj∗

x∈Vi∗

since ϕ is a decreasing function of its non-negative argument. That is, a uniform bound on (17) is achieved by an experimental design X which has smaller (union of ) Voronoi cells, with respect to either dΘ1 or dΘ2 , in regions with more emphasis on the local, more quickly decaying, correlation and less emphasis on the global, more slowly decaying, correlation. Note that the global and local Ä Ä ä Ä ää emphases at xi are given concretely by ω1 (xi )2 ϕ supx∈Vi∗ dΘ1 (xi , x) − ϕ supx∈Vi∗ dΘ2 (xi , x) Ä Ä ä Ä ää and ω2 (xi )2 ϕ supx∈Vi∗ dΘ1 (xi , x) − ϕ supx∈Vi∗ dΘ2 (xi , x) , respectively. Recall that Cases 1 and 2 relate to the uppermost terms in (8), which without further development provide the MSPE for a model with mean zero. Now, we consider the lowermost terms in (8), 8

which are relevant for Gaussian process models with a mean or non-null regression component. The lowermost terms in (8) can be bounded above as Ä

ä0 Ä

h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) Ä

H(X)0 Ψθ (X, X)−1 H(X)

ä−1

ä

× h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x)

(21)

2 ¡ Ä ä

0 −1 ≤ h(x) − H(X) Ψθ (X, X) Ψθ (X, x) λmin H(X)0 Ψθ (X, X)−1 H(X) 2

2 ¬

0 −1 ≤ n sup Ψθ (u, v) h(x) − H(X) Ψθ (X, X) Ψθ (X, x) λmin H(X)0 H(X) . 2

u,v∈Ω

The first inequality is true because a0 B−1 a ≤ λmax (B−1 )kak22 and λmax (B−1 ) = 1/λmin (B) and the second inequality is true because λmin (A0 B−1 A) ≥ λmin (A0 A)λmin (B−1 ) = λmin (A0 A)/λmax (B) and as implication (11) of Gershgorin’s theorem [22]. Note that the term n supu,v∈Ω Ψθ (u, v) does not depend on the experimental design.

2

The components of the squared Euclidean norm h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) 2 are squared errors for an interpolator of the regression functions. Intuitively, we might expect these squared interpolation errors to behave in a manner similar to the MSPE for the Gaussian process model with mean zero. In fact, it has been shown above that if the regression functions are draws from a Gaussian process with mean zero and a covariance structure as described in Cases 1 and 2, then the expectation of these squared errors can be controlled through the experimental design as described above. That is, an experimental design which gives low nominal error in the mean zero case will

2 also make the term h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) 2 small.

Alternatively, a reproducing kernel Hilbert space (RKHS) [1, 24] may be defined as the completion of the function space spanned by {Ψθ (xi , ·) : xi ∈ Ω} with respect to the inner product P P P h i ai Ψθ (xi , ·), j bj Ψθ (yj , ·)i = i,j ai bj Ψθ (xi , yj ). Many commonly selected regression functions, for example constant, linear, polynomial, and spline, will also lie in the RKHSs induced by many of the common covariance functions. For example, the Gaussian kernel induces an RKHS of functions with infinitely many continuous derivatives and Matérn kernels induce RKHSs of functions with a fixed number of continuous derivatives. If the selected regression functions lie in the RKHS induced by the chosen covariance function, then deterministic RKHS interpolation error bounds as a decreasing function of the fill distance, such as Theorem 5.1 in [8], can be applied. Another alternative would be to choose as regression functions covariance function (half) evaluations {Ψθ (xi , ·) : i ∈ I} at a well-distributed set of centers I. These regression functions are capable of approximating a broad range of mean functions and have the appealing feature that the lowermost term in the bound (8) is then identically zero. The term λmin (H(X)0 H(X)) in the denominator of (21) indicates that (at least for regression functions which do not make h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) ≡ 0), the design properties implied by the mean zero development in Cases 1 and 2 need to be balanced with traditional experimental design properties. Two common scenarios are of particular interest. First, consider a constant 9

regression function, a mean parameter. In this situation, H(X)0 H(X) = n irrespective of experimental design. Second, consider linear regression functions in each dimension in addition to the constant. If each linear function is expressed on the same scale, λmin (H(X)0 H(X)) will be large for ¯ , the average design value, and whose orientations an experimental design with points far from x ¯ emphasize each basis vector in an orthonormal basis of Rd equally. For the common situation xi − x where Ω = [0, 1]d , λmin (H(X)0 H(X)) will be maximized for a design with equal numbers of points in each of the corners of [0, 1]d . So, high-quality experimental designs for Gaussian process models with linear regression mean components will balance the fill distance-based criteria described for Cases 1 and 2 with the push of design points to the “corners” of Ω. Similarly, high-quality experimental designs for Gaussian process models with quadratic regression mean components will balance the fill distance-based criteria described for Cases 1 and 2 with the push of design points to the edges and middle of the design space. Example near optimal 23 run experimental designs for the nominal situations described in Case 1 (stationarity), Case 2 (non-stationary correlation), and Case 1 along with linear regression functions are illustrated in the top left, top right, and bottom panels, respectively, of Figure 1. For each case, ϕ(d) = exp{−d2 }. For Case 2, ω1 (u)2 = 1 − kuk2 /2, ω2 (u)2 = kuk2 /2, Θ1 = 1 · I2 , Θ2 = 10 · I2 , and σ 2 = 1, while for Case 1 along with linear regression functions Θ = 2 · I2 and σ 2 = 1. As expected, in the first panel, illustrating the stationary situation, the design points lie near a triangular lattice (subject to edge effects). Similarly, in the second panel, illustrating the non-stationary correlation situation, the design points in the upper right, where the shorter range, more quickly decaying, correlation is emphasized, are more dense than in the lower left, where the longer range, more slowly decaying, correlation is emphasized. Further, in the third panel, illustrating the impact of regression functions, the design points balance fill distance and a push towards the corners of the input space. Finding designs which minimize (or nearly minimize) the error bounds is challenging. Here, we adopted the following approach. For the stationary situation, the optimization routine was initialized at a triangular lattice, scaled to minimize the fill distance. For both the non-stationary and stationary with regression functions situations, a homotopy continuation [10] approach, which slowly transitioned from the stationary objective function to either the non-stationary or the stationary with regression functions objective function, was used. Nelder-Mead black box optimization along with penalities to enforce input space constraints was used throughout [16]. 4. Numeric Error. In Section 3, it has been shown that increasing the number of data points will decrease the nominal error. On the other hand, the numeric error can become arbitrarily large by the addition of new data sites. Here, we develop bounds on the numeric error in terms of properties of the experimental design by adapting and extending results in [7], [8], and [24]. The numeric accuracy of Gaussian process emulation depends on the accuracy of floating point matrix manipulations. Floating point numbers are the rounded versions that computers perform 10

1.0

1.0

●

● ●

●

●

●

●

●

●

●

●

●

0.8

0.8

●

● ●

●

●

0.6

0.6

●

● ●

●

● ●

● ●

● ●

●

●

●

●

● ●

0.4

0.4

●

●

●

●

0.2

0.2

●

● ● ●

●

●

●

●

0.0

0.0

●

0.2

0.4

0.6

1.0

0.0

0.8

1.0

0.0

0.2

0.4

0.8

1.0

●

●

●

0.6

● ●

● ●

0.8

● ●

0.6

●

●

●

●

0.4

●

●

●

0.2

● ● ●

0.0

● ●

●

0.0

0.2

0.4

●

0.6

0.8

1.0

Figure 1. Top Left Panel: Near optimal experimental design with respect to nominal error for stationary correlation. Top Right Panel: Near optimal experimental design with respect to nominal error for the Case 2 model of nonstationary correlation with ϕ(d) = exp{−d2 }, ω1 (u)2 = 1 − kuk2 /2, ω2 (u)2 = kuk2 /2, Θ1 = 1 · I2 , Θ2 = 10 · I2 . Bottom Panel: Near optimal experimental design with respect to nominal error for stationary correlation and a linear regression function for each dimension.

calculations with as opposed to the targeted numbers. Commonly, computer and software have 15 digits of accuracy meaning that k˜ x − xk2 ≤ 10−15 , kxk2 11

where x denotes the actual value and x ˜ denotes the value that the computer stores. Throughout, we will make use of the following lemma on the accuracy of floating point matrix inversion which is a combination of Lemmas 2.7.1 and 2.7.2 in [7]. ˜ with kA − Ak ˜ 2 ≤ δkbk2 , and ˜x = b ˜ 2 ≤ δkAk2 , kb − bk Lemma 4.1. Suppose Ax = b and A˜ ˜ is non-singular, κ(A) = r/δ < 1/δ for some δ > 0. Then, A 1+r k˜ xk2 ≤ , kxk2 1−r ˜ k2 kx − x 2δ ≤ κ(A), kxk2 1−r

(22)

where κ(A) = kAk2 kA−1 k2 . ˜ ˜, b, and b, Additionally, note that for (conformable) a, a 0 ˜ − (˜ ˜ ˜ = a0 (b − b) ˜0 b a − a)0 b a b − a ˜ + (˜ ˜ ≤ kak2 kb − bk ˜ 2 + ka − a ˜ 2. ˜k2 kbk ≤ a0 (b − b) a − a)0 b

(23)

The inner portion of the numeric error in the second term in (4) can be bounded as follows. Here, ˜ denote the solutions to the linear systems Ax = b and A˜ ˜ ˜ −1 b ˜ x = b, and throughtout, A−1 b and A respectively, as opposed to the actual matrix multiplication. ˆ ˜ˆ (x) fϑ ˆ (x) − fϑ Ä ä0 Ä ä ˜ ˆ − Ψθ (x, X)Ψθ (X, X)−1 H(X) − Ψ ˆ ˜ ˜ θ (x, X)Ψ ˜ θ (X, X)−1 H(X) = h(x) − h(x) β β Ä ä ˜ θ (x, X)Ψ ˜ θ (X, X)−1 f˜(X) + Ψθ (x, X)Ψθ (X, X)−1 f (X) − Ψ

(24)

−1 −1 ˜ ˜ ˆ ˆ ˜ ˜ ≤ kh(x) − h(x)k 2 kβk2 + kΨθ (x, X)Ψθ (X, X) H(X) − Ψθ (x, X)Ψθ (X, X) H(X)k2 kβk2 ˜ θ (x, X)Ψ ˜ θ (X, X)−1 f˜(X)| + |Ψθ (x, X)Ψθ (X, X)−1 f (X) − Ψ −1 −1 ˜ ˜ ˆ ˜ ˜ ≤ kh(x) − h(x)k 2 kβk2 + |Ψθ (x, X)Ψθ (X, X) f (X) − Ψθ (x, X)Ψθ (X, X) f (X)|

Ã

+

p Ä X

ä2

˜ j (X) ˜ θ (x, X)Ψ ˜ θ (X, X)−1 h Ψθ (x, X)Ψθ (X, X)−1 hj (X) − Ψ

ˆ 2, kβk

j=1

˜ j (X) denote the j th regression function evaluated at X and its floating point apwhere hj (X) and h ˜ θ (X, X)−1 f˜(X), vj = Ψθ (X, X)−1 hj (X), ˜=Ψ proximation, respectively. Let u = Ψθ (X, X)−1 f (X), u ˜ j (X). Then, (24) along with inequality (23) implies ˜ θ (X, X)−1 h ˜j = Ψ and v ˆ ˜ (x) − f (x) fϑ ˆ ˆ ϑ

(25)

˜ ˆ ˜ θ (x, X)k2 k˜ ˜ k2 + kΨθ (x, X) − Ψ ≤ kh(x) − h(x)k uk2 2 kβk2 + kΨθ (x, X)k2 ku − u Ã

+

p Ä X

˜ θ (x, X)k2 k˜ ˜ j k2 + kΨθ (x, X) − Ψ kΨθ (x, X)k2 kvj − v vj k2

ä2

ˆ 2. kβk

j=1

Now, we state a few typical assumptions on the computer and software floating point accuracy. 12

Assumption 4.1.

Take κ(Ψθ (X, X)) = r/δ for r < 1 and

˜ ˜ ˜ kh(x) − h(x)k 2 ≤ δkh(x)k2 , kf (X) − f (X)k2 ≤ δkf (X)k2 , khj (X) − hj (X)k2 ≤ δkhj (X)k2 , ˜ θ (X, X)k2 ≤ δkΨθ (X, X)k2 , and kΨθ (x, X) − Ψ ˜ θ (x, X)k2 ≤ δkΨθ (x, X)k2 . kΨθ (X, X) − Ψ Under Assumption 4.1, Lemma 4.1 can be applied to (25) to obtain ˆ ˜ (x) − f (x) fϑ ˆ ˆ ϑ

(26)

ˆ 2 + kΨθ (x, X)k2 ≤ δkh(x)k2 kβk Ã

+

p X

Ç

j=1

1+r 2δ κ(Ψθ (X, X))kuk2 + δkΨθ (x, X)k2 kuk2 1−r 1−r

2δ 1+r κ(Ψθ (X, X))kvj k2 + δkΨθ (x, X)k2 kvj k2 kΨθ (x, X)k2 1−r 1−r

å2

ˆ 2. kβk

Note that kuk2 = kΨθ (X, X)−1 f (X)k2 ≤ kΨθ (X, X)−1 k2 kf (X)k2 = kf (X)k2 /λmin (Ψθ (X, X)). Similarly, kvj k2 ≤ khj (X)k2 /λmin (Ψθ (X, X)). Using these facts along with r < 1 and grouping terms gives ˆ ˜ (x) − f (x) fϑ ˆ ˆ ϑ

(27)

Ñ

ˆ 2 + 2δ kΨθ (x, X)k2 ≤ δkh(x)k2 kβk 1−r

Ã

ˆ 2 kf (X)k2 + kβk

p X

é

khj (X)k22

g(X, Ψθ ),

j=1

where g(X, Ψθ ) =

κ(Ψθ (X, X)) + 1 . λmin (Ψθ (X, X))

For experimental designs which are not too small and have reasonable parameter estimation propˆ 2 will be of a similar magnitude to kβk2 . Further, the terms kΨθ (x, X)k2 , kf (X)k2 , and erties kβk √ √ √ khj (X)k2 are Monte Carlo approximations to nkΨθ (x, ·)kL2 (Ω) , nkf (·)kL2 (Ω) , and nkhj (·)kL2 (Ω) , respectively, with respect to the distribution of the experimental design. That is, the terms in the bound (27), aside from g(X, Ψθ ), influence the numeric accuracy only weakly and vanishingly, and the bound depends on the experimental design primarily through g(X, Ψθ ). The implication of Gershgorin’s theorem [22] in (11) implies g(X, Ψθ ) can be bounded in terms of the minimum eigenvalue of Ψθ (X, X) as (28)

1 g(X, Ψθ ) ≤ λmin (Ψθ (X, X))

Ç

å

n supu,v∈Ω Ψθ (u, v) +1 . λmin (Ψθ (X, X))

We adapt results from [24] to provide a lower bound for λmin (Ψθ (X, X)) and thereby an upper bound for (28) which can be applied in Case 1 (stationarity) and Case 2 (non-stationary correlation). The following (angular frequency, unitary) definition of the Fourier transform will be used throughout the below development. 13

Definition 4.1.

For f ∈ L1 (Rd ) define the Fourier transform [19] fˆ(ω) = (2π)−d/2

Z

0

f (x)e−iω x dx.

Rd

For a continuous, positive definite, translation invariant kernel Φ which has Fourier transform ˆ ∈ L1 (Rd ), Φ n X n X

αj αk Φ(xj − xk ) = (2π)−d/2

j=1 k=1

(29)

n X n X j=1 k=1

Z

αj αk

0

ˆ eiω (xj −xk ) Φ(ω)dω

Rd

2 Z X n 0 iω xj ˆ α e = (2π)−d/2 j Φ(ω)dω, Rd j=1

for α ∈ Rn , xi ∈ Rd . Representation (29) implies that a lower bound for j,k αj αk Φ(xj − xk ) is P ˆ ˆ provided by j,k αj αk Υ(xj − xk ), where Υ has Υ(ω) ≤ Φ(ω). Consider ΥM with P

(30)

ˆ ˆ M (ω) = Φ∗ (M )Γ(d/2 + 1) (χM ∗ χM )(ω), Υ 2d M d π d/2

ˆ ∗ (M ) = inf kωk ≤2M Φ(ω), ˆ where M > 0, Φ χM (ω) = 1 for kωk2 ≤ M and 0 otherwise, and ∗ 2 denotes the convolution operator (f ∗ g)(x) =

Z

f (y)g(x − y)dy.

Rd

ˆ M (ω) = 0 ≤ Φ(ω). ˆ For kωk2 > 2M , Υ On the other hand, for kωk2 ≤ 2M , Z ˆ ˆ M (ω) = Φ∗ (M )Γ(d/2 + 1) χM (t)χM (ω − t)dt Υ 2d M d π d/2 Rd ˆ ∗ (M )Γ(d/2 + 1) Φ ˆ ∗ (M ) ≤ Φ(ω). ˆ ≤ vol B(0, 2M ) = Φ 2d M d π d/2

ˆ ˆ So, Υ(ω) ≤ Φ(ω) for all ω ∈ Rd . The candidate Υ can be recovered from the inverse Fourier ˆ M )∨ transform (Υ ˆ ∗ (M )Γ(d/2 + 1) Φ (χM ∗ χM )∨ (t) 2d M d π d/2 ˆ ∗ (M )Γ(d/2 + 1) Φ = (2π)d/2 ((χM )∨ (t))2 2d M d π d/2 ˆ ∗ (M )Γ(d/2 + 1) Φ 2 ktk−d = 2 Jd/2 (M ktk2 ), 2d/2

ΥM (t) =

where Jν is a Bessel function of the first kind. A proof of the final equality is given as Lemma 12.2 in [24]. Define ΥM (0) as (31)

ΥM (0) ≡ lim ΥM (t) = t→0

14

ˆ ∗ (M ) Å M ãd Φ . Γ(d/2 + 1) 23/2

This limit follows from the Taylor series representation of the Bessel function [23]. Now, a lower bound on the quadratic form involving Υ is developed. n X n X

αj αk ΥM (xj − xk ) =

j=1 k=1 n X

≥

(32)

=

j=1 n X j=1 n X

=

αj2 ΥM (0) − αj2 ΥM (0) − Ñ

αj2

Pn

k=1, k6=j

|αj ||αk |ΥM (xj − xk )

1X 2 (α + αk2 )ΥM (xj − xk ) 2 j6=k j n X

n X

αj2

j=1

ΥM (0) −

j=1

Each

αj αk ΥM (xj − xk )

j6=k

j6=k

j=1 n X

X

X

αj2 ΥM (0) +

j=1

αj2 ΥM (0) −

≥

n X

ΥM (xj − xk )

k=1, k6=j n X

é

ΥM (xj − xk )

.

k=1, k6=j

|Υ(xj − xk )| can be bounded in terms of the separation distances qj =

1 min kxj − xk k2 2 k=1,...,n, k6=j

and q = min qj . j

For m ∈ N, let Ejm = {x ∈ Rd : mqj ≤ kxj − xk2 < (m + 1)qj }. Then, every xk , k 6= j is contained in exactly one Ejm . Further, every B(xk , q) is essentially disjoint and completely contained in {x ∈ Rd : mqj − q ≤ kxj − xk2 < (m + 1)qj + q}. So, each Ejm can contain no more than ((m + 1)qj + q)d − (mqj − q)d = qd

Ç

(m + 1)qj +1 q

åd

Å

−

data points. We now make use of the following lemma. Lemma 4.2.

For d ∈ N and qj ≥ q > 0, Ç

(m + 1)qj +1 q

åd

Å

ãd

−

mqj −1 q

Å

ã

≤ (3qj /q)d md−1 .

Proof. Take d = 1, then Ç

å

(m + 1)qj mqj +1 − − 1 = 2 + qj /q ≤ 3qj /q. q q 15

ãd

mqj −1 q

Now, assume the result is true for 1 ≤ d∗ < d. Let c = 3qj /q. Then, Ç Ç

=⇒ Ç

=⇒

(m + 1)qj +1 q (m + 1)qj +1 q (m + 1)qj +1 q

åd−1 åd

Å

mqj − −1 q Ç

− åd

Å

−

ãd−1

(m + 1)qj +1 q

mqj −1 q

åÅ

Ç

ãd

≤

≤ cd−1 md−2 Ç

ãd−1

mqj −1 q

≤ å

å

(m + 1)qj + 1 cd−1 md−2 q Å

(m + 1)qj qj + 1 cd−1 md−2 + +2 q q

ãÅ

ãd−1

mqj −1 q

.

The proof is completed by showing that the right-hand side of the final inequality is bounded above by cd md−1 . The right-hand side can be represented in terms of c as Ç

å

Å

ãÅ

Å

ãÅ

(m + 1)c c + 1 cd−1 md−2 + +2 3 3 ñ

d

d−1

=c m

ñ

(m + 1) 1 1 2 + + + 3m mc 3 c Å

1 1 1 1 1 ≤c m + + + − 3 3m mc 3 mc ï ò 1 1 1 1 1 + + + − ≤ cd md−1 3 3m mc 3 mc ò ï 1 d d−1 2 + ≤ cd md−1 , =c m 3 3m d

d−1

mc −1 3

ãd−1

1 1 − 3 mc

ãd−1 ô

ãd−1 ô

where the first inequality is true because 1/3 + 2/c ≤ 1 and the second inequality is true because (1/3 − 1/(mc))d−1 is a decreasing function of d ≥ 2. Lemma 4.2 implies that each Ejm contains no more than (3qj /q)d md−1 points. Note that on Ejm , Υ(xj − xk ) is bound above as ˆ ∗ (M )Γ(d/2 + 1) Φ 2 kxj − xk k−d 2 Jd/2 (M kxj − xk k2 ) 2d/2 ˆ ∗ (M )Γ(d/2 + 1) Φ 2d+2 −d ≤ kx − x k j k 2 M πkxj − xk k 2d/2

Υ(xj − xk ) =

(33)

Γ2 (d/2 + 1) = ΥM (0) π Γ2 (d/2 + 1) ≤ ΥM (0) π

Ç Ç

4 M kxj − xk k2 4 M mqj

åd+1

åd+1

,

where the first inequality follows from the Bessel function bound provided in Lemma 3.3 of [15]. Combining Lemma 4.2 with (33) gives n X

∞ X

Γ2 (d/2 + 1) Υ(xj − xk ) ≤ ΥM (0) π m=1 k=1, k6=j Γ2 (d/2 + 1)π = ΥM (0) 18

Ç

q qj

åÅ

12 Mq

ãd+1

, 16

Ç

4 M mqj

åd+1

(3qj /q)d md−1

−2 = π 2 /6. Now, taking M = c /q and where the equality follows from the fact that ∞ ∗ m=1 m referring back to (32), the quadratic form can be bounded as

P

(34)

n X n X

αj αk Φ(xj − xk ) ≥ Υc∗ /q (0)

j=1 k=1

n X

Ç

Γ2 (d/2 + 1)π 1− 18

αj2

j=1

Ç

q qj

åÅ

12 c∗

ãd+1 å

A minor generalization on this quadratic form bound is summarized in the following theorem. Theorem 4.1. Suppose Φ is a positive definite, translation invariant kernel with Fourier transˆ ∈ L1 (Rd ). Then, form Φ n X n X

αj αk Φ(xj − xk ) ≥ Υc∗ /q(Θ) (0)

j=1 k=1

n X

Ç

αj2

j=1

Ç

Γ2 (d/2 + 1)π 1− 18

q(Θ) qj (Θ)

åÅ

12 c∗

ãd+1 å

,

for c∗ > 0, where ΥM (0) is defined in (31) and the separation distances with respect to the Mahalanobis-like distance dΘ , Θ non-singular are given by qj (Θ) =

1 min dΘ (xj , xk ) 2 k=1,...,n, k6=j

and

q(Θ) = min qj (Θ). j

Note that the generalization follows by applying the previous development to the transformed space, v 7→ v∗ = Θv. This result is now applied in Case 1 and Case 2 to provide numeric error bounds for stationary and non-stationary correlation situations, respectively. Case 1: Ψθ (u, v) = σ 2 ϕ(kΘ(u − v)k2 ). Here, λmin (Ψθ (X, X)) = min

kak2 =1

X

= σ 2 min

kak2 =1

(35)

= σ 2 min

kak2 =1

=σ

2

ai aj Ψθ (xi , xj )

i,j

min

kak2 =1

≥ σ 2 min

kak2 =1

X

ai aj ϕ(kΘ(xi − xj )k2 )

i,j

X

ai aj ϕ(kx∗i − x∗j k2 )

i,j

X i,j n X

ai aj Φ(x∗i − x∗j ) a2i ì (Θ),

i=1

where Φ(x∗ − y∗ ) = ϕ(kx∗ − y∗ k2 ) and Ç

(36)

Γ2 (d/2 + 1)π ì (Θ) = Υc∗ /q(Θ) (0) 1 − 18

Ç

q(Θ) qi (Θ)

åÅ

12 c∗

ãd+1 å

.

The lower bound (35) is maximized for ì (Θ) constant over i and as large as possible. This implies qi (Θ) = q(Θ) for all i. Now, the lower bound depends on (37)

ˆ ∗ (c∗ /q(Θ)) Φ Υc∗ /q(Θ) (0) = Γ(d/2 + 1) 17

Ç

c∗ /q(Θ) 23/2

åd

,

which is an increasing function of q(Θ) that approaches zero as q(Θ) approaches zero. That is, in the stationary situation given by Case 1, numeric accuracy is preserved for designs which are well-separated. Case 2: Ψθ (u, v) = σ 2 (ω1 (u)ω1 (v)ϕ(kΘ1 (u − v)k2 ) + ω2 (u)ω2 (v)ϕ(kΘ2 (u − v)k2 )). Here, assume additionally that Θ2 = aΘ1 for some a > 1. Slightly, coarsen the bounds by replac˜ 0 = inf m∈[r ,M ] Υm (0), r∗ = ing ΥM (0) with its monotone decreasing in M lower bound Υ M ∗ c∗ / maxx,y∈Ω dΘ1 (x, y). Let `˜i denote the coarsened version of (36). Then, λmin (Ψθ (X, X)) = min

kak2 =1

(38)

= σ 2 min

kak2 =1

≥ σ 2 min

kak2 =1

X i,j n X

X

ai aj Ψθ (xi , xj )

i,j

ai aj (ω1 (xi )ω1 (xj )ϕ(kΘ1 (xi − xj )k2 ) + ω2 (xi )ω2 (xj )ϕ(kΘ2 (xi − xj )k2 )) Ä

ä

a2i ω1 (xi )2 `˜i (Θ1 ) + ω2 (xi )2 `˜i (Θ2 ) ,

i

ˆ is the Fourier transform of Φ defined by Φ(x∗ − y∗ ) = ϕ(kx∗ − y∗ k2 ) in (36) and (37). where Φ The lower bound (38) is maximized for ω1 (xi )2 `˜i (Θ1 ) + ω2 (xi )2 `˜i (Θ2 ) constant over i and as large as possible. Consider two design points xi and xj and suppose that the points in the input space near xi have more weight on the global, long range, correlation than the points in the input space near xj and the points in the input space near xj have more weight on the local, short range, correlation than the points in the input space near xi , in the sense that (39)

ω1 (xi )2 (`˜i (Θ2 ) − `˜i (Θ1 )) ≥ ω1 (xj )2 (`˜j (Θ2 ) − `˜j (Θ1 )), ω2 (xi )2 (`˜i (Θ2 ) − `˜i (Θ1 )) ≤ ω2 (xj )2 (`˜j (Θ2 ) − `˜j (Θ1 )).

Here, we consider the situation where qi (Θ1 ) and qi (Θ2 ) are small across i, the situation where a ˜0 bound on the numeric error is most relevant. For q(Θ1 ) and q(Θ2 ) sufficiently small, Υ c∗ /q(Θ) is strictly increasing in q. Further, the assumption Θ2 = aΘ1 for some a > 1 implies q(Θ1 )/qi (Θ1 ) = q(Θ2 )/qi (Θ2 ) and q(Θ2 ) > q(Θ1 ). Together, these facts imply `˜i (Θ2 ) > `˜i (Θ1 ). Uniformity of the bounds (38) along with ω1 (·)2 + ω2 (·)2 = 1 gives (40)

ω1 (xi )2 (`˜i (Θ1 ) − `˜i (Θ2 )) − ω1 (xj )2 (`˜j (Θ1 ) − `˜j (Θ2 )) = `˜j (Θ2 ) − `˜i (Θ2 ), ω2 (xj )2 (`˜j (Θ1 ) − `˜j (Θ2 )) − ω2 (xi )2 (`˜i (Θ1 ) − `˜i (Θ2 )) = `˜j (Θ1 ) − `˜i (Θ1 ).

Combining (39) and (40) with the fact that `˜i (Θ) is an increasing function of qi (Θ) for small qi (Θ) gives (41)

qj (Θ1 ) < qi (Θ1 )

and qj (Θ2 ) < qi (Θ2 ).

That is, a uniform bound on (38) is achieved by an experimental design X which has smaller separation distance, with respect to either dΘ1 or dΘ2 , in regions with more emphasis on the 18

local, more quickly decaying, correlation and less emphasis on the global, more slowly decaying, correlation. Note that in the numeric accuracy context, the global and local emphases, for small qi (Θ1 ) and qi (Θ2 ), at xi are given concretely by ω1 (xi )2 (`˜i (Θ2 ) − `˜i (Θ1 )) and ω2 (xi )2 (`˜i (Θ2 ) − `˜i (Θ1 )), respectively. Example near optimal 23 run experimental designs for the numeric situations described in Case 1 (stationarity) and Case 2 (non-stationary correlation) are illustrated in the left and right panels, respectively, of Figure 2. For both cases, ϕ(d) = exp{−d2 }, the so-called Gaussian correlation function. Despite its wide-spread use, this correlation function has particularly poor numeric properties and requires quite large Θ1 and Θ2 to achieve reasonable numeric performance. For Case 2, ω1 (u)2 = 1 − (1 + exp {−25(u1 − 1/2)})−1 , ω2 (u)2 = (1 + exp {−25(u1 − 1/2)})−1 , Θ1 = 40 · I2 , Θ2 = 100 · I2 , and σ 2 = 1. As expected, in the first panel, illustrating the stationary situation, the design points lie near a triangular lattice (subject to edge effects), similar to but expanded towards the edges of the design space relative to the nominal design. Similarly, in the second panel, illustrating the non-stationary correlation situation, the design points on the right-hand side, where the shorter range, more quickly decaying, correlation is emphasized, are more dense than on the left-hand side, where the longer range, more slowly decaying, correlation is emphasized. While the provided bounds hold for all c∗ > 0, the actual value of the bounds depends on the selected value −1/(d+1)

18 . Similarly to the nominal examples, for of c∗ . Here, we take c∗ = 1.1 × 12 πΓ2 (d/2+1) the stationary situation, the optimization routine was initialized at a triangular lattice, scaled to maximize the separation distance, while for the non-stationary situation, a homotopy continuation [10] approach along with Nelder-Mead [16] was used.

5. Parameter Estimation. Consider maximum likelihood estimation and let E denote the expectation conditional on X and f (X). Then, for n not too small, ¶

©2

E fˆϑ∗ (x) − fˆϑˆ (x) (42)

ˆ ∂ fˆϑ∗ (x) ˆ ∂ fϑ∗ (x) Var ϑ 0 ∂ϑ∗ ∂ϑ∗ ∂ fˆϑ∗ (x) ∂ fˆϑ∗ (x) I(ϑ∗ )−1 , ≈ 0 ∂ϑ∗ ∂ϑ∗ ≈

∂` ∂` where I(ϑ∗ ) = E ∂ϑ 0 denotes the information matrix and ` denotes the log-likelihood of the ∗ ∂ϑ ∗

∂ fˆϑ (x) ∗

data f (X). Roughly, a high-quality design for parameter estimation will have ∂ϑ∗

small and 2

λmin (I(ϑ∗ )) large. Ä

ä0

Ä

ä0

Arrange the vector of parameters as ϑ = β 0 θ 0 and θ = σ 2 %0 . Throughout Section 5, we will use matrix differentiation, see for example [12]. Then, the vector of derivatives of the emulator

19

●

●

1.0

1.0

●

●

●

●

● ●

0.8

●

0.8

●

●

●

●

●

●

●

0.6

●

0.6

●

●

●

●

●

●

●

0.4

●

0.4

●

●

●

● ● ●

●

●

●

0.2

●

0.2

●

● ●

●

●

0.0

0.2

0.4

●

0.6

●

●

0.8

0.0

0.0

●

1.0

●

●

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2. Left Panel: Near optimal experimental design with respect to numeric error for stationary correlation. Right Panel: Near optimal experimental design with respect to numeric error for the Case 2 model of non-stationary correlation with ϕ(d) = exp{−d2 }, ω1 (u)2 = 1 − (1 + exp {−25(u1 − 1/2)})−1 , ω2 (u)2 = (1 + exp {−25(u1 − 1/2)})−1 , Θ1 = 40 · I2 , and Θ2 = 100 · I2

with respect to the unknown parameter values c1 =

∂ fˆϑ (x) ∂ϑ

has block components

© ∂ fˆϑ (x) ∂ ¶ = h(x)0 β + Ψθ (x, X)Ψθ (X, X)−1 (f (X) − H(X)β) ∂β ∂β

= h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x),

(43) c2 =

∂ fˆϑ (x) = 0. ∂σ 2 ˆ

ϑ (x) Developing an expression for ∂ f∂% is more complex and broken into a few parts. Let δ(X) = f (X) − H(X)β and let u denote the dimension of %. Then,

(44)

∂ fˆϑ (x) = c3 = ∂%

Ç

∂Ψθ (x, X) ∂Ψθ (X, X)−1 Ψθ (X, X)−1 + (Iu ⊗ Ψθ (x, X)) ∂% ∂%

Note that, ∂Ψθ (X, X)Ψθ (X, X)−1 ∂% ∂Ψθ (X, X) ∂Ψθ (X, X)−1 = Ψθ (X, X)−1 + (Iu ⊗ Ψθ (X, X)) . ∂% ∂%

0=

So, (45)

∂Ψθ (X, X)−1 ∂Ψθ (X, X) = −(Iu ⊗ Ψθ (X, X)−1 ) Ψθ (X, X)−1 ∂% ∂% 20

å

δ(X).

Plugging (45) into equation (44) gives the third block component ∂ fˆϑ (x) ∂% Ç å ∂Ψθ (x, X) −1 ∂Ψθ (X, X) = − (Iu ⊗ Ψθ (x, X)Ψθ (X, X) ) Ψθ (X, X)−1 δ(X). ∂% ∂%

c3 = (46)

Now, we develop an expression for I(ϑ∗ ). We consider a fixed underlying kernel Ψ and parameters % which rescale the input differences. This is a special case of our Case 1 (stationarity) assumption which is commonly used in practice. Results for a more general input rescaling as described in Case 1 would be broadly similar, albeit more complex. Results for Case 2 (non-stationary correlation) would require several additional modeling assumptions and are not developed here. Throughout the parameter estimation section, take Ψθ (u, v) = σ 2 ϕ (kdiag{%}(u − v)k2 ) and define Φ% (u − v) = ϕ(kdiag{%}(u − v)k2 ) and Φ(u − v) = ϕ(k(u − v)k2 ), so that Ψθ (u, v) = σ 2 Φ% (u − v) = σ 2 Φ(diag{%}(u − v)).

(47)

Then, up to an additive constant, the log-likelihood is 1 1 ` = − log det Ψθ (X, X) − (f (X) − H(X)β)0 Ψθ (X, X)−1 (f (X) − H(X)β). 2 2 So,

(48)

∂` = H(X)0 Ψθ (X, X)−1 (f (X) − H(X)β), ∂β ∂` n 1 = − 2 + 4 (f (X) − H(X)β)0 Φ% (X − X)−1 (f (X) − H(X)β). ∂σ 2 2σ 2σ

The derivative of ` with respect to % can be broken into three parts via the chain rule, (49)

∂` ∂(vec Ψθ (X, X))0 ∂(vec Ψθ (X, X)−1 )0 ∂` = . ∂% ∂% ∂vec Ψθ (X, X) ∂vec Ψθ (X, X)−1 |

{z A

}|

{z B

}|

{z C

}

These parts can be treated in turn. First, consider part B. Similarly to (45), ∂(vec Ψθ (X, X)−1 Ψθ (X, X))0 ∂(vec In )0 = ∂vec Ψθ (X, X) ∂vec Ψθ (X, X) −1 0 ∂(vec Ψθ (X, X) ) = (Ψθ (X, X) ⊗ In ) ∂vec Ψθ (X, X) ∂(vec Ψθ (X, X))0 + (In ⊗ Ψθ (X, X)−1 ) ∂vec Ψθ (X, X) ∂(vec Ψθ (X, X)−1 )0 =⇒ = −(Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 ). ∂vec Ψθ (X, X) 0=

(50)

Next, consider part C, (51)

∂` 1 = [vec Ψθ (X, X) − (f (X) − H(X)β) ⊗ (f (X) − H(X)β)] . ∂vec Ψθ (X, X)−1 2 21

For part A, ∂{Ψθ (X, X)} ∂Φ (diag{%}(xi − xj )) = σ2 ∂% ∂% 0 ∂(x − x ) i j diag{%} ∇Φ(diag{%}(xi − xj )) = σ2 ∂% = σ 2 diag{xi − xj }∇Φ(diag{%}(xi − xj )). Let Cθ denote the n2 ×d matrix whose [n(j −1) +i]th row is σ 2 ∇Φ(diag{%}(xi −xj ))0 diag{xi −xj }. Then, C0θ =

(52)

∂(vec Ψθ (X, X))0 . ∂%

Equations (50), (51), and (52) can be plugged into equation (49) to give ∂` 1 = − C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 ) [vec Ψθ (X, X) − (f (X) − H(X)β) ⊗ (f (X) − H(X)β)] ∂% 2 î ó 1 = − C0θ vec Ψθ (X, X)−1 − Ψθ (X, X)−1 δ(X) ⊗ Ψθ (X, X)−1 δ(X) , 2 where δ(X) = f (X) − H(X)β. So, the information matrix has block components ∂2` = H(X)0 Ψθ (X, X)−1 H(X), ∂β∂β 0 ∂2` 1 = I(σ 2 , β) = −E 2 0 = 4 E(f (X) − H(X)β)0 Φ% (X − X)−1 H(X) = 00 , σ ∂σ ∂β 2 ∂ ` = I(%, β) = −E ∂%∂β 0

I11 = I(β, β) = −E I21 I31

Ä

ä

= C0θ Ψθ (X, X)−1 H(X) ⊗ Ψθ (X, X)−1 E(f (X) − H(X)β) = 0, I22 = I(σ 2 , σ 2 ) = −E (53)

∂2` ∂σ 2 ∂σ 2

n 1 + 6 E(f (X) − H(X)β)0 Φ% (X − X)−1 (f (X) − H(X)β) 4 2σ σ n 1 n = − 4 + 6 trace Φ% (X − X)−1 σ 2 Φ% (X − X) = 4 , 2σ σ 2σ 2` ∂ I32 = I(%, σ 2 ) = −E ∂%∂σ 2 Å ã 1 1 2 = C0θ E − 4 vec Φ% (X − X)−1 + 6 Φ% (X − X)−1 δ(X) ⊗ Φ% (X − X)−1 δ(X) 2 σ σ Å ã 1 0 1 2 = Cθ − 4 vec Φ% (X − X)−1 + 4 vec Φ% (X − X)−1 2 σ σ 1 0 = 4 Cθ vec Φ% (X − X)−1 . 2σ =−

22

Developing a formula for I(%, %) is more complex and broken into parts. I(%, %) = −E (54)

∂2` ∂%∂%0

î óä ∂Cθ 1 Ä = E Id ⊗ (vec Ψθ (X, X)−1 )0 − (δ(X)0 Ψθ (X, X)−1 ⊗ δ(X)0 Ψθ (X, X)−1 ) 2 ∂% Ç å −1 0 0 −1 0 −1 1 ∂(vec Ψθ (X, X) ) ∂(δ(X) Ψθ (X, X) ⊗ δ(X) Ψθ (X, X) ) + E − Cθ . 2 ∂% ∂%

Note that the expectation of the first term in (54) is zero, since E(δ(X)0 Ψθ (X, X)−1 ⊗ δ(X)0 Ψθ (X, X)−1 ) Ä

= E vec (Ψθ (X, X)−1 δ(X)δ(X)0 Ψθ (X, X)−1 )

ä0

= (vec Ψθ (X, X)−1 )0 .

So, I(%, %) = (55)

1Ä −C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 ) 2 0 ! ∂ vec (Ψθ (X, X)−1 δ(X)δ(X)0 Ψθ (X, X)−1 ) Cθ . −E ∂%

The expectation in (55) is 0

(56)

∂ vec (Ψθ (X, X)−1 δ(X)δ(X)0 Ψθ (X, X)−1 ) E ∂% −1 ∂(vec (Ψθ (X, X) )0 = (Eδ(X)δ(X)0 Ψθ (X, X)−1 ⊗ In ) ∂% ∂(vec (Ψθ (X, X)−1 )0 (In ⊗ Eδ(X)δ(X)0 Ψθ (X, X)−1 ) + ∂% = −2C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 ).

Plugging (56) into (55) gives 1 I33 = I(%, %) = C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 )Cθ . 2

(57)

Using partitioned matrix inverse results [9] and noting that c2 , I21 , I12 , I31 , and I13 are matrices of zeros, the expression for the approximate expected parameter estimation error (42) is (58)

¶


©2

−1 −1 ≈ c01 I11 c1 + c03 (I33 − I32 I22 I23 )−1 c3 .

The first term on the right-hand side of (58) can be bounded above as Ä

äÄ

−1 c01 I11 c1 = h(x)0 − Ψθ (x, X)Ψθ (X, X)−1 H(X)

Ä

ä−1

H(X)0 Ψθ (X, X)−1 H(X) ä

× h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) (59)

2 λmax (Ψθ (X, X))

0 −1 h(x) − H(X) Ψ (X, X) Ψ (X, x)

θ θ 2 λmin (H(X)0 H(X))

n supu,v∈Ω Ψθ (u, v)

2 ≤

h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) . 0 2 λmin (H(X) H(X)) 23

≤

The second term on the right-hand side of the approximate parameter estimation error expression (58) has −1 c03 (I33 − I32 I22 I23 )−1 c3

Ä

h

−1 −1 −1 = c03 I33 + I33 I32 I22 − I23 I33 I32 −1/2

= c03 I33 (60)

h

−1/2

Id + I33 Ä

ä

Ä

−1/2

á

Ä

ä

i

−1 I23 I33 c3

−1 I32 I22 − I23 I33 I32

−1 ≤ kc3 k22 λmax I33 λmax Id + I33

−1 = kc3 k22 λmax I33

ä−1

1+

ä−1

Ä

i

ë

{z D

−1/2

I33

−1 I32 I22 − I23 I33 I32

−1 I23 I33 I32 −1 I22 − I23 I33 I32

|

−1/2

I23 I33

ä−1

c3 −1/2

I23 I33

.

}

−1 Note that part D is an increasing function of I23 I33 I32 which can be bounded above as

(61)

−1 I32 I23 I33 Ä ä−1 1 C0θ vec Ψθ (X, X)−1 = 4 (vec Ψθ (X, X)−1 )0 Cθ C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 )Cθ 2σ kvec Ψθ (X, X)−1 k22 λmax (Cθ C0θ ) ≤ 4 2σ λmin (Ψθ (X, X)−2 )λmin (C0θ Cθ ) n ≤ 4 κ(C0θ Cθ )κ(Ψθ (X, X))2 , 2σ

where κ(A) = λmax (A)/λmin (A), the condition number of diagonalizable matrix A. The first equality in (61) follows from the expressions for I32 and I33 in (53) and (57) and the inequalities follow from properties of eigenvalues. Further, the term I22 = 2σn4 does not depend on the design −1 configuration. The term λmax (I33 ) can be bounded above as −1 λmax (I33 ) = 1/λmin (I33 )

Ä

= 2/λmin C0θ (Ψθ (X, X)−1 ⊗ Ψθ (X, X)−1 )Cθ (62)

≤

2λmax (Ψθ (X, X))2 λmin C0θ Cθ

≤

2(n supu,v∈Ω Ψθ (u, v))2 , λmin C0θ Cθ

ä

where the final inequality follows from Gershgorin’s theorem [22]. We summarize this development in the following theorem. Theorem 5.1. Suppose f (·) ∼ GP(h(·)0 β, σ 2 ϕ(kdiag{%}(· − ·)k2 )) for fixed, known regression ˆ denote the maximum likelihood functions h(·) and positive definite ϕ(kdiag{%}(· − ·)k2 ). Let ϑ Ä ä0 estimator of the unknown parameters ϑ = β 0 σ 2 %0 . Then, an approximate upper bound for

24

¶

©2


(63)

is given by

2 n supu,v∈Ω Ψθ (u, v)

0 −1 h(x) − H(X) Ψ (X, X) Ψ (X, x)

θ θ 2 λmin (H(X)0 H(X)) Ç 2 åÇ å 2 v) 1 2 n supu,v∈Ω Ψθ (u, + 2kc3 k2 , λmin C0θ Cθ 1 − κ(C0θ Cθ )κ(Ψθ (X, X))2

where H(X) has rows h(xi )0 , c3 is defined in (46), and Cθ is defined immediately before (52). This upper bound is approximate in the sense that for a sequence of experimental designs for which the maximum likelihood estimates converge, the probability that the upper bound is violated goes to zero. (x,X) (X,X) Both h(x) − H(X)0 Ψθ (X, X)−1 Ψθ (X, x) and c3 = ∂Ψθ∂% − (Iu ⊗ Ψθ (x, X)Ψθ (X, X)−1 ) ∂Ψθ∂% are nominal interpolation errors, respectively for the regression functions and (the transpose of) the Jacobian of Ψθ (X, x) with respect to correlation parameters. As discussed towards the end of Section 3, we expect the norms of both of these interpolation errors to behave in a manner similar to Gaussian process or RKHS interpolation. That is, the norms of both of these terms will be small for experimental designs which are high-quality with respect to nominal error.

The √ 0 term kδ(X)k2 = kf (X) − H(X)βk2 is a Monte Carlo approximation to n f (·) − h(·) β L2 (Ω)

with respect to the distribution of the data. Further, as discussed in the nominal error section, the λmin (H(X)0 H(X)) term in the denominator of (59) implies a balance between traditional design properties, pushing points towards the boundaries of the design space for linear regression functions, for example, and space-filling properties. The term κ(Ψθ (X, X)) can be controlled by ensuring that design points do not become poorly separated as discussed in Section 4. Theorem 5.1 and the above discussion imply that the parameter estimation error can be controlled by ensuring that the experimental design has good nominal and numeric properties in addition to controlling λmin (C0θ Cθ ) and κ(C0θ Cθ ). The matrix C0θ Cθ is, in general, a sum of outer products, (64)

P

i,j

cij c0ij , where the cij are given by

cij = σ 2 ∇Φ(diag{%}(xi − xj ))0 diag{xi − xj }.

Now, we examine the terms cij in (64). We will restrict our attention to underlying kernels Φ which are radially decreasing in the sense that Φ(δ 1 ) ≥ Φ(δ 2 ) if kδ 1 k2 ≤ kδ 2 k2 . For radially decreasing underlying kernels Φ, the term cij is near zero if xi − xj is near zero or far from zero, while the term cij has negative components if the difference xi − xj is slightly beyond the location where Φ(diag{θ}(·)) is decreasing most rapidly along each coordinate axis. See, for example, Figure 3 showing both components of ∇Φ(diag{%}(·) in the upper panels and both components of Ä ä0 ∇Φ(diag{%}(·))0 diag{·} in the lower panel for Φ(d) = exp{−d0 d} and % = 1 2 . Pairs of points xi , xj whose difference lies slightly beyond the location where Φ(diag{θ}(·)) is decreasing most rapidly along each coordinate axis have potential to increase eigenvalues of C0θ Cθ . λmin (C0θ Cθ ) is 25

large and κ(C0θ Cθ ) is small for sets of differences {xi − xj } which balance the differences along coordinate axes in the sense that 2

nk max ∇Φ(diag{%}(d))0 diag{d}

d

k

2

≈ nl max ∇Φ(diag{%}(d))0 diag{d}

d

l

,

for k, l = 1, . . . , d where {∇Φ(diag{%}(d))0 diag{d}}k denotes element k of ∇Φ(diag{%}(d))0 diag{d} and nk denotes the number of differences (of length slightly beyond the location where Φ(diag{θ}(·)) is decreasing most rapidly) along coordinate axis k. In the example described above and illustrated in Figure 3, {∇Φ(diag{%}(d))0 diag{d}}21 ≈ (0.8)2 and {∇Φ(diag{%}(d))0 diag{d}}22 ≈ (0.4)2 so an experimental design solely targeting the eigenvalues of C0θ Cθ would have roughly n1 differences Ä ä0 Ä ä0 xi − xj = ±1 0 and n2 differences xi − xj = 0 ±0.5 where

Gradient, component 1

n2 . 4

Gradient, component 2

1.0

1.5

n1 (0.8)2 = n2 (0.4)2 =⇒ n1 =

−0.2

0.5

−0.6 −0.4

0.4

−0.8 −0.4

0.0

−0.8

0

−0.5

0.8

0.4

0.6

−0.6

0.8

0.2

−0.2

0.6

c_ij, component 1

c_ij, component 2

1.0

1.5

−1.5

−1.0

0.2

0.5

−0.05

−0.1

−0.1

−0.3

−0.3

−0.5

−0.5

−0.25 −0.35

−0.2

−0.3

0.0

−0.1

−0.7

−0.1

−0.6

−0.6

−0.4

−0.4

−0.5

−0.15

.7

−0

−0.2

−0.2

−0.2 −0.25 −0.35

−0.3

−1.0

−0.15

−1.5

−0.05

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−1.5

−1.0

−0.5

0.0

0.5

Figure 3. Upper Panels: Both components of ∇Φ(diag{%}(·). Lower ∇Φ(diag{%}(·))0 diag{·}. Φ(d) = exp{−d0 d} and % = ( 1 2 )0 .

1.0

1.5

Panels: Both components of

Consider an example situation with for Φ(d) = exp{−d0 d} and % = ( 3 3 )0 . An experimental design maximizing λmin (C0θ Cθ ) and minimizing κ(C0θ Cθ ) is shown in the left panel of Figure 4. There are 11 points at the middle location and 3 at each peripheral location. In particular, the design which is optimal with respect to C0θ Cθ is not space-filling. A near optimal experimental design with respect to the upper-bound in Theorem 5.1 is shown in the right panel of Figure 4. The 26

1.0

1.0

influence of λmin (C0θ Cθ ) and κ(C0θ Cθ ) is substantially less than the influence of the space-filling properties controlling the nominal and numeric error.

●

●

0.8

0.8

●

●

● ●

●

●

●

0.6

0.6

●

● ● ●

●

●

0.4

●

0.4

●

● ● ●

0.2

0.2

●

●

0.0

0.0

● ●

0.0

0.2

0.4

0.6

0.8

1.0

0.0

●

●

●

0.2

0.4

0.6

0.8

1.0

Figure 4. Left Panel: Experimental design maximizing λmin (C0θ Cθ ) and minimizing κ(C0θ Cθ ) for Φ(d) = exp{−d0 d} and % = ( 3 3 )0 . Note that there are 11 points at the middle location and 3 at each peripheral location. Right Panel: Near optimal experimental design with respect to the upper-bound in Theorem 5.1.

6. Discussion. Broadly applicable and rigorously justified principles of experimental design for Gaussian process emulation of deterministic computer experiments have been developed. The space-filling properties “small fill distance” and “large separation distance”, potentially with respect to an input space rescaling to accommodate varying rates of correlation decay depending on displacement orientation, are only weakly conflicting and ensure well-controlled nominal, numeric, and parameter estimation error. The presence of non-stationarity in correlation requires a higher density of input locations in regions with more emphasis on the local, more quickly decaying, correlation, relative to input locations in regions with more emphasis on the global, more slowly decaying, correlation. The inclusion of regression functions results in near optimal designs which balance the traditional experimental design properties of the regression functions with space-filling properties, while consideration of error in parameter estimation results in near optimal designs slightly favoring pairs of input locations have particular lengths and orientations. The influence on the accuracy of emulation of regression functions and error in parameter estimation appears to be substantially less than the influence of the space-filling properties “small fill distance” and “large separation distance”. It is noteworthy that a model of effect sparsity, where a number of input variables have little or no impact on the response, can be obtained by taking the corresponding row and column entries of Θ near zero. In the context of this effect sparsity model, experimental designs which are space-filling in lower dimensional projections would be favored. 27

This work has several limitations. All results are in terms of controlling error rates with upper bounds. Actual error rates (of the nominal, numeric, or parameter estimation variety) could be somewhat less in a particular situation. A notable example are the minimum integrated mean squared error twin point designs of [5]. Further, no consideration is given to numeric error in parameter estimation and this error could be substantial, especially if the design is poor with respect to information about the parameters. However, given the secondary importance of experimental design properties specific to parameter estimation, this source of error is not expected to strongly impact the error in interpolation. Also, the discussed model for non-stationarity accounts for only non-constant correlation decay across the input space and, in particular, does not allow non-constant underlying variability in the Gaussian process model. However, non-constant underlying variability can be modeled as Ψ(u, v) = σ(u)σ(v)Φ(u − v) and this non-stationary model behaves intuitively, with regions having more underlying variability requiring a higher density of points than regions having relatively less variability. The results follow in a manner similar to non-stationarity in correlation, although they are in fact simpler, and this development is omitted due to space constraints. Lastly, the impact on interpolator accuracy of low-order functional ANOVAs, which might also be expected to favor designs which are space-filling in lower dimensional projections, has not been examined. REFERENCES [1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404. [2] Aurenhammer, F. (1991). Voronoi Diagrams – A Survey of a Fundamental Geometric Data Structure. ACM Comp. Surv. 23 345–405. [3] Ba, S. and Joseph, V. R. (2012). Composite Gaussian process models for emulating expensive functions. Ann. Appl. Stat. 6 1838–1860. [4] Bartle, R. G. (1995). The Elements of Integration and Lebesgue Measure. Wiley, New York. [5] Crary, S. B. (2002). Design of computer experiments for metamodel generation. Analog Integrated Circuits and Signal Processing, 32, 7–16. [6] Ellis A. M., Garcia A. J., Focks D. A., Morrison A. C., and Scott T. W. (2011). Parameterization and sensitivity analysis of a complex simulation model for mosquito population dynamics, dengue transmission, and their control. The American Society of Tropical Medicine and Hygiene, 85 257–264. [7] Golub, G. H. and Van Loan, C. F. (1989). Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore. [8] Haaland, B. and Qian, P. Z. G. (2011). Accurate emulators for large-scale computer experiments. Ann. Statist. 39 2974–3002. [9] Harville, D. A. (2008). Matrix Algebra From a Statistician’s Perspective. Springer, New York. [10] Eaves, B. C. (1972). Homotopies for computation of fixed points Math. Prog. 3 1–22. [11] Ipsen, I. C. F. and Nadler, B. (2009). Refined perturbation bounds for eigenvalues of Hermitian and nonHermitian matrices. SIAM Journal on Matrix Analysis and Applications 31, 40–53. [12] Magnus, J. R., and Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (revised). New York: Wiley. [13] Mahalanobis, P. C. (1936). On the generalised distance in statistics. Proc. Nat. Inst. Sci. India 2 49–55. [14] McKay, M. D., Conover, W. J., and Beckman, R. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21 239–245.

28

[15] Narcowich, F. J. and Ward, J. D. (1992). Norm estimates for the inverses of a general class of scattered-data radial-function interpolation matrices. Journal of Approximation Theory 69 84–109. [16] Nocedal, J., and Wright, S. J. (1999). Numerical Optimization (Vol. 2). New York: Springer. [17] Sacks, J., William, J. W., Mitchell, T. J., and Wynn, H. P. (1989). Design and analysis of computer experiments. Statist. Sci. 4 409–423. [18] Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of Computer Experiments. Springer, New York. [19] Stein, E. M. and Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press, New Jersey. [20] Stein, M. L. (1987). Large sample properties of simulations using Latin hypercube sampling. Technometrics 29 143–151. [21] Tse K. M., Lee H. P., Shabana N., Loon S. C., Watson P. G., and Thean S. Y. (2011). Do shapes and dimensions of scleral flap and sclerostomy influence aqueous outflow in trabeculectomy? A finite element simulation approach. British Journal of Ophthalmology, 96 432–437. [22] Varga, R. S. (2004). Ger˘sgorin and His Circles. Springer-Verlag, Berlin. [23] Watson, G. N. (1995). A treatise on the theory of Bessel functions. Cambridge University Press, New York. [24] Wendland, H. (2005). Scattered Data Approximation. Cambridge University Press, New York. [25] Wu, C. J. (2014). Post-Fisherian Experimentation: from Physical to Virtual. ahead of print J. Amer. Statist. Assoc.

H. Milton Stewart School of Industrial and Systems Engineering 755 Ferst Drive, NW Atlanta, GA 30332-0205 USA

29

Principles of Experimental Design for Gaussian Process Emulators of ...

Principles of Experimental Design for Gaussian Process Emulators of ...

Suggest Documents

Gaussian-Process-Based Emulators for Building ...

Parameter estimation for Gaussian process emulators - MUCM

Nonstationary Gaussian Process Emulators with ...

Gaussian process single-index models as emulators for ... - arXiv

Gaussian process emulators for computer experiments with inequality ...

Gaussian process single-index models as emulators for ... - arXiv

Particle learning of Gaussian process models for sequential design ...

Co-design of Emulators for Power electric Processes ... - CiteSeerX

Experimental process design for sorption ... - Inderscience Online

Experimental reality: principles for the design of augmented ...

Distance-distributed design for Gaussian process surrogates arXiv

Particle-based Gaussian process optimization for input design in ...

Experimental Teaching of Process Control Principles Supported by ...

Experimental characterization of Gaussian quantum communication ...

Gaussian Process Regression for Rendering Music Performance

Gaussian Process for star and planet characterisation

Gaussian Process Factorization Machines for ... - Semantic Scholar

Domain Decomposition Approach for Fast Gaussian Process ...

Gaussian Process Regression for Hand Gesture

Identification of Gaussian Process State Space Models

ABSTRACT Title of dissertation: Gaussian Process ... - DRUM

Gaussian process regression model for ... - Proteome Science

Gaussian Process Regression for Structured Data Sets

Permutation Methods for Sharpening Gaussian Process Approximations