Parameter space exploration with Gaussian process trees Robert B. Gramacy
[email protected] U.C. Santa Cruz Advancement Proposal June 7, 2004 Abstract Large scale computer simulations can be time-consuming to run. Sweeps over input parameters needed to obtain even qualitative understanding of the simulation output can consequently be prohibitively expensive. Thus, there is a need for computationally inexpensive surrogate models that can be used in place of simulation to adaptively select new settings of input parameters and map the response with far fewer simulation runs. This proposal outlines the foundation of a general methodology for modeling and adaptive sampling. Binary trees are used to recursively partition the input space, and Gaussian process models are fit within each partition. Trees facilitate non-stationarity and a Bayesian interpretation provides an explicit measure of predictive uncertainty that can be used to guide future sampling. Our methods are illustrated on several examples, including a motivating example involving computational fluid dynamics simulation of a NASA reentry vehicle. This document concludes with a rather large set of possible extensions of this approach that may appear in a dissertation.
1
Introduction
Computer simulation is by now an accepted tool for providing insight into complex phenomena. As computing power has advanced so too has the fidelity of the simulations. The drive towards higher fidelity simulation means that accurate modeling taxes even the fastest of computers. Computational fluid dynamics simulations in which fluid flow phenomena are modeled are an excellent example — fluid flows over complex surfaces may be modeled accurately but only at the cost of supercomputer resources. A simulation model defines a mapping (perhaps non-deterministic) from parameters describing the input to one or more output responses. Without an analytic representation of this mapping, the simulation must be run for many different inputs in order to build up an understanding of the simulation’s possible outcomes. Computational expense and/or high dimensional input usually prohibits a naive approach to the mapping of the response surface. A computationally inexpensive approximation to the simulation (O’Hagan et al., 1999) with active learning is one possible remedy. If the approximation is a good match to the simulation, then samples may be drawn in regions of the input space where the output response is changing significantly. For models which return both predictions and associated confidences, regions can be identified where the model is unsure of the response. We focus on the Gaussian process (GP) as a suitable approximation for a number of reasons. GPs are conceptually straightforward, easily accommodate prior knowledge in the form of covariance functions, and return a confidence around predictions. In spite of these benefits there are three important difficulties in applying standard GPs in our setting. Firstly, inference on the GP scales poorly with the number of data points; typically requiring time in O(N 3 ), where N is the number of data points. Secondly, GP models are usually stationary in that the same covariance structure is used throughout the entire input space. In the applications we have in mind, where subsonic flow is quite different than supersonic flow, this limitation is unacceptable. Thirdly, the error (standard deviation) associated with a predicted response under a GP
1
model does not directly depend on any of the previously observed output responses. Instead, it depends only upon the previously sampled input settings {xi }N i=1 and the correlation matrix C(xi , xj ). All of these shortcomings may be addressed by partitioning the input space into regions, and fitting separate GPs within each region. Partitioning allows for non-stationary behavior, and can ameliorate some of the computational demands (by fitting models to less data). Finally, a fully Bayesian interpretation yields uncertainty measures for predictive inference which can help direct future sampling. The foundations of the techniques in this dissertation proposal draw upon two successful previous approaches to similar problems. The use of trees and recursive partitioning for achieving non-stationarity has a long history (Breiman et al., 1984; Denison et al., 1998)— the Bayesian application of which is well worked out— and the use of GPs for active learning has also seen recent success (Seo et al., 2000). Alternative approaches for non-stationary modeling include mixtures of GPs (Tresp, 2001) and infinite mixtures of GPs (Rasmussen & Ghahramani, 2002). The remainder of the paper is structured as follows. We define the problem and review the necessary background in Section 2. Section 3 provides details on the use of Bayesian treed GP models including inference and prediction. Section 4 considers how the treed GP model is used to adaptively select input parameters, and Section 5 presents results on real and simulated data. Many of the more detailed arguments and derivations are left to the appendices. Throughout the document we allude to the shortcomings of the current approach where applicable. Finally, we conclude with a rather lengthy discussion of proposed work in Section 6, outlining many possible avenues for future research.
2
Background and related work
We model the simulation output as (Sacks et al., 1989) t(x) = β > x + w(x)
(1)
where t is the (possibly multivariate) output of the computer model, x is a particular (multivariate) input value, β are linear trend coefficients, and w(x) is a zero mean random process with covariance C(x, x0 ) = σ 2 K(x, x0 ), and K is a correlation matrix. Stationary Gaussian processes (Sacks et al., 1989; Santner et al., 2003) are a popular example of a model that fits this description. As discussed in the introduction, we require more flexibility than offered by a stationary GP. To achieve non-stationarity we turn to binary trees, using them to partition the input space, and then fit a GP to each partition. This approach bears some similarity to the models of Kim et al. (2002), who fit separate GPs in each element of a Voronoi tessellation. Our approach is better geared toward problems with a smaller number of distinct partitions, leading to a simpler overall model. Using a Voronoi tessellation allows an intricate partitioning of the space, but has the trade-off of added complexity and can produce a final model that is difficult to interpret. The complexities of this added flexibility are not warranted in our application.
2.1
Stationary Gaussian Processes
GPs are a popular kernel-based method for regression and classification. Though the method can be traced back to Kriging (Matheron, 1963), it is only recently that they have been broadly applied in machine learning. Consider a training set D = {xi , ti }N i=1 of mX -dimensional input parameters and mY -dimensional simulation outputs. We indicate the collection of inputs as the N × mX matrix X whose ith row is x> i . A GP (Seo et al., 2000) is a collection of random variables Y(x) indexed by x having a jointly Gaussian distribution for any subset of indices. It is specified by a mean µ (x) = E Y(x) and correlation function K(x, x0 ) = σ12 E [Y(x) − µ (x)][Y(x0 ) − µ (x0 )]> . Given a set of observations D, the resulting density over outputs at a new point x is easily found to be Gaussian with mean variance
yˆ(x) = k> (x)K−1 t, and > σy2ˆ (x) = σ 2 [K(x, x) − k> (x)K−1 N k (x)].
2
For simplicity we assume that the output is scalar (i.e., we are modeling each output response independently and so mY = 1) so that the image of the covariance function is a scalar. For now, the linear trend term in (1) is zero. We define k> (x) to be the N -vector whose ith component is K(x, xi ), K to be the N × N matrix with i, j element K(xi , xj ), and t to be the N -vector of observations with i component ti . It is important to note that the uncertainty, σy2ˆ (x), associated with the prediction has no direct dependence on the observed simulation outputs t. Typically, the covariance function depends on hyperparameters which are determined either by maximizing the likelihood of D or integrating over them.
2.2
Bayesian Treed Models
A tree model partitions the input space and infers a separate model within each partition. Partitioning is often done by making binary splits on the value of a single variable (e.g., speed > 0.8) so that partition boundaries are parallel to coordinate axes. Partitioning is recursive, so each new partition is a sub-partition of a previous one. For example, a first partition may divide the space in half by whether the first variable is above or below its midpoint. The second partition will then divide only the space below (or above) the midpoint of the first variable, so that there are now three partitions (not four). Since variables may be revisited, there is no loss of generality by using binary splits as multiple splits on the same variable will be equivalent to a non-binary split. These sorts of models are often referred to as Classification and Regression Trees (CART) (Breiman et al., 1984). CART has become popular because of its ease of use, clear interpretation, and ability to provide a good fit in many cases. The Bayesian approach is straightforward to apply to tree models (Chipman et al., 1998; Denison et al., 1998). Key is the specification of a meaningful prior for the size of the tree. Here we follow Chipman et. al who specify the prior through a tree-generating process. Starting with a null tree (all data in a single partition), the tree T is probabilistically split recursively, with each partition η being split with probability psplit (η, T ) = a(1 + qη )−b where qη is the depth of η in T and a and b are parameters chosen to give an appropriate size and spread to the distribution of trees. More details are available in Chipman et. al (1998). We expect a relatively small number of partitions, and choose these parameters accordingly. As part of the process prior, we further require that each new region have at least five data points, since the parameters of a GP cannot be effectively estimated if there are too few points in a partition.
3
Non-stationary GPs via Trees
For many computer models, a stationary Gaussian process is insufficient to capture the different behavior of the output in different parts of the space. For example, in the CFD example mentioned in the introduction, behavior near Mach one is quite different than behavior well below or above this level. The output is often fairly linear in local regions away from Mach one, with much more curvature around Mach one. Thus fitting a single Gaussian process model to the whole space mis-represents the correlation structure in one or both parts. We introduce a non-stationary process based on fitting a stationary Gaussian process in each partition of a tree, which allows us to divide the input space into coherent regions and fit separate mean and correlation structures in each region. To start things off we define the model conditional on a particular tree. Then, we later discuss integrating over possible trees, using reversible-jump Markov chain Monte Carlo. Prediction is also conditioned on the tree structure, and so is also averaged over in the posterior to get a full accounting of uncertainty.
3.1
Hierarchical Model
A tree T recursively partitions the input space into into R non-overlapping regions: {rν }R ν=1 . Each region rν contains data Dν = {Xν , tν }, consisting of nν observations. Each split in the tree is based on a selected dimension uj ∈ {1, . . . , mX } and an associated split criterion sj , so that one of the resulting sub-partitions consist of those observations in {Xν , tν } with the uj th parameter less than sj , and the other contains those
3
observations greater than or equal to sj . Thus, the structure of the tree is determined by a hierarchy of splitting criteria {uj , sj }, j = 1, . . . , dR/2e.
{v1 , s1 }
T:
X[:, u1] ≥ s1
X[:, u1] < s1
{v2 , s2 }
D3 = {X3 , Z3 }
X[:, u2 ] ≥ s2
X[:, u2] < s2
D1 = {X1 , Z1 }
D2 = {X2 , Z2 }
Figure 1: An example tree with two splits, resulting in R = 3 partitions. Splitting locations are chosen discretely from X: uj ∈ {1, . . . , m} chooses the splitting dimension sj ∈ X[:, vj ] column uj of X satisfying ancestral partitioning. Figure 1 shows an example tree. In this example D1 contains x’s whose u1 coordinate is less than s1 and whose u2 coordinate is less than s2 . Like D1 , D2 has x’s where coordinate u1 < s1 , but differs from D1 in that the u2 coordinate must be bigger than or equal to s2 . Finally, D3 contains the rest of the x’s differing from those in D1 and D2 because the u1 coordinate of its x’s is greater than or equal to s1 . To avoid confusion, it is worth noting that the corresponding z values accompany the x’s of each region. Given a tree T , we fit a stationary GP with linear trend (1) independently within each region. The nν × nν covariance matrix for the process in the νth region is denoted Kν and the linear trend coefficients > are β ν . We denote the full set of coefficients across all regions as β > = [β > 1 , . . . , β R ] (and similarly for all
4
other region-specific parameters). The hierarchical generative model we use is:1 tν |βν , σν2 , dν , gν ∼ N (Fν βν , σν2 Kν ), β ν |σν2 , W, β0 ∼ N (β 0 , σν2 W) β0 ∼ N (µ, B), W
σν2 −1
(2)
∼ IG(α0 /2, q0 /2), ∼ W ((ρV)−1 , ρ),
with Fν = (1, Xν ), and W is a (mX + 1)× (mX + 1) matrix. N , IG, and W are the Normal, Inverse-Gamma, and Wishart distributions, respectively. The GP correlation structure for each partition, Kν , is chosen from an isotropic power family with a fixed power p0 , but unknown range dν and nugget gν parameters: [(xj − xk )> (xj − xk )]p0 + gν δj,k (3) Kν (xj , xk ) = exp − dν where δ·,· is the Kronecker delta function. For notational convenience we continue to refer K as a correlation matrix, even though with the nugget term, g, in K(·, ·) of Eq. (3) it is no longer technically a correlation matrix. The nugget, as depicted in (3), is a way of introducing measurement error into the stochastic process which, though peculiar, has some advantages. For further details, and other thoughts on the nugget, please refer to Appendix A. Other possible correlation functions are considered as part of our proposed work. See Section 6.1.1. Parameters to the correlation function are given bimodal hierarchical priors dν ∼ G(αd1 , γd1 ) + G(αd2 , γd2 ) gν ∼ G(αg1 , γg2 ) + G(αg2 , γg2 ), where G is the gamma distribution– the definition of which, for clarity, is included below: G(θ|α, β) =
β α θα−1 exp{−θα}, Γ(α)
θ > 0.
Hierarchical mixture-priors on d and g can express our prior belief that the global covariance structure is non-stationary. Otherwise, non-informative G(ε, ε), ε small, can be used. Below, we shall refer to parameters to such hierarchical priors as γ. Two components are intended to represent the “more interesting” (rapidly changing) and less interesting (flat) regions of the space. Finally, priors need to be placed on γ. For j ∈ {1, 2}: αdj ∼ G(1, λαdj ) αgj ∼ G(1, λαgj )
γdj ∼ G(1, λγdj ) γgj ∼ G(1, λγgj )
(4) (5)
Finally, constants µ, B, V, ρ, α0 , q0 , λ∗ , p0 are treated as known. Below we shall use αd as a shorthand for {αd1 , αd2 }; similarly for αg , γd , and γg .
3.2
Prediction
Prediction under the above GP model is straightforward (Hjort & Omre, 1994). The predicted value of t at x is normally distributed with mean yˆ(x) = E(t(z)| data, x ∈ Dν ) = f > (x)β ν + kν (x)> K−1 ν (tν − Fν β ν ), 1 We
omit the dependence on T .
5
(6)
and variance σ ˆ (x)2 = Var(z(x)| data, x ∈ Dν ) −1 = σν2 [κ(x, x) − q> ν (x)Cν qν (x)],
(7)
where > −1 C−1 ν = (Kν + Fν WFν )
qν (x) = kν (x) + Fν Wν f (x)
>
κ(x, y) = Kν (x, y) + f (x)Wf (x)
with f > (x) = (1, x> ), and kν (x) is a nν −vector with kν,j (x) = Kν (x, xj ), for all xj ∈ Xν . One has to be careful when using the above kriging equations with the definition of the correlation matrix K as given in (3). In particular, the nugget term only applies when computing the correlation between a data location and itself. It does not apply for duplicate locations with the same coordinates. For more details see Appendix A.2.
3.3
Estimating the model parameters
The data Dν = {X, t}ν are used to estimate the parameters θ ν ≡ {βν , σν2 , dν , gν }, for ν = 1, . . . , R. Parameters to the hierarchical priors (θ0 = {W, β0 , γ}) depend only on {θν }R ν=1 . Conditional on the tree SR T , we write the full set of parameters as θ = θ0 ∪ ν=1 θν . Samples from the posterior distribution of θ are gathered using Markov chain Monte Carlo (MCMC) (Gelman et al., 1995). Sampling requires that we find the complete conditional distributions for each of the parameters. Some of them ({d, g, σ 2 }ν ) are sampled more efficiently if we partially marginalize their full conditionals, as we can analytically integrate out dependence on many of the other parameters. First we list the full conditionals for the parameters associated with the linear trend. Since we use conjugate priors, these can be sampled using Gibbs steps. Below we simply state the results derived for the parameters in (2). The full derivations are included in Appendix B.1. ˜ , σ2 V ˜ ) β |rest ∼ N (β ν
ν
ν
βν
where ˜ = V ˜ (F> K−1 tν + W−1 β ) β ν 0 ν ν βν
−1 −1 −1 Vβ˜ν = (F> ) , ν Kν Fν + W
(8)
and ˜ ,V ˜ ) β0 |rest ∼ N (β 0 β0 where Vβ˜0 =
B
−1
+W
−1
r X i=0
σν−2
!−1
˜ = V˜ β 0 β0
B
−1
µ+W
−1
r X i=1
β ν σν−2
!
(9)
and W−1 |rest ∼ W ρV+VTˆ , ρ + r where VTˆ =
r X 1 (βν − β0 )(βν − β0 )> . 2 σ ν i=1
6
(10)
Analytically integrating out β and σ 2 gives a marginal posterior for dν and gν , which can then be used for more efficient MCMC. As before, we simply quote the results here, and the details are left to Appendix B.2. p(dν ,gν |t, β0 , W) = |Vβ˜ν | (2π)nν |Kν ||W|
! 21
α0 /2
(q0 /2)
(α0 +nν )/2
[(q0 + ψν )/2]
Γ [(α0 + nν )/2] p(dν , gν ), Γ [α0 /2]
(11)
> −1 −1 ˜ > V−1 β ˜ . Eq. (11) can be used to iteratively obtain draws for d and where ψν = t> β0 − β ν Kν tν + β 0 W ν β˜ν ν g. For example, when conditioning on gν , the Metropolis-Hastings (MH) acceptance ratio for dν is
q(dν ) p(d∗ν |t, β 0 , W, gν ) p(d∗ν |αd , γd ) q(d∗ν ) p(dν |t, β 0 , W, gν ) p(dν |αd , γd ) where q(dν ) = q(dν |d∗ν ) and q(d∗ν ) = q(d∗ν |dν ) are the backward and forward proposal probabilities the old width parameter given the new width parameters, and vice versa, respectively. An analogous ratio exists for gν . The conditional distribution of σν2 with β ν integrated out is σν2 |dν , g, β0 , W ∼ IG((α0 + nν )/2, (q0 + ψν )/2)
(12)
which allows Gibbs sampling. The full derivation of (12) is also included in Appendix B.2. Hyperparameters for the priors of dν and gν , Eqs. (4) and (5) if desired, would require MH draws.
3.4
Tree Structure
Integrating out dependence on the tree structure T is accomplished by reversible-jump MCMC (RJ-MCMC) (Richardson & Green, 1997). We implement the tree operations grow, prune, change, and swap similar to those in Chipman et al. (1998). Tree proposals can change the size of the parameter space (θ). To keep things simple, proposals for new parameters— via an increase in the number of partitions R— are drawn from their priors, thus eliminating the Jacobian term usually present in RJ-MCMC. New splits are chosen uniformly from the set of marginalized input locations X. Swap and change tree operations are straightforward because the number of partitions (and thus parameters) stays the same. In a change operation we propose moving an existing split-point {u, s}, to either the next greater or lesser value of s (s+ or s− ) along the uth dimension of (marginalized) locations from dR/2e X. This is accomplished by sampling s0 uniformly from the set {uν , sν }ν=1 × {+, −}. Parameters θr in regions below the split-point {u, s0 } are held fixed. Uniform proposals and priors on split-points cause the MH acceptance ratio for change to reduce to a simple likelihood ratio. The swap operation is similar, however we slightly augment the one described in Chipman et al. (1998). Swaps proposed on parent-child internal nodes which split on the same variable are always rejected because a child region below both parents becomes empty after the operation. Figure 2 gives an illustration. However, if instead a rotate operation from Binary Search Trees (BSTs) is performed, the the proposal will almost always accept. Rotations are a way of adjusting the configuration (and thus height) of a BST without violating the BST property. Red-Black Trees make extensive use of rotate operations (Cormen et al., 1990). In the context of a Bayesian MCMC tree proposal, rotations encourage better mixing of the Markov chain by providing a more dynamic set of candidate nodes for pruning, thereby helping it escape local minima. Figure 3 shows an example of a successful (right) rotation where the swap of Figure 2 failed. Since the partitions at the leaves remain unchanged, the likelihood ratio of a proposed rotate is always 1. The only “active” part of the MH acceptance ratio is the prior on tree T , preferring trees of minimal depth. Still, calculating the acceptance ratio for a rotate is non-trivial because the depth two of the sub-tress change. Sub-trees T1 and T3 of Figure 3 change depth, either increasing or decreasing respectively, depending on the direction of the rotation. In a right-rotate, nodes in T1 decrease in depth, while those in T3 increase. The 7
{1, 5}
{1, 3}
T:
T swapped: X[:, 1] < 5
X[:, 1] ≥ 5
X[:, 1] < 3
swap
swap {1, 3}
T3
{1, 5}
X[:, 1] ≥ 3
X[:, 1] < 3
T1
X[:, 1] ≥ 3
T2
X[:, 1] < 5
T2
X[:, 1] ≥ 5
∅
T1
Figure 2: Swapping on the same variable is always rejected because one of the leaves corresponds to an empty region. T1 , T2 , T3 are arbitrary sub-trees (could be leaves).
{1, 5}
{1, 3}
T rotated:
T: X[:, 1] < 5
X[:, 1] ≥ 5
X[:, 1] < 3
T1
X[:, 1] ≥ 3
rotate (right)
rotate {1, 3}
X[:, 1] < 3
T3
{1, 5}
T1
X[:, 1] < 5
X[:, 1] ≥ 3
T2
T2
X[:, 1] ≥ 5
T3
Figure 3: Rotating on the same variable is almost always accepted. T1 , T2 , T3 are arbitrary sub-trees (could be leaves). opposite is true for left-rotation. If I = {Ii , I` } is the set of nodes (internals and leaves) of T1 and T3 , before rotation, which increase in depth after rotation, and D = {Di , D` } are those (internals and leaves) which
8
decrease in depth, then the MH acceptance ratio for a rotate is p(T1∗ )p(T3∗ ) p(T ∗ ) = p(T ) p(T1 )p(T3 ) Q Q Q Q −b −b −b −b η∈D` [1 − aqη ] η∈Di aqη η∈I` [1 − a(2 + qη ) ] η∈Ii a(2 + qη ) Q Q Q . = Q −b −b −b −b η∈D` [1 − a(1 + qη ) ] η∈Di a(1 + qη ) η∈I` [1 − a(1 + qη ) ] η∈Ii a(1 + qη )
(13)
The MH acceptance ratio for a right-rotate is analogous. Grow and prune operations are more complex because they add or remove partitions, and thus cause a change in the dimension of the parameter space. The first step for either operation is to select a leaf node (for grow), or the parent of a pair of leaf nodes (for prune). We choose the node uniformly from the set of legal candidates. When a new region r is added, new parameters {d, g}r must be proposed, and when a region is taken away the parameters must be absorbed by the parent region, or discarded. When evaluating the MH acceptance ratio for either operation we marginalize over the {β, σ 2 }r parameters. One of the newly grown children is selected (uniformly) to receive the d and g parameters of its parent. To ensure that the resulting Markov chain is ergodic and reversible, the other new sibling draws its d and g parameters from their priors. Symmetrically, prune operations randomly select parameters d and g for the consolidated node from one of the children being absorbed. If the grow or prune operation is accepted, σr2 can next be drawn from its marginal posterior (with βr integrated out) after which draws for β r and the other parameters for the rth region can then proceed as usual. Let {X, t} be the data at the new parent node η at depth qη , and {X1 , t1 } and {X2 , t2 } be the new child data (both at depth qη + 1) created by the new split {u, s}. Also, let P be the set of pruneable nodes of T , and G the number of growable nodes respectively. The Metropolis-Hastings acceptance ratio for grow is: |P| + 1 a(1 + qη )−b (1 − a(2 + qη )−b )2 p(d1 , g1 |t1 , β 0 , W)p(d2 , g2 |t2 , β0 , W) × × . |G| 1 − a(1 + qη )−b p(d, g|t, β 0 , W) The prune operation is analogous: |G| + 1 p(d, g|Z, β0 , T) 1 − a(1 + dη )−b × × . |P| p(d1 , g1 |Z, β 0 , T)p(d2 , g2 |Z2 , β0 , T) (1 − a(2 + dη )−b )2 a(1 + dη )−b
4
Adaptive Sampling
Much of the current work in large-scale computer models starts by evaluating the model over a complete grid of points. After the full grid has been run, a human may identify interesting regions and perform additional runs if desired. In this section we discuss improvements to this approach. First is a quick review of active learning, the subtopic of Machine Learning research into which adaptive sampling falls. Then, onto adaptive sampling together with a short treatment of a basic idea from the experimental design literature, Latin hypercube sampling (Box et al., 1978; Fisher, 1935).
4.1
Active Learning
Informally, active learning is the process of selecting design sites or configurations with the goal of extracting the maximum information at the cost of increasing the size of the data set. In the Machine Learning literature (Fine, 1999; Angluin, 1998; Fine et al., 2000; Atlas et al., 1990), active learning, or equivalently query learning or selective sampling, refers to the situation where a learning algorithm has some (perhaps limited) control over the inputs it trains on. Active learning is a relatively new paradigm that is finding utility in many applications. For example, it is currently being used to aid in computational drug design/discovery by helping to find compounds that are active against a biological target (Warmuth et al., 2001; Warmuth et al., 2003). To the best of my knowledge, ours is the first application of active learning that uses nonstationary modeling to help select small designs. 9
Non-stationary models like the treed kriging model presented in Section 3 fit independent stationary models in different regions of the input space. Thus the uncertainty in the model, and in the uncertainty in the response can vary over the input space. Reduction in model uncertainty can almost surely be obtained if future input samples are chosen wisely. Active learning in the context of estimating response surfaces is what we are calling adaptive sampling. Supposing it was possible to start with relatively small spacefilling “peppering” of input data, adaptive sampling proceeds by fitting a model, estimating predictive uncertainty, and then choosing future data samples where a known response would cause the largest reduction in uncertainty. The process repeats until a desired threshold in predictive uncertainty is met. In this iterative fashion the model adapts to the data, and the (new) data either reinforces or suggests a modification to the (old) model.
4.2
Choosing new samples
The current thrust of our research has been in designing, developing, implementing, and debugging the appropriate non-stationary model, which we believe is the Bayesian treed kriging model described in Section 3. Less time, unfortunately, has been allotted to finding the best adaptive sampling scheme. Nevertheless, we have come up with a simple heuristic that works quite well. In Section 6.2 we outline our future plans for adaptive sampling. Having described the predictive algorithm used to model P (t|x), we now consider how to choose new sampling locations based on this distribution. Two criteria have been previously proposed. The simplest ˜ choice is to maximize the information gained about model parameters {θ, T } by selecting the location x which has the greatest standard deviation in predicted output (Mackay, 1992). Given its simplicity this ˜ minimizing the resulting expected is the method we explored first. An alternative measure is to select x squared error averaged over the input space (Cohn, 1996). This approach is being explored as part of our current and proposed work (Section 6.2.2). A comparison between these two methods using standard GPs appears in (Seo et al., 2000). To further improve our adaptive sampling we shall exploit Latin hypercube (LH) designs (Box et al., 1978). LHs systematically choose points that are spread out, taking on values throughout the region, but in different combinations across dimensions, thereby obtaining nearly full coverage with fewer points than a full gridding. To create a LH (McKay et al., 1979) with n samples in a mX -dimensional space, one starts with an nmX grid over the search space. For each row in the first dimension of the grid, a row in each other dimension is chosen randomly without replacement, so that exactly one sample point appears in each row for each dimension. Within these chosen grid cells, the actual sample point is typically chosen randomly. In one dimension a LH design is equivalent to a complete grid, but as the number of dimensions grows, the number of points in the LH design stays constant, and the computational savings grow exponentially with mX . As mentioned above, two key ingredients are necessary for adaptive sampling: 1. A spatially dependent (or input dependent) metric of model uncertainty. 2. Based on the metric, a method of estimating which new spatial location, if added into the data set, will yield the largest decrease in model uncertainty. A side effect of fitting our Bayesian treed kriging model with MCMC is that we get samples of the predictive distribution at new design points. These samples provide a rather convenient metric of model uncertainty; namely the width (or norm) of their outer (5% and 95%) predictive quantiles. Our heuristic for choosing the next adaptive sample involves predicting at a grid of new design points (or better yet, a LH sample), assessing the model uncertainty at each new location, and choosing one (or some) of those points to be added into the data set. We came up with several possible ways of accomplishing this: • sampling probabilistically: treating the difference in quantiles as a discrete distribution and choosing randomly. • taking the maximum: choosing the design locations with the highest quantile based error. 10
• A natural combination of the above two heuristics is to exponentiate the quantile-based discrete distribution (mentioned in the first bullet). This will create a distribution which accentuates higher mass locations, and tones down the low ones. In the limit, the result is that only the max has any mass.
4.3
Waiting for requested responses
When the response at a new design location is the the output of a complicated and computationally intensive computer simulation, the result may not be available immediately. However, we might want the adaptive sampling scheme to continue to select new points without waiting for the requested response. For example, in a parallel computing environment, it may be just as easy to request ten responses as it is to demand one evaluation of the computer code. These codes usually run iteratively until some convergence threshold is met. This means that all requested responses may not be available simultaneously. Some sort of place-holder for pending requests is needed so that multiple processors are not all working on the same (or nearby) design points simultaneously. One possibility is to estimate the response with its predicted mean as a surrogate, until the result is available.
5
Results and Discussion
In this section we demonstrate an adaptive sampling scheme based on the Bayesian treed GP model of Section 3. Given N previous samples and their responses we use the model and its predictive quantiles to select a new location at which to request a response. For all experiments herein this is accomplished as follows: 15,000 MCMC rounds are performed, in which the parameters θ|T are updated. Every fourth round we also update the tree structure (T ) by drawing probabilistically from the discrete distribution {2/5, 1/5, 1/5, 1/5}, and attempting a {change, grow, prune, swap} operation accordingly. The first 5,000 of the 15,000 rounds are treated as burn-in, after which predictions are made using the parameters sampled during the remaining 10,000 rounds. Suppose that there are currently N locations xi for which we have a response ti .2 An initial LH sample of size N0 is used to get things started. At the beginning of the MCMC rounds we lay down a LH sample of N 0 new locations on which to predict. Quantiles (95th and 5th) are computed at each of the N 0 predictive locations. Based on their difference, one is selected. Every third adaptive sample is chosen probabilistically, treating the quantiles as a discrete distribution, while the rest are chosen by taking the maximum. Probabilistic samples are taken for robustness, as the maximum is only the optimal choice when the model is specified completely correctly. Finally, a response is elicited at the chosen input location, and the pair is then added into the data. The process is repeated, the model re-fit, and another adaptive sample is chosen from a new set of N 0 LH samples.
5.1
Synthetic Data
The following (highly non-stationary) synthetic data sets help demonstrate a proof of concept. 5.1.1
1-d Sinusoidal data set:
Our first example is a simulated data set on the input space [0, 60]. The true response is (Higdon, 2002): 1 πx 4πx t(x) = sin + cos θ(x − 35.75) (14) 5 5 5 where θ is the step function defined by θ(x) = 1 if x > 0 and θ(x) = 0 otherwise. Zero mean Gaussian noise with sd = 0.1 is added to the response. This data set typifies the type of non-stationary response surface that our model was designed to exploit. Higdon et al. used a smaller input domain and did not include a region where the response is flat. 2 We
first translate and re-scale the data so that it lies on the unit cube in N (y)CN qN (y)]
σ ˆy2 (y) = σ 2 [κ(y, y) − qN +1 (y)> C−1 N +1 qN +1 (y)], using the notation for the predictive variance given in (7). By using the partition inverse equations (Barnett, 1979) for a covariance matrix CN +1 in terms of CN , we obtained a nice expression for ∆σy2 (x): ∆ˆ σy2 (x) =
2 −1 σ 2 q> N (y)CN qN (x) − κ(x, y) −1 κ(x, x) − q> N (x)CN qN (x)
.
The details of this derivation is included in Appendix C. Rather than considering x and y on a grid, following the example of Seo et al. we are currently using a LH sample of predictive locations similar to the ALM-like technique described in Section 4. The reduction in predictive variance that would be obtained by adding x into the data set is calculated by averaging over y: 0
N 1 X ∆σ (x) = 0 ∆ˆ σy2 i (x) N i=1 2
One of the benefits of ALC is that ∆σ 2 (x) is easily approximated using MCMC methods. Applying ALC has yielded impressive preliminary results. In particular, adaptive samples are less heavily concentrated near the boundaries of the partitions. 21
As part of my future work I plan to further study approximations to optimal design techniques like the ALC and ALM algorithms. Also it will be interesting to try other methods of the choosing candidate locations. It seems more feasible to use the theory from S-DACE to solve this problem, rather than the fulladaptive sampling problem, and then use ALC and ALM-like algorithms to sub-sample from the candidates, and rank them for possible inclusion into the design. Until a better plan presents itself, this approach will be the main focus of our future work on adaptive sampling.
6.3
Implementation
Ultimately, the goal of this project is to develop a useful tool. We would like to give NASA a program that adaptively selects samples, controls a super-cluster in order to obtain responses where requested, and gracefully interacts with the scientist monitoring its progress. Our code has come a long way from the Matlab and R (R Development Core Team, 2003) prototypes that we developed a year ago. We have fast C code for implementing the Gaussian process models which employ ATLAS for optimized linear algebra, and C++ classes for the tree structure. Still, we are far from having a useful tool to give NASA. A major issue is that the model does not scale well as the data sets get large. Even with the fastest processors and the most optimized linear algebra libraries, inverting a 10000×10000 matrix is slow, especially if it needs to be done at least once every MCMC round, for 10000 rounds. We have some ideas for speeding things up a bit. For example, other authors have had success with kriging implementations that avoid directly inverting large matrices by employing iterative methods like the Conjugate Gradient algorithm (Nychka, 2003). Also, iterative techniques for Matrix inversion, like the SJM or Jacobi methods, converge quickly provided they are supplied with a good initial guess. When proposing new correlation matrices, we often have good guesses in the form of old matrices. Furthermore, it has been demonstrated that careful thresholding of the correlation function can often lead to useful approximate and sparse correlation matrices (Nychka et al., 2002). Fast sparse matrix multiplication routines could significantly speed up iterative inversion methods. A nice feature of our treed model is that it is highly parallelizeable. Conditional on the tree structure, each region is independent and so inference on the Gaussian process model each region can proceed in parallel. Moreover, conditional on the current estimates of model parameters, prediction is independent of inference. Of course, actually parallelizing the code is not a trivial task. Another possible way to speed things up is the idea from Section 6.1.2 of aggressive partitioning. Or, we can give up on being fully Bayesian. For example, we could fit a subset of the parameters using Maximum Likelihood methods. Rather than attempt to choose adaptive samples by averaging quantities over all MCMC predictive variances (like ∆σy2 (x) from Section 6.2.2), we could instead condition them on maximum a posteriori (MAP) estimates. A similar approach can be taken to applying techniques from the S-DACE literature (Section 6.2.1). A second shortcoming in the code is that it is not adequately equipped to control a supercluster. An interface between the adaptive sampler and the cluster of CFD solvers needs to be defined. In particular, we have not decided how the sampler should behave while it is waiting for elicited responses (see Section 4.3), or how it should respond when there are multiple idle solvers waiting for work. We have not studied how the lag in eliciting and obtaining responses from the cluster affects future adaptive samples. If we temporarily use predicted means in place of unfinished computations, how does that affect the model uncertainty nearby? Many of the above issues have the highest priority in our list of future work. It would be nice to have something working over at NASA as soon as possible so that we can better gauge the practicality of our approach, as well as the success of our models on real experiments.
6.4
Summarizing the dissertation plan.
The following topics are considered primary, and will mostly likely contribute significantly to the dissertation: multiple dimensional width parameters in the Power family covariance function, increased communication among partitions, co-kriging, the ALC algorithm, choosing candidates for ALC and ALM, and whatever implementation details are necessary to make a practical tool for NASA. The Mat`ern family, intricate 22
hierarchical non-stationary correlation functions, optimal S-DACE techniques, etc., are topics which require further investigation before it can be determined what part they will play in our future research plans.
23
A
Thoughts on the nugget
A.1
Two models with measurement error
Following the development in Hjort & Omre (Hjort & Omre, 1994), a Gaussian process is often written as Z(X) = m(X, β) + ε(X).
(20)
The mean function m(X, β) is often taken to be linear in X: m(X, β) = F β. β are coefficient parameters and F = (1, X> )> . The process variance, governed by ε(X) is such that cov(Z(X), Z(X0 )) = cov(ε(X), ε(X)) = σ 2 K(X, X0 ), where K is a correlation function depicting the smoothness of the process. K(x,x’)
1.0
0.0
||x − x’||
Figure 13: Graphical depiction of the correlation function (22). Accordingly, observations zi × xi for i = 1, . . . , n are said to form a Gaussian process if they satisfy (Z1 , . . . , Zn )> ∼ Nn [Fβ, σ 2 K],
(21)
where correlation matrix K is usually constructed using a one of a family of parameterized correlation functions, such as the Gaussian (or Exponential) family: ||xi − xj ||2 K(xi , xj |d) = exp − (22) d Such correlation matrices should be positive definite with all entries less than or equal to one. Figure 13 shows a cartoon of how correlation, like that described by (22), decays as the distance ||xi − xj || increases. Relations for interpolation in terms predicted mean and errors can be obtained using multivariate normal theory (see Hjort & Omre). This kind of spatial interpolation is commonly called Kriging. It easily seen that ˆ i ) = zi , using a parameterized correlation function like the one in (22) results in a predictive mean of Z(x and error ˆ i ) − Z(xi )]2 = 0 σ ˆ 2 (xi ) = E[Z(x when xi corresponds to any of the input data locations x1 , . . . , xn . For x 6= xi , we have that σ ˆ 2 (x) > 0, increasing as the distance from x to closest xi gets large. An example interpolation is shown in Figure 14. However, if the modeler believes that the observations are subject to measurement error, then interpolation might not be the only goal. The goal may be to smooth the data rather than “connect the dots”. Figure 15 shows what a possible smoothing of the data presented in Figure 14 might look like. 24
Z
X
Figure 14: Data interpolated by Kriging, using a correlation function like that in (22) with a model like that in (20). Z
X
Figure 15: A smooth alternative to interpolation of the data presented in Figure 14
To smooth the data the model (20) must be augmented to include an additional variance term account for “measurement error”. However, this is not the approach most commonly taken in the Geostatistical community. They choose instead to add a so-called nugget term (η) directly into the definition of the correlation function (22), leaving the underlying model formulation (20) unchanged: ||xi − xj ||2 K(xi , xj |d, η) = exp − + ηI{i=j} , (23) d where I{·} is the boolean indicator function. Note that the matrix K resulting from (23) is no longer a correlation matrix (in the strictest sense) because its diagonal may have entries which are greater than one. Figure 16 shows the resulting (dis-continuous) exponential correlation function graphically. To my knowledge, the parameter η does not have a straightforward statistical interpretation. In fact, several authors advise against using this approach for this very reason. However, (23) gives that K(x, x|d, η) = 1 + η, which makes the prediction error non-zero for data locations xi (see Hjort & Omre): σ ˆ 2 (xi ) = σ 2 η,
and
ˆ i ) 6= zi Z(x
25
(unless η = 0),
K(x,x’)
1.0 + ETA
1.0
0.0
||x − x’||
Figure 16: Graphical depiction of the (dis-continuous) correlation function (23) with nugget.
provided that one is careful about the bookkeeping for the standard Kriging equations. See the following section for a note about constructing covariances carefully for predictive locations. Thus, the nugget does accomplish the goal of smoothing the data rather than interpolating. Predictive means and error-bars look similar to those drawn in Figure 15, although uncertainty is usually somewhat lower near observed data locations. The correct way to account for measurement error is to augment the model (20): Z(X) = m(X, β) + ε(X) + η(x),
(24)
where m and ε are as before, and η(x) is an independent zero-mean noise process, usually Gaussian. Given observations zi × xi for i = 1, . . . , n, the corresponding Gaussian process can be written as a sum of independent normals: (Z1 , . . . , Zm )> ∼ Nn [Fβ, σ 2 K] + Nn [0, τ 2 I] 2
2
0
∼ Nn [Fβ, (σ + τ )K ]
(25) (26)
where K0 is a (true) correlation matrix defined in terms of K(x, x0 |d) from (22) by K 0 (xi , xj |d, σ 2 , τ 2 ) =
σ2 (K(x, x0 |d) + τ 2 I), σ2 + τ 2
(27)
or, equivalently, we could have written the following formula for the correlation function K 0 (xi , xj |d, σ 2 , τ 2 ) =
σ2 2 (K(x, x0 ) + τI{i=j} ), σ2 + τ 2
(28)
which is essentially the a scaled version of (23), thus highlighting the equivalence using the model in (20) with a correlation that includes a nugget term (the nugget model), and a model like that in (24) that includes an explicit noise parameter. The main difference is that now all three ingredients (σ 2 , τ 2 , and K0 ) have meaningful statistical interpretations. Despite its less than satisfactory interpretability or statistical meaning, many authors have likely chosen the nugget model because its parameters are easy to estimate using Maximum Likelihood and Monte-Carlo based methods. See Cressie (Cressie, 1991) for arguments both for and against the use of the nugget approach in light of an available (but perhaps more complicated) more statistically meaningful model, particularly in the case of estimation of the semi-variogram and or correlogram. There are some advantages of the using the nugget approach in our current research (“Bayesian adaptive sampling using Gaussian process trees”) which is why we prefer this method. Since σ 2 is, in a sense, decoupled from the nugget it is possible to obtain Gibbs draws for σ 2 which would not be possible under (24) 26
using (26). Moreover, the simplified structure allows us to integrate out dependence on β in the conditional posterior for σ 2 , and further integrate out dependence on both β and σ 2 in each of the respective posteriors for d and η. These results, which would not be possible without the nugget model, are helpful in two ways: 1. Marginalizing out parameters from full conditionals is always desirable when using MCMC techniques for fitting the model. Doing so has the effect of lessening the correlation between the chains for the parameters, leading to faster convergence, and thus “better” draws from the posterior. 2. Treed Gaussian processes involve proposing tree operations that grow (and prune, move, or swap) partitions (leaves) from the tree. As part of Metropolis Hastings steps, likelihood ratios need to be computed in order to accept (or reject) a proposed tree operation. Having an expression the likelihood which does not depend on σ 2 and β (because they have been integrated out) is desirable for the reasons stated above, as well as for computational reasons, not to mention parsimony.
A.2
Careful bookkeeping when predicting with the nugget model.
One has to be a little bit careful when applying the kriging equations (like those in Hjort & Omre) in when using the nugget model formulation mentioned above. To help illustrate, below we will re-write the prediction equations that we use: The predicted value of z(x) at x is normally distributed with mean and variance zˆ(x) = f > (x)β + k(x)> K−1 (t − Fβ), σ ˆ (x)2 = σ 2 [κ(x, x) − q> (x)C−1 q(x)], where C−1 = (K + FWF> )−1 , q(x) = k(x) + FWf (x), f > (x) = (1, x> ), κ(x, y) = K(x, y) + f > (x)Wf (x), and k(x) is a n−vector with kj (x) = K(x, xj ), for all xj ∈ X. Here, our focus is mainly on the definitions of k(x), κ(x, y), and K(x, y) which are measurements of the correlation of a predictive location x and other locations y. Remember that the covariance matrix K is constructed using the definition for K(·, ·) from (23) giving correlations between the data locations xi ∈ X, and results in a covariance matrix K which has 1 + η along the diagonal. Notice that according to the (23) K(xi , xj ) = 1 + η when i = j. but when i 6= j we have that K(xi , xj ) ≤ 1 even in the case where xi = xj whence K(xi , xj ) = 1. Therefore, when computing k(x), κ(x, y) one has to be careful to make the distinction between the covariance between a point and itself, versus the covariance between multiple points with the same configurations (because they are different). For example, if one is considering a new set of predictive locations, yi ∈ Y, then k(yi ) has entries less than or equal to 1 (no nugget), with equality only when yi = xj for some xj ∈ X. Alternatively, the correlation matrix between pairs of predictive locations from Y satisfies the same properties as that of K, the correlation matrix of for the data locations (X).
27
B B.1
Estimating Parameters: Details Full Conditionals
β: p(βν |rest) ∝ p(Zν |βν , σν2 , dν , gν )p(β ν |β 0 , σν2 , W) = N (Zν |Fν β ν , σν2 Kν ) · N (β ν |β 0 , σν2 W) 1 0 −1 0 −1 ∝ exp − 2 (Zν − Fν β ν ) Kν (Zν − Fν β ν ) + (β ν − β0 ) W (β ν − β0 ) 2σν 1 0 0 0 0 0 −1 −1 −1 −1 ∝ exp − 2 −2Zν Kν Fν β ν + βν Fν Kν Fν β ν + βν W βν − 2βν W β0 2σν 1 0 0 −1 0 −1 0 −1 −1 = exp − 2 β ν (Fν Kν Fν + W )β ν − 2β ν (Fν Kν Zν + W β0 ) 2σν
giving
˜ , σ2 V ˜ ) βν |rest ∼ N (β ν ν βν
(29)
where ˜ = V ˜ (F0 K−1 Zν + W−1 β ), β ν ν ν 0 βν
−1 −1 Vβ˜ν = (F0ν K−1 ) ν Fν + W
which can be sampled using Gibbs. β0: p(β 0 |rest) = p(β|β0 , σ 2 , W)p(β 0 ) r Y = p(β 0 ) p(β ν |β 0 , σν2 , W) i=1
= N (β 0 |µ, B)
r Y
N (β ν |β 0 , σν2 W)
i=1
Y r 1 1 0 −1 0 −1 exp − (β − β0 ) W (β ν − β 0 ) ∝ exp − (β 0 − µ) B (β 0 − µ) 2 2σν ν i=1 ( " #) r X 1 1 0 −1 0 −1 = exp − (β 0 − µ) B (β 0 − µ) + (β − β 0 ) W (β ν − β0 ) 2 σ2 ν i=1 ν " #) ( r r X X β 1 0 −1 0 −1 0 0 ν σν−2 β 0 − 2β0 W−1 ∝ exp − β0 B β0 − 2β0 B µ + β0 W−1 2 2 σ ν i=1 i=1 ( " ! !#) r r X X 1 βν 0 0 −1 −1 −2 −1 −1 ∝ exp − β0 B + W σν β0 − 2β0 B µ + W 2 σ2 i=0 i=1 ν giving ˜ ,V ˜ ) β0 |rest ∼ N (β 0 β0
(30)
where Vβ˜0 =
B
−1
+W
−1
r X i=0
σν−2
!−1
˜ = V˜ β 0 β0
B
−1
µ+W
−1
r X i=1
28
βν σν−2
!
.
which can be sampled using Gibbs.
T−1 : p(W−1 |rest) = p(W)p(β|β0 , σ 2 , W) = W (W−1 |(ρV)−1 , ρ) ·
r Y
N (βν |β0 , σν W)
i=1
1 −1 ∝ |W | exp − tr((ρV)W ) × 2 ( ) r X 1 1 |W−1 |r/2 exp − (βν − β0 )0 W−1 (βν − β0 )} 2 i=1 σν2 −1 (ρ−m−1)/2
= |W−1 |(ρ+r−m−1)/2 × !#) " ( r X 1 1 0 −1 −1 (βν − β0 ) W (βν − β0 ) exp − tr((ρV)W ) + tr 2 σ2 i=1 ν
obtained because a scalar is equal to its trace. Applying more properties of the trace operation gives p(W
−1
|rest) ∝ |W
−1
|
ρ+r−k−1 2
r X (βν − β0 )(βν − β0 )0 ρV+ σν2 i=1
" 1 exp − tr 2 (
!
W
−1
!#)
which means W
−1
|rest ∼ W
r X 1 (βν − β0 )(βν − β0 )0 , ρ + r ρV+ 2 σ i=1 ν
!
(31)
and that Gibbs sampling is appropriate.
B.2
Marginalized Conditional Posteriors
We are interested in taking the complete conditional posteriors for the range dν and the nugget gν of each partition, and analytically integrating out β and σ 2 to get a marginal posterior, which can then be used for
29
more efficient MCMC. X1 × Z1 , . . . , X × Zr . Y p(dν , gν |Zν , β 0 , W) p(d, g|Z, β 0 , W) = ν
=
YZ Z
p(dν , gν , β ν , σν2 |Zν , β 0 , W) dβ ν dσν2
ν
∝
YZ Z
p(Zν |dν , gν , βν , σν2 )p(dν , gν , β ν , σν2 |β0 , W) dβν dσν2
ν
=
Y
=
Y
p(dν , gν )
Z
p(σν2 )
Z
p(Zν |dν , gν , β ν , σν2 )p(β ν |σν2 , β0 , W) dβ ν dσν2
p(dν , gν )
Z
p(σν2 )
Z
(2π)−
ν
ν
nν 2
m
1
1
σν−nν |Kν |− 2 × (2π)− 2 σν−m |W|− 2
1 × (2π)m/2 σνm |Vβ˜ν | 2 × N (βν |β˜ν , σν2 Vβ˜ν ) i 0 1 h 0 −1 0 −1 ˜ −1 ˜ dβ dσ 2 . × exp − 2 Zν K Zν + β 0 W β 0 − βν Vβ˜ β ν ) ν 2σν
Everything in the above inner integrand except the Normal Density (which integrates to 1) does not depend on β ν . Thus we have p(d, g|Z, β 0 , W) =
Y
|Vβ˜ν |
p(dν , gν ) ×
(2π)nν |Kν ||W|
ν
! 21 Z
σν−nν p(σν2 ) exp
ψν − 2 dσν2 , 2σν
˜ 0 V−1 β ˜ . Expanding the prior for σ 2 gives: where ψν = Z0ν K−1 Zν + β 00 W−1 β 0 − β ν β˜ ν ν ν
p(d, g|Z, β 0 , W) =
Y
|Vβ˜ν |
p(dν , gν ) ×
ν
×
=
Y
Z
nν (σν2 )− 2
(2π)nν |Kν ||W| α0 q0 2
Γ( α20 )
p(dν , gν ) ×
ν
×
q0 2
α20
Γ( α20 )
2
×
(σν2 )−(
α0 2
+1)
|Vβ˜ν | (2π)nν |Kν ||W| ν Γ( α0 +n ) 2 ν ( q0 +ψ 2
α0 +nν 2
) q0 + ψν × exp − dσν2 , 2σν2
! 21
Z
q0 ψν exp − 2 exp − 2 dσν2 2σν 2σν
! 21
α0 +nν
ν ( q0 +ψ ) 2 2 α0 +nν Γ( 2 )
(σν2 )−(
α0 +nν 2
+1)
since the integrand above is really IG((α0 + 2)/2, (q0 + ψν )/2) the integral evaluates to 1, and we have: p(d, g|Z, β 0 , W) =
Y ν
p(dν , gν ) ×
|Vβ˜ν | (2π)nν |Kν ||W|
! 21
×
q0 2 q0 +ψν 2
α20
× ν α0 +n 2
Γ
α0 +nν 2 Γ α20
.
(32)
Equation (32) can be used in place of the likelihood of the data conditional on all parameters. It can be thought of as a likelihood of the data, conditional on only the range parameter d, and the nugget g. Moreover, (32) can be used to iteratively obtain draws for d and g with the appropriate conditioning. 30
Using the same ideas we can complete conditional of σν2 with β ν integrated out, which strangely enough involves the same qν quantity: Z 2 p(σν |Zν , dν , gν , β0 , W) = p(β ν , σν2 |Zν , dν , gν , β0 , W) dβ ν Z = p(σν2 ) p(Zν |dν , gν , β ν , σν2 )p(β ν |σν2 , β0 , W) dβ ν |Vβ˜ν |
! 21
ψν −nν 2 σ p(σ ) exp − ν ν (2π)nν |Kν ||W| 2σν2 ψν q0 =∝ σν−nν (σν2 )−(α0 /2+1) exp − 2 exp − 2 2σν 2σν q + ψ 0 ν , = (σν2 )−((α0 +nν )/2+1) exp − 2σν2 =
which means that σν2 |d, g, β0 , W ∼ IG((α0 + nν )/2, (q0 + ψν )/2).
(33)
In addition to improving mixing, (33) will be useful for obtaining Gibbs draws for σ 2 , after an accepted grow or prune tree operations, when β may not be available.
31
C
Using ALC sampling with a hierarchical GP
We would like to use the partition inverse equations (Barnett, 1979) for a covariance matrix CN +1 in terms −1 of CN , so that we can obtain an equation for C−1 N +1 in terms of CN : CN +1 =
CN m>
m κ
C−1 N +1
=
> −1 [C−1 ] g N + gg µ > g µ
(34)
where m = [C(x1 , x), . . . , C(xN , x)], κ = C(x, x), for an N + 1st point x where C(·, ·) is the covariance function, and g = −µC−1 N m
−1 µ = (κ − m> C−1 . N m)
−1 If C−1 N is available, these partitioned inverse equations allow one to compute CN +1 , without explicitly −1 constructing CN +1 . Moreover, the partitioned inverse can be used to compute CN +1 with time in O(n2 ) rather than the usual O(n3 ). Using our notation for a hierarchically specified Gaussian Process, in the context of ALC sampling, we wish to invert the matrix, KN +1 + F> N +1 WFN +1
which is key to the computation of the predictive variance σ ˆ (x)2 . We have that KN kN (x) FN WF> FN Wf (x) > N KN +1 + FN +1 WFN +1 = + k> f (x)> WF> f (x)> Wf (x) N (x) K(x, x) N kN (x) + FN Wf (x) KN + FN WF> N . = > > K(x, x) + f (x)> Wf (x) k> N (x) + f (x) WFN (*) Using the notation CN = KN + FN WF> N , qN (x) = kN (x) + FN Wf (x), and κ(x, y) = K(x, y) + f (x)> Wf (y) we get the following simplification: CN qN (x) CN +1 = KN +1 + F> WF = . N +1 N +1 qN (x)> κ(x, y) −1 : Applying the partitioned inverse equations (34) gives the following nice expression for (KN +1 +F> N +1 WFN +1 )
C−1 N +1
= (KN +1 +
−1 F> N +1 WFN +1 )
=
> −1 [C−1 ] g N + gg µ > g µ
(35)
where g = −µC−1 N qN (x)
−1 µ = (κ(x, x) − qN (x)> C−1 N qN (x))
using the most recent definitions of CN and κ(·, ·), see (*). From here we wish to obtain an expression for the key quantity of the ALC algorithm from Seo et. al. (Seo et al., 2000). Namely, the reduction in variance at a point y given that the location x is added into the data: ∆ˆ σy2 (x) = σ ˆy2 − σ ˆy2 (x), where −1 > σ ˆy2 = σ 2 [κ(y, y) − q> N (y)CN qN (y)]
and
σ ˆy2 (y) = σ 2 [κ(y, y) − qN +1 (y)> C−1 N +1 qN +1 (y)].
32
Now, −1 −1 2 > ∆ˆ σy2 (x) = σ 2 [κ(y, y) − q> N (y)CN qN (y)] − σ [κ(y, y) − qN +1 (y)CN +1 qN +1 (y)] −1 > = σ 2 [qN +1 (y)> C−1 N +1 qN +1 (y) − qN (y)CN qN (y)]. −1 Focusing on q> N +1 (y)CN +1 qN +1 (y), we first decompose qN +1 :
qN +1 = kN +1 (y) + FN +1 Wf (y) FN kN (y) Wf (y) + = K(y, x) f > (x) kN (y) + FN Wf (y) qN (y) = = . κ(x, y) K(y, x) + f > (x)Wf (y) Turning our attention to C−1 N +1 qn+1 (y), with the help of (35): C−1 N +1 qN +1 (y) =
> −1 [C−1 ] g N + gg µ > g µ
qN (y) κ(x, y)
=
> −1 [C−1 ]qN (y) + gκ(x, y)) N + gg µ > g qN (y) + µκ(x, y)]
=
qN (y) κ(x, y)
,
then −1 q> N +1 (y)CN +1 qN +1 (y)
>
> −1 (C−1 )qN (y) + gκ(x, y)) N + gg µ g> qN (y) + µκ(x, y)
−1 > −1 = q> )qN (y) + gκ(x, y)] N (y)[(CN + gg µ
+ κ(x, y)[g> qN (y) + µκ(x, y)]. Finally: −1 > ∆ˆ σy2 (x) = σ 2 [qN +1 (y)> C−1 N +1 qN +1 (y) − qN (y)CN qN (y)]. > −1 = σ 2 [q> qN (y) + 2κ(x, y)g> qN (y) + µκ(x, y)2 ] N (y)gg µ > −2 = σ 2 µ[q> qN (y) + 2µ−1 κ(x, y)g> qN (y) + κ(x, y)2 ] N (y)gg µ 2 −1 = σ 2 µ q> + κ(x, y) , N (y)gµ
and some minor re-arranging after plugging in for µ and g gives: ∆ˆ σy2 (x)
=
2 −1 σ 2 q> N (y)CN qN (x) − κ(x, y) −1 κ(x, x) − q> N (x)CN qN (x)
33
.
References Angluin, D. (1998). Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M., Marks, R., Aggoune, M., & Park, D. (1990). Training connectionist networks with queries and selective sampling. Advances in Neural Information Processing Systems, 566–753. Barnett, S. (1979). Matrix methods for engineers and scientists. McGraw-Hill. Bates, R. A., Buck, R. J., Riccomagno, E., & Wynn, H. P. (1996). Experimental design and observation for large systems. Journal of the Royal Statistical Society Series B., 58, 77–94. Box, G. E. P., Hunter, W. G., & Hunter, J. S. (1978). Statistics for experimenters. New York: Wiley. Breiman, L., Friedman, J. H., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Brown, P. J., Le, N., & Zidek, J. (1994). Multivatiate spatial interpolation and exposure to air pollutants. Canadien Journal of Statistics, 22, 459–510. Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design, a review. Statistical Science, 10 No. 3, 273–1304. Chipman, H., George, E., & McCulloch, R. (1998). Bayesian CART model search (with discussion). Journal of the American Statistical Association, 93, 935–960. Cohn, D. A. (1996). Neural network exploration using optimal experimental design. Advances in Neural Information Processing Systems (pp. 679–686). Morgan Kaufmann Publishers. Cormen, T. H., Leiserson, C. E., & Rivest, R. L. (1990). Introduction to algorithms. The MIT Electrical Engineering and Computer Science Series. MIT Press/McGraw Hill. Cressie, N. A. (1991). Statistics for spatial data. John Wiley and Sons, Inc. Denison, D., Mallick, B., & Smith, A. (1998). A Bayesian CART algorithm. Biometrika, 85, 363–377. Draper, N. R., & Hunter, W. G. (1966). Design of experiments for parameter estimation in multi-response situations. Biometrika, 53, 525–533. Draper, N. R., & Hunter, W. G. (1967). The use of prior distributions in the design of experiments for parameter estimation in nonlinear situations: multivariate case. Biometrika, 54, 622–665. DuMouchel, W., & Jones, B. (1985). Model robust response surface designs: scaling two-level factorials. Biometrika, 72, 513–526. DuMouchel, W., & Jones, B. (1994). A simple Bayesian modification of D-optimal designs to reduce dependence on and assumed model. Technometrics, 36, 37–47. Fine, S. (1999). Knowledge acquisition in statistical learning theory. Fine, S., Gilad-Bachrach, R., & Shamir, E. (2000). Learning using query by committee, linear separation and random walks. Eurocolt ’99, 1572 of LNAI, 34–49. Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver and Boyd. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall.
34
Higdon, D. (2002). Space and space-time modeling using process convolutions. Quantitative Methods for Current Environmental Issues (pp. 37–56). London: Springer-Verlag. Hjort, N. L., & Omre, H. (1994). Topics in spatial statistics. Scandinavian Journal of Statistics, 21, 289–357. Johns, C., Nychka, D., Kittel, T., & Daly, C. (2003). Infilling sparse records of spatial fields. JASA Applications. Kim, H.-M., Mallick, B. K., & Holmes, C. C. (2002). Analyzing non-stationary spatial data using piecewise Gaussian processes (Technical Report). Texas A&M University – Corpus Christi. Le, N., Sun, L., & Zidek, J. (1999). Bayesian spatial interpolations and backcasting using gaussian-generalized inverted wishart model (Technical Report 185). University of British Columbia, Statistics Department. Leon, A. C. P. D., & Atkinson, A. C. (1991). Optimum experimental design for discriminating between two rival models in the presence of prior information. Biometrika, 78, 601–608. Mackay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4, 589–603. Matheron, G. (1963). Principles of geostatistics. Economic Geology, 58, 1246–1266. McKay, M. D., Conover, W. J., & Beckman, R. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21, 239–245. M¨ uller, P., & Parmigiani, G. (1996). Numerical evaluation of information theoretic measures. In D. A. Berry, K. M. Chaloner and J. F. Geweke (Eds.), Bayesian analysis of Statistics and Econometrics: Essays in honor of Arnold Zeillner, 397–406. Wiley, New York. Nychka, D. (2003). The matrix reloaded: Computations for large spatial data sets. SAMSI/GSP Workshop on spatiotemporal statistics. Boulder, CO. http://www.cgd.ucar.edu/stats/pub/nychka/manuscripts/matrix.pdf. Nychka, D., Wikle, C., & Royle, J. (2002). Multiresolution models for nonstationary spatial covariance functions. Statistical Modelling, 2, 215–332. O’Hagan, A. (1985). Curve fitting and optimal design for prediction (with discussion). Journal of the Royal Statistical Society, Series B, 40, 1–41. O’Hagan, A., Kennedy, M. C., & Oakley, J. E. (1999). Uncertainty analysis and other inference tools for complex computer codes. Bayesian Statistics 6 (pp. 503–524). Oxford University Press. R Development Core Team (2003). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3. Rasmussen, C. E., & Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. Advances in Neural Information Processing Systems. MIT Press. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B, Methodological, 59, 731–758. Rogers, S. E., Aftosmis, M. J., Pandya, S. A., N. M. Chaderjian, E. T. T., & Ahmad, J. U. (2003). Automated cfd parameter studies on distributed parallel computers. 16th AIAA Computational Fluid Dynamics Conference. AIAA Paper 2003-4229. Sacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of computer experiments. Statistical Science, 4, 409–435.
35
Santner, T. J., Williams, B. J., & Notz, W. I. (2003). The design and analysis of computer experiments. New York, NY: Springer-Verlag. Schmidt, A. M., & O’Hagan, A. (2003). Bayesian inference for nonstationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society, Series B, 65, 745–758. Seo, S., Wallat, M., Graepel, T., & Obermayer, K. (2000). Gaussian process regression: Active data selection and test point rejection. Proceedings of the International Joint Conference on Neural Networks IJCNN 2000 (pp. 241–246). IEEE. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric curve fitting. Journal of the Royal Statistical Society Series B, 47, 1–52. Spezzaferri, F. (1988). Non-sequential designs for model discrimination and parameter estimation. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith (Eds.), Bayesian statistics, vol. 3, 777–783. Oxford University Press. Stein, M. L. (1999). Interpolation of spatial data. New York, NY: Springer. Tresp, V. (2001). Mixtures of gaussian processes. Advances in Neural Information Processing Systems 13 (pp. 654–660). MIT Press. Warmuth, M. K., Liao, J., Ratsch, G., Mathieson, M., Putta, S., & Lemmen, C. (2003). Support vector machines for active learning in the drug discovery process. Journal of Chemical Information Sciences, 43(2), 667–672. Warmuth, M. K., Ratsch, G., Mathieson, M., Liao, J., & Lemmen, C. (2001). Active learning in the drug discovery process. Advances in Neural Information Processing Systems, Vancouvner BC, Canada. Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T., & Morris, M. D. (1992). Screening, predicting, and computer experiment. Technometrics, 34, 15–25.
36