Jul 9, 2009 - We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr ,ts}. We train a GP on the ...
Experimental Design for the Heteroscedastic Model
Alexis Boukouvalas, Dan Cornford Neural Computing Research Group, Aston University
July 9rd , 2009
A. Boukouvalas, D. Cornford
MUCM
1/23
Overview
Refresher of Heteroscedastic Gaussian Process Emulator. Experimental results on Yuhba and Rabies model.
Experimental Design using the Fisher Information Matrix. Motivation. Derivation. Experimental results: Monotonicity. Submodularity. Optimization space using Exhaustive Search. Design Criterion Test.
Open Questions and Conclusions.
A. Boukouvalas, D. Cornford
MUCM
2/23
Random Output Simulators
Stochastic simulator A mapping that produces random output given a fixed set of inputs. Observational model yi (xi ) = ti (xi ) + ε(xi )
A. Boukouvalas, D. Cornford
(1)
MUCM
3/23
Algorithm
Main idea Use a coupled system of GPs to evaluate the mean and variance.
We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.
Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.
A. Boukouvalas, D. Cornford
MUCM
4/23
Algorithm
Main idea Use a coupled system of GPs to evaluate the mean and variance.
We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.
Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.
A. Boukouvalas, D. Cornford
MUCM
4/23
Algorithm
Main idea Use a coupled system of GPs to evaluate the mean and variance.
We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.
Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.
A. Boukouvalas, D. Cornford
MUCM
4/23
Algorithm
Main idea Use a coupled system of GPs to evaluate the mean and variance.
We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.
Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.
A. Boukouvalas, D. Cornford
MUCM
4/23
GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :
µ∗ = K ∗ (K + RN −1 )−1 t T
Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.
We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1
We omit the mean function although inclusion is straightforward.
A. Boukouvalas, D. Cornford
MUCM
5/23
GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :
µ∗ = K ∗ (K + RN −1 )−1 t T
Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.
We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1
We omit the mean function although inclusion is straightforward.
A. Boukouvalas, D. Cornford
MUCM
5/23
GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :
µ∗ = K ∗ (K + RN −1 )−1 t T
Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.
We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1
We omit the mean function although inclusion is straightforward.
A. Boukouvalas, D. Cornford
MUCM
5/23
Heteroscedastic Model: Joint Process Derivation Our observation equation at a given design point xi is: ti (xi ) = yi (xi ) + ε(xi ) Likelihood: p(¯ ti |¯ yi ) = p(¯ ti |yi ) = N (¯ ti |yi ,
σ2 (xi ) ni
),
where ni the number of replicate observations and σ2 (xi ) the true variance at location xi . We estimate the true variance by the predictive mean of GS . Due to independence of the noise we can write the likelihood in matrix form for all observations 1 . . . N: p(¯t|y¯ ) = p(¯t|y) = N (¯t|y, RP −1 ), N where R = diag (σ2 (xi ))N i =1 and P = diag (ni )i =1 .
Our zero mean GP prior is: p(y) = N (y|0, K ). A. Boukouvalas, D. Cornford
MUCM
6/23
Heteroscedastic Model: Joint Process Derivation Continued
The marginal observation density can then be calculated: Z Z p(¯t) = p(¯t|y)p(y)dy = N (¯t|y, RP −1 )N (y|0, K )dy
= N (¯t|0, Cµ = K + RP −1 ).
Condition on the known sites to obtain predictive distribution: p(¯t∗ |¯t)
= N (K (x∗ , x )T (K (x , x ) + R (x )P (x )−1 )−1 ¯t ,
K (x∗ , x∗ ) + R (x∗ )P (x∗ )−1
+ K (x∗ , x )T (K (x , x ) + R (x )P (x )−1 )−1 K (x∗ , x )).
A. Boukouvalas, D. Cornford
MUCM
7/23
Heteroscedastic Simulated Example ‘Yuhba’ function 2
y = 2(e−30(x −0.25) + sin(πx 2 )) − 2 + exp(sin(2πx ))N (0, 1), where N (0, 1) is the standard normal distribution.
A. Boukouvalas, D. Cornford
MUCM
8/23
Yuhba function: Total number of evaluations fixed
(b) Root Mean Squared Error
(c) Mahalanobis
Comparison of emulator fit where the total number of simulator evaluations is fixed at different levels. Notation: 30T3 = 30 design points each with 3 replicates. Results shown for a total of 90, 300, 400, 600 and 1600 total number of simulator evaluations.
A. Boukouvalas, D. Cornford
MUCM
9/23
CSL Simulator Overview Rabies disease propagation simulator with two vector species: raccoon dogs and foxes. Two types of output: time series and summary statistics for each run. Stochastic simulator. Output is stochastic but not normally distributed. 14 inputs, 1 output (Time To Disease Extinction)
A. Boukouvalas, D. Cornford
MUCM
10/23
Performance
(a) Mahalanobis
(b) Elapsed Time
(c) RMSE
(d) MSE Variance
A. Boukouvalas, D. Cornford
MUCM
11/23
Comparing sparse approximation methods to replicate design
Mahalanobis Error of Projected process ‘Kersting’ (4000 design points using 1000 support points) vs replicated design (1000 design points × 4 replicates).
A. Boukouvalas, D. Cornford
MUCM
12/23
Screening
Interpretation of Gaussian Process All input factors have been sphered so length scales can be used for importance ranking. With mean function length scales apply to residual process only.
Interpreting the variance emulator (GS ) by looking at the regression coefficients (Coeff) and correlation length scales (Scale). FACTOR R AC D ENSITY R AC D EATH R AC B IRTH
A. Boukouvalas, D. Cornford
C OEFF 0.1608 0.0633 0.0200
FACTOR R AC R ABID F OX I NF F OX R ABID
S CALE 1.4281 1.4594 1.5047
MUCM
13/23
Fisher Information Matrix
The FIM is p × p symmetric matrix where p the number of unknown parameters:
F=−
Z
[
∂2 [ln(f (X; θ)]f (X; θ)dX ∂θi θj
Given X distributed as N (µ(θ), Σ(θ), the i , j element of the FIM is: Fij =
A. Boukouvalas, D. Cornford
∂µT −1 ∂µ 1 ∂Σ −1 ∂Σ Σ + tr (Σ−1 Σ ) ∂θi ∂θj 2 ∂θi ∂θj
MUCM
(2)
14/23
Derivatives of the joint mean function The derivatives of the joint mean function with respect to the covariance parameters of GM and GS are:
∂µGM ∗ ∂θµ ∂µGM ∗ ∂θΣ
T ∂Kµ∗ T −1 ∂Cµ −1 ¯ Cµ−1 ¯t − Kµ∗ C t, Cµ ∂θµ ∂θµ µ T −1 ∂R = −Kµ∗ Cµ P −1 Cµ−1 ¯t, ∂θΣ
=
(3) (4)
The R matrix is a diagonal with elements Rii = exp(r (xi )) and hence ∂r (x )
∂Rii = exp(r (xi )) ∂θΣi . r (xi ) is the most likely the derivative is ∂θ Σ prediction of the variance from GS at point xi . Hence the derivative is
∂r (xi ) ∂KΣ∗ T −1 2 −1 ∂CΣ −1 2 T = CΣ λ − KΣ∗ CΣ C λ . ∂θΣ ∂θΣ ∂θΣ Σ
A. Boukouvalas, D. Cornford
MUCM
(5)
15/23
Derivatives of the joint variance
The derivatives of the joint variance function with respect to the covariance parameters of GM and GS are:
∂ΣGM ∗ ∂θµ ∂ΣGM ∗ ∂θΣ
= =
∂Kµ∗∗ T −1 ∂Kµ −1 − Ξ − ΞT + Kµ∗ C Kµ∗ , Cµ ∂θµ ∂θµ µ ∂R∗ −1 T −1 ∂R P∗ + Kµ∗ Cµ P −1 Cµ−1 Kµ∗ , ∂θΣ ∂θΣ
(6) (7)
∂K T
where Ξ = ∂θµ∗ Cµ−1 Kµ∗ . µ
A. Boukouvalas, D. Cornford
MUCM
16/23
Biased Estimation Empirical Parameter Covariance We compute the empirical covariance of squared exponential kernel using 2000 samples of a Heteroscedastic GP.
A. Boukouvalas, D. Cornford
MUCM
17/23
Monotonicity
A. Boukouvalas, D. Cornford
MUCM
18/23
Submodularity
Definition A function F is submodular iff for A ⊂ B and ε \ B: F (A ∪ ε) − F (A) ≥ F (B ∪ ε) − F (B ) Nemhauser et al, 1978 Theorem If F is a monotone submodular function over a finite ground set with 1 k F (0) = 0 then the greedy optimization algorithm is within (1 − ( k − ) ) k constant factor of the optimal strategy for k design points.
A. Boukouvalas, D. Cornford
MUCM
19/23
Is the FIM a submodular function?
X-axis: Design Size, Y Axis: number of violations over 100 realizations.
A. Boukouvalas, D. Cornford
MUCM
20/23
Optimization space using Exhaustive Search
Using FIM pick 6 points from 30 point candidate set (1,623,160). A. Boukouvalas, D. Cornford
MUCM
21/23
Design Criterion Experiment: Designs Used Designs considered
A. Boukouvalas, D. Cornford
MUCM
22/23
Design Criterion Test Empirical Parameter Variance vs FIM
A. Boukouvalas, D. Cornford
MUCM
23/23
Summary Heteroscedastic Framework Approach improves upon existing methods both in terms of accuracy and computational efficiency, in terms of inference and prediction time. In combination with a discrepancy model and real-world observations, this method could facilitate the efficient statistical calibration of stochastic simulators. Open Questions Is the FIM a good design criterion for correlated non-linear models? In Zhu and Stein monotonicity of Fisher Information matrix to empirical parameter variance is shown but this is empirical evidence only. When doing maximum likelihood for the GP parameters, are they identifiable and consistently estimable? In some cases (see Stehlnik) Matern parameters are not. Under what conditions are they consistently estimable for the squared exponential? A. Boukouvalas, D. Cornford
MUCM
24/23
References
K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. ”Most likely heteroscedastic gaussian process regression”. In Proc. 24th International Conf. on Machine Learning, 2007. Paul W. Goldberg and Christopher K. I. Williams and Christopher M. Bishop. ”Regression with Input-dependent Noise: A Gaussian Process Treatment”. Advances in Neural Information Processing Systems. The MIT Press, 1998. Werner Muller and Milan Stehlik. ”Issues in the Optimal Design of Computer Experiments.”. IFAS Research Paper Series, July 2007. Werner Muller and Milan Stehlik. ”Issues in the Optimal Design of Computer Experiments.”. IFAS Research Paper Series, July 2007. Zhengyuan Zhu and Michael L. Stein. ”Spatial sampling design for parameter estimation of the covariance function”. Journal of Statistical Planning and Inference 2005
A. Boukouvalas, D. Cornford
MUCM
25/23