Aug 16, 2017 - CO] 16 Aug 2017. Local Kernel Dimension Reduction in. Approximate Bayesian Computation. Jin Zhou. Kenji Fukumizu. July 13, 2018.
arXiv:1609.01022v1 [stat.CO] 5 Sep 2016
Local Kernel Dimension Reduction in Approximate Bayesian Computation Jin Zhou
Kenji Fukumizu
September 6, 2016
Abstract Approximate Bayesian Computation (ABC) has been widely used in applications involving intractable likelihood functions. Instead of explicitly evaluating the likelihood function, ABC approximates the posterior distribution by first jointly sampling the parameter and data, and then accepting the pair based on the distance between the data and the observation. The efficiency of the sampling depends on the distance function which itself depends on the dimensionality of the data. It is a common practice to use summary statistics in the distance function to reduce the dimensionality, and the construction of summary statistics that are both low dimensional and sufficient is an important issue. This paper proposes Local Gradient Kernel Dimension Reduction (LGKDR) to construct low dimensional summary
1
statistics that aims at sufficiency regarding the parameters to be estimated. The proposed method identifies a linear sufficient subspace of the original summary statistics and applies a weighting kernel to concentrate on the local properties near the observation point. The low dimensional statistics are formed by projecting the original summary statistics onto that subspace. Unlike many other dimensional reduction methods, no assumptions are made on the marginal distribution of the original variable nor the form of the regression model, permitting usage in a wide range of applications. Experiments show that the proposed method successfully constructs low dimensional summary statistics without specific design or prior domain knowledge, and achieves competitive or better performance compared to other dimensional reduction methods used in ABC.
1
Introduction
Monte Carlo methods are prominent tools in sampling and inference problems. While the celebrated Markov Chain Monte Carlo methods find successes in applications which likelihood functions are known up to a unknown constant, MCMC can not be used in scenarios where likelihoods are intractable. For these cases, if the problem can be characterized by a generative model, Approximate Bayesian Computation(ABC) is often a candidate approach. ABC is a Monte Carlo method that approximates the posterior distribution by jointly generating simulated data and parameters and then 2
sampling according to the distance between the simulated data and observation, without the evaluating likelihoods. ABC has been first introduced in population genetics[1][2] and then been introduced to a range of complex applications including dynamical systems[3], ecology[4], Gibbs random fields[5] and demography [6]. The accuracy of ABC depends not only on the Monte Carlo error, but also on the distance function used in the sampling step. Since the probability of generating a sample that is identical to the observation is approximately zero, instead of sampling from the true posterior, a set of simulated sample that are close to the observation is accepted, where ”closeness” is characterized by the distance function. This approximation error is proportional to the threshold used in the distance function. It is then desirable to use a threshold that is as small as possible, but the efficiency of sampling would decrease accordingly. There is a trade-off between the accuracy and the efficiency by the choices of thresholds given the fixed distance function. In practical applications, however, the trade-off is even more serious as the data are high dimensional. In this case, distance functions can no longer characterize the ”closeness” of the sample properly, worsening substantially the performance of sampling. This is known as the curses of dimensionality. To circumvent this problem, summary statistics of the original data are used in the distance function instead of the original data. It is then crucial that the summary statistics used here are both low dimensional and sufficient. Formally, assume the generative model p(y|θ) of observation y given 3
parameter θ, and consider summary statistics sobs = Gs (yobs ) and s = Gs (y),where Gs : Y → S is the mapping from the original sample space Y to low dimensional summary statistics S. The posterior distribution, p(θ|yobs ), is approximated by p(θ|sobs ), which is constructed as p(θ|yobs ) = R pABC (θ, s|sobs )ds, with pABC (θ, s|sobs ) ∝ p(θ)p(s|θ)K(ks − sobs k/h)
(1)
where K is a smoothing kernel with bandwidth h. If the summary statistics s are sufficient, it can be shown that (1) would converges to the posterior p(θ|sobs ) as h goes to zero[7]. As discussed above, performance of ABC relies on the summary statistics and the sampling algorithm. Many publications are devoted to improve the performance of sampling. Rejection method[8], Markov Chain Monte Carlo(MCMC)[9] and more advanced methods like sequential Monte Carlo [10][3] are popular ones. In this paper, we focus on the construction of summary statistics. In early works of ABC, summary statistics are chosen by domain experts in an ad-hoc manner. It is sufficient if the dimensionality is small, but choosing appropriate summary statistics is much more difficult for complex models. When applications of ABC expand to a wider discipline, dimension reduction methods are introduced as a principled method to construct summary statistics. In this case, a set of redundant and hopefully sufficient summary statistics
4
are picked by domain experts in the first place, which are often called initial summary statistics, then a dimensional reduction method is applied to yield low dimensional summary statistics while preserving the sufficiency. Several dimensional reduction methods have been proposed to ABC such as entropy based subset selection[11], partial least square[12], neural network[13] and expected posterior mean[14]. The entropy based subset selection methods work well only in instances where the set of low dimensional summary statistics is a subset of the initial summary statistics, and the computational complexity increases exponentially with the size of the initial summary statistics. The partial least square and neural network methods aim to capture the nonlinear relationships of the original summary statistics. In both cases, a specific form of the regression function is assumed, the performance of such algorithms depend on whether the assumptions are met. A comprehensive review [15] discusses the methods mentioned above and compares the performances. It is reported that the expected posterior mean method (Semi-automatic ABC) [14] produces relatively better results compared to the methods mentioned above in various experiments . Semi-automatic ABC [14] uses the estimated posterior mean as summary statistics. A pilot run of ABC is first used to sample data from a truncated region of parameter space with non-negligible probability mass. The posterior mean is then estimated using the simulated data and is used as the summary statistics in a following formal run of ABC. A linear model of the form: θi = β (i) f (y) + ǫi is used in the estimation, where f (y) are the possi5
bly non-linear transforms of the data. For each application, the features f (y) have to be carefully designed to achieve a good estimation. However there are no principled methods on identifying a good set of features given a particular application. For simplicity, a vector of powers of the data (y, y2 , y3 , y4 , ...) is often used as f (y) as noted in [14]. To provide a principled way to capture the higher order non-linearity and realize an automatic construction of summary statistics, we introduce the kernel based sufficient dimension reduction methods. This paper proposes a localized version of Gradient based kernel dimension reduction (GKDR) [16]. GKDR estimates the sufficient low dimensional subspace by solving an eigenvalue problem of gradients in the reproducing kernel Hilbert spaces. While in the original GKDR, the estimation is based on averages of matrices representing derivatives over all data points, aiming at reducing variance, in ABC problems, we make inference based on a single observation point. A localized GKDR is proposed by averaging over simulated points in a small neighborhood around the observation. Each point is weighted using a distance metric measuring the difference between the simulated data and the observation. Another proposal is to use different summary statistics for different parameters. Note that sufficient subspace for different parameters can be different, depending on the particular problem. In some cases, as will be shown latter, applying separated dimension reduction procedure yields a better estimation. As demonstrated, our method achieves a similar or better results in 6
comparison with Semi-automatic ABC[14]. For applications involving high non-linearity, substantial improvements are obtained. The paper is organized as follows. In Section 2, we review GKDR and introduce its localized modification followed by some discussions of computation considerations. In Section 3, we show simulation results for various commonly conducted ABC methods, and compare the proposed method with the Semi-automatic ABC.
2
Local Kernel Dimension Reduction
In this section, we review the Gradient based Kernel Dimension Reduction (GKDR) and propose the modified Local GKDR. Discussions are given at the end of this section.
2.1
Gradient based kernel Dimension Reduction
Given observation (s, θ), where s ∈ Rm are initial summary statistics and θ ∈ R is the parameter to be estimated in a specific ABC application. Assuming that there is a d-dimensional subspace U ⊂ Rd , d < m such that
θ⊥ ⊥ s | B T s,
(2)
where B = (β1 , ..., βd ) is the orthogonal projection matrix from Rm to Rd . The columns of B spans U and B T B = Id . Condition (2) shows that given
7
B T s, θ is independent of the initial summary statistics s. It is then sufficient to use d dimensional constructed vector z = B T s as the summary statistics. This subspace U is called effective dimension reduction(EDR) space [17] in traditional dimensional reduction literature. While there are a tremendous amount of published works about estimating the EDR space, in this paper, we propose to use GKDR in which no strong assumption of marginal distribution or variable type is made. The following is a brief review of GKDR, for further details, we refer to [18] [19] [16]. Let B = (β1 , ..., βd ) ∈ Rm×d be the projection matrix to be estimated, and z = B T s. We assume (2) is true and p(θ|s) = p˜(θ|z). The gradient of the regression function is denoted by ∇s as
∇s =
∂E(θ|s) ∂E(θ|z) ∂E(θ|z) = =B ∂s ∂s ∂z
(3)
which shows that the gradients are contained in the EDR space. Given the following estimator M = E[∇s ∇Ts ] = BAB T , where Aij = E[E(θ|βiT s)E(θ|βjT s)], i, j = 1, ..., d. The projection directions β lie in the subspace spanned by the eigenvectors of M. It is then possible to estimate the projection directions using eigenvalue decomposition. In GKDR, the matrix M is estimated by the kernel method described below. Let Ω be an unempty set, a real valued kernel k : Ω × Ω → R is called P positive definite if ni,j=1 ci cj k(xi .xj ) ≥ 0 for any xi ∈ Ω and ci ∈ R. Given a positive definite kernel k, there exists a unique reproducing kernel Hilbert
8
space (RKHS) H associated with it such that: (1)k(·, x) spans H;(2)H has the reproducing property[20]: for all x ∈ Ω and f ∈ H, hf, k(·, x)i = f (x). Given training sample (s1 , θ1 ), ..., (sn , θn ), let kS (si , sj ) = exp(−||si − 2 sj ||2 /σS2 ) and kΘ (θi , θj ) = exp(−||θi − θj ||2 /σΘ ) be Gaussian kernels defined
on Rm and R, associated with RKHS HS and HΘ , respectively. With assumptions of boundness of the conditional expectation E(θ|S = s) and the average gradient functional with respect to z, the functional can be estimated using cross-covariance operators defined in RKHS and the consistency of their empirical estimators are guarenteed [16]. Using these estimators, we construct a covariance matrix of average gradients as
cn (si ) = ∇kS (si )T (GS + nǫn In )−1 GΘ (GS + nǫn In )−1 ∇kS (si ) M
(4)
where GS and GΘ are Gram matrices kS (si , sj ) and kΘ (θi , θj ), respectively. ∇kS ∈ Rn×m is the derivative of the kernel kS (·, si ) with respect to si , and ǫn is the regularization coefficient. This matrix can be viewed as the straight forward extension of covariance matrix in principle component analysis (PCA); the data here are the features in RKHS representing the gradients instead of the gradients in their original real space. P ˜ = 1/n n M cn (si ) is calculated over the The averaged estimator M i=1
training sample (s1 , θ1 ), ..., (sn , θn ). Finally, the projection matrix B is esti-
mated by taking d eigenvectors corresponding to the d largest eigenvalues of
9
˜ just like in PCA, where d is the dimension of the estimated subspace. In M this paper, we assume d is known.
2.2
Local GKDR
˜ is obtained by averaging over the trainAs discussed above, the estimator M ing sample si . When applied to ABC, since only one observation sample is available, we propose to generate a set of training data using the generating model and introduce a weighting mechanism to concentrate on the observation and avoid using sample generated in regions with low probability density. Given simulated data X1 , ..., XN and a weight kernel Kw : Rm → R, we propose the local GKDR estimator N 1 X ct (Xi ) ˜ Kw (Xi )M M= N i=1
(5)
ct is m × m matrix and Kw (Xi ) is the corresponding weight. Kw (x) where M can be any weighting kernel. In the numerical experiments, a triweight kernel is used, which is written as
Kw (Xi ) = (1 − u2 )3 1u