Kernel Sequential Monte Carlo - arXiv

20 downloads 0 Views 2MB Size Report
In another line of research, Sequential Monte Carlo (SMC) methodology, which ... Kernel Adaptive Metropolis Hastings (KAMH), with pro- posals that are locally ...
arXiv:1510.03105v3 [stat.CO] 16 Feb 2016

Kernel Sequential Monte Carlo

Ingmar Schuster∗ CEREMADE, Universit´e Paris Dauphine, France

INGMAR . SCHUSTER @ DAUPHINE . FR

Heiko Strathmann∗ Gatsby Unit, University College London, UK

HEIKO . STRATHMANN @ GMAIL . COM

Brooks Paige Department of Engineering, University of Oxford, UK Dino Sejdinovic Department of Statistics, University of Oxford, UK ∗

BROOKS @ ROBOTS . OX . AC . UK

DINO . SEJDINOVIC @ STATS . OX . AC . UK

equal contribution

Abstract Bayesian posterior inference with Monte Carlo methods has a fundamental role in statistics and probabilistic machine learning. Target posterior distributions arising in increasingly complex models often exhibit high degrees of nonlinearity and multimodality and pose substantial challenges to traditional samplers. We propose the Kernel Sequential Monte Carlo (KSMC) framework for building emulator models of the current particle system in a Reproducing Kernel Hilbert Space and use the emulator’s geometry to inform local proposals. KSMC is applicable when gradients are unknown or prohibitively expensive and inherits the superior performance of SMC on multi-modal targets and its ability to estimate model evidence. Strengths of the proposed methodology are demonstrated on a series of challenging synthetic and real-world examples.

1. Introduction Monte Carlo methods have become one of the main inference tools of statistics and machine learning over the last thirty years (Robert & Casella, 2004; Andrieu et al., 2003). They are used to numerically approximate intractable integrals with respect to a given target distribution, e.g. a Bayesian posterior. Importantly, they also provide means to quantify uncertainty in the form of variance estimates, credible intervals and regions of high posterior density. The most widely used Monte Carlo method today is MCMC,

where a Markov chain is constructed that is ergodic w.r.t. the desired target. MCMC allows to approximately generate a sample from the target when running the chain long enough. In another line of research, Sequential Monte Carlo (SMC) methodology, which traditionally was employed for filtering problems with a sequence of time-varying target distributions (Doucet et al., 2001b), has been employed as an alternative to MCMC in the context of a single (static) target distribution (Chopin, 2002; Del Moral et al., 2006; Capp´e et al., 2008; Sch¨afer & Chopin, 2013; Fearnhead & Taylor, 2013), with particular success on multi-modal targets, and for estimating model evidence. Sequential Monte Carlo methods often introduce a series of auxiliary intermediate distributions, beginning with an “easy” distribution and sampling from each of them in turn until the final target is reached. In contrast to the use of SMC for inference in time-series models, where each intermediate distribution is typically defined on a successively larger latent space, SMC methods for static target distributions define an artificial, smooth series of incremental targets that can be constructed in a number of ways, e.g. by tempering the target density (Neal, 1998; Del Moral et al., 2006), or through including data points (and thus likelihood terms) sequentially (Chopin, 2002). Population Monte Carlo methods introduced by Capp´e et al. (2004) can be viewed a special case of SMC which targets the final distribution at each iteration. The original work on PMC exhibited striking resemblance of commonly used MCMC methods such as Random Walk metropolis, often finding that the same proposal kernel with PMC produces better estimates than with MCMC (Douc et al.,

Kernel Sequential Monte Carlo

2007).

can significantly improve sampling efficiency over ASMC.

Kernel methods have recently been employed in the context of adaptive MCMC (Sejdinovic et al., 2014; Strathmann et al., 2015). Via modelling a Markov chain trajectory in a reproducing kernel Hilbert space (RKHS) and using geometry therein, it is possible to significantly improve mixing on nonlinear target distributions. Covariance in the RKHS can be used to construct an adaptive random walk scheme, Kernel Adaptive Metropolis Hastings (KAMH), with proposals that are locally aligned with the target density (Sejdinovic et al., 2014). Gradients of exponential families in the RKHS can be used to construct Kernel Hamiltonian Monte Carlo (KHMC), an algorithm that can behave similarly to HMC (Strathmann et al., 2015) as it sees more data. Both KAMH and KHMC fall back to a random walk in previously unexplored regions, inheriting convergence properties such as geometric ergodicity on log-concave targets (c.f. Proposition 3 in Strathmann et al. (2015)). In addition, these methods have the remarkable property that they do not require gradient information of the target.

A second algorithm we consider uses the (approximate) infinite dimensional kernel exponential family models by Strathmann et al. (2015) to estimate target gradients. Using this estimate, we develop Kernel Gradient Importance Sampling (KGRIS), a gradient-free version of Gradient Importance Sampling (GRIS) of Schuster (2015). KGRIS is a novel adaptation of kernel gradient estimation ideas for constructing Langevin diffusions.

In this paper, we develop a framework for Kernel Sequential Monte Carlo (KSMC) for sampling from static models. KSMC represents the particle system of SMC algorithms in an RKHS, similarly to the previous work using un-weighted samples in adaptive MCMC (Sejdinovic et al., 2014; Strathmann et al., 2015). We argue, however, that KSMC is a more natural framework for employing such RKHS-based representations and proposals arising from them. Namely, kernel-based adaptive MCMC samplers suffer from a difficult to tune exploration-exploitation trade-off since a decaying adaptation schedule is required for both KAMH and KHMC in order to ensure convergence to the correct target and even ergodicity of the resulting Markov chain. There is currently no principled guidance on selecting such adaptation schedule. In sharp contrast, in SMC (and PMC) adaptation can proceed without limitations: the choice of an adaptation schedule is thus completely circumvented allowing to build better RKHS-based models of the target as more samples are generated. The learned geometry of the corresponding “emulator” model is used to construct proposal distributions both for rejuvenation (MCMC) steps inside SMC algorithms as well as for importance sampling directly. We apply this framework to different SMC strategies and study advantages and inherent tradeoffs in the resulting algorithms. We first introduce the Kernel Adaptive Sequential Monte Carlo Sampler (KASMC), where the global covariance estimate in the adaptive SMC sampler (ASMC) of Fearnhead & Taylor (2013) is replaced with the kernel-informed local covariance by Sejdinovic et al. (2014). As ASMC itself, KASMC starts as a standard random walk and then smoothly transitions into kernel-informed proposals that

Like their gradient-based counter-parts, these kernelised algorithms can lead to substantially improved performance over adaptive random walk methods. Unlike them, since no gradient or higher-order information is required, our family of KSMC algorithms can be used in combination with IS frameworks for sampling from doubly intractable targets, such as SMC2 (Chopin et al., 2013) and IS2 (Tran et al., 2013). Those recently established methods are usable where more refined gradient-based samplers such as MALA, HMC, and GRIS are not applicable, since gradient information is unavailable.

Outline. The paper proceeds as follows. We review SMC samplers in Section 2.1 and emulators based on positive definite kernels in 2.2. The Kernel SMC samplers are introduced in Section 3 and tradeoffs of different variants are discussed. In Section 4, we compare two instances of KSMC algorithms to the existing adaptive SMC sampler and MCMC algorithms on a synthetic nonlinear target, when sampling hyperparameters of a Gaussian Process classifier, and for a challenging stochastic volatility model. The last section concludes.

2. Background 2.1. Sequential Monte Carlo For ease of exposition, we will start with the special case of sequential Monte Carlo samplers called Population Monte Carlo (PMC). Sequential Monte Carlo samplers (SMC) are then described, although not in full generality in order to make distinction simpler for the reader. For details of both algorithms, we refer to Capp´e et al. (2004; 2008) for PMC and Del Moral et al. (2006) for SMC samplers. PMC is based on the observation that one can sample from a target distribution π by drawing a proposal distribution q from a density pq over proposals. Then, generating a sample X ∼ q, we can use it with the weight w(X) = π(X)/q(X) and the usual IS estimator to obtain a consistent and asymptotically unbiased estimate of the integral of interest. The density over proposals pq can be arbitrary, and samples from it need not be independent either. In its initial formulation, starting from a system of N initial samples

Kernel Sequential Monte Carlo

called particles1 , population Monte Carlo samples from a target π by drawing a new particle Xi0 from a local proposal q(·|Xi ) conditioned on the value of the corresponding particle Xi in the previous iteration. An importance re-sampling step is used, re-sampling from {Xi }N i=1 proportional to the weights wi = π(Xi0 )/q(Xi0 |Xi ), resulting in an approximate sample from π. The algorithm is iterated using the latest un-weighted sample to inform new local proposal distributions, which can look much like Markov proposals within Metropolis-Hastings; for instance, the first proposals used in PMC methods were random-walk steps (Douc et al., 2007; Capp´e et al., 2004).

P 1 measure with mean µx := |x| x∈x k(x, ·) and covariP 1 ance |x| k(x, ·) ⊗ k(x, ·) − µx ⊗ µx in a possibly x∈x infinite-dimensional RKHS, induced by a positive definite kernel function k (Smola et al., 2007). The RKHS measure is supported only on a finite dimensional subspace and can be sampled from exactly.

Sequential Monte Carlo samplers in the more general case introduce an artificial sequence of incremental target densities π0 , . . . , πT (Del Moral et al., 2006), culminating in πT ≡ π. Though it may be difficult to construct an importance sampling proposal density for π, the design of these intermediate densities is such that consecutive target distributions πi , πi+1 are close, e.g. under a divergence measure like Kullback-Leibler or total variation distance. Thus, sampling from πt+1 is easy given a set of samples from πt . At iteration t = 0, we generate a set of samples and accompanying weights {(w0,i , X0,i )}N i=1 , referred to as particles, from a proposal distribution q0 , where w0,i = π0 (X0,i )/q0 (X0,i ). We apply importance resampling, re¯ 0,i )}N . sulting in N equally weighted particles {(1/N, X i=1 ¯ 0,i by a We then perform a rejuvenation move for each X Markov kernel leaving π0 invariant (which can be thought of as a single MCMC step). At iteration t + 1 > 0, we ¯ t,i and approximate πt+1 by the weighted set Xt+1,i = X particle system {(πt+1 (Xt+1,i )/πt (Xt+1,i ), Xt+1,i )}N i=1 . This system is then again re-sampled and rejuvenated using MCMC, etc.

¯ j ) = N (·|X ¯ j , γ 2 I +ν 2 MX,X¯ CM > ¯ )), (1) qKAMH (·|X X,Xj j

Kernel based proposals such as those developed in Sejdinovic et al. (2014) and Strathmann et al. (2015) can be used either for MCMC rejuvenation steps in SMC or directly as local importance sampling proposals. Namely, the proposal distribution q is constructed based on modelling the particle system {(wt,i , Xt,i )}N i=1 in a RKHS. 2.2. Kernel Adaptive MCMC Sejdinovic et al. (2014) introduced KAMH as a method for adapting the proposal distribution in a MetropolisHastings MCMC algorithm, based on the history of the Markov chain x = {x1 , x2 , . . . }. When the target distribution exhibits strong non-linear dependencies across dimensions, adaptation of a global (location-independent) covariance matrix as done by Haario et al. (2001) can be inefficient. For this reason, Sejdinovic et al. (2014) represent covariance of the target as an empirical Gaussian 1 While the original PMC paper used the term Population, we will stick with the SMC nomenclature.

¯ j , Sejdinovic et al. (2014) Given a (un-weighted) particle X showed that it is possible to integrate out the RKHS proposal analytically for certain kernels, which elegantly results in a closed form Gaussian proposal density in the in¯ j is given by put space. This proposal at particle X

where C = I − n1 11> is a centering matrix and MX,X¯ j collects kernel gradients with respect to all particles, ¯ 1 )|x=X¯ , ..., ∇x k(x, X ¯ N )|x=X¯ ]. MX,X¯ j = 2[∇x k(x, X j j Additional exploration noise with variance γ 2 avoids that the proposal collapses in unexplored regions of the input space. This ensures that the proposal puts mass everywhere in the space, resulting in ergodic Markov chains and valid importance samplers. When using a linear kernel for k, the covariance matrix > γ 2 I + ν 2 MX,X¯ j CMX, ¯ j simply reduces to an estimate of X the global covariance of the current target πt . In MCMC this results exactly in the adaptive Metropolis algorithm of Haario et al. (2001). Using a Gaussian radial basis function (RBF) kernel instead results in a covariance matrix that locally aligns to the structure of the posterior in the prox¯j . imity of X While KAMH can lead to improved mixing in MCMC, its proposals are still based on a random walk. Strathmann et al. (2015) constructed an algorithm that adaptively learns the gradient structure of the Markov chain history, and simulates Hamiltonian dynamics using the learned gradients. This is done by fitting an infinite dimensional exponential family model with an un-normalised density function given by exp(hf, k(x, ·)iH − A(f )), where hf, k(x, ·)iH = f (x) is the inner product between natural parameters f and sufficient statistics k(x, ·) in an infinite dimensional RKHS H, and A(f ) is the (intractable) log-partition function. Remarkably, it is possible to efficiently estimate f via minimising the expected L2 error of ∇x f (x) without dealing with A(f ). Combining this model with a further layer of approximation, based on Random Fourier Features (Rahimi & Recht, 2007), allows for efficient on-line updates of the emulator. The approximate estimator takes the simple form f (x) = θ> φx , where φx ∈ Rm is an embedding of x into an m-dimensional feature space, and θ ∈ Rm is estimated

Kernel Sequential Monte Carlo

by θˆ = C −1 b from samples x. Here b, C are simple averages of the form b := −

d 1 X X ¨` φx , |x| x∈|x| `=1

C :=

d 1 X X ˙ `  ˙ ` T φ x φx , |x| x∈|x| `=1

with element-wise derivatives φ˙ `x := ∂2 φ , ∂x2` x

∂ ∂x` φx

and φ¨`x :=

c.f. Proposition 2 in Strathmann et al. (2015). The resulting KHMC algorithm leads to similarly substantial improvements over random walks as HMC, but without requiring gradient information of the target. Any form of adaptation in MCMC requires care in order to preserve ergodicity of the resulting Markov chain. For this reason, some form of vanishing adaptation scheme is needed (Rosenthal, 2011; Andrieu & Thoms, 2008). For a pure MCMC implementation using the KAMH kernel, it was proposed to update the proposal family with vanishing probability (Sejdinovic et al., 2014; Strathmann et al., 2015). This, however, creates a difficult to quantify exploration / exploitation trade-off, and it is unclear at which point the target geometry has been estimated accurately. By using kernel proposals within SMC we circumvent this problem, as vanishing adaptation is not necessary for the overall validity of the algorithm. 2.3. Intractable Likelihoods and Evidence Estimation In the presence of intractable likelihoods, i.e. where likelihoods cannot be evaluated but unbiasedly estimated, SMC is still a valid algorithm when likelihood values can be estimated unbiasedly. Estimation of the likelihood can be handled using an Importance Sampling or SMC procedure, as in (Tran et al., 2013; Chopin et al., 2013). A simple way to think about such nested estimation schemes is that one can set up an algorithm on an extended sampling space, which spans the actual parameters of interest as well as any nuisance variables necessary for estimating the likelihood. Naturally, when sampling on a larger space, variance of estimates will increase. Intractable likelihoods usually result in unavailability of target distribution gradients. Consequently, efficient gradient-based sampling schemes such as MALA or HMC are unavailable. Particles are therefore moved based on random walk schemes, which often fail to account for local covariance structure of the target, and poorly scale in increasing dimension. As was seen above, RKHS-informed proposals are able to incorporate local covariance structure without accessing gradients. This makes them an ideal candidate for use in IS or SMC with intractable likelihoods. An important issue in Bayesian model selection and averaging is that of estimating the normalizing constant, or evidence. The evidence is the marginal probability of the data

under a model and can easily be estimated in SMC instantiations (Doucet & Johansen, 2009; Fearnhead & Taylor, 2013). This enables routine computation of Bayes factors and posterior model probabilities while also sampling from a posterior over parameters of each model. Under the assumption that the normalizing constant Z0 of π0 (the distribution that is used for initially setting up the particle system) is known, one can estimate the ratio of normalizing constants of any two consecutive targets by N 1 X Zt ≈ wt,j Zt−1 N j=1

(2)

for wt,j = πt (Xt−1,j )/πt−1 (Xt−1,j ) and thus an estimate for Z = ZT can be found recursively by Z = ZT ≈ Z0

T Y 1 X wt,j N j t=1

(3)

starting with known value Z0 . When the likelihood is intractable and importance weights are noisy, evidence estimation is still valid. For a theoretical account of evidence estimates using IS2 see Lemma 3 of Tran et al. (2013). 2.4. Existing SMC Algorithms The Adaptive SMC Sampler As we saw in Section 2.1, SMC provides an approximation to a sequence of distributions (Doucet & Johansen, 2009). To sample from a static target, we define an artificial sequence of target distributions π0 , . . . , πT between a distribution π0 that is easy to sample from (for example the prior over the parameters in a Bayesian setup) and the actual distribution of interest π = πT (e.g. the posterior), resulting in what is called an SMC sampler in the terminology of Del Moral et al. (2006). The main realization of the adaptive SMC (ASMC) sampler studied by Fearnhead & Taylor (2013) is that before moving the particles through a Markov Kernel κθ leaving πt invariant, one can actually use the information provided by the re-weighted particle system to set the parameters θ of the Markov kernel; in particular, the global covariance Σt of πt can be approximated and a scaling parameter ν 2 can be updated to use a ¯ t,i , ν 2 Σt + γ 2 I) within MH to proposal distribution N (·|X construct κθ for the rejuvenation step (where θ subsumes Σt , ν 2 ). Gradient IS Orthogonally, Schuster (2015) proposed Gradient Importance Sampling (GRIS), an algorithm using a discretised Langevin diffusion with Importance Sampling instead of the usual Metropolis-Hastings correction. For a given target distribution π ˜ and a previously generated sample X, GRIS uses the proposal distribution q(X) =

Kernel Sequential Monte Carlo

N (X + D(∇ log π ˜ (X)), sd Σ), where Σ is fitted to the covariance of π ˜ from previous samples and the drift function D(·) and sd are parameters of the algorithm. The target π ˜ might be part of a sequence of target distributions as in SMC samplers or the same static distribution for several iterations of Importance Sampling, following PMC arguments (Capp´e et al., 2004). In numerical experiments, GRIS compares favorably to its closest MCMC relatives like the adaptive MALTA algorithm (Atchad´e, 2006) and adaptive Metropolis (Haario et al., 2001).

3. Kernel Adaptive Sequential Monte Carlo We now develop a Kernel Sequential Monte Carlo framework. Essentially this applies the ideas for efficient kernel induced density emulators to SMC samplers. First and foremost, this means fitting the emulators to weighted particles. Furthermore, we experimentally found that RaoBlackwellization was crucial for stabilising the emulators. Once the emulator is fitted, we can use it in either of two ways: as proposals for MH rejuvenation steps or as importance densities in PMC. In their full generality, SMC samplers would allow the usage of these emulators both as importance densities and for MCMC rejuvenation steps, but for clarity we avoid the combination in the experiments. 3.1. Kernel Adaptive Rejuvenation Using a weighted RKHS fit, we can use both emulators presented in Section 2.2 for the rejuvenation step of a SMC sampler. More specifically, at time-step t + 1, we target distribution πt+1 , based on a particle system approximating πt . After re-weighting, the new system {wt+1,i , Xt+1,i )}N i=1 is a weighted approximation to πt+1 . We can use this approximation to fit the emulator, and generate rejuvenation proposals from it that are corrected by Metropolis-Hastings. For the experiments in the rejuvenation-type of SMC sampler, we use the KAMH algorithm which results in covariance matrices for Gaussian proposals which locally align with the target (Sejdinovic et al., 2014). The resulting Kernel Adaptive SMC sampler (KASMC) inherits KAMH’s ability to explore non-linear targets more efficiently than proposals based on estimating global covariance structure such as in Fearnhead & Taylor (2013) and Haario et al. (2001). Figure 1 shows a simple comparison of a global (ASMC) and local proposal distribution (KASMC). For this emulator, even a fit using the particle system after a re-sampling step is enough to improve performance over ASMC. Thus KASMC exhibits a certain robustness, as resampling increases the variance of the fit and should generally be avoided in this context.

Figure 1. Proposal distributions around one of many particles (blue) for each KASMC (red) and ASMC (green). KASMC proposals locally align to the target density while ASMC’s global covariance estimate might result in poor MH rejuvenation moves.

3.2. Kernel Induced Importance Densities Another way to use kernel-based emulators is for generating proposals which are corrected by Importance Sampling rather than Metropolis-Hastings. In our second approach, a kernel emulator generates samples for which importance weights are then computed. As an example of a kernelised PMC algorithm, we here use the kernel gradient emulator by Strathmann et al. (2015), in its finite dimensional approximation (KMC finite). Rather than simulating Hamiltonian dynamics to generate a proposal, we here only take single gradient steps. This keeps the risk of divergence due to wrongly estimated gradients low. We arrive at a variant of Gradient IS (Schuster, 2015) without the need for closed form gradients: Kernel Gradient IS (KGRIS). Fitting the approximate kernel gradient emulator on weighted data is straight forward as all involved terms are simple averages. After new samples are generated and importance weighted, the emulator is updated with this new information the procedure iterates. The original PMC paper introduces re-sampling in order to deal with unweighted instead of weighted samples (Capp´e et al., 2004). While some of the consecutive papers avoid re-sampling altogether (Capp´e et al., 2008; Cornuet et al., 2012), we consider re-sampling as a way to obtain the locations X for our Markov type proposals q(·|X), due to a better behaving variance in high dimensions. Namely, with re-sampling, Monte Carlo variance grows only as O(d) instead of as O(exp(d)) where d is the dimensionality (c.f. Doucet & Johansen, 2009; Del Moral, 2004). We focus on the case where the same target distribution π is used at every iteration. Given a re-sampled number of N particles from the last iteration and the updated emulator q, we draw one sample Xi0 ∼ q(·|Xi ) and weight it with wi = π(Xi0 )/˜ q (Xi0 ) for each i. In the simplest case q˜(Xi0 ) = q(·|Xi ). While this weighting scheme was used in Capp´e

Kernel Sequential Monte Carlo

et al. (2004), and provides asymptotically consistent estimates, we found it to be unstable in weighted updates of the kernel-based gradient emulator by Strathmann et al. (2015). This problem is solved by usinga Rao-Blackwellized  imPN 0 portance weights wi = π(Xi0 )/ q(X |X ) where j i j=1 j ranges over all particles. 3.3. SMC versus PMC for kernel based proposals In the wider SMC community, it seems consensus that using an artificial sequence of proposal distributions for sampling from a static target is to be preferred. This is based on the fact that the coverage of the final target is better in these tempering-style algorithms. It however results in a considerable computational investment for those iterations where an intermediate target is considered. We also note that on-line updates of the kernel emulator are not possible: the target changes in every iteration. The contrary is true in PMC, where the the actual distribution of interest is targeted in every iteration. Here, a popular approximation technique of kernels is a good fit: By expressing the emulator model in terms of finite dimensional random Fourier features (Rahimi & Recht, 2007), we can perform cheap on-line updates (Strathmann et al., 2015). The emulator therefore can accumulate information of all PMC iterations without the computational efforts of re-computing its solution.

4. Evaluation We begin by studying performance of KASMC on a simple non-linear target. The second experiment considers KASMC for estimating Bayesian model evidence in a model with an intractable likelihood on a real-world dataset. The final experiment uses a challenging stochastic volatility model with SP 500 data from Chopin et al. (2013) to evaluate KGRIS. 4.1. Improved Convergence on a Synthetic, Non-Linear Target We begin by studying convergence of KASMC compared to existing algorithms on a simple benchmark example: the strongly twisted banana-shaped distribution in d = 8 dimensions used in Sejdinovic et al. (2014). This distribution a multivariate Gaussian with a non-linearly transformed second component.

Model. Let X ∼ N (0, Σ) be a multivariate normal in d ≥ 2 dimensions, with Σ = diag(v, 1, . . . , 1), which undergoes the transformation X → Y , where Y2 = X2 + b(X12 − v), and Yi = Xi for i 6= 2. We write Y ∼ B(b, v).

Figure 2. Improved convergence of all mixed moments up to order 3 of KASMC compared to using standard static or adaptive Metropolis-Hastings steps.

It is clear that EY = 0, and that B(y; b, v) = N (y1 ; 0, v)N (y2 ; b(y12 −v), 1)

d Y

N (yj ; 0, 1).

j=3

Experimental details. We use v = 100 and b = 0.1 and compare SMC algorithms using different rejuvenation MH steps: a static Random Walk Metropolis move (RWSMC) with the asymptotically optimal scaling, (Roberts et al., √ 1997), ν = 2.38/ d, ASMC, and KASMC using a Gaussian RBF kernel. For the latter two algorithms, all particles are used to compute the proposal, and a fixed learning rate of λ = 0.1 is chosen to adapt scale parameters. Starting with particles from a multivariate Gaussian N (0, 502 ), we use a geometric bridge that reaches the target B(y; b = 0.1, v = 100) in 20 steps. We repeat the experiment over 30 runs. Results. Figure 2 shows that KASMC achieves faster convergence of the first 3 moments, i.e. in MMD2 distance to a large benchmark sample. 4.2. Evidence Estimation in Gaussian Process Classification Following (Sejdinovic et al., 2014), we consider Bayesian classification on the UCI Glass dataset, discriminating window glass from non-window glass, using a Gaussian Process (GP). It was found that the induced posterior is indeed non-linear (Strathmann et al., 2015; Sejdinovic et al., 2014). In Sejdinovic et al. (2014), samples from the poste2

The Maximum Mean Discrepancy, here using a polynomial kernel of order 3, quantifies differences of all mixed moments up to order 3 of two independent sets of samples.

Kernel Sequential Monte Carlo

rior density over hyper-parameters of the covariance function of the GP were simulated (the GP latent variables integrated out). We will emphasize a different point here: a major gain of SMC over MCMC is the ability to estimate the normalizing constant, or model evidence, which is straight-forward within the SMC framework. Model evidence is especially useful in Bayesian model selection as it is equivalent to the probability of the dataset under the model. Again, we evaluated KASMC on this problem. Model. Consider the joint distribution of latent variables f , labels y (with covariate matrix X), and hyper-parameters θ, given by

Estimation variance

103

K ASS ASMC 102

101

100

0

100

200

300

400

500

Number of particles

p(f , y, θ) = p(θ)p(f |θ)p(y|f ), where f |θ ∼ N (0, Kθ ), with Kθ modelling the covariance between latent variables evaluated at the input covariates: PD (xi,d −x0j,d )2  and θd = log `2d . (Kθ )ij = exp − 12 d=1 `2

Figure 3. Estimating model evidence of a GP using the IS2 framework. The plot shows the MC variance over 50 runs as a function of the size of the particle system.

d

Consider the binary logistic classifier, i.e. p(yi |fi ) = 1 1−exp(−yi fi ) where yi ∈ {−1, 1}. In order to perform Bayesian model selection (i.e. comparing different covariance functions), we need to estimate the model evidence of the marginal posterior given the hyperparameters, that is Z Z −1 = dθp(y|θ)p(θ), where the marginal likelihood p(y|θ) is intractable for non-Gaussian likelihoods p(y|f ). In the MCMC context, Pseudo-marginal MCMC methods (Andrieu & Roberts, 2009) enable exact inference by replacing p(y|θ) with an unbiased estimate pˆ(y|θ) :=

nimp 1 X

nimp

p(f (i) |θ) p(y|f (i) ) (i) , q(f |θ) i=1

(4)

 nimp where f (i) i=1 ∼ q(f |θ) are nimp importance samples. In Sejdinovic et al. (2014, Section 5.2), the importance distribution q(f |θ) is chosen as the Laplace approximation of p(f |y, θ) ∝ p(y|f )p(f |θ). Experimental details. The case naturally fits into the IS2 framework, as described in Section 2.3. Using the IS2 framework, we estimate model evidence for the GP classifier described above, equipped with a standard Gaussian Automatic Relevance Determination (ARD) covariance kernel. We here do not tune the number of ’inner’ importance samples, but follow (Sejdinovic et al., 2014) and use nimp = 100. The ground truth model evidence was established via running 20 SMC instances using N = 1000 particles and a bridge length of 30, and averaging their evidence estimates. The actual experiment is performed using each N = 100 particles and a bridge length of 20,

which we repeat for 50 runs each. All experiments use a bridge starting from the prior on the log hyper-parameters p(θd ) = N (0, 52 ). The learning rate is set to be constant λt = 1, and adaptation is towards a 0.23 acceptance rate; the kernel parameter of KASMC is tuned on-line using the median heuristic. Results. Figure 3 shows that estimates of KASMC exhibit less variance than those of ASMC. The bias was slightly improved, but comparable. 4.3. Stochasitic Volatility model with intractable likelihood A particularly challenging class of bayesian inverse problems are Stochastic volatility models. As time series models, they often involve high-dimensional nuisance variables, which usually cannot be integrated out analytically. Furthermore, risk management necessitates to account for parameter and model uncertainty, and models have to capture the non-linearities in the data (Barndorff-Nielsen & Shephard, 2001). We here concentrate on the prediction of daily volatility of asset prices, reusing the model and dataset studied by Chopin et al. (2013) to evaluate KGRIS. Model. Let st be the value of some financial asset on day t, then yt = 10(5/2) log(st /st−1 ) is called the log-returns (upscaling for numerical reasons). Then we model the observed log-return yt as dependent on a latent vt by the observation equation √ yt = µ + βvt + vt t for t ≤ 1. Here t is a sequence of i.i.d. standard Gaussian errors and vt is assumed to be a stationary stochastic process known as the actual volatility. Chopin et al.

Kernel Sequential Monte Carlo

2

2

2

k ∼ Pois(λξ /ω ), c1:k ∼ U(t, t + 1), e1:k ∼ Exp(ξ/ω ) zt+1 = zt exp(−λ) +

k X

0.0064 RMSE covariance

(2013) develop a hierarchical model for vt based on the idea of analytically integrating a continuous time volatility model over daily intervals (for details see Chopin et al., 2013; Barndorff-Nielsen & Shephard, 2001). Using this construction, the (discrete time) vt is parameterised by stationary mean ξ and variance ω 2 of the so called spot volatility and the exponential decay λ of its auto-correlation. This results in the following model for the actual volatility vt :

AM KGIS

0.0062 0.0060 0.0058 0.0056 0.0054 0.0052

5

10

15

20

25

30

35

40

45

50

Population size

ej exp(−λ(t + 1 − cj ))

j=1

 −1

vt+1 = λ

zt − zt+1 +

k X

 ej  , xt+1 = (vt+1 , zt+1 )>

Figure 4. Convergence of RMSE for estimating all elements of the posterior covariance matrix of the stochastic volatility model.

j=1

where zt is the discretely sampled spot volatility process and (vt+1 , zt+1 )> is the Markovian representation of the state process. The variables k, c1:k and e1:k are generated independently for each time period and for k = 0 the set 1 : k is defined to be empty. The dynamic implies a Γ(ξ 2 /ω 2 , ξ 2 /ω 2 ) stationary distribution for zt , which is also used as the initial distribution on z0 . The parameters of the model are subsumed under θ = (µ, β, ξ, ω 2 , λ). We use the same vague priors µ ∼ N (0, σ 2 = 2), β ∼ N (0, σ 2 = 2), ξ ∼ Exp(0.2) ω 2 ∼ Exp(0.2), λ ∼ Exp(1). Chopin et al. (2013) developed a sampler for θ based on Iterated Batch Importance Sampling using nested SMC with pseudo-marginal MCMC moves for integrating out the xt and dubbed their approach SMC2 . Experimental details. In our experiment, we use KGRIS proposals for θ in a Population Monte Carlo setting, i.e. without resorting to MCMC moves at all. We re-use the code developed for the original SMC2 paper in order to integrate out the xt and thus get likelihood estimates, with the same settings for algorithm parameters. The observed st are the 753 observations from consecutive days of the S&P 500 index also used by Chopin et al. (2013). KGRIS uses a particle system of increasing sizes with each particle going through 100 iterations. See Figure 6 in the Appendix for a plot of the pair-wise marginals of this posterior. Due to the lack of analytically available gradients, we compare two gradient free PMC versions: KGRIS using the gradient emulator with a random feature approximation by Strathmann et al. (2015), and a random walk PMC with global covariance adaptation in the style of Haario et al. (2001). Results. Figure 4 shows that the incorporated gradients lead to better performance of KGRIS in estimating the targets covariance matrix. This is in-line with the finding that

GRIS improves over pure random walk methods (Schuster, 2015).

5. Conclusion In this paper, we developed a framework for Kernel Sequential Monte Carlo samplers. Kernel emulators can be used both for MCMC rejuvenation steps for tempered samplers and for inducing importance densities. Of the evaluated variants of Kernel SMC, KASMC exploits local covariance structure of the target through positive definite kernel functions, by using RKHS-informed MH moves inside SMC. KGRIS uses gradient emulators to construct a gradient-free version of GRIS. Kernel SMC combines advantages of SMC for multi-modal targets and for estimating model evidence with RKHS information on the target distribution. It is especially attractive in the case where likelihoods are intractable but can be un-biasedly estimated, naturally fitting into existing frameworks of IS2 and SMC2 . We evaluated KASMC on the synthetic strongly twisted Banana target and in a problem of evidence estimation for a GP classification model. On the Banana problem, the ability of KASMC to adapt to non-linear support gave it a clear edge over the ASMC algorithm, which adapts to the global covariance instead of local covariance structure. In the GP classification problem, KASMC can be used for evidence approximation in a straight-forward way – unlike MCMC. The problem is particularly challenging since it includes an intractable likelihood, which itself has to be estimated through Importance Sampling. KASMC again performed slightly better than ASMC on this non-linear posterior over hyper-parameters of the GP. The most challenging problem, a stochastic volatility model, was used to evaluate KGRIS. This model accounted for long and short term effects and 753 nuisance variable had to be integrated out using a nested SMC algorithm. KGRIS compared favourably to an similar algorithm without gradient estimates.

Kernel Sequential Monte Carlo

We found that when using a distribution sequence leading up to the actual target (SMC case), computationally cheap approximations of kernel emulators make little sense, as they can not be updated. Furthermore, especially for the gradient emulator, high variance can be a problem. A weighted and Rao-Blackwellised fit is vital for its successful application. A very natural extension of this work is to consider the gradient emulator for Importance Sampling using Hamiltonian dynamics, an idea proposed by Neal (2011) and Naesseth & Lindsten (2015).

References Andrieu, C. and Roberts, G.O. The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics, 37(2):697–725, 2009. Andrieu, Christophe and Thoms, Johannes. A tutorial on adaptive MCMC. Statistics and Computing, 18 (November):343–373, 2008. ISSN 09603174. doi: 10.1007/s11222-008-9110-y. Andrieu, Christophe, de Freitas, Nando, Doucet, Arnaud, and Jordan, Michael I. An introduction to MCMC for machine learning. Machine learning, pp. 5–43, 2003. URL http://link.springer.com/ article/10.1023/A:1020281327116. Atchad´e, Yves F. An adaptive version for the metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8 (March):235–254, 2006. ISSN 13875841. doi: 10.1007/ s11009-006-8550-0. Barndorff-Nielsen, Ole E. and Shephard, Neil. NonGaussian Ornstern-Uhlenbeck-Based Models and Some of Their Uses in Financial Economics. Journal of the Royal Statistical Society. Series B, 63(2):167–241, 2001. Capp´e, O, Guillin, a, Marin, J. M, and Robert, C. P. Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4):907–929, December 2004. ISSN 1061-8600. doi: 10.1198/106186004X12803. URL http://www.tandfonline.com/doi/ abs/10.1198/106186004X12803. Capp´e, Olivier, Douc, Randal, Guillin, Arnaud, Marin, Jean-Michel, and Robert, Christian P. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447–459, April 2008. ISSN 0960-3174. doi: 10.1007/s11222-008-9059-x. URL http://link.springer.com/10.1007/ s11222-008-9059-x. Chopin, Nicolas. A sequential particle filter method for static models. Biometrika, 89(3):539–552, 2002. ISSN 0006-3444. doi: 10.1093/biomet/89.3. 539. URL http://biomet.oupjournals.org/ cgi/doi/10.1093/biomet/89.3.539. Chopin, Nicolas, Jacob, Pierre E, and Papaspiliopoulos, Omiros. SMC2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75 (3):397–426, 2013. Cornuet, Jean Marie, Marin, Jean Michel, Mira, Antonietta, and Robert, Christian P. Adaptive Multiple Importance Sampling. Scandinavian Journal of Statistics, 39

Kernel Sequential Monte Carlo

(4):798–812, 2012. ISSN 03036898. doi: 10.1111/j. 1467-9469.2011.00756.x. Del Moral, Pierre. FeynmanKac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer, 2004. ISBN 0387202684. doi: 10.1198/jasa. 2005.s52. Del Moral, Pierre, Doucet, Arnaud, and Jasra, Ajay. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, June 2006. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2006.00553. x. URL http://doi.wiley.com/10.1111/j. 1467-9868.2006.00553.x.

Naesseth, Christian A. and Lindsten, F. Importance sampling with Hamiltonian dynamics. NIPS 2015 workshop for scalable Monte Carlo methods, 2015. Neal, Radford M. New Monte Carlo Methods Based on Hamiltonian Dynamics. In MaxEnt workshop, number July, 2011. Neal, RM. Annealed Importance Sampling. Technical report, University of Toronto, 1998. URL http://link.springer.com/article/10. 1023/A:1008923215028. Rahimi, Ali and Recht, Benjamin. Random Features for Large Scale Kernel Machines. In Neural Information Processing Systems, number 1, pp. 1–8, 2007.

Douc, R., Guillin, a., Marin, J. M., and Robert, C. P. Convergence of adaptive mixtures of importance sampling schemes. Annals of Statistics, 35(1):420–448, 2007. ISSN 00905364. doi: 10.1214/009053606000001154.

Robert, Christian P. and Casella, George. Monte Carlo statistical methods. Springer, second edi edition, 2004. ISBN 9781475730739.

Doucet, A, Freitas, N De, and Gordon, N. An Introduction to Sequential Monte Carlo Methods. 2001a. URL http://link.springer.com/chapter/ 10.1007/978-1-4757-3437-9_1.

Roberts, Gareth O., Gelman, Andrew, and Gilks, W. R. Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7 (1):110–120, 1997. ISSN 10505164. doi: 10.1214/aoap/ 1034625254.

Doucet, Arnaud and Johansen, AM. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, (December 2008):4–6, 2009. URL http: //automatica.dei.unipd.it/tl_files/ utenti/lucaschenato/Classes/PSC10_11/ Tutorial_PF_doucet_johansen.pdf. Doucet, Arnaud, Freitas, Nando De, and Gordon, Neil. Sequential Monte Carlo Methods in Practice. Springer, 2001b. ISBN 9781441928870. Fearnhead, Paul and Taylor, Benjamin M. An Adaptive Sequential Monte Carlo Sampler. Bayesian Analysis, (2):411–438, 2013. doi: 10.1214/13-BA814. Haario, Heikki, Saksman, Eero, and Tamminen, Johanna. An Adaptive Metropolis Algorithm. Bernoulli, 7(2):223–242, 2001. ISSN 13507265. doi: 10. 2307/3318737. URL http://www.jstor.org/ stable/3318737. Ihler, Alexander T, Fisher, John W, Moses, Randolph L, and Willsky, Alan S. Nonparametric belief propagation for self-localization of sensor networks. Selected Areas in Communications, IEEE Journal on, 23(4):809–819, 2005. Lan, Shiwei, Streets, Jeffrey, and Shahbaba, Babak. Wormhole hamiltonian monte carlo. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Rosenthal, Jeffrey S. Optimal Proposal Distributions and Adaptive MCMC. In Handbook of Markov Chain Monte Carlo, chapter 4, pp. 93–112. Chapman & Hall, 2011. Sch¨afer, Christian and Chopin, Nicolas. Sequential Monte Carlo on large binary sampling spaces. Statistics and Computing, 23(2):163–184, 2013. ISSN 09603174. doi: 10.1007/s11222-011-9299-z. Schuster, Ingmar. Gradient Importance Sampling. arXiv preprint arXiv:1507.05781, 2015. Sejdinovic, Dino, Strathmann, Heiko, Lomeli, Maria G., Andrieu, Christophe, and Gretton, Arthur. Kernel Adaptive Metropolis-Hastings. In International Conference on Machine Learning (ICML), pp. 1665–1673, 2014. Smola, Alex, Gretton, Arthur, Song, Le, and Schlkopf, Bernhard. A hilbert space embedding for distributions. In Hutter, Marcus, Servedio, RoccoA., and Takimoto, Eiji (eds.), Algorithmic Learning Theory, volume 4754 of Lecture Notes in Computer Science, pp. 13–31. Springer Berlin Heidelberg, 2007. ISBN 978-3-540-75224-0. doi: 10.1007/ 978-3-540-75225-7 5. URL http://dx.doi.org/ 10.1007/978-3-540-75225-7_5. Strathmann, Heiko, Sejdinovic, Dino, Livingstone, Samuel, Szabo, Zoltan, and Gretton, Arthur. Gradientfree hamiltonian monte carlo with efficient kernel exponential families. In NIPS, 2015.

Kernel Sequential Monte Carlo

Tran, Minh-Ngoc, Scharth, Marcel, Pitt, Michael K., and Kohn, Robert. Importance sampling squared for Bayesian inference in latent variable models. pp. 1– 39, 2013. URL http://arxiv.org/abs/1309. 3339.

Kernel Sequential Monte Carlo

Appendices A. Implementation Details In this section, we cover a number of implementation details for using KASMC in practice, such as optimal scaling, adaptive re-sampling and reweighting between iterations. A.1. Scaling Parameters Similar to other MH proposals, KAMH has a free scaling parameter denoted ν 2 which we would like to adapt after one SMC iteration. To accomplish parameter tuning, we use the standard framework of stochastic approximation for tuning MCMC kernels (Andrieu & Thoms, 2008), i.e. tuning acceptance rate αt towards an asymptotically optimal acceptance rate αopt = 0.234 for random walk proposals (Rosenthal, 2011). More precisely, after the MCMC rejuvenation step, a RaoBlackwellised estimate α ˆ t of expected acceptance probability is available by simply averaging the acceptance probabilities for all MH proposals. Then, set 2 νt+1 = νt2 + λt (ˆ αt − αopt ) (5) for some non-increasing sequence λ1 , . . . , λT . This strategy of approximating optimal scaling assumes that consecutive targets are close enough so that the acceptance rate when using νt2 to target πt provides information about the expected acceptance rate when using νt2 with target πt+1 . As an alternative to this, one could treat νt2 as an auxiliary random variable and define a distribution over it designed to maximize expected utility, an approach taken in the adaptive SMC sampler (Fearnhead & Taylor, 2013). A.2. Construction of a Target Sequence One possibility for constructing a sequence of distributions is the geometric bridge defined by πt ∝ π01−ρt π ρt for some initial distribution π0 , where (ρt )Tt=1 is an increasing sequence satisfying ρT = 1. This is the construction used in the experimental section. Another construction is to use a mixture πt ∝ (1 − ρt )π0 + ρt π. When π is a Bayesian posterior, one can also add more data with increasing t, e.g. by defining the intermediate distributions as πt (X) = π(X|d1 , . . . , dbρt Dc ) where dj is a datapoint and D is the number of data points. This results in an online inference algorithm called Iterated Batch Important Sampling (IBIS) (Chopin, 2002). In IBIS especially, we can apply non-diminishing adaptation, unlike in adaptive MCMC. When using a distribution sequence that computes the posterior density π using the full dataset (such as the geometric bridge or the mixture sequence), one can reuse the intermediate samples when targeting πt for posterior estimation. As the value of π is computed for the geometric bridge and the mixture sequence, we re-use the weight π(X)/πt−1 (X) for posterior estimation while employing πt (X)/πt−1 (X) to inform proposal distributions at iteration t. This way, the evaluation of π (which is typically costly) is put to good use for improving the posterior estimate. A.3. Reweighting and Adaptive re-sampling The fact that the weighted approximation to the final target is returned in our algorithm stems from the fact that this approximation has lower variance than the re-sampled particle system (Doucet & Johansen, 2009). This is why in practice re-sampling might not be performed at every iteration. Rather, re-sampling only when Effective Sample Size (ESS) for the current target falls below a certain threshold will decrease Monte Carlo variance. For details we refer to reviews on SMC (Doucet & Johansen, 2009; Doucet et al., 2001a). Furthermore, care should be taken with respect to implementation of re-weighting: caching values between iterations saves much computation time.

B. Additional experiments B.1. A Multi-modal, Non-Linear Application: Sensor Network Localisation We next study performance of KASMC on a multi-modal target arising in a real-world application: inferring the locations of S sensors within a network, as discussed in (Ihler et al., 2005; Lan et al., 2014). While a typical application in robotics

Kernel Sequential Monte Carlo

would involve locations changing in time, we here focus on the static case: assume a number of stationary sensors that measure distance to each other in a 2-dimensional space. A distance measurement is successful with a probability that decays exponentially in the squared distance; the observation is missing otherwise. If distance is measured, it is corrupted by Gaussian noise. The posterior over the unknown sensor locations forms an extremely constrained non-linear and multimodal distribution induced by the spatial set-up. Model. Assume S sensors with unknown locations {xi }Si=1 ⊆ R2 . Define an indicator variable Zi,j ∈ {0, 1} for the distance Yij ∈ R+ between a pair of sensors (xi , xj ) being either observed (Zi,j = 1) or not (Zi,j = 0), according to    kxi − xj k22 . Zi,j ∼ Binom 1, exp − 2R2 If the distance is observed, then Yij is corrupted by Gaussian noise, i.e.  Yi,j |Zi,j = 1 ∼ N kxi − xj k, σ 2 , and Yi,j = 0 otherwise. Previously, (Ihler et al., 2005) focussed on MAP estimation of the sensor locations, and (Lan et al., 2014) focussed on a well-conditioned case (S = 8 sensors and B = 3 base sensors with known locations) that results in almost no ambiguity in the posterior. We argue that Bayesian quantification of uncertainty is more important for cases where noise and missing measurements does not allow to reconstruct the sensor locations exactly. We therefore reuse the dataset from (Lan et al., 2014) (R = 0.3, σ 2 = 0.02)3 , but only use the first S = 3 locations/observations. In order to encourage ambiguities in the localisation task, we only use the first 2 base sensors of (Lan et al., 2014) with known locations that each do observe distances to the S unknown sensors but not of each other. Unlike (Lan et al., 2014), we use a Gaussian prior N (0.5, I) to avoid the posterior being situated in a bounded domain. Experimental set-up. We run KASMC using 10000 particles and a bridge length of 50, and MCMC-KAMH for 50 · 10000 iterations of which we discard half as burn-in. Both KASMC and KAMH were √ initialised with (a) sample(s) from the prior. Tuning parameters ν 2 are set using a diminishing adaptation schedule λt = 1/ t for KAMH and a fixed learning rate λt = 1 for KASMC. We use a Gaussian RBF kernel with a band-width set using the median heuristic4 . In order to compare ASMC to KASMC, we created a benchmark sample via running 100 standard MCMC chains (randomly initialised to cover all modes) each for 50000 iterations, discarding half the samples as burn-in, and randomly down-sampling to a size of 100. We then compute the empirical MMD distance to the output of the individual algorithms, averaged over 10 runs. Results. Figure 5 shows the marginalised posterior for one run each of KASMC (SMC) and KAMH (MCMC), where we matched the number of likelihood evaluations (500000). MCMC is not able to traverse between the multiple modes and interpretations of the data, in contrast to SMC. Figure 5 furthermore reveals the non-linear structure of the posterior. For the chosen number of sensors, ASMC and KASMC perform similarly. With less sensors, i.e. more ambiguity, KASMC produces samples with both less MMD distance from a benchmark sample and less variance. For example, for a set-up with S = 2 and 1000 particles, we get a MMD distance to a benchmark sample of 0.76 ± 0.4 for KASMC and 0.94 ± 0.7 for ASMC.

3 4

Downloaded from http://www.ics.uci.edu/˜slan/lanzi/CODES_files/WHMC-code.zip on 8/Oct/2015. The median heuristic for Gaussian RBF kernels is to choose the bandwidth to match the median pairwise distance between samples.

Kernel Sequential Monte Carlo

1.0

MCMC (KAMH)

0.8 0.6 0.4 0.2 0.0 −0.1 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 5. Posterior samples of unknown sensor locations (in color) by kernel-based SMC and MCMC on the sensors dataset. The set-up of the true sensor locations (black dots) and base sensors (black stars) causes uncertainty in the posterior. SMC recovers all modes while MCMC does not. The posterior has a clear non-linear structure.

Figure 6. Samples from the doubly stochastic volatility model used in Section 4.3.