inference in state-space models ... parameters entering a state space / Hidden Markov model. ..... However generating at
An intro to particle methods for parameter inference in state-space models PhD course FMS020F–NAMS002 “Statistical inference for partially observed stochastic processes”, Lund University http://goo.gl/sX8vU9
Umberto Picchini Centre for Mathematical Sciences, Lund University
Umberto Picchini (
[email protected])
This lecture will explore possibilities offered by particle filters, also known as Sequential Monte Carlo (SMC) methods, when applied to parameter estimation problems. We will not give a thorough introduction to SMC methods. Instead, we will introduce the most basic (popular) SMC method with the main goal of performing inference for θ, a vector of unknown parameters entering a state space / Hidden Markov model. Results anticipation Thanks to recent and astonishing results, we show how it is possible to construct a practical method for exact Bayesian inference on the parameters of a state-space model.
Umberto Picchini (
[email protected])
Acknowledgements (early enough, not to forget...)
Part of the presented material, slides, MATLAB codes etc. has been kindly shared by: Magnus Wiktorsson (Matematikcentrum, Lund); Fredrik Lindsten (Automatic Control, Linköping).
Umberto Picchini (
[email protected])
Working framework We consider hereafter the (important) case where data are modelled according to an Hidden Markov Model (HMM), often denoted state-space model (SSM). (In recent literature the terms SSM and HMM have been used interchangeably. We do the same here, though elsewhere it is sometimes assumed that HMMs→discrete space and SSM →continuous space) A state-space model refers to a class of probabilistic graphical models that describes the probabilistic dependence between latent state variables and the observed measurements of a dynamical system.
Umberto Picchini (
[email protected])
Our system of interest is {Xt }t>0 . This is a latent/unobservable stochastic process. (Unavailable) values of the system on a discrete time grid {0, 1, ..., T} are X0:T = {X0 , ..., XT } (we consider integer times t for ease of notation). Actual observations (data) are noisy y1:T = {y1 , ..., yT } That is at time t each yt is a noisy/corrupted version of the true state of the system Xt . {Xt }t>0 is assumed a Markovian stochastic process. Observations yt are assumed conditionally independent given the corresponding latent state Xt . We are going to formalize this in next slide. Umberto Picchini (
[email protected])
SSMs as “graphical models” Graphically: #
#
#
Yt−1
Yt
Yt+1
(Observations)
"! "! "! 6 6 6 # # #
...
- Xt−1
-
Xt
- Xt+1
- ...
(Markov chain)
"! "! "!
Yt |Xt = xt ∼ p(yt |xt ) Xt+1 |Xt = xt ∼ p(xt+1 |xt ) X0 ∼ π(x0 ) Umberto Picchini (
[email protected])
(Observation density) (Transition density) (Initial distribution)
Example: Gaussian random walk A most trivial example (linear, Gaussian SSM).
xt = xt−1 + ξt , yt = xt + et ,
ξt ∼iid N(0, q2 ) et ∼iid N(0, r2 )
Therefore p(xt |xt−1 ) = N(xt ; xt−1 , q2 ) p(yt |xt ) = N(yt ; xt , r2 ) Notation: here and in the following N(µ, σ2 ) is the Gaussian distribution with mean µ and variance σ2 . N(x; µ, σ2 ) is the evaluation at x of the pdf of a N(µ, σ2 ) distribution. Umberto Picchini (
[email protected])
Notation reminder
Always remember X0 is NOT the state for the first observational time, that one is X1 . Instead X0 is the (typically unknown) system’s initial state for process {Xt }. X0 can be set to be a deterministic constant (as in the example we are going to discuss soon).
Umberto Picchini (
[email protected])
Markov property of hidden states
Briefly: Xt (and actually the whole future Xt+1 , Xt+2 ,...) given Xt−1 is independent from anything that has happened before time t − 1: p(xt |x0:t−1 , y1:t−1 ) = p(xt |xt−1 ),
t = 1, ..., T
Also, past is independent of the future given the present: p(xt−1 |xt:T , yt:T ) = p(xt−1 |xt )
Umberto Picchini (
[email protected])
Conditional independence of measurements
The current measurement yt given xt is conditionally independent of the measurement and state histories: p(yt |x0:t , y1:t−1 ) = p(yt |xt ).
The Markov property on {Xt } and the conditional independence on {Yt } are the key ingredients for defining an HMM or SSM.
Umberto Picchini (
[email protected])
In general {Xt } and {Yt } can be either continuous– or discrete–valued stochastic processes. However in the following we assume {Xt } and {Yt } to be defined on continuous spaces.
Umberto Picchini (
[email protected])
Our goal In all previous slides we haven’t made explicit reference to the presence of unknown constants that we wish to estimate. For example in the Gaussian random walk example the unknowns would be the variances (q2 , r2 ) (and perhaps the initial state x0 , in some situation). In that case we would like to estimate θ = (q2 , r2 ) using observations y1:T . More in general... Main goal We introduce general methods for SSM producing inference for the vector of parameters θ. We will be particularly interested in Bayesian inference. Of course θ can contain all sorts of unknowns, not just variances. Umberto Picchini (
[email protected])
A quick look into the final goal p(y1:T |θ) is the likelihood of the measurements conditionally on θ. π(θ) is the prior density of θ (we always assume continuous-valued parameters). It encloses knowledge about θ before we “see” our current data y1:T . Bayes theorem gives the posterior distribution: π(θ|y1:T ) =
p(y1:T |θ)π(θ) ∝ p(y1:T |θ)π(θ) p(y1:T )
inference based on π(θ|y1:T ) is called Bayesian inference. p(y1:T ) is the marginal likelihood (evidence), independent of θ. Goal: calculate (sample draws from) π(θ|y1:T ). Remark: θ is a random quantity in the Bayesian framework. Umberto Picchini (
[email protected])
As you know... (remember we set some background requirements for taking this course) Bayesian inference can rarely be performed without using some Monte Carlo sampling. Except for simple cases (e.g. data y1 , ..., yT independently distributed from some member of the exponential family) it is usually impossible to write the likelihood p(y1:T |θ) in closed form. Since early 90’s MCMC (Markov chain Monte Carlo) has opened the possibility to perform Bayesian inference in practice. Are we obsessed with Bayesian inference? NO! However it sometimes offers the easiest way to deal with complex, non-trivial models. Some good reads The Bayesian approach actually opens the possibility for a surprising result (see later on...). Umberto Picchini (
[email protected])
The likelihood function for SSMs First of all, notice that in the Bayesian framework, since θ is random, we do not simply write the likelihood function as pθ (y1:T ) nor p(y1:T ; θ), but we must condition on θ and write p(y1:T |θ). In a SSM data are not independent, they are only conditionally independent→ complication!: p(y1:T |θ) = p(y1 |θ)
T Y
p(yt |y1:t−1 , θ) =?
t=2
We don’t have a closed for expression for the product above because we do not know how to calculate p(yt |y1:t−1 , θ). Let’s see why. Umberto Picchini (
[email protected])
In a SSM the observed process is assumed to depend on the latent Markov process {Xt }: we can write Z Z p(y1:T |θ) = p(y1:T , x0:T |θ)dx0:T = p(y1:T |x0:T , θ) × p(x0:T |θ) dx0:T | {z } | {z } =
ZY T
use cond. indep.
p(yt |xt , θ) × p(x0 |θ)
t=1
T Y
use Markovianity
p(xt |xt−1 , θ) dx0:T
t=1
Problems The expression above is a (T + 1)-dimensional integral /
For most (nontrivial) models, transition densities p(xt |xt−1 ) are unknown / Umberto Picchini (
[email protected])
Special cases Despite the analytic difficulties, find approximations for the likelihood function is possible (we’ll consider some approach soon). In some simple cases, closed form solutions do exist: for example when the SSM is linear and Gaussian (see the Gaussian random walk example) then the classic Kalman filter gives the exact likelihood. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters. However, approximations offered by particle filters (a.k.a. sequential Monte Carlo) are presently the state-of-art for general non-linear non-Gaussian SSM. Umberto Picchini (
[email protected])
What if we had p(y1:T |θ)... Ideally if we had p(y1:T |θ) we could use Metropolis-Hastings (MH) to sample from the posterior π(θ|y1:T ). At iteration r + 1 we have: 1
2
current value is θr , propose a new θ∗ ∼ q(θ∗ |θr ), e.g. θ∗ ∼ N(θr , Σθ ) for some covariance matrix Σθ . draw u ∼ U(0, 1) accept θ∗ if u < min(1, A), where A=
p(y1:T |θ∗ ) π(θ∗ ) q(θr |θ∗ ) × × p(y1:T |θr ) π(θr ) q(θ∗ |θr )
then set θr+1 := θ∗ and p(y1:T |θr+1 ) := p(y1:T |θ∗ ). Otherwise, reject θ∗ , set θr+1 := θr . Set r := r + 1, go to 1 and repeat.
Umberto Picchini (
[email protected])
By repeating steps 1-2 as much as wanted we are guaranteed that, by discarding a “long enough” number of iterations (burnin), the remaining draws form a Markov chain (hence dependent values) having π(θ|y1:T ) as their stationary distribution. Therefore if you have produced R = R1 + R2 iterations of Metropolis-Hastings, where R1 is a sufficiently long burnin, for scalar θ you can then plot the histogram of the last R2 draws θR1 +1 , ..., θR−R1 . Such histogram gives the density π(θ|y1:T ), up to a Monte Carlo error induced by using a finite R2 . for a vector valued θ ∈ Rp create p separate histograms of the draws pertaining each component of θ. Such histograms represent the posterior marginals π(θj |y), j = 1, ..., p.
Umberto Picchini (
[email protected])
Brush up MH with a simple example A simple toy problem not related to state-space modelling. Data: n = 1, 000 observations i.i.d yj ∼ N(3, 1). Now assume µ = E(yj ) unknown. Estimate µ. Assume for example a “wide” prior: µ ∼ N(2, 22 ) (do not look at data!) Gaussian random walk to propose µ∗ = µr + 0.1 · ξ, with ξ ∼ N(0, 1). Notice Gaussian proposals are symmetric, hence q(µ∗ |µr ) = q(µr |µ∗ ) therefore A=
Umberto Picchini (
[email protected])
p(y1:n |µ∗ ) π(µ∗ ) × p(y1:n |µr ) π(µr )
Start far at µ = 10. Produce 10,000 iterations.
10
14
9
12
posterior π(µ|y) after burnin posterior prior
8 10 7 8 6 6 5 4 4 2
3 2
0 0
1000
2000
3000
4000
5000
6000
7000
MCMC iterations
8000
9000 10000
-1
0
1
2
3
4
5
6
True value µ = 3
The posterior on the right plot discards the initial 500 draws (burnin).
Umberto Picchini (
[email protected])
Remark Notice Metropolis-Hastings is not an optimization algorithm. Unlike in maximum likelihood estimation here we are not trying to converge towards some mode. What we want is to explore thoroughly (i.e. sample from) π(θ|data), including its tails. This way we can directly assess the uncertainty about θ by looking at π(θ|data), instead of having to resort to asymptotic arguments regarding the sampling distribution of θˆ n when n → ∞ as in ML-theory.
Umberto Picchini (
[email protected])
Example: a nonlinear SSM
xt−1 xt = 0.5xt−1 + 25 (1+x 2
t−1 )
+ 8 cos(1.2(t − 1)) + vt ,
yt = 0.05xt2 + et , with a deterministic x0 = 0, measurements at times t = 1, 2, ..., 100. vt ∼ N(0, q), et ∼ N(0, r) all independent. Data generated with q = 0.1 and r = 1. 20 15 10 5 0 -5 -10 X Y
-15 -20 0
10
20
30
40
50
time
Umberto Picchini (
[email protected])
60
70
80
90
100
Priors: q ∼ InvGamma(0.01, 0.01), r ∼ InvGamma(0.01, 0.01) Prior density InvGamma(0.01,0.01)
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Recall true values are q = 0.1 and r = 1. The chosen priors cover both.
Umberto Picchini (
[email protected])
Results anticipation... Perform an MCMC (how about the likelihood? we’ll see this later). R = 10, 000 iterations (generous burnin = 3, 000 iterations). Proposals: qnew ∼ N(qold , 0.22 ), and similarly for rnew . Start far away: initial values qinit = 1, rinit = 0.1. Trace plot of p(q | y 1:T ), acc. rate 1.103015e+01 percent
1.5
Empirical PDF of p(q | y 1:T ) after burnin
20 15
1
10 0.5
5
0 0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Trace plot of p(r | y 1:T ), acc. rate 1.103015e+01 percent
3
0 -0.05
2
2
1
1
0 0
1000
2000
3000
4000
5000
6000
7000
Umberto Picchini (
[email protected])
8000
9000 10000
0
0 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
2
2.2
Empirical PDF of p(r | y 1:T ) after burnin
3
0.6
0.8
1
1.2
1.4
1.6
1.8
How do we deal with general SSM for which the exact likelihood p(y1:T |θ) is unavailable? That’s the case of the previous example. For example, the previously analysed model is nonlinear in xt and therefore the exact Kalman filter can’t be used.
Umberto Picchini (
[email protected])
General Monte Carlo integration Recall our likelihood function: p(y1:T |θ) = p(y1 |θ)
T Y
p(yt |y1:t−1 , θ)
t=2
We can write p(yt |y1:t−1 , θ) as Z p(yt |y1:t−1 , θ) = p(yt |xt , θ)p(xt |y1:t−1 , θ)dxt = E(p(yt |xt , θ)) Use Monte Carlo integration: generate N draws from p(xt |y1:t−1 , θ), then invoke the law of large numbers. produce N independent draws xti ∼ p(xt |y1:t−1 , θ), i = 1, ..., N for each xti compute p(yt |xti , θ) and by LLN we have N 1 X p(yt |xti , θ) → E(p(yt |xt , θ)), N i=1
N→∞
error term is O(N −1/2 ) regardless the dimension of xt ,
Umberto Picchini (
[email protected])
SMC and the bootstrap filter
But how to generate “good” draws (particles) xti ∼ p(xt |y1:t−1 , θ)? Here “good” means that we want particles such that the values of p(yt |xti , θ) are not negligible (“explain” a large fraction of the integrand). For SSM, sequential Monte Carlo (SMC) is the winning strategy. We will NOT give a thorough introduction to SMC methods. We only use a few notions to solve our parameter inference problem.
Umberto Picchini (
[email protected])
Importance sampling Let’s remove for a moment the dependence on θ. Z p(yt |y1:t−1 ) = p(yt |x0:t )p(x0:t |y1:t−1 )dx0:t Z = p(yt |xt )p(xt |xt−1 )p(x0:t−1 |y1:t−1 )dx0:t Z p(yt |xt )p(xt |xt−1 )p(x0:t−1 |y1:t−1 ) = h(x0:t |y1:t )dx0:t h(x0:t |y1:t ) where h(·) is an arbitrary (positive) density function called “importance density”. Choose one easy to simulate from. i ∼ h(x |y ), 1 simulate N samples: x0:t i = 1, ..., N 0:t 1:t 2 3
i )p(xi p(y |xti )p(xti |xt−1 0:t−1 |y1:t−1 ) i |y ) h(x0:t 1:t
construct importance weights wit = t P p(yt |y1:t−1 ) = E(p(yt |xt )) ≈ N1 Ni=1 wit
Umberto Picchini (
[email protected])
Importance Sampling
Importance Sampling is an appealing and revolutionary idea for Monte Carlo integration. i is not However generating at each time a “cloud of particles” x0:t really computationally appealing, and it’s not clear how to do so.
We need something enabling a sort of sequential mechanism, as t increases.
Umberto Picchini (
[email protected])
Sequential Importance Sampling
When h(·) is chosen in an intelligent way, an important property is the one that allows sequential update of weights. After some derivation on the board, we have (see p. 121 in Särkkä1 and p. 5 in Cappe et al.2 ) wit ∝
i ) p(yt |xti )p(xti |xt−1 wi . i h(xti |x0:t−1 , y1:t ) t−1
However, the dependence of wt on wt−1 is a curse (with remedy) as described in next slide.
1 2
Särkkä, available here. Cappe, Godsill and Moulines. Available here.
Umberto Picchini (
[email protected])
The particle degeneracy problem Particle degeneracy occurs when at time t all but one of the importance weights wit are close to zero. This implies a poor approximation to p(yt |y1:t−1 ). Notice that when a particle gets a zero weight (or a “small” positive weight that your computer sets to zero → numerical underflow) that particle is doomed! Since wit ∝
i ) p(yt |xti )p(xti |xt−1 wi . i h(xti |x0:t−1 , y1:t ) t−1
if for a given i we have wit−1 = 0 particle i will have zero weight for all subsequent times. Umberto Picchini (
[email protected])
The particle degeneracy problem
For example, suppose a new data point yt is encountered and this is an outlier. Then most particles will end up far away from yt and their weights will be ≈ 0. Those particles will die. Eventually, for a time horizon T which is long enough, all but one particle will have zero weight. This means that the methodology is very fragile and particle degeneracy has affected SMC methods for decades.
Umberto Picchini (
[email protected])
The Resampling idea A life saving solution is to use resampling with replacement (Gordon et al.3 ). 1
i interpret w˜ it as the probability to sample P xi t from the weighted set i i i i {xt , w˜ t , i = 1, ..., N}, with w ˜ t := wt / i wt .
2
draw N times with replacement from the weighted set. Replace the old particles with the new ones {˜xt1 , ..., x˜ tN }.
3
Reset weights wit = 1/N (the resampling has destroyed the information on “how” we reached time t).
Since resampling is done with replacement, a particle with a large weight is likely to be drawn multiple times. Particles with very small weights are not likely to be drawn at all. Nice! 3
Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.
Umberto Picchini (
[email protected])
Sequential Importance Sampling Re-sampling (SISR) We are now in the position to introduce a method that samples from h(·), so that these samples can be used as if they were from p(x0:t |y1:t−1 ) provided these are appropriately re-weighted. 1
Sample xt1 , ..., xtN ∼ h(·) (same as before)
3
compute weights wit (same as before) P normalize weights w˜ it = wit / Ni=1 wit
4
We have a discrete probability distribution {xti , w˜ it }, i = 1, ..., N
2
5
6
Resample N times with replacement from the set {xt1 , ..., xtN } having associate probabilities {w˜ 1t , ..., w ˜ Nt }, to generate another 1 N sample {˜xt , ..., x˜ t }. {˜xt1 , ..., x˜ tN } is an approximate sample from p(x0:t |y1:t−1 ).
Sketch of Proof on page 111 in Wilkinson’s book. Umberto Picchini (
[email protected])
The next animation illustrates the concept of sequential importance sampling resampling with N = 5 particles. Light blue: observed trajectory (data) dark blue: forward simulation of the latent process {Xt } pink balls: particles xti green balls: selected particles x˜ ti from resampling red curve: density p(yt |xt )
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Umberto Picchini (
[email protected])
Bootstrap filter This is the method we will actually use to ease Bayesian inference. The bootstrap filter is a simple example of SISR.
i i Choose h(xti |x0:t−1 , y1:t ) := p(xti |xt−1 )
Recall in SISR after performing resampling reset all weights → wit−1 = N1 Thus i p(yt |xti )p(xti |xt−1 ) i w i i h(xt |x0:t−1 , y1:t ) t−1 1 = p(yt |xti ) ∝ p(yt |xti ), N
wit ∝
i = 1, ..., N
This is a SISR strategy therefore we still have that xti ≈ p(x0:t |y1:t−1 ) Umberto Picchini (
[email protected])
Bootstrap filter in detail
1
t = 0 (initialize) x0i ∼ π(x0 ), assign w˜ i0 = 1/N, i = 1, ..., N
2
at the current t assume we have the weighted particles {xti , w˜ it }
3
4
5
6
from the current sample, resample with replacement N times to obtain {˜xti , i = 1, ..., N}. i From your model propagate forward xt+1 ∼ p(xt+1 |˜xti ), i = 1, ..., N. i ) and normalise weights Compute wit+1 = p(yt+1 |xt+1 P N w ˜ it+1 = wit+1 / i=1 wit+1
set t := t + 1 and if t < T go to step 2.
Umberto Picchini (
[email protected])
So the bootstrap filter by et al. (1993)4 easily provide what we need! P pˆ (yt |y1:t−1 ) = N1 Ni=1 wit Finally a likelihood approximation: pˆ (y1:T ) = pˆ (y1 )
T Y
pˆ (yt |y1:t−1 )
t=2
Put back θ in the notation so to obtain: approximate maximum likelihood θmle = argmaxθ pˆ (y1:T ; θ) or exact Bayesian inference by using pˆ (y1:T |θ) inside Metropolis-Hastings. why exact?? let’s check it after the example... 4
Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.
Umberto Picchini (
[email protected])
Back to the nonlinear SSM example We can now comment on how we obtained previously shown results (reproposed here). We used the bootstrap filter with N = 500 particles and R = 10, 000 MCMC iterations. xt forward propagation: xt+1 = 0.5xt + 25 (1+x 2 ) + 8 cos(1.2t) + vt+1 t
wit+1
=
i N(0.05(xt+1 )2 , r)
Trace plot of p(q | y 1:T ), acc. rate 1.103015e+01 percent
1.5
Empirical PDF of p(q | y 1:T ) after burnin
20 15
1
10 0.5
5
0 0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Trace plot of p(r | y 1:T ), acc. rate 1.103015e+01 percent
3
0 -0.05
2
2
1
1
0 0
1000
2000
3000
4000
5000
6000
7000
Umberto Picchini (
[email protected])
8000
9000 10000
0
0 0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
2
2.2
Empirical PDF of p(r | y 1:T ) after burnin
3
0.6
0.8
1
1.2
1.4
1.6
1.8
Numerical issues with particle degeneracy When coding your algorithms try to consider the following before normalizing weights code unnormalised weights on log-scale: e.g. when wit = N(yt |xti ) the exp() in the Gaussian pdf will likely produce an underflow (wit = 0) for xti far from yt . Solution: reason in terms of logw instead of w. However afterwards we necessarily have to go back to w:=exp(logw) then normalize and the above might still not be enough. Solution: subtract the maximum (log)weight from each (log)weight, e.g. set logw:=logw-max(logw). This is totally ok, the importance of particles is unaffected, weights are only scaled by the same constant c=max(logw). Umberto Picchini (
[email protected])
Safe (log)likelihood computation As usual, better compute the (approximate) loglikelihood instead of the likelihood: log pˆ (y1:T |θ) =
T N X X − log N + log wit t=1
max{log wit , i
At time t set ct = i wi∗ t := exp(log wt − ct )
i=1
= 1, ..., N} and set
Once the wi∗ t are available Pall P compute N log i=1 wit = ct + log Ni=1 wi∗ t . (the latter follows from wit = wi∗ t exp{ct } which you don’t need to evaluate) Umberto Picchini (
[email protected])
Incredible result Quite astonishingly Andrieu and Roberts5 proved that using an unbiased estimate of the likelihood function into the MCMC routine is sufficient to obtain exact Bayesian inference!
That is using the acceptance ratio A=
pˆ (y1:T |θ∗ ) π(θ∗ ) q(θ|θ∗ ) × × pˆ (y1:T |θ) π(θ) q(θ∗ |θ)
will return a Markov chain with stationary distribution π(θ|y1:T ) regardless the finite number N of particles used to approximate the likelihood!. The good news is that E(ˆp(y1:T |θ)) = p(y1:T |θ) with pˆ (y1:T |θ) obtained via SMC. 5
Andrieu and Roberts (2009), Annals of Statistics, 37(2) 697–725.
Umberto Picchini (
[email protected])
The previous result will be considered, in my opinion, one of the most important statistical results of the XXI century. In fact, it offers an “exact-approximate” approach, where because of computing limitations we can only produce N < ∞ particles, while still be reassured to obtain exact (Bayesian) inference under minor assumptions. But let’s give a rapid (technically informal) look at why it works. Key result: unbiasedness (del Moral 20046 ) We have that
Z E(ˆp(y1:T |θ)) = pˆ (y1:T |θ, ξ)p(ξ)dξ = p(y1:T |θ)
with ξ ∼ p(ξ) vector of all random variates generated during SMC (both to propagate forward the state and to perform particles resampling). 6
Easier to look at Pitt, Silva, Giordani, Kohn. J. Econometrics 171, 2012.
Umberto Picchini (
[email protected])
To prove the exactness of the approach we look at the (easier and less general) argument in sec. 2.2 of Pitt, Silva, Giordani, Kohn. J. Econometrics 171, 2012. To simplify the notation take y := y1:T . π(θ, ˆ ξ|y) approximate joint posterior of (θ, ξ) obtained via SMC π(θ, ˆ ξ|y) =
pˆ (y|θ, ξ)p(ξ)π(θ) p(y)
(notice ξ and θ are assumed a-priori independent) Notice we put p(y) not pˆ (y) at the denominator: this follows from asRwe obtain R R the unbiasedeness assumption R p ˆ (y|θ, ξ)p(ξ)π(θ)dξdθ = π(θ){ pˆ (y|θ, ξ)p(ξ)dξ}dθ = R π(θ)p(y|θ)dθ = p(y). Umberto Picchini (
[email protected])
The exact (unavailable) posterior of θ is π(θ|y) =
p(y|θ)π(θ) p(y)
therefore the marginal likelihood (evidence) is p(y) =
p(y|θ)π(θ) π(θ|y)
and pˆ (y|θ, ξ)p(ξ)π(θ) p(y) π(θ|y)ˆp(y|θ, ξ)p(ξ) π(θ) = p(y|θ) π(θ)
π(θ, ˆ ξ|y) =
Umberto Picchini (
[email protected])
Now, we know that applying an MCMC targeting π(θ, ˆ ξ|y) then discarding the output pertaining to ξ corresponds to integrate-out ξ from the posterior Z Z π(θ|y) pˆ (y|θ, w)p(ξ)dξ = π(θ|y) π(θ, ˆ ξ|y)dξ = p(y|θ) | {z } E(ˆp(y|θ))=p(y|θ)
We are thus performing a pseudo-marginal approach: “marginal” because we disregard ξ; pseudo because we use pˆ (·) not p(·). Therefore we proved that, using MCMC on an (artificially) augmented posterior, then discard from the output all the random variates created during SMC returns exact Bayesian inference. Notice that discarding the ξ is something that we naturally do in Metropolis-Hastings hence nothing strange is happening here. The ξ are just instrumental, uninteresting, variates independent of θ and independent of {Xt }. Umberto Picchini (
[email protected])
One more example Let’s study the stochastic Ricker model. yt ∼ Pois(φNt ) Nt = r · Nt−1 e−Nt−1 +et ,
et ∼ N(0, σ2 )
Nt is the size of a population at t r is the growth rate φ is a scale parameter et environmental noise So this example shows a observational model for yt given by a discrete distribution. Classic Kalman based filtering can’t accommodate discreteness. Umberto Picchini (
[email protected])
We can use the R package pomp to fit this model. pomp supports a number of inference methods of interest to us: particle marginal methods iterated filtering approximate Bayesian computation (ABC) synthetic likelihoods ...and more. Essentially a user who’s interested in fitting a SSM can benefit from a well tested platform, and experiment with a number of possibilities. I have never used it used it until 4 days ago (!). But now I have a working example (also as an optional exercise!).
Umberto Picchini (
[email protected])
Here are 50 observations from the Ricker model at times (1, 2, ..., 50) generated with (r, σ, φ) = (44.7, 0.3, 10) and starting size N0 = 7 at t = 0.
e
−0.4
0.0
0.4 0
5 10
N
20
0
100
y
200
myobservedricker
0
10
20
30
40
50
time
You can also notice the path of the (latent) process Nt as well as the (not so interesting) realizations of et . Umberto Picchini (
[email protected])
The pomp pmcmc function runs a pseudo-marginal MCMC. Assume we are only interested in (r, φ) and keep σ = 0.3 constant (“known”). Here we use 500 particles with 5,000 MCMC iterations. We start at (r, φ) = (7.4, 5). We assumed flat (improper) priors.
50
r phi
5 6 7 8 9
11 10
30
−500 0.0 0.5 1.0 −800
0
1000
2000
3000
4000
8
PMCMC iteration
0
4
nfail
12 −1.0
log.prior
loglik
−200
PMCMC convergence diagnostics
0
1000
2000
3000
4000
PMCMC iteration
Umberto Picchini (
[email protected])
5000
5000
We used a non-adaptive Gaussian random walk (variances are kept fixed). You might get better results with an adaptive version. A topic set as an (optional) exercise is to have a thought at why the method fails at estimating parameters of a nearly deterministic (smaller σ) stochastic Ricker model. Furthermore, for Nt deterministic (σ = 0), where particle filters are not applicable (nor needed), exact likelihood calculation is also challenging. A great coverage of this issue is in Fasiolo et al. (2015) arXiv:1411.4564 comparing particle marginal methods, approximate Bayesian computation, iterated filtering and more. Very much recommended read.
Umberto Picchini (
[email protected])
Conclusions We have outlined a powerful methodology for exact Bayesian inference, regardless the number of particles N a too small N will have a negative impact on chain mixing → many rejections, sticky chain. the methodology is perfectly suited for state-space modelling. the methodology is “plug-and-play” as besides knowing the observations density p(yt |xt ) nothing else is required. in fact we only need the ability to forward simulate xt given xt−1 which can be performed numerically from the state model. the above implies that knowledge of the transition density expression p(xt |xt−1 ) is not required. BINGO! Umberto Picchini (
[email protected])
Software I’m not very much updated on all available software for parameter inference via SMC (tend to write my code from scratch). Some possibilities are: LiBBi (C++ template library) Biips (C++ with interfaces to MATLAB/R) demo("PMCMC") in the R package smfsb. R package pomp accompanying MATLAB code for the book by S. Särkkä. Code available here. the MATLAB code for the presented example is available at the course webpage. Umberto Picchini (
[email protected])
Important issues How to tune the number of particles? Doucet, Pitt, and Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. arXiv:1210.1871 (2012). Pitt, dos Santos Silva, Giordani and Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics 171, no. 2 (2012): 134-151. Sherlock, Thiery, Roberts and Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. arXiv:1309.7209 (2013).
Umberto Picchini (
[email protected])
Important issues
Comparison of resampling schemes. Douc, Cappe and Moulines. Comparison of resampling schemes for particle filtering. http://biblio.telecom-paristech.fr/cgi-bin/ download.cgi?id=5755 Hol, Schön and Gustaffson. On resampling algorithms for particle filters, http://people.isy.liu.se/rt/ schon/Publications/HolSG2006.pdf
Umberto Picchini (
[email protected])
Appendix
Umberto Picchini (
[email protected])
Multinomial resampling This is the simplest to explain (not the most efficient though) resampling scheme. At time t we wish to sample from a population of weighted particles (xti , w˜ it , i = 1, ..., N). What we actually do is to sample N times particle indeces with replacement from the population (i, w˜ it , i = 1, ..., N). This will be a sample of size N from a multinomial distribution. Pick at particle from the “urn”, the larger its probability w˜ it the more likely it will be picked. Record its index i and put it back in the urn. Repeat for a total of N times. To code the sampling procedure we just need to recall the inverse transform method. For a generic random variable X, let FX be an invertible cdf. We can sample an x from FX using x := FX−1 (u), with u ∼ U(0, 1). Umberto Picchini (
[email protected])
For example let’s start from a simple example of multinomial distribution, the Bernoulli distribution. X ∈ {0, 1} with p = P(X = 1), 1 − p = P(X = 0). Then x 1 − p set x := 1. For the multinomial case it is a simple generalization. Drop time t and set w˜ i = pi . FX is a stair with N steps. Shoot a u ∼ U(0, 1) and return index i i0 X 0 i := min i ∈ {1, ..., N}; ( pi ) − u > 0 . i=1 Umberto Picchini (
[email protected])
Cool reads on Bayesian methods (titles are linked) You have an engineeristic/signal processing background: check S. Särkkä “Bayesian Filtering and Smoothing” (free PDF from the author!) You are a data-scientist: check K. Murphy “Machine Learning: a probabilistic perspective”. You are a theoretical statistician: C. Robert “The Bayesian Choice”. You are interested in bioinformatics/systems biology: check D. Wilkinson “Stochastic Modelling for Systems Biology, 2ed.”. You are interested in inference for SDEs with applications to life sciences: check the book by Wilkinson above and C. Fuchs “Inference for diffusion processes”. Umberto Picchini (
[email protected])
Cool reads on Bayesian methods (titles are linked)
You are a computational statistician: check “Handbook of MCMC”. Older (but excellent) titles are: J. Liu “Monte Carlo strategies in Scientific Computing” and Casella-Robert “Monte Carlo Statistical Methods”. You want a practical hands-on and (almost) maths free introduction: check “The BUGS Book” and “Doing Bayesian Data Analysis”.
Umberto Picchini (
[email protected])