Simulation-based Bayesian inference for discretely ...

Simulation-based Bayesian inference for discretely observed Markov models using a factorised posterior distribution Simon R. White∗ MRC Biostatistics Unit, Cambridge CB2 0SR, UK

Theodore Kypraios School of Mathematical Sciences, University of Nottingham, NG7 2RD, UK

Simon Preston School of Mathematical Sciences, University of Nottingham, NG7 2RD, UK

John Crowe School of Electrical Electronic Engineering, University of Nottingham, NG7 2RD, UK

Abstract Growing computing power means that simulation-based methods of inference, which do not require calculation of the likelihood, are becoming increasingly useful. Recent papers, e.g. Toni et al. [8], have stimulated considerable interest in this area. We consider an approach suitable for discretely observed Markov models that involves writing the Bayesian posterior density as a product of factors, and then using simulation-based inference within each factor. This has the advantage of treating the data symmetrically, and in the context of approximate Bayesian computation (ABC) typically enables a more stringent threshold to be set, making the posterior “less approximate”, or even exact. However, extra costs are entailed in working with the product density. We use two examples to illustrate the approach and discuss the trade-offs compared with other simulation-based approaches.

Keywords: Approximate Bayesian Computation; Simulation; Stochastic Lotka-Volterra ∗

Author for correspondence ([email protected])

1

1

Outline

Suppose our data are a set of observations denoted X = {x1 , . . . , xn } = {x(t1 ), . . . , x(tn )} of state variable x ∈ Rm at time points t1 , . . . , tn . According to Bayes theorem, if these data arise from a model with parameters θ, and if prior beliefs about θ are expressed via the density π(θ), then the posterior density is π(X |θ)π(θ) ∝ π(X |θ)π(θ), π(X |θ)π(θ) dθ θ

π(θ|X ) = R

(1)

where π(X |θ) is the likelihood. Typically, direct calculationR of the posterior density is not possible because the normalising constant θ π(X |θ)π(θ) dθ is difficult to evalute, but provided the likelihood is tractable one can draw samples from the posterior distribution using, for example, rejection sampling [1]. If the likelihood is intractable, alternative options include using data augmentation [2] or, as we focus on here, using simulation-based methods that rely only on being able to sample realisations from the model. Consider the following rejection-sampling algorithm: Algorithm 1 Exact Bayesian Computation (EBC) 1: Sample θ∗ from π(θ). 2: Simulate dataset X ∗ = {x∗ (t1 ), . . . , x∗ (tn )} from the model using parameters θ∗ . 3: Accept θ∗ if X ∗ = X , otherwise reject. 4: Repeat. In principle, this algorithm generates samples exactly from π(θ|X ), although its practical utility depends on the underlying likelihood corresponding to a discrete probability distribution, i.e., where state variables can adopt only integer values (xi ∈ Zm ), else the probability of acceptance is zero. For continuous distributions, step 3 can be replaced with 3’. Accept θ ∗ if d(X , X ∗ ) ≤ ε, otherwise reject

where d(X , X ∗) is a measure of ‘distance’ between the observed and simulated datasets, and ε is a tolerance value. This is the Approximate Bayesian Computation (ABC) algorithm introduced by Tavare et al. [4]. ABC methodology has been further developed by Pritchard et al. [5] and Beaumont et al. [6], and during the past decade has been extended and used within MCMC [7] and Sequential Monte Carlo (SMC) [8] frameworks. Samples drawn using ABC are from the density π(θ|d(X , X ∗) ≤ ε), which converges to π(θ|X ) in the limit ε → 0. The tolerance ǫ must be chosen small 2

enough such that the approximate posterior is acceptably close to the true posterior, but large enough to give a sufficiently high acceptance rate. A natural extension to ABC is SMC-ABC; this uses samples from the prior distribution of the parameters (called particles) which are propagated through a sequence of intermediate distributions, π(θ|d(X , X ∗) ≤ εi ), where the tolerances ǫi for i = 1, . . . , T are chosen such that ǫ1 > . . . ǫT ≥ 0. In other words, these intermediate distributions evolve towards π(θ|X ). SMCABC can work very well [8], although its efficiency largely depends on certain tunable aspects, such as the choice of the tolerances ǫi , and the kernel via which the particles are propagated, and without care the approach can even fail altogether due to “particle degeneracy” [9]. For simulation-based methods, computing resources are always limiting, and a guiding principle is to take every opportunity to exploit model structure to minimise computational costs. In this letter, we consider a simulationbased approach for a particular (but fairly broad) class of models, namely those that have the Markov property and whose vector of state variables is observable at discrete time points. For such models we propose a simulationbased algorithm that: i) takes advantage of the model’s Markov structure, ii) requires minimal amount of tuning, and iii) enables “less approximate”, or even exact, Bayesian inference for the parameters of the model.

2

Methods

The Markov property enables the likelihood to be written as ! n Y π(X |θ) = π(xi |xi−1 , . . . , x1 , θ) π(x1 |θ) =

i=2 n Y

!

π(xi |xi−1 , θ) π(x1 |θ),

i=2

(2)

and hence the posterior as π(θ|X ) ∝ π(X |θ)π(θ) n Y π(xi |xi−1 , θ)π(θ) ∝ π(x1 |θ)π(θ) π(θ) i=2 ! n Y ∝ π(θ)(1−n) π(θ|x1 ) π(θ|xi , xi−1 ) . i=2

3

(3)

Essentially, in (3) the density of the posterior distribution of θ|X = (x1 , . . . , xn ) has been decomposed into a product of the densities of the posterior distributions θ|(xi , xi−1 ), for i = 2, . . . , n. We can estimate the posterior density π(θ|X ) as follows: Algorithm 2 Piece-Wise Exact (Approximate) Bayesian Computation: PW-EBC (PW-ABC) for i = 2 to n do a: Apply EBC (ABC) Algorithm to draw exact (approximate) samples from π(θ|xi , xi−1 ); b: Calculate the kernel density estimate (KDE), π ˆ (θ|xi , xi−1 ) end for Calculate an estimate, π ˆ (θ|X ), of π(θ|X ) by replacing in (3) the densities π(θ|xi , xi−1 ) with their corresponding KDEs, π ˆ (θ|xi , xi−1 ). PW-EBC and PW-ABC have a major advantage compared with EBC and ABC, namely that for a given tolerance ǫ, the rejection sampling acceptance probability is substantially higher when drawing samples from π(θ|xi , xi−1 ) than it is from π(θ|X ). PW-EBC and the PW-ABC also benefit from being trivial to parallelise: factors can be calculated in parallel, and so can samples within factors. PW-EBC and PW-ABC involve having to calculate kernel density estimates (KDEs). Sophisticated methods are available for calculating KDEs quickly [see, for example, 10], but for the examples in this letter we found it sufficient to adopt the simple approach using sums of Gaussian basis functions and using standard automatic bandwidth selectors [see, for example, 11]. The only subtlety in our approach was that for parameters with bounded support (e.g. parameters that can only be positive) to avoid “edge effects” we calculated KDEs for simple transformed versions of the parameters. These transformations were chosen such that the transformed parameters have support on the whole of the real line. In the following section we show via two examples how inference based on the PW-EBC algorithm performs well and leads to big computational savings compared with simulation-based approaches that do not exploit the Markov property.

3

Two examples

We illustrate our method with two examples. First is an integer-valued time series model called INAR(1). For this model, the likelihood is available (al4

beit awkward to compute) which enables comparison of our approach with a “gold standard” MCMC approach. Second is a stochastic Lotka–Volterra model; this is a simple example from a common class of models (widely appropriate for stochasic chemical kinetics, for instance) in which the likelihood, and therefore standard methods of inference, are unavailable.

3.1

Integer-Valued Autoregressive Models

Integer-valued time series occur in many contexts as counts of events or objects in consecutive time intervals, or at consecutive points in time; for example, the number of patients in a hospital at a specific time. Parameter inference for these models can be somewhat challenging [see e.g. 12, 13], but PW-EBC (PW-ABC) offers a simple approach. The model Consider the following integer-valued autoregressive model of order p, known as INAR(p): Xt =

p X

αi ◦ Xt−i + Zt ,

t ∈ Z,

(4)

i=1

where Zt for t > 1 are independent and identically distributed integer-valued random variables with E[Zt2 ] < ∞, with the Zt assumed to be independent of the Xt . Here we assume Zt ∼ P o(λ). Each operator αi ◦ denotes binomial thinning defined by Binomial(W, αi ), W > 0 αi ◦ W = , 0, W =0 for non-negative integer-valued random variable W . In addition, all the operators, αi , i = 1, . . . p, are assumed to be independent. We consider the simplest example of this model, INAR(1) [see, for example, 14], supposing that we have some observed data X = {x1 , . . . , xn } from this model and wish to make inference for the parameters (α, λ). Results To test our approach, we generated 100 observations from an INAR(1) process using parameters (α, λ) = (0.7, 1) and X(0) = 10 (Figure S2 in Supplementary Material), and then attempted to draw inference using Uniform(0,1) and Exp(0.01) priors, respectively. For the EBC algorithm the probability 5

Factorised Posterior

0.0

0.5

1.0

1.5

2.0

(a) logit Binomial parameter, logit(α)

Factorised Posterior

−0.5

0.0

0.5

(b) log Poisson parameter, log(λ)

Figure 1: Histograms of the marginal posterior densities of the INAR(1) model parameters from i) PW-EBC approach (solid line), and ii) MCMC algorithm (bars). Vertical lines indicate the true values. (Online version in colour) of acceptance is around 10−100 , which is prohibitively small. The ABC algorithm requires a value of ǫ so large that sequential approaches (e.g. SMCABC) are needed. Using PW-EBC (essentially using ǫ = 0) we were able to draw samples from π(θ|xi , xi−1 ) for all 99 factors, and still achieve acceptance rates of around 9% on average for each factor. In other words, exploiting the factorised posterior (3) enabled exact inference. The results plotted in Figure 1 show that the posterior densities for logit(α) and log(θ) match closely to the ones obtained from the “gold-standard” MCMC approach. Here we have used p = 1, for which the likelihood is available [see, for example 14], for purposes of comparison with the MCMC approach. However, we could have applied PW-EBC equally easily for p > 1, a case in which the likelihood is essentially intractable and problem-specific approaches have been developed (e.g. variants of the“expectation-maximisation” algorithm and conditional least squares [see, for example, 13, and the references therein]); PW-EBC remains as simple and well suited for p > 1 as for p = 1. 6

3.2

Stochastic Lotka-Volterra Dynamics

The stochastic Lotka–Volterra (LV) model we consider here is a model of predator–prey dynamics and an example of a stochastic discrete-state-space continuous-time Markov process [see, for example, 15]. Models of this type commonly arise when modelling chemical kinetics. We can think of the predator–prey dynamics in chemical kinetics terms: the predators and prey are two populations of “reactants” subject to three “reactions”, namely prey birth, predation and predator death. Inference is simple if the type and precise time of each reaction is observed. However, a more common setting is where the population sizes are only observable at discrete time points, in which case the likelihood is not available and inference is much more difficult. Reversible-Jump MCMC have been developed in this context [16], but require considerable expertise to implement. On the other hand, because simulating realisations from such models can be performed straightforwardly (for instance using the so-called “Gillespie algorithm” [17]) simulation-based approaches are an attractive alternative. The model Let Y1 and Y2 denote the number of prey and predators respectively. Suppose Y1 and Y2 are subject to the following reactions r

1 Y1 → 2 Y1 ,

r

2 Y1 + Y2 → 2 Y2 ,

r

3 Y2 → ∅,

which respectively represent prey birth, predation and predator death. We consider the problem of making inference about vector of rates r = (r1 , r2 , r3 ) based on time-course data for Y1 and Y2 . Results We generated a realisation from the stochastic LV model using r = (10, 0.1, 10) and Y1 (0) = Y2 (0) = 1000, and to simulate a challenging sparse dataset we sampled 13 observations at time points drawn from a uniform distribution on (0, 1.5) (see Figure S3, Supplementary Material). We chose these parameters to follow closely the example in Section 3.1.2 of Toni et al. [8]. We also used the same priors they used: π(r1 ) ∼ U(0, 28), π(r2 ) ∼ U(0, 0.4) and π(r3 ) ∼ U(0, 28). Attempting to employ ABC leads to a small probability of acceptance (< 10−10 ) even for large ǫ (> 103 ). In contrast, by exploiting the factorised posterior (3) we can use PW-EBC (i.e. set ǫ = 0) and perform exact inference with a typical acceptance rate of 10−7 . 7

−1.5

−1.0

−0.5

0.0

0.5

1.0

(a) scaled logit Prey, logit(r1 28)

−2.0

−1.5

−1.0

−0.5

0.0

0.5

(b) scaled logit Interaction, logit(r2 0.04)

−1.5

−1.0

−0.5

0.0

0.5

1.0

(c) scaled logit Predator, logit(r3 28)

Figure 2: Marginal posterior densities of the transformed parameters for the the Lotka-Volterra example. Solid vertical lines indicate the true values and dashed lines the 95% credible interval. (Online version in colour) The marginal posteriors for the transformed parameters are shown in Figure 2. These plots show that PW-EBC has performed very well even for this sparse dataset.

8

4

Summary

These examples show how exploiting a model’s Markov structure to factorise the posterior distribution can give dramatic computational savings in simulation-based inference. In both examples we set the tolerance ǫ = 0 to make the inference exact. More generally, for problems that demand it, one can use ǫ > 0 (resulting in an ABC scheme). There is much potential for embedding such an approach within sequential schemes of the type considered by Toni et al. [8].

References [1] B.D. Ripley. Stochastic simulation. John Wiley & Sons Inc., New York, 1987. doi: 10.1002/9780470316726. [2] M.A. Tanner and W.H. Wong. The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc., 82(398):528–550, 1987. [3] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov chain Monte Carlo in practice. Chapman & Hall, London, 1996. [4] S. Tavare, D.J. Balding, R.C. Griffiths, and P. Donnelly. Inferring Coalescence Times From DNA Sequence Data. Genetics, 145(2):505–518, 1997. [5] J.K. Pritchard, M.T. Seielstad, A. Perez-Lezaun, and M.W. Feldman. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol, 16(12):1791–1798, 1999. [6] M.A. Beaumont, W. Zhang, and D.J. Balding. Approximate Bayesian Computation in Population Genetics. Genetics, 162(4):2025–2035, December 2002. [7] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavar´e. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA, 100(26):15324, 2003. [8] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M.P. Stumpf. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface, 6:187–202, 2009. doi: 10.1098/rsif.2008.0172.

9

[9] N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–551, 2002. doi: 10.1093/biomet/89.3.539. [10] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM J. Sci. Comput., 12(1):79–94, 1991. doi: 10.1137/0912004. [11] B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986. [12] P. Neal and T. Subba Rao. MCMC for integer-valued ARMA processes. J. Time Ser. Anal., 28(1):92–110, 2007. doi: 10.1111/j.1467-9892.2006. 00500.x. [13] E. McKenzie. Discrete variate time series. In Stochastic processes: modelling and simulation, volume 21 of Handbook of Statist., pages 573–606. North-Holland, Amsterdam, 2003. [14] M.A. Al-Osh and A.A. Alzaid. First-order integer-valued autoregressive (INAR(1)) process. J. Time Ser. Anal., 8(3):261–275, 1987. doi: 10. 1111/j.1467-9892.1987.tb00438.x. [15] D.J. Wilkinson. Stochastic modelling for systems biology. Chapman & Hall/CRC, Boca Raton, FL, 2006. [16] R.J. Boys, D.J. Wilkinson, and T.B. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 18(2):125–135, 2008. doi: 10.1007/s11222-007-9043-x. [17] D.T. Gillespie. Exact stochastic simulation of coupled chemical reactions. J Phys Chem, 81(25):2340–2361, 1977.

10