Playing Russian Roulette with Intractable Likelihoods

0 downloads 0 Views 1MB Size Report
Jun 17, 2013 - Fisher-Bingham distributions defined on the d-Sphere. A large scale ..... a number of possible Russian Roulette procedures well-known in the Physics literature. ...... Monte Carlo strategies in scientific computing. Springer.
Playing Russian Roulette with Intractable Likelihoods Mark Girolami∗, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade

arXiv:1306.4032v1 [stat.ME] 17 Jun 2013

June 19, 2013

Abstract A general scheme to exploit Exact-Approximate MCMC methodology for intractable likelihoods is suggested. By representing the intractable likelihood as an infinite Maclaurin or Geometric series expansion, unbiased estimates of the likelihood can be obtained by finite time stochastic truncations of the series via Russian Roulette sampling. Whilst the estimates of the intractable likelihood are unbiased, for unbounded unnormalised densities they induce a signed measure in the Exact-Approximate Markov chain Monte Carlo procedure which will introduce bias in the invariant distribution of the chain. By exploiting results from the Quantum Chromodynamics literature the signed measures can be employed in an Exact-Approximate sampling scheme in such a way that expectations with respect to the desired target distribution are preserved. This provides a general methodology to construct Exact-Approximate sampling schemes for a wide range of models and the methodology is demonstrated on well known examples such as posterior inference of coupling parameters in Ising models and defining the posterior for Fisher-Bingham distributions defined on the d-Sphere. A large scale example is provided for a Gaussian Markov Random Field model, with fine scale mesh refinement, describing the Ozone Column data. To our knowledge this is the first time that fully Bayesian inference over a model of this size has been feasible without the need to resort to any approximations. Finally a critical assessment of the strengths and weaknesses of the methodology is provided with pointers to ongoing research1 .

1

Introduction

A fundamental open problem of growing importance in the widespread application of Markov chain Monte Carlo (MCMC) methods for Bayesian computation is the definition of transition kernels for target distributions with data densities that are analytically or computationally intractable. Specifically we consider doubly-intractable distributions of the form originally described in Murray et al. (2006). As a working example, consider a Bayesian inference problem where data y ∈ Y is used to make posterior inferences about the variables θ ∈ Θ that define a statistical model. A prior distribution defined by a density π(θ) is adopted and the data density is given by p(y|θ) = f (y; θ)/Z(θ) where f (y; θ) is an unnormalised function of the R data and parameters with Z(θ) = f (x; θ)dx representing the associated normalising term. The posterior density follows in the usual form as π(θ|y) = p(y|θ) × π(θ) × where Z(y) = interest i.e.

R

1 f (y; θ) 1 = × π(θ) × Z(y) Z(θ) Z(y)

(1)

p(y|θ)π(θ)dθ. Bayesian inference proceeds by taking posterior expectations of functions of Z Eπ(θ|y) {ϕ(θ)} = ϕ(θ)π(θ|y)dθ (2)

and Monte Carlo estimates of the above expectations can be obtained by employing MCMC methods (Gelman et al., 2003; Gilks, 1999; Liu, 2001; Robert and Casella, 2010). To construct a Markov chain whose ∗ [email protected] 1 Code

to replicate all results reported can be downloaded from http://www.ucl.ac.uk/roulette

1

invariant distribution has density π(θ|y) a transition kernel is constructed by employing a proposal distribution q(θ ′ |θ) and an acceptance probability equal to   f (y; θ ′ )π(θ ′ ) q(θ|θ′ ) Z(θ) α(θ ′ , θ) = min 1, . (3) × × f (y; θ)π(θ) q(θ ′ |θ) Z(θ ′ ) Assuming that all terms appearing in the expression for the acceptance probability are analytic or computable then the MCMC simulation can proceed in a standard manner. The problem arises when the value of the normalising term for the data density, Z(θ), cannot be obtained either due to it being non-analytic or uncomputable with finite computational resource. This particular situation is far more widespread in modern day statistical applications than a cursory review of the literature would suggest and forms one of the main challenges to methodology for computational statistics currently.

1.1

Intractable Target Distributions

There is a growing class of statistical models for which the data distribution is doubly intractable and thus exact inference over such statistical models is infeasible. The doubly-intractable nature of the data density that is considered in this paper is distinct from the situation that is common in Approximate Bayesian Computation (ABC), where even computation of the function f (y; θ) is infeasible. In ABC, there is no alternative recourse than to develop methodology that is approximate and this is a major area of current research activity in computational statistics with examples of some recent contributions being (Fearnhead and Prangle, 2012; Marin et al., 2011; Mengersen et al., 2013). The distributions considered in this paper are doubly-intractable, that is the normalising constant is intractable, see also Atchad´e et al. (2013) for an asymptotically consistent approach to the problem. Atchad´e et al. (2013) construct an adaptive MCMC scheme with two parallel processes, one performing the sampling with the other approximating the normalizing constant. The approximation is achieved by spatial smoothing of an importance sampling estimate computed over a grid. In practice the approach suffers from the curse of dimensionality as the quality of the importance sampling estimate depends on the number and location of the grid points. These points need to grow exponentially with the dimension of the space, which obviously limits the applicability of this methodology. Likewise (Caimo and Friel, 2011; Everitt, 2013; Liang, 2010) consider approximate methods applied to inference over lattice based models and the emerging area of social networks. The areas of application that require inference over doubly-intractable distributions are many, e.g. Directional Statistics see (Walker, 2011) for a specific sampling scheme for the parameters of the Fisher-Bingham distribution; in Machine Learning model structures such as Boltzman Machines induce intractable posterior distributions (Salakhutdinov and Hinton, 2009); the intractable nature of distributions for stochastic diffusions has been addressed in (Beskos et al., 2006; Fearnhead et al., 2008), where the Poisson and Generalised Poisson estimators are developed; likewise in Markov Random Field (MRF) models e.g. Ising, Potts colouring, autologistic, and spatial point process models, the intractable nature of the normalising terms needs to be addressed e.g. (Caimo and Friel, 2011; Gu and Zhu, 2001; Hughes et al., 2011). Finally, in massive scale Gaussian Markov Random Field (GMRF) models (Aune et al., 2012), approximate methods based on composite likelihoods (Eidsvik et al., 2013) have been explored to address the difficulty of computing the normalising term in such massive mesh based models.

1.2

Introducing auxiliary variables

An exact sampling methodology for distributions with intractable normalising constants is proposed in Walker (2011), where a Reversible-Jump MCMC (RJMCMC) sampling scheme is developed that cleverly gets around the intractable nature of the normalising term. Consider the univariate distribution p(y|θ) = f (y; θ)/Z(θ) where N i.i.d. observations, yi are available. In its most general form, it is required that y belongs to some bounded interval [a, b], and that there exists a constant M < +∞ such that f (y; θ) < M for all θ and y (it is assumed that [a, b] = [0, 1], and M = 1 in this work). The method introduces auxiliary

2

variables ν ∈ (0, ∞), k ∈ {0, 1, . . .}, {s}(k) = (s1 , . . . , sk ), which has the joint density f (ν, k, {s}(k) , y|θ) ∝

N k Y exp(−ν)ν k+N −1 Y f (yi ; θ). (1 − f (sj ; θ)) 1(0 < sj < 1) k! i=1 j=1

QN Integrating out ν and s(k) and summing over all k returns the data distribution i=1 p(yi |θ). An RJMCMC scheme is proposed to sample from the joint density f (ν, k, {s}(k) , y|θ) and this successfully gets around the intractable nature of the normalising term. However the methodology has some limitations to its generality. Firstly, the unnormalised density function must be strictly bounded from above to ensure the positivity of the terms in the first product. This obviously limits the generality of the methodology to the class of strictly bounded functions, however this is not overly restrictive as many functional forms for f (yi ; θ) are bounded e.g. when there is finite support, or when f (yi ; θ) takes an exponential form with strictly negative argument. Even if the function to be sampled is bounded, finding bounds that are tight is extremely difficult and the choice of the bound directly impacts the efficiency of the sampling scheme constructed, see e.g. Ghaoui and Gueye (2009) for bounds on binary lattice models. Ideally we would wish to relax the requirement for the data, y, to belong to a bounded interval, but if we integrate with respect to each sj over an unbounded interval then we can no longer return 1 − Z(θ) and the sum over k will therefore no longer define a convergent geometric series equaling Z(θ). This last requirement particularly restricts the generality and further use of this specific sampling method for intractable distributions.

1.3

Valid Metropolis-Hastings-type transition kernels

Two ingenious MCMC solutions to the doubly-intractable problem were proposed in Møller et al. (2006) and Murray et al. (2006), where Markov chain transition kernels employing single-sample unbiased estimates of the ratio of the troublesome normalising constants are used to define correct MCMC sampling schemes. From standard equalities of ratios of normalising terms, Murray et al. (2006) employed the following unbiased estimate Z(θ) f (x; θ) f (x; θ′ ) ≈ where x ∼ , Z(θ ′ ) f (x; θ′ ) Z(θ′ )

which is plugged into the expression for the acceptance probability thus defining a noisy transition kernel as discussed in e.g. Nicholls et al. (2012). These methods present a major methodological step forward, however they require the capability to draw perfect samples from the model e.g. Propp and Wilson (1996). This can be considered a restriction to the widespread applicability of this class of methods. Attempts have been made to relax the requirement of perfect sampling from the model, such as the Double Metropolis-Hastings Sampler, Liang (2010). However, despite claims of exactness, this scheme is only approximate and does not guarantee convergence to the correct invariant distribution. Additionally a number of ‘Inexact-Approximate’ schemes have been suggested with analysis of the gap between the true target and the distribution to which the various chains converge. See, for example, Nicholls et al. (2012); Wang and Atchad´e (2013) with applications to inference in network and lattice models motivating this strand of work e.g. Caimo and Friel (2011); Everitt (2013). However, methodology that provides convergence to the exact target distribution is desirable and therefore we turn to the ’Exact-Approximate’ class of simulation based methods.

2

Exact-approximate Sampling Methods

The Exact-Approximate class of methods is particularly appealing in that they have the least number of restrictions placed upon them and provide the most general and extensible MCMC methods for intractable distributions. To appeal to the Exact-Approximate or pseudo-marginal MCMC schemes (Andrieu and Roberts, 2009; Beaumont, 2003; Doucet et al., 2012) for such distributions, we require unbiased and positive estimates of the target density to give an acceptance probability of the form   π ˆ (θ ′ |y) q(θ|θ ′ ) (4) × α(θ ′ , θ) = min 1, π ˆ (θ|y) q(θ′ |θ) 3

where the estimate at each proposal is propagated forward as described in Andrieu and Roberts (2009); Beaumont (2003). The remarkable feature of this scheme is that the corresponding transition kernel has an invariant distribution with θ-marginal given precisely by the desired posterior distribution, π(θ|y). In fact this is a rather straightforward consequence of the appearance of the Monte Carlo error in the plugin estimates of the target distribution. Representing the Monte Carlo estimation error with a random variable ξ ∼ Pθ and π ˆ (θ|y) = π(θ, ξ|y), then it is straightforward to check that the Metropolis-Hastings algorithm with target Π(dθ, dξ|y) := π(θ, ξ|y)dθPθ (dξ) and proposal Q(θ, ξ; dθ′ , dξ′ ) := q(θ′ |θ)dθ ′ Pθ ′ (dξ ′ ) has acceptance probability given by (4), and clearly given the unbiasedness of π(θ, ξ|y), Π(dθ, dξ|y) admits the required target π(θ|y) as its marginal distribution. This is a result highlighted in the statistical genetics literature (Beaumont, 2003) then popularised and formally analysed in Andrieu and Roberts (2009) with important developments such as Particle MCMC (Doucet et al., 2012) proving to be extremely powerful and useful in a large class of emerging statistical models. It is interesting to note that the problem of Exact-Approximate inference was first considered in the Quantum Chromodynamics literature almost thirty years ago, predating the statistics literature by some twenty years. This was motivated by the need to reduce the computational effort of obtaining values for the strength of bosonic fields in defining a Markov process to simulate configurations following a specific law, see for example Bakeyev and De Forcrand (2001); Bhanot and Kennedy (1985); Joo et al. (2003); Kennedy and Kuti (1985); Lin et al. (2000).

2.1

Proposed methodology

This current paper takes inspiration from the Exact-Approximate framework (Andrieu and Roberts, 2009) and the previously published Physics literature addressing the problem of Exact-Approximate simulation for (j) doubly-intractable targets. For each θ and y, we show that one can construct random variables {Vθ , j ≥ 0} (where dependence on y is omitted) such that the series defined as (j)

π ˆ (θ, {Vθ }|y) := 

∞ X

(j)



j=0

 (j) is finite almost surely, has finite expectation, and E π ˆ (θ, {Vθ } |y) = π(θ|y). We propose two such series

expansions of the posterior distributions, one using a convergent geometric series expansion (Section 3.1), and the other defined by a MacLaurin series expansion (Section 3.2). Although unbiased, these estimators are not practical as they involve infinite series. We propose a solution that introduces a computationally feasible truncation of the infinite sum which, crucially, remains unbiased. We achieve this goal using one of a number of possible Russian Roulette procedures well-known in the Physics literature. More precisely, we (j) introduce a random time τθ , such that with ξ := (τθ , {Vθ , 0 ≤ j ≤ τθ }) the estimate π ˆ (θ, ξ|y) :=

τθ X j=0

2.2

(j)



satisfies

∞   X (j) (j) Vθ . E π ˆ (θ, ξ|y)|{Vθ , j ≥ 0} = j=0

The Sign Problem

If the known function f (y; θ) forming the estimate of the target is bounded then theoretically the whole procedure can proceed without difficulty, assuming the bound provides efficiency of sampling. However in the much more general situation where the function is not bounded there is a complication here in that the unbiased estimate π ˆ (θ, ξ|y) is not guaranteed to be nonnegative (although its expectation is nonnegative). This issue prevents us from plugging-in directly the estimator π ˆ (θ, ξ|y) in the Exact-Approximate framework for the case of unbounded functions. The problem of such unbiased estimators returning negative valued estimates turns out to be a wellstudied issue in the Quantum Monte Carlo literature, see e.g. (Lin et al., 2000). The problem is known as the Sign Problem2 which in its most general form is NP-hard (non-deterministic polynomial time hard) 2 Workshops

devoted to the Sign Problem, e.g. International Workshop on the Sign Problem in QCD and Beyond, are held regularly http://www.physik.uni-regensburg.de/sign2012/

4

(Troyer and Wiese, 2005) and at present it seems that a completely general and practical solution is an outstanding open problem in both computational statistics and quantum Monte Carlo. As we cannot guarantee in the general case that the sign problem will not occur, we follow (Lin et al., 2000) and show that with a weighting of expectations it is still possible to compute any integral of the form R h(θ)π(θ|y)dθ by Markov Chain Monte Carlo. Without any loss of generality, we can write π ˆ (θ, ξ|y) = pˆ(θ, ξ|y)/Z(y), where Z(y) R R is some intractable normalizer, pˆ(θ, ξ|y) can be evaluated, and by the unbiasedness of π ˆ (θ, ξ|y), Z(y) = pˆ(θ, ξ|y)Pθ (dξ)dθ. Although the measure pˆ(θ, ξ|y)Pθ (dξ)dθ/Z(y) integrates to one, it is not a probability measure because of the sign problem. Define σ(θ, ξ|y) := sign(ˆ p(θ, ξ|y)), where sign(x) = 1 when x > 0, sign(x) = −1 if x < 0 and sign(x) = 0 if x = 0. Furthermore define |ˆ p|(θ, ξ|y) := |ˆ p(θ, ξ|y)| as the absolute value of the measure, then we have pˆ(θ, ξ|y) = σ(θ, ξ|y)|ˆ p|(θ, ξ|y). Suppose that we wish to compute the expectation Z Z Z h(θ)π(θ|y)dθ = h(θ)ˆ π (θ, ξ|y)Pθ (dξ)dθ. We can write the above integral as Z h(θ)π(θ|y)dθ

= = = =

where π ˇ (dθ, dξ|y) is the distribution

Z Z

h(θ)ˆ π (θ, ξ|y)Pθ (dξ)dθ Z Z 1 h(θ)ˆ p(θ, ξ|y)Pθ (dξ)dθ Z(y) RR h(θ)σ(θ, ξ|y)|ˆ p|(θ, ξ|y)Pθ (dξ)dθ RR σ(θ, ξ|y)|ˆ p|(θ, ξ|y)Pθ (dξ) RR h(θ)σ(θ, ξ|y)ˇ π (dθ, dξ|y) RR , σ(θ, ξ|y)ˇ π (dθ, dξ|y)

(5)

|ˆ p|(θ, ξ|y)dθPθ (dξ) π ˇ (dθ, dξ|y) := R R . |ˆ p|(θ, ξ|y)dθPθ (dξ)

As in the Exact-Approximate framework outlined above, the Metropolis-Hastings algorithm with target distribution π ˇ (dθ, dξ|y) and proposal kernel Q(θ, ξ; dθ ′ , dξ ′ ) = q(θ ′ |θ)dθ ′ Pθ ′ (dξ ′ ) has acceptance probability given by   |ˆ p|(θ, ξ|y) q(θ|θ′ ) min 1, . × |ˆ p|(θ, ξ|y) q(θ ′ |θ)

Such a Metropolis-Hastings scheme is easily implementable, and in view of (5), the output of this MCMC R procedure gives an importance-sampling-type estimate for the desired expectation h(θ)π(θ|y)dθ. We describe the procedure more systematically in Section 10, and we discuss in particular how to compute the effective sample size of the resulting Monte Carlo estimate. The following section addresses the issue of constructing the unbiased estimator to be used in the overall MCMC scheme.

3

Unbiased Estimation of the Likelihood Function

The foundational component of Exact-Approximate MCMC is the unbiased and positive estimator of the target density. In the general methodology developed here, an unbiased estimator of the normalising term is required. However, to ensure that expectations with respect to the target distribution are preserved for unbounded functions, it is not essential for the estimate of the intractable distribution to be strictly positive and we exploit this characteristic in our methodology. In outline, the intractable distribution is firstly written in terms of a nonlinear function of the nonanalytic/computable normalising term. For example, in Equation (1), the nonlinear function is the simple reciprocal 1/Z(θ) and an equivalent, albeit somewhat convoluted representation would be exp(− log Z(θ)). 5

This function is then represented by a convergent MacLaurin Series expansion in terms of independent, ˆ unbiased estimates of the normalising term Z(θ). The infinite series expansion can then be stochastically truncated without introducing bias. These two components—(1) unbiased independent estimates of the normalising constant, and (2) unbiased stochastic truncation of the infinite series representation—then produce an unbiased, though not strictly positive, estimate of the intractable distribution. The final two components of the overall methodology consists of (3) constructing an MCMC scheme which targets a distribution proportional to the absolute value of the unbiased estimator, and then (4) computing Monte Carlo estimates with respect to the desired intractable distribution as detailed in the previous section. In the following sections, we present two forms of unbiased, though for unbounded functions not strictly positive, estimators of the target distribution that employ unbiased estimates of Z(θ). Such unbiased estimates of the normalising term can straightforwardly be obtained with carefully designed Importance Sampling estimators such as Sequential Monte Carlo Sampling (Del Moral et al., 2006) or similar schemes such as Annealed Importance Sampling (AIS) (Neal, 1998).

3.1

Geometrically Tilted Estimator

Consider a tilted form of a density function which forms the basis of saddle-point approximation methods, see e.g. Goutis and Casella (1999); Reid (1988). Such tilted expressions ,where an embedding into an exponential family is constructed, is the basis for the classical cumulant based finite truncation approximations, such the Edgeworth and Gram-Charlier expansions (Hall, 1992). In this context, however, we employ the tilting to remove the bias from an approximation to produce an unbiased estimator of a density. The target density can be written as a geometrically tilted correction of the biased estimator π(θ|y) = ˜ ˜ f (y; θ)/Z(θ), where Z(θ) is some approximation e.g. an unbiased importance sampling estate, an upperbound, or a deterministic approximation. Then, using the multiplicative correction # " ∞ X (6) κ(θ)n , π(θ|y) = π ˜ (θ|y) × c(θ) 1 + n=1

˜ it is straightforward to note that, given Z(θ) and Z(θ) are strictly positive, the target density can be written e as a geometrically tilted correction of the biased estimator. This requires κ(θ) = 1 − c(θ)Z(θ)/Z(θ), and that the constant c(θ) ensures |κ(θ)| < 1. The convergence of the geometric series ensures that " # ∞ X ˜ c(θ) Z(θ) π(θ|y) = π ˜ (θ|y) × c(θ) 1 + κ(θ)n = π ˜ (θ|y) × =π ˜ (θ|y) × . 1 − κ(θ) Z(θ) n=1

Therefore with an infinite number of independent unbiased estimates of Z(θ), each denoted as Zˆi (θ), an unbiased estimate of the target density is " !# ∞ Y n X Zˆi (θ) 1 − c(θ) π ˆ (θ|y) = π ˜ (θ|y) × c(θ) 1 + . (7) e Z(θ) n=1 i=1

Notice that the series in (7) is finite a.s. and we can interchange summation and expectation if ! Zˆi (θ) E 1 − c(θ) < 1. e Z(θ)   e Zˆ12 (θ) , which is Since E(|X|) ≤ E 1/2 (|X|2 ), a sufficient condition for this is 0 < c(θ) < 2Z(θ)Z(θ)/E slightly more stringent than |κ(θ)| < 1. Under this assumption, the expectation of π ˆ (θ|y) is n o    ∞ Y n E Zˆi (θ) X 1 − c(θ)  E {ˆ π (θ|y)} = π ˜ (θ|y) × c(θ) 1 + e Z(θ) n=1 i=1 # " ∞ X n = π ˜ (θ|y) × c(θ) 1 + κ(θ) n=1

= π(θ|y).

6

Therefore, the essential property E {ˆ π (θ|y)} = π(θ|y) required for Exact-Approximate MCMC is satisfied by this tilted correction. The use of a convergent geometric series has been exploited previously by Booth (2007) to estimate the reciprocal of an integral function. The following sections will describe a number of ways to stochastically truncate the infinite sum in equation (7) in a fully unbiased manner, thus providing a practical solution to the problem of obtaining an unbiased estimator. There are two difficulties with this estimator. It will be difficult in practice to find c(θ) that ensures the series in (7) is convergent in the absence of knowledge of the actual value of Z(θ). However, by ensuring ˜ that Z(θ)/c(θ) is a strict upper-bound on Z(θ), denoted by ZU , guaranteed convergence (and positivity) of the geometric series is established. However, this may not be a satisfactory construction as upper bounds on normalising constants are typically loose (see, for example Ghaoui and Gueye, 2009), making the ratio Z(θ)/ZU extremely small, and, therefore, κ(θ) ≈ 1. In this case, the convergence of the geometric series will be particularly slow, which will have practical implications for the stochastic truncation of the infinite series as will be seen in the following sections. A more practical approach is to take a pilot run estimate of the normalising constant, ZP (θ), characterising the level of estimator variance and then using this to control the variability of the ratio c(θ)Z(θ)/ZP (θ) by careful selection of the global variable c(θ). An alternative to the Geometric tilted estimate, which does not have the practical issue of ensuring the region of convergence is maintained, is now described.

3.2

Unbiased estimators using an exponential auxilliary variable

The introduction of an auxiliary variable ν ∼ Expon(Z(θ)) defines a joint distribution of the form of π(θ, ν|y)

= =

1 1 × π(θ) × Z(θ) Z(y) 1 exp (−νZ(θ)) × f (y; θ) × π(θ) × Z(y)

[Z(θ) exp(−νZ(θ))] × f (y; θ) ×

(8) (9)

Therefore, an Exact-Approximate sampling scheme can be constructed by using the following unbiased estimator π ˆ (θ, ν|y)

=

\ exp(−νZ(θ)) × f (y; θ) × π(θ) ×

1 . Z(y)

The MacLaurin series expansion for the exponential function is ∞ X (−ν)n exp(−νZ(θ)) = 1 + Z(θ)n , n! n=1

(10)

suggesting an unbiased estimator of the form \ exp(−νZ(θ)) =1+

n ∞ X (−ν)n Y ˆ Zi (θ), n! i=1 n=1

(11)

where {Zˆi (θ), i ≥ 1} are i.i.d. random variables with expectation equal to Z(θ). Since n! grows faster than the exponential, this series is always well defined (finite almost surely). The issue of unbiased truncation of the series will be discussed in the following section. Though at this point it should be noted that the series is alternating in sign and, as such, any stochastic truncation of the MacLaurin expansion will not be guaranteed to be positive, although this is not a requirement for the methodology. As in the previous section, a preliminary estimate or approximation (e.g. an upper-bound if one is available) can be employed, ˜ e.g. Z(θ), by noting that   ˜ ˜ exp(−νZ(θ)) = exp(−ν Z(θ)) × exp ν(Z(θ) − Z(θ)) ! ∞ n X νn  ˜ ˜ = exp(−ν Z(θ)) × 1 + Z(θ) − Z(θ) , n! n=1 7

which yields a tilted estimator of the form \ exp(−νZ(θ))

=

˜ exp(−ν Z(θ)) ×

! ∞ n  X νn Y  ˜ Z(θ) − Zˆi (θ) . 1+ n! i=1 n=1

(12)

In Fearnhead et al. (2008), the Generalised Poisson Estimator, originally proposed in Beskos et al. (2006), is employed to estimate transition functions that are similar to (12). Here again, this series is finite almost surely with finite expectation. The choice of which estimator to employ will be problem dependent and, in situations where it is difficult to guarantee convergence of the geometric series, this form of estimator may be more suitable. The final element of the proposed methodology is the unbiased truncation of the infinite series estimators which is discussed in the following section.

4

Unbiased Truncation of Infinite Sums: Russian Roulette

Two unbiased estimators of nonlinear functions of a normalising constant have been considered. Both of them rely on the availability of an unbiased estimator for the normalising constant and a series representation of the nonlinear function. We now require a computationally feasible means of obtaining the desired estimator without explicitly computing the infinite sum and, crucially, without introducing any bias into the final estimate. It transpires that there are a number of ways to randomly truncate the convergent infinite sum P∞ S(θ) = i=0 φi (θ) in an unbiased manner, see (Papaspiliopoulos, 2011) for a good review of such methods.

4.1

Single Term Weighted Truncation

The simplest unbiased truncation method is to define a set of probabilities and draw an integer index k with probabilityPqk then return φk (θ)/qk as the estimator. It is easy to see that the estimator is unbiased ˆ as E{S(θ)} = k qk φk (θ)/qk = S(θ). The definition of the probabilities should be chosen to minimise the variance of the estimator, see e.g. (Fearnhead et al., 2008). An example could be that each index is drawn from a Poisson distribution k ∼ Poiss(λ) with qk = λk exp(−λ)/k!. The problem with Poisson truncation in the case of a geometric series where φk (θ) = φk (θ) is that the corresponding variance is infinite since the combinatorial function k! grows faster than the exponential. However, if instead we use the geometric distribution as our importance distribution, the variance is finite subject to some conditions on the choice of p, the parameter of the Geometric distribution. P∞note that, as k is chosen P∞ To see this, with probability qk = pk (1 − p), the second moment E[Sˆ2 ] = k=0 Sk2 qk = k=0 φ2k /pk (1 − p) is finite if limk→∞ |φ2k+1 /pφ2k | < 1.

4.2

Russian Roulette

An alternative unbiased truncation that exhibits superior performance in practice is based on a Monte-Carlo scheme, known as Russian Roulette, which is employed extensively in the modelling of Neutron Scattering in Nuclear Physics and Ray Tracing in Computer Graphics. The procedure is based on a sequence of probabilities {qj , j ≥ 1} qj ∈ (0, 1]. Using these numbers, and i.i.d. uniform U(0, 1) random variables {Uj , j ≥ 1}, we find the first time k ≥ 1 such that Uk ≥ qk . Then the Russian Roulette estimate of S(θ) is ˆ S(θ) = φ0 (θ) +

k−1 X j=1

φj (θ) , Qj i=1 qi

Q where we denote throughout that ba · = 0 ifQa < b. We refer the reader to the Appendix for more detailed n discussion where it is shown that if limn→∞ j=1 qj = 0, the Russian Roulette truncation terminates with

8

probability one, and, therefore, setting each Sk = α0 + 

 ˆ E S(θ)

=

∞ X

k=1

(k−1 Y i=1

= φ0 (θ) +

)

qi (1 − qk ) Sk =

∞ X

Sk+1

k Y

i=1

k=1

pi −

Pk−1 j=1

∞ X

Sk

k=1

∞ X

k=1

Sk

φj (θ)/ k−1 Y i=1

k Y

Qj

pi −

i=1 qi ,

we have

∞ X

k Y

Sk

k=1

pi = φ0 (θ) +

i=1

pi

i=1 ∞ X

k=1

φk (θ) = S(θ).

P ˆ is finite subject It can also be shown (See Appendix) that if n≥1 φn Sn+1 < ∞, then the variance of S(θ) k to certain conditions. For a geometric series φk (θ) = φ (θ), if one chooses qj = q, then these conditions hold provided q > φ(θ)2 . In general there is a trade-off between the computing time of the scheme and the variance of the returned estimate. If the selected qj ’s are close to unity, the variance is small, but the computing time is high. But if qj ’s are close to zero, the computing time is fast but the variance can be very high, possibly infinite. Again in the case of the geometric series, φk (θ) = φk (θ), choosing qj = q = φ(θ) works reasonably well in practice. Now that the complete Exact-Approximate MCMC scheme has been detailed the following section illustrates the methodology on some models that are doubly-intractable.

5 5.1

Experimental Evaluation Ising Lattice Spin Models

Ising models are examples of doubly-intractable distributions over which it is challenging to perform inference. They form a prototype for priors for image segmentation and autologistic models e.g. (Gu and Zhu, 2001; Hughes et al., 2011; Møller et al., 2006). Current methods such as the Exchange algorithm (Murray et al., 2006) require access to a perfect sampler (Propp and Wilson, 1996) and will be computationally burdensome and restricted to applications where perfect simulation from the model is available. A practical alternative to perfect sampling is employed in (Caimo and Friel, 2011), where an auxiliary MCMC run is used to approximately simulate from the model. This is inexact and introduces bias, but it is hoped that the bias has little practical impact. We compare this approximate scheme with our exact methodology in this section. To apply our Exact-Approximate methodology, all that is required is the ability to produce an unbiased estimate of the normalising constant, which can be obtained using annealed importance sampling. For an N × N grid of spins, y = (y1 , . . . yN 2 ), y ∈ {+1, −1}, the Ising model has likelihood   N2 X X 1 (13) p(y; α, β) = yi yj  , yi + β exp α Z(α, β) i∼j i

where i and j index the rows and column of the lattice and the notation i ∼ j denotes summation over all neighbours in the Markov blanket. The parameters α and β indicate the strength of the external field and the interactions between neighbours respectively. The normalising constant,   N2 X X X yi yj  , (14) yi + β Z(α, β) = exp α i

y

2

i∼j

requires summation over all 2N possible configurations of the model, which is computationally infeasible even for moderately sized lattices. Experiments were carried out on a small 10 × 10 lattice to enable a detailed comparison of the various algorithms. A configuration was simulated using a Gibbs sampler with parameters set at α = 0 and β = 0.2. Inference was carried out over the posterior distribution p(β|y) (α = 0 was assumed fixed). A standard Metropolis-Hastings sampling scheme was used, with a normal proposal distribution centred at the current value and acceptance rates tuned to around 40%. A uniform prior on [0, 1] was set over β. At each iteration, an unbiased estimate of the normalised likelihood was obtained using the geometric series construction 9

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0 0

1 (a)

2 ×104

0 0

1 (b)

2 ×104

0 0

1 (c)

2 ×104

Figure 1: Traces of samples using geometric tilting with (a) Russian Roulette, (b) Poisson truncation, and (c) the Exchange algorithm for comparison. Note that these samples were not drawn from the posterior distribution, p(β|y), but from the (normalised) absolute value of the estimated density.

Mean Standard deviation ESS

Roulette 0.2021 0.0627 2913

Poisson 0.2009 0.0630 3065

Exchange 0.2013 0.0626 1727

Figure 2: Monte Carlo estimates of the mean and standard deviation of the posterior distribution p(β|y) using the three algorithms described. The geometric tilting estimates have been corrected for negative estimates. described above. Annealed importance sampling with 2,000 importance samples and 2,000 temperature ˆ β). The infinite geometric series was truncated levels was used to produce the unbiased estimates Z(α, using both the Poisson truncation and Russian Roulette. For comparison, the posterior distribution was also sampled using the approximate form of the Exchange algorithm as advocated in (Caimo and Friel, 2011) with samples drawn at each iteration using a Gibbs sampler run for 50,000 steps. All three chains were run for 20,000 iterations and the second half of the samples used for subsequent Monte Carlo estimates. Obviously the exact posterior mean and standard deviation are not available for comparison but the estimates from the three methods agree well as seen in Table 2. The traces in Figure 1 show that the algorithms mix rather well and Figures 4 and 5 show that the estimates of the mean and standard deviation ˆ β), agree well. If a smaller number of temperature levels or importance samples are used in the estimates Z(α, then some sticking is seen in the trace. This is due to an overestimation of the likelihood, which means it is difficult for a subsequent proposal to be accepted. Therefore, the lower the variance of the estimates, the less severe the problem. The autocorrelation functions (Figure 3) and the effective sample size (Table 2) of both Russian Roulette and Poisson truncation outperform the approximate Exchange algorithm in this example; of course it is possible to improve the performance of our algorithm by using more computation, whereas this is not possible with the approximate Exchange algorithm. It should be noted that the approximate form of the Exchange algorithm in this guise is significantly less computationally intensive. However, in reality if one wishes to converge to the exact posterior a perfect sample would need to be drawn and this would considerably increase the computation required. In terms of the effect of the Sign Problem in the simulations, there was only one negative estimate of ˆ Z(θ) using Russian Roulette truncation and none using Poisson truncation. An upper bound is available for the Ising model which corresponds to setting all spins to +1, however as this bound is very loose it is impractical in this context. Hence the availably of a method to deal with negative estimates frees us, in this case, from atrocious upper bounds that would explode the asymptotic variance of the chains. Furthermore, as negative estimates were extremely rare, the variance inflation due to the sign trick is almost non-existent.

10

replacemen

0.8 0.6 0.4 0.2 0 −0.2 0

10 Lag

20

0.8 0.6 0.4 0.2 0 −0.2 0

(a)

10 Lag

20

0.8 0.6 0.4 0.2 0 −0.2 0

(b)

10 Lag

20

(c)

Figure 3: Autocorrelation plots for samples drawn from the posterior distribution p(β|y) for 10 × 10 Ising models for three methods: (a) Geometric with roulette truncation, (b) Geometric with Poisson truncation and (c) Approximate Exchange.

0.24

0.24

0.24

0.22

0.22

0.22

0.2

0.2

0.2

0.18 0

1 (a)

2 ×104

0.18 0

1 (b)

2 ×104

0.18 0

1 (c)

2 ×104

Figure 4: Plots of the running mean for the posterior distribution p(β|y) of a 10 × 10 Ising models using three methods: (a) Geometric with roulette truncation, (b) Geometric with Poisson truncation and (c) Approximate Exchange.

0.08

0.08

0.08

0.06

0.06

0.06

0.04

0.04

0.04

0.02 0

1 (a)

2 ×104

0.02 0

1 (b)

2 ×104

0.02 0

1 (c)

2 ×104

Figure 5: Plots of the running standard deviation for the posterior distribution p(β|y) of a 10 × 10 Ising models using three methods: (a) Geometric with roulette truncation, (b) Geometric with Poisson truncation and (c) Approximate Exchange.

11

5.2

The Fisher-Bingham Distribution on a Sphere

The Fisher-Bingham distribution (Kent, 1982) is constructed by constraining a multivariate Gaussian vector to lie on the surface of a d-dimensional unit radius sphere, Sd . Its form is p(y|A) ∝ exp{y′ Ay},

(15)

where A is a d × d symmetric matrix and, from here on, we take d = 3. After rotation to principle axes, A is diagonal and so the probability density can be written ) ( d X 2 λi yi . (16) p(y|λ) ∝ exp i=1

This is invariant under addition of a constant factor to each λi so for identifiability we take 0 = λ1 ≥ λ2 ≥ λ3 . The normalising constant, Z(λ) is given by ) ( d Z X 2 (17) λi yi } µ(dy) exp Z(λ) = S

i=1

where µ(dy) represents Hausdorff measure on the surface of a sphere. Very few papers have presented Bayesian posterior inference over the distribution due to the intractable nature of Z(λ). However in a recent paper, Walker uses an auxiliary variable method (Walker, 2011) outlined in the Introduction to sample from p(λ|y) . We can apply our version of the Exact-Approximate methodology as we can use importance sampling to get unbiased estimates of the normalising constant. Twenty data points were simulated using an MCMC sampler with λ = [0, 0, −2] and posterior inference was carried out by drawing samples from p(λ3 |y) i.e. it was assumed λ1 = λ2 = 0. Our Exact-Approximate methodology was applied using geometric tilting with Russian Roulette truncation. A uniform distribution on the surface of a sphere was used to draw importance samples for the estimates of Z(λ). The proposal distribution for the parameters was Gaussian with mean given by the current value, a uniform prior on [−5, 0] was set over λ3 , and the chain was run for 20,000 iterations. Walker’s auxiliary variable technique was also implemented for comparison using the same prior but with the chain run for 200,000 samples and then the chain thinned by taking every 10th sample to reduce strong autocorrelations between samples. In each case the final 10,000 samples were then used for Monte Carlo estimates. In the Russian Roulette method, only six negative estimates were observed in 10,000 estimates. The estimates of the mean and standard deviation of the posterior agree well (Table 7), however the effective sample size and autocorrelation of our method are superior as seen in Figure 6. Note that it is also possible to get an upper bound on the importance sampling estimates for the Fisher-Bingham distribution. If we change our identifiability constraint to be 0 = λ1 ≤ λ2 ≤ λ3 , we nowPhave a convex sum in the exponent d 2 which can be maximised by giving unity weight to the largest λ i.e. i=1 λi yi < λmax . We can compute P ˜ Z(θ) as 1/N n exp(λmax )/g(yn ), where g(y) is the importance distribution.

6

A Large Scale Example: Exact approximate methods for the Ozone data-set

In this section, we apply the methods of this paper to the estimation of a large spatial data set, where it is computationally infeasible to compute the required densities exactly. In particular, we consider the total column ozone data set that has been used many times in the literature to test algorithms for large spatial problems (Aune et al., 2012; Bolin and Lindgren, 2011; Cressie and Johannesson, 2008; Eidsvik et al., 2013; Jun and Stein, 2008). The data, which is shown in Figure 8, consists of N = 173, 405 ozone measurements gathered by a satellite with a passive sensor that measures back-scattered light (Cressie and Johannesson, 2008). While a full analysis of this data set would require careful modelling of both the observation process and the uncertainty of the field, for the sake of simplicity, we will focus on fitting a stationary model. The size of this data set, as well as its spatial extent, makes exact computation of the log-likelihood extremely

12

0 0.8 0.6

−2

0.4 0.2

−4 −6

0 −0.2 0

5000

0

10000

(a)

20 Lag

40

(b)

0 0.8 0.6

−2

0.4 0.2

−4 −6

0 −0.2 0

5000

0

10000

(c)

20 Lag

40

(d)

Figure 6: Sample traces and autocorrelation plots for the Fisher-Bingham distribution for the geometric tilting with Russian Roulette truncation ((a) and (b)) and Walker’s method ((c) and (d)).)

Estimate of mean Estimate of standard deviation ESS

Roulette

Walker

-2.377 1.0622 1356

-2.334 1.024 212

Figure 7: Estimates of the posterior mean and standard deviation of the posterior distribution using roulette and Walker’s method for the Fisher-Bingham distribution. An estimate of the effective sample size (ESS) is also shown based on 10,000 MCMC samples.

13

1

80

0.8

Latitude

60 40

0.6

20

0.4

0

0.2

−20

0

−40

−0.2

−60 −80

−0.4 0

50

100

150

200

250

300

350

Longitude Figure 8: Normalised observations of the ozone data, aligned with a map of the world. challenging. We shall show how to construct a computationally feasible unbiased estimator of the loglikelihood and then combine this with Russian Roulette and the random walk Metropolis algorithm to perform inference. As far as we are aware this is the first time that full and exact Bayesian inference has been feasible and performed on this particular size of model and data.

6.1

The model

We model the data using the following three-stage hierarchical model yi |x, κ, τ ∼ N (Ax, τ −1 I) −1

x|κ ∼ N (0, Q(κ)

(18)

)

κ ∼ log2 N (0, 100), τ ∼ log2 N (0, 100), where Q(κ) is the precision matrix of a Mat´ern SPDE model defined on a fixed triangulation of the globe and A is a matrix that evaluates the piecewise linear basis functions in such a way that x(si ) = [Ax]i . The parameter κ controls the range over which the correlation between two values of the field is essentially zero (Lindgren et al., 2011). The precision matrix Q is sparse, which allows both for low-memory storage and for fast matrix-vector products. In this paper, the triangulation over which the SPDE model is defined has n = 196, 002 vertices that are spaced in a fairly regular way around the globe. This allows us to perform piecewise linear spatial prediction over the entire globe. As the observation process is Gaussian, a straightforward calculation shows that   (19) x|y, κ, τ ∼ N τ (Q(κ) + τ AT A)−1 AT y, (Q(κ) + τ AT A)−1 .

Given the hierarchical model in (18), we are interested in sampling the parameters κ and τ . To this end, we sample their joint posterior distribution given the observations y, marginalised over the latent field x, π(κ, τ |y) ∝ π(y|κ, τ )π(κ)π(τ ). To compute this expression, we need the marginal likelihood π(y|κ, τ ), which in this case is available analytically since π(y|x, τ ) and π(x|κ) are both Gaussian. Convolving those gives Z π(y|κ, τ ) = π(y|x, κ, τ )π(x|κ)dx = N (0, τ −1 I + AQ(κ)−1 AT ). 14

Using the matrix inversion lemma, the log-posterior is, 2L(θ) := 2 log π(y|κ, τ ) = C + log(det(Q(κ))) + N log(τ ) − log(det(Q(κ) + τ AT A))

− τ y T y + τ 2 y T A(Q(κ) + τ AT A)−1 AT y.

6.2

(20)

An unbiased estimator of the likelihood

The standard method for computing the log-determinant of a precision matrix is to compute its Cholesky factorisation Q = LLT , where L is lower triangular. Then, PNby the properties of the determinant, the logdeterminant can be computed trivially as log(det(Q)) = 2 i=1 log(Lii ). Unfortunately, when the precision matrix is very large, it is typically infeasible to compute and store a Cholesky factorisation. Although we cannot compute the exact log-determinant without a Cholesky factorisation, we can employ the methods described in this paper to construct an unbiased estimate of the determinant and the overall marginal likelihood to be used in an Exact-Approximate MCMC scheme. To construct an unbiased estimator of log(det(Q)), we note that log(det(Q)) = tr(log(Q)) = Ez (z T log(Q)z), where z is a vector of i.i.d. centred, unit variance random variables (Bai et al., 1996). Therefore, an unbiased estimator of the log-determinant can be constructed through Monte-Carlo estimates of the expectation with respect to the distribution of z. This unbiased estimate is then employed in a Russian Roulette truncated MacLaurin expansion of the exponential function to obtain the required unbiased estimate of the overall Gaussian likelihood. Aune et al. (2012) used rational approximations to compute each z T log(Q)z to machine precision and they introduced a graph colouring method that massively reduces the variance in the Monte Carlo estimator. The evaluation of the log-likelihood (20) also requires (Q(κ) + τ AT A)−1 AT y, which we compute to numerical precision using a preconditioned conjugate gradient method.

6.3

Russian Roulette MCMC Scheme

As all that is required in this proposed methodology are a number of independent unbiased estimates, the scheme is inherently parallel. As the computational and numerical implementation details of this particular application are involved, full description would detract from the central contribution of the paper. As such the details of the implementation are provided in the software codes that implement the Russian Roulette methodology for this application and can be downloaded from http://www.ucl.ac.uk/roulette. A further detailed computational description will be published in a companion application paper. Using the unbiased estimator of the Gaussian marginal likelihood in a Russian Roulette random walk Metropolis algorithm, we are able to obtain Monte Carlo estimates based on the joint posterior π(κ, τ | y). An acceptance rate of 12% is achieved in this experiment and a run of 5,000 samples produces the density plot for the joint posterior shown in Figure 9. The purpose of this example was to demonstrate that it is still possible to perform exact Bayesian inference on large-scale models using the techniques described in this paper. Indeed the speed of the MCMC method and the quality of mixing will scale with the number of compute nodes available, in this case a relatively small number of nodes were used. However the possibility of massive scale parallel implementation is now available and it is clear that this methodology opens the way to performing exact Bayesian inference on massive scale models as seen in climate and oceanographic studies e.g. (Frangos et al., 2010).

7

Discussion and Conclusion

The capability to perform Exact-Approximate MCMC on a wide class of doubly-intractable distributions has been established in this paper. It has been demonstrated on various classes of problems including a massivescale GMRF model for which the proposed methodology is, at present, the only methodology available for exact Bayesian inference. The methods described are not reliant on the ability to simulate exactly from the underlying model only on the availability of unbiased estimates of the inverse of a normalising term, which makes them applicable to a wider range of problems than has been the case to date. The inherent 15

−11

κ

−10 −9 −8 −7 −6 −11.353

−11.3519

−11.3509

τ

−11.3498

Figure 9: Density estimate of log p(τ, κ|y). computational parallelism of the methodology, due to it only requiring a number of independent estimates of normalising constants, indicates that opportunities to deploy this Exact-Approximate form of inference on the types of large-scale models and data prominent in Inverse Problems (IP) and Uncertainty Quantification (UQ). The development of this method, which returns an unbiased estimate of the target distribution, is based on the stochastic truncation of a multiplicatively tilted correction to an approximation to the desired density. If the intractable likelihood is composed of a bounded function and non-analytic normalising term, then the proposed methodology can proceed to full MCMC with no further restriction. However, in the more general case, where an unbounded function forms the likelihood, then the almost sure guarantee of positivity of unbiased estimates is lost. The potential bias induced due to this lack of strict positivity is dealt with by adopting the scheme employed in the QCD literature where an absolute measure target distribution is used in the MCMC and a sign corrected Monte Carlo estimate ensures that expectations with respect to the actual posterior measure is preserved. The inflation of the Monte Carlo error in such estimates is a function of the severity of the sign problem and this has been characterised in our work. What has been observed in the experimental evaluation is that, for the examples considered, the sign problem is not such a practical issue when the variance of the estimates of the normalising terms is well controlled and this has been achieved by employing Sequential Monte Carlo Sampling in some of the examples. A general solution to the sign problem for the Exact-Approximate MCMC based inference of doublyintractable distributions remains an open problem. In its most general representation within Quantum Monte Carlo it is recognised that the sign problem is NP-hard and as such indicates that a practical and elegant solution may remain elusive for some time to come. Nevertheless the methodology presented in this paper provides a general scheme with which Exact-Approximate MCMC for Bayesian inference can be deployed on a large class of statistical models. This opens up further opportunities in statistical science and the related areas of science and engineering that are dependent on simulation-based inference schemes.

8

Acknowledgements

M.A. Girolami is most grateful to Arnaud Doucet for numerous motivating discussions regarding this work. M.A.Girolami is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) via the Established Career Research Fellowship EP/J016934/1 and the Programme Grant Enabling Quantification of Uncertainty for Large-Scale Inverse Problems, EP/K034154/1, http://www.warwick.ac.uk/equip. He also gratefully acknowledges support from a Royal Society Wolfson Research Merit Award.

16

References Andrieu, C. and G. Roberts (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics 37 (2), 697–725. Atchad´e, Y., N. Lartillot, and C. Robert (2013). Bayesian computation for intractable normalizing constants. Brazilian Journal of Probability and Statistics to appear. Aune, E., D. Simpson, and J. Eidsvik (2012). Parameter estimation in high dimensional gaussian distributions. Technical Report Statistics 5/2012, NTNU. Bai, Z., G. Fahey, and G. Golub (1996). Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics 74 (1), 71–89. Bakeyev, T. and P. De Forcrand (2001). Noisy Monte Carlo algorithm reexamined. Physical Review D 63 (5), 54505. Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164 (3), 1139–1160. Beskos, A., O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead (2006, June). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3), 333–382. Bhanot, G. and A. Kennedy (1985). Bosonic lattice gauge theory with noise. Physics letters B 157 (1), 70–76. Bolin, D. and F. Lindgren (2011). Spatial models generated by nested stochastic partial differential equations, with an application to global ozone mapping. Annals of Applied Statistics 5, 523–550. Booth, T. (2007). Unbiased Monte Carlo estimation of the reciprocal of an integral. Nuclear science and engineering 156 (3), 403–407. Caimo, A. and N. Friel (2011). Bayesian inference for exponential random graph models. Social Networks 33 (1), 41–55. Cressie, N. A. C. and G. Johannesson (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 70 (1), 209–226. Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 411–436. Doucet, A., M. Pitt, and R. Kohn (2012). Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. arXiv preprint arXiv:1210.1871 . Eidsvik, J., B. A. Shaby, B. J. Reich, M. Wheeler, and J. Niemi (2013). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics (To Appear). Everitt, R. (2013). Bayesian parameter estimation for latent markov random fields and social networks. Journal of Computational and graphical Statistics. Fearnhead, P., O. Papaspiliopoulos, and G. O. Roberts (2008, September). Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (4), 755–777. Fearnhead, P. and D. Prangle (2012). Constructing summary statistics for approximate bayesian computation: semi-automatic approximate bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74, 419–474.

17

Frangos, M., Y. Marzouk, K. Willcox, and B. van Bloemen Waanders (2010). Surrogate and reducedorder modeling: a comparison of approaches for large-scale statistical inverse problems, in Computational Methods for Large-Scale Inverse Problems and Quantification of Uncertainty. Wiley. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2003). Bayesian Data Analysis. Chapman and Hall/CRC. Ghaoui, L. E. and A. Gueye (2009). A convex upper bound on the log-partition function for binary distributions. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, pp. 409–416. Gilks, W. R. (1999). Markov Chain Monte Carlo In Practice. Chapman and Hall/CRC. Goutis, C. and G. Casella (1999). Explaining the saddlepoint approximation. The American Statistician 53 (3), 216–224. Gu, M. and H. Zhu (2001). Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), 339–355. Hall, P. (1992). The bootstrap and Edgeworth expansion. Hughes, J., M. Haran, and Caragea, P.C. (2011). Autologistic models for binary data on a lattice. Environmetrics 22 (7), 857–871. Joo, B., I. Horvath, and K. Liu (2003). The Kentucky noisy Monte Carlo algorithm for Wilson dynamical fermions. Physical Review D 67 (7), 074505. Jun, M. and M. L. Stein (2008). Nonstationary covariance models for global data. The Annals of Applied Statistics 2 (4), 1271–1289. Kennedy, A. and J. Kuti (1985). Noise without noise: a new Monte Carlo method. Physical review letters 54 (23), 2473–2476. Kent, J. (1982). The Fisher-Bingham distribution on the sphere. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 44, 71–80. Liang, F. (2010). A double Metropolis–Hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation 80 (9), 1007–1022. Lin, L., K. Liu, and J. Sloan (2000). A noisy Monte Carlo algorithm. Physical Review D 61 (7), 074505. Lindgren, F., H. Rue, and J. Lindstr¨ om (2011, September). An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach (with discussion). Journal of the Royal Statistical Society. Series B. Statistical Methodology 73 (4), 423–498. Liu, J. S. (2001). Monte Carlo strategies in scientific computing. Springer. Marin, J.-M., P. Pudlo, C. Robert, and R. Ryder (2011). Approximate bayesian computational methods. Statistics and Computing 21 (2), 289–291. Mengersen, K., P. Pudlo, and C. Robert (2013). Bayesian computation via empirical likelihood. Proceedings of the National Academy of Sciences 110 (4), 1321–1326. Møller, J., A. Pettitt, R. Reeves, and K. Berthelsen (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93 (2), 451–458. Murray, I., Z. Ghahramani, and D. MacKay (2006). MCMC for doubly-intractable distributions. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence UAI06 , 359–366. Neal, R. M. (1998). Annealed importance sampling. Statistics and Computing 11, 125–139. 18

Nicholls, G., C. Fox, and A. Watt (2012). Coupled MCMC with a randomized acceptance probability. arXiv preprint arXiv:1205.6857 . Papaspiliopoulos, O. (2011). Monte Carlo probabilistic inference for diffusion processes: A methodological framework. In D. Barber, A. Cemqil, and S. Chippa (Eds.), Bayesian Time Series Models, pp. 82–99. Cambridge University Press. Propp, J. and D. Wilson (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random structures and Algorithms 9 (1-2), 223–252. Reid, N. (1988). Saddlepoint methods and statistical inference. Statistical Science 3 (2), 213–227. Robert, C. and G. Casella (2010). Introducing Monte Carlo Methods with R. Springer. ISBN: 978-1-44191575-7. Salakhutdinov, R. and G. Hinton (2009). Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Volume 5, pp. 448–455. Troyer, M. and U.-J. Wiese (2005, May). Computational complexity and fundamental limitations to fermionic quantum monte carlo simulations. Phys. Rev. Lett. 94, 170201. Walker, S. (2011). Posterior sampling when the normalising constant is unknown. Communications in Statistics - Simulation and Computation 40 (5), 784–792. Wang, J. and Atchad´e (2013). Bayesian inference of exponential random graph models for large social networks. Communications in Statistics - Simulation and Computation.

9

Appendix: Russian Roulette

P Consider estimating the sum S = k≥0 αk assumed finite. Let {qj , j ≥ 0}, qj ∈ (0, 1]. Let {Uj , j ≥ 1} be a sequence of i.i.d. uniform U(0, 1) random variables. Define the random time τ = inf{k ≥ 1 : Uk ≥ qk }, Define for k ≥ 1, Sk =

k−1 X j=0

where we convene that estimate of S is

Qb

i=a

where inf ∅ = +∞.

αj Qj

,

i=1 qi

· = 1 if a > b. For completeness, we set S∞ = ∞. The Russian Roulette Sˆ = Sτ .

Once given the probabilities {qj }, Sˆ is straightforward to simulate on a computer, and provides an unbiased estimate of S. Qn ˆ = S. Proposition 9.1. Suppose that limn→∞ ( i=1 qi ) = 0, then τ is finite almost surely, and E(S) Qn ˆ = E(Sτ ) = Proof. P(τ = ∞) = limn→∞ P(τ ≥ n) = limn→∞ i=1 qi = 0. From the definition, E(S) Qk−1 P∞ i=1 qi , we have k=1 Sk P(τ = k). Since P(τ = k) = (1 − qk ) ∞ X

Sk P(τ = k) =

k=1

k=1

= = Thus

P∞

k=1

∞ X

Sk (1 − qk )

α0 + α0 +

∞ X

k=1 ∞ X

Sk+1

k−1 Y

qi =

i=1

k Y

i=1

qi −

∞ X

Sk

i=1

k=1 ∞ X

k−1 Y

Sk

k Y

qi −

qi

i=1

k=1

αk = S.

k=1

Sk P(τ = k) is a convergent series with sum equal to 19

P

i≥0

αi .

∞ X

k=1

Sk

k Y

i=1

qi

The expected value of the stopping time, τ , can be found as follows. P Qn Proposition 9.2. Suppose that n≥1 i=1 qi < ∞, then E(τ ) < ∞. More precisely, E(τ ) = 1 +

∞ k X Y

k=1

Proof. E(τ ) =

P∞

k=1

qi

i=1

!

.

kP(τ = k). So, as above ∞ X

kP(τ = k) =

k=1

k ∞ ∞ k−1 ∞ k−1 Y X X Y X Y qi qi − k qi (1 − qk ) = k k k=1

=

1+

i=1

∞ X

(k + 1)

i=1

k=1

=

1+

∞ Y k X

i=1

k=1

k Y

qi −

∞ X

k

k=1

k Y

i=1

qi

i=1

k=1

qi .

k=1 i=1

Next we verify that the variance of the estimator is finite and gain some insight into how we can choose the probabilities {qk ; k ≥ 1} to ensure this. Proposition 9.3. Suppose that

X

n≥1

αn Sn+1 < ∞.

ˆ < ∞ and is given by Then Var(S) ˆ = Var(S)

α20

+2

∞ X

n=1

Proof. E(Sτ2 ) =

P∞

k=1

∞ X

∞ X

k=1

n=1

α2n

n Y 1 q i=1 i

!

− S2.

Sk2 P(τ = k). Using the same telescoping sum calculations, we have

Sk2 P(τ = k) =

k=1 2 But Sk+1 − Sk2

αn Sn+1 −

∞ X

∞ X

k=1

 Qk

i=1 qi

Sk2 (1 − qk )

k−1 Y

qi = α20 +

i=1

∞ X

2 (Sk+1 − Sk2 )

∞ X

αk Sk+1 −

k=1

k Y

qi .

i=1

= αk (Sk + Sk−1 ) and therefore

Sk2 P(τ = k) = α20 +

∞ X

αk (Sk + Sk+1 ) = α20 + 2

k=1

k=1

Q P∞ P k 2 We can conclude that if ∞ k=1 αk j=1 n=1 αk Sk+1 < ∞ then 2 2 ˆ Thus Var(S) = E(Sτ ) − S < ∞, and has the stated expression.

1 qj



∞ X

k=1

< ∞, and

α2k Qk

i=1 qi

P∞

k=1

.

Sk2 P(τ = k) < ∞.

As an example, if αi = αi for α ∈ (0, 1), and we choose qi = q for some q ∈ (0, 1), then αn Sn+1 ∼ αn rn , ˆ < ∞. where r = α/q. Therefore if α2 /q < 1, the condition of Proposition 9.3 are satisfied and Var(S) But if q > α2 is chosen too close to 1, the computing time of the algorithm will be high. Although this variance/computing speed trade-off can be investigated analytically, a rule of thumb that works well in simulations is to choose q = α.

20

10

Appendix: Computing absolute measure expectations

Let (X, B) denotes a general measure space with aR reference sigma-finite measure dx.R Let π : X → R a function taking possibly negative values such that |π(x)|dx < ∞. We assume that π(x)dx > 0 and we wish to compute the quantity R h(x)π(x)dx I= R , π(x)dx R for some measurable function h : X → R such that |h(x)π(x)|dx < ∞. We introduce σ(x) = sign(π(x)), |π(x)| . Thus p is a probability density on X. Suppose that we can construct an ergodic Markov and p(x) = R |π(x)|dx chain {Xn , n ≥ 0} with invariant distribution p, for instance using the Metropolis-Hastings algorithm. An importance sampling-type estimate for I is given by Pn σ(Xk )h(Xk ) Pn Iˆn = k=1 . k=1 σ(Xk ) Iˆn has the following properties.

Proposition 10.1. 1. If the Markov chain {Xn , n ≥ 0} is phi-irreducible and aperiodic, then Iˆn converges almost surely to I as n → ∞. R 2. Suppose that {Xn , n ≥ 0} is geometrically ergodic and |h(x)|2+ǫ p(x)dx < ∞ for some ǫ > 0. Then  √  w n Iˆn − I → N(0, σ 2 (h)),

and C11

=

Varp ({hσ}(X))

∞ X

j=−∞

C22

=

Varp (σ(X))

∞ X

j=−∞

C12

=

where σ 2 (h) =

C11 + I 2 C22 − 2IC12 , r2

  Corrp {hσ}(X), P |j| {hσ}(X) ,

  Corrp σ(X), P |j| σ(X)

q 1 Varp ({hσ}(X))Varp (σ(X)) 2  ∞ ∞     X X Corrp σ(X), P |j| {hσ}(X)  . × Corrp {hσ}(X), P |j| σ(X) + j=−∞

j=−∞

Proof. Part (1) is a straightforward application of the law of large numbers for the Markov chain {Xn , n ≥ 0}: as n → ∞, Iˆn converges almost surely to R R σ(x)h(x)p(x)dx h(x)π(x)dx R = R = I. σ(x)p(x)dx π(x)dx A bivariate central limit theorem using the Cramer-Wold device gives that  1 Pn       √ σ(Xk )h(Xk ) − rI 0 C11 Z1 w k=1 n P ∼N , → n n 1 C12 0 Z σ(X ) − r 2 k k=1 n

where C11 , C12 and C22 are as given above. √ w By the delta method, it follows that n(Iˆn − I) →

Z1 −IZ2 r

C12 C22



  2 −2IC12 ∼ N 0, C11 +I Cr22 . 2

.

We can roughly approximate the asymptotic variance σ 2 (h) as follows. Suppose for simplicity that the Markov chain is reversible, so that C12 =

∞ q   X Corrp {hσ}(X), P |j| σ(X) . Varp ({hσ}(X))Varp (σ(X)) j=−∞

21

Assume also that that the mixing of the Markov chain is roughly the same across all the functions: ∞ X

j=−∞

  Corrp {hσ}(X), P |j| {hσ}(X) = =

∞ X

j=−∞ ∞ X

  Corrp σ(X), P |j| σ(X)

 Corrp σ(X), P |j| {hσ}(X) ≡ V, Corrp (σ(X), {hσ}(X)) j=−∞

where we also assume that Corrp ({hσ}(X), σ(X)) 6= 0. Therefore σ 2 (h) V

Varp ({hσ}(X)) + I 2 Varp (σ(X)) − 2ICovp ({hσ}(X), σ(X)) r2 2 2 rˇ π (h σ) + I − 2Irˇ π (hσ) , = r2 R R where π ˇ = π/ π, and π ˇ (f ) = f (x)ˇ π (x)dx. By a Taylor approximation of (h, σ) 7→ h2 σ around (ˇ π (h), π ˇ (σ)), 2 2 it comes easily that rˇ π (h σ) = π ˇ (h ) + 2Irˇ π (hσ) − 2I 2 , so that ≈

 V σ 2 (h) ≈ π ˇ (h2 ) − I 2 × 2 . r

Thus a quick approximation of the Monte Carlo variance of Iˆn is given by (P 2 )  Pn n h(Xk )σ(Xk ) h2 (Xk )σ(Xk ) Vˆ 1 k=1 k=1 Pn Pn ×  Pn − × 2 , 1 n k=1 σ(Xk ) k=1 σ(Xk ) k=1 σ(Xk ) n

where Vˆ is an estimate of the common autocorrelation sum. For example Vˆ can be taken as the lag-window  P∞ |j| estimate of j=−∞ Corrp {hσ}(X), P {hσ}(X) . Pn The quantity n1 k=1 σ(Xk ) which estimates r is indicative of the severity of the positivity issue. The smaller r, the harder it is to estimate I accurately.

22