Bayesian model comparison with the Hyv\" arinen score: computation ...

2 downloads 0 Views 1MB Size Report
Oct 31, 2017 - posterior distribution pj(θj|Y1:T ) under model Mj concentrates .... Rousseau et al., 2014; Douc et al., 2014; Douc, Olsson and Roueff, 2016,.
Bayesian model comparison with the Hyv¨ arinen score: computation and consistency

arXiv:1711.00136v1 [stat.ME] 31 Oct 2017

Stephane Shao∗, Pierre E. Jacob∗, Jie Ding† , Vahid Tarokh†

Abstract The Bayes factor is a widely used criterion in Bayesian model comparison and can be regarded as a ratio of out-of-sample predictive scores under the logarithmic scoring rule. However, when some of the candidate models involve vague or improper priors on their parameters, the Bayes factor features an arbitrary multiplicative constant that hinders its interpretation. As an alternative, we consider model comparison using the Hyv¨ arinen score of Dawid & Musio (2015). We provide a method to consistently estimate this score for parametric models, using sequential Monte Carlo (SMC). In particular, we show that it can be estimated for models with tractable likelihoods via SMC samplers, and for nonlinear non-Gaussian state-space models by using SMC2 . We prove the asymptotic consistency of this new model selection criterion under strong regularity assumptions. We illustrate the method on diffusion models for population dynamics and L´evy-driven stochastic volatility models.

Keywords: Bayes factor, non-informative prior, model comparison, sequential Monte Carlo, state-space model.

1

Introduction

1.1

Bayesian model comparison

Our goal is to perform Bayesian model comparison in situations where the candidate models involve possibly vague or improper prior distributions on some of their parameters. The Bayes factor (Jeffreys, 1939) between two models – defined as the ratio of their respective marginal likelihoods – is a widely used approach to Bayesian model comparison. If one of the candidate models includes the data-generating process, that model is termed well-specified or correct, and the Bayes factor can be interpreted as a ratio of odds, which updates the relative probabilities of the models being correct. In the misspecified or M-open setting (Bernardo and Smith, 2000), the marginal likelihood can be interpreted as a measure of out-of-sample predictive performance assessed with the logarithmic scoring rule (e.g. Kass and Raftery, 1995; Key, Pericchi and Smith, 1999; Bernardo and Smith, 2000; Bissiri, Holmes and Walker, 2016). Scoring rules are loss functions for the task of predicting an observation y with a probability distribution p, and the logarithmic scoring rule quantifies predictive performance with − log p(y). The Bayes factor leads to consistent model selection (e.g. Dawid, 2011; Lee and MacEachern, 2011; Walker, 2013; Chib and Kuffner, 2016) as the number of observations goes to infinity, and so does its approximation by the Bayesian Information Criterion (e.g. Lv and Liu, 2014). The Bayes factor can thus be justified both asymptotically and non-asymptotically, in well-specified and misspecified settings. However, if any of the models involve vague or improper prior distributions on their parameters, the Bayes factor can take arbitrary values. Such priors can be sensibly motivated, either by practical or theoretical considerations; for instance, they arise naturally as reference priors like Jeffrey’s priors (e.g. Chapter 3 of Robert, 2007). This limitation of the Bayes factor, sometimes referred to as Bartlett’s paradox (Bartlett, 1957; Kass and Raftery, 1995), is a long-lasting challenge in Bayesian model comparison (Chapter 7 of Robert, 2007). Many approaches have ∗ Department † School

of Statistics, Harvard University. Emails: [email protected], [email protected]. of Engineering and Applied Sciences, Harvard University. Emails: [email protected], [email protected].

1

been proposed to tackle this issue, either by modifying the Bayes factor (e.g. O’Hagan, 1995; Berger and Pericchi, 1996; Berger, Pericchi and Varshavsky, 1998; Berger and Pericchi, 2001) or by bypassing it altogether (e.g. Kamary, Mengersen, Robert and Rousseau, 2014, and references therein). One promising solution is to use a different scoring rule to assess the predictive performances of candidate models (Dawid and Musio, 2015; Dawid, Musio and Ventura, 2016). We consider the Hyv¨ arinen score (Hyv¨ arinen, 2005), which is the simplest choice of scoring rule that is proper, local, and homogeneous (Dawid and Lauritzen, 2005; Parry, Dawid and Lauritzen, 2012; Ehm and Gneiting, 2012). Homogeneity ensures that the score does not depend on normalizing constants of candidate densities, hence allowing for improper and vague priors. For any dy -dimensional observation y ∈ Rdy and twice differentiable density p on Rdy , the Hyv¨arinen score is defined as H(y, p) = 2∆y log p(y) + k∇y log p(y)k2 ,

(1)

where ∇y and ∆y respectively denote the gradient and Laplacian operators with respect to the variable y. Making the prior more vague is asymptotically equivalent to multiplying the evidence p(y) by a positive constant. The log-derivatives in Eq. (1) get rid of such multiplicative constants, so that the Hyv¨arinen score is not affected as much as the logarithmic score by any arbitrary vagueness of the prior distribution. For instance, in a simple Normal location model with prior distribution µ ∼ N (µ0 , σ02 ) and likelihood Y |µ ∼ N (µ, 1), the prior predictive distribution is Y ∼ N (µ0 , 1 + σ02 ). The logarithmic score is then − log p(y) = (1/2) log(2π(1 + σ02 )) + (y − µ0 )2 /(2(1 + σ02 )), which is not bounded in σ0 as σ0 → +∞, i.e. as the prior becomes more vague. In contrast, the Hyv¨arinen score is equal to H(y, p) = −2/(1 + σ02 ) + (y − µ0 )2 /(1 + σ02 )2 , which is bounded in σ0 . Given T observations y1:T = (y1 , ..., yT ) ∈ YT and a finite set M of candidate models, each inducing a respective joint marginal density of (Y1 , ..., YT ) denoted by pM with M ∈ M, the new criterion selects the model with the smallest predictive sequential – or prequential (Dawid, 1984) – Hyv¨arinen score, defined as HT (M ) =

T X

H (yt , pM (dyt |y1:t−1 )) ,

(2)

t=1

where pM (dyt |y1:t−1 ) denotes the conditional density of Yt given (y1 , ..., yt−1 ) under model M , with pM (dy1 |y1:0 ) denoting the prior predictive distribution of Y1 . These conditional densities result from integrals with respect to the sequence of partial posterior distributions, which need to be approximated by sequential numerical methods. In this paper, our main contribution is to use sequential Monte Carlo methods to consistently estimate prequential Hyv¨ arinen scores, thereby enabling their use in Bayesian model comparison for general parametric models. More specifically, we show that this estimation can be achieved for models with tractable likelihoods via SMC samplers (Chopin, 2002; Del Moral, Doucet and Jasra, 2006; Zhou, Johansen and Aston, 2016). It can be further achieved for state-space models by using SMC2 (Fulop and Li, 2013; Chopin, Jacob and Papaspiliopoulos, 2013) under the mild requirement that we can simulate the latent state process and evaluate the measurement density (Bret´ o, He, Ionides and King, 2009; Andrieu, Doucet and Holenstein, 2010), and under some integrability conditions. When these conditions are not met, we propose an alternative estimator of Hyv¨arinen scores using a combination of SMC2 and kernel density estimation (Bhattacharya, 1967). Under regularity conditions allowing for misspecified settings, we prove that the prequential Hyv¨ arinen score yields a consistent model selection criterion. We also propose a Hyv¨ arinen score for discrete observations using a complete characterization of proper scoring rules on discrete spaces (McCarthy, 1956; Hendrickson and Buehler, 1971; Dawid, Lauritzen and Parry, 2012; Dawid, Musio and Columbu, 2017). This paper is organized as follows. In Section 2, we review the prequential Hyv¨arinen score and explain how it can be estimated via SMC samplers for models with tractable likelihoods. In Section 3, we propose two methods relying on SMC2 to estimate the prequential Hyv¨arinen score of nonlinear non-Gaussian state-space models with only tractable measurement density. In Section 4, we address the consistency of the prequential Hyv¨arinen score for model selection in the case of non-nested models with continuous observations. In Section 5, we extend the definition 2

of the Hyv¨ arinen score to discrete observations. In Section 6, we apply the prequential Hyv¨arinen score to the comparison of diffusion models for population dynamics, and L´evy-driven stochastic volatility models. Limitations and directions for future research are discussed in Section 7. Appendix A provides implementation details for the SMC and SMC2 algorithms. Appendix B presents assumptions enabling the estimation of the Hyv¨arinen score using key identities. In Appendix C, we prove the consistency of the prequential Hyv¨arinen score for independent and identically distributed (i.i.d.) models under explicit regularity conditions. Consistency for general state-space models with dependent data is discussed at length under more high-level assumptions. Additional proofs and simulation studies can be found in the supplementary material, available on the webpage of the second author. The R code producing the figures is available at github.com/pierrejacob/bayeshscore.

1.2

Terminology and notation

For conciseness, we will abbreviate the Hyv¨ arinen score to H-score. The prequential Hyv¨arinen score HT (M ) of a model M will be referred to as the H-score of M . Given two models M1 and M2 , the difference of their H-scores HT (M2 ) − HT (M1 ) will be termed the H-factor of M1 against M2 . By analogy with the logarithm of the Bayes factor, the sign of the H-factor of M1 against M2 dictates which model to select: if it is strictly positive, it favors model M1 ; if it is strictly negative, it favors model M2 . We denote random variables by capital letters, and realizations by lower case letters. We define N∗ = N\{0} and use the colon notation for sets of objects, e.g. y1:t = {y1 , ..., yt } for all t ∈ N∗ , with the convention y1:0 = ∅. Unless specified otherwise, k · k denotes the Euclidean norm. Each observation y = (y(1) , ..., y(dy ) )> is a vector of dimension dy ∈ N∗ and takes values in Y ⊆ Rdy . All the probability distributions are assumed to admit densities with respect to the Lebesgue measure whenever needed. We use p to denote a generic probability density or mass function. For a probability density function p, we use p(y) when regarding p as a real-valued function evaluated at y, and p(dy) when identifying p with its induced probability distribution. We let P? (resp. E? ) denote the probability (resp. expectation) induced by the data-generating mechanism of the stochastic process (Yt )t∈N∗ . We use the abbreviation P? -a.s. for P? -almost surely. We let p? denote the probability density or mass function associated with P? . We do not put any restrictions on p? to allow for model misspecification. All the assumptions and results stated without explicit reference to any particular model are meant for all the candidate models. For notational convenience, we drop the explicit dependence on the model in the notation whenever there is no ambiguity, e.g. when making statements regarding a single model at a time. When dealing concurrently with several models from a set M, we use the subscript M ∈ M to condition on a particular model. Each candidate model M is parametrized by a parameter θM , living in a space TM ⊆ RdθM of dimension dθM ∈ N∗ . The H-score HT (M ) of a model M is a random variable that depends on the observations Y1:T , although the notation makes this dependence implicit. When the context is clear, we use the notation HT (M ) to denote both the random score and its realization. For a differentiable function f on Y, we use ∂f (yt )/∂yt (k) or ∂f (y)/∂y(k) y=y interchangeably to denote the k-th partial derivative of f evaluated t at yt ∈ Y. We use the following conventions: an exponential distribution with rate λ > 0, denoted by Exp(λ), has density x 7→ λ exp(−λx) for x > 0; a Gamma(α, β) distribution with shape α > 0 and rate β > 0 has density x 7→ β α Γ(α)−1 xα−1 e−βx for x > 0; a scaled inverse chi square distribution with degrees of freedom ν > 0 and scale s > 0, denoted by Inv-χ2 (ν, s2 ), corresponds to the distribution of the inverse of a Gamma(ν/2, s2 ν/2) variable, 2 and has density x 7→ (ν/2)ν/2 Γ(ν/2)−1 sν x−(ν/2+1) e−νs /(2x) for x > 0; LN(µ, σ 2 ) denotes a log Normal distribution 2 2 with location µ ∈ R and log-scale σ > 0, with density x 7→ x−1 (2πσ 2 )−1/2 e−(log(x)−µ) /(2σ ) for x > 0; Unif(a, b), with a < b, denotes a uniform distribution with support [a, b] ⊂ R; Poisson(λ) denotes a Poisson distribution with rate λ > 0 and probability mass function k 7→ e−λ λk /(k!) for k ∈ N; finally, NB(m, v), with v > m > 0, denotes a negative binomial distribution parametrized by mean and variance, i.e. with probability mass function  k 7→ k+r−1 (1 − p)r pk for k ∈ N, where m = pr/(1 − p) and v = pr/(1 − p)2 . k

3

2

Computing the H-score

2.1

H-score in a Bayesian approach

The H-score of a model M , defined in Eq. (2), can be written as  dy T X 2 X  2 ∂ log p(yt |y1:t−1 ) + HT (M ) = ∂yt 2(k) t=1 k=1

∂ log p(yt |y1:t−1 ) ∂yt (k)

!2  .

(3)

The marginal predictive densities appearing in Eq. (3) correspond to integrals with respect to posterior distributions, R since p(yt |y1:t−1 ) = p(yt |θ, y1:t−1 ) p(θ|y1:t−1 ) dθ. As noted in Dawid and Musio (2015), a judicious interchange of differentiation and integration under appropriate regularity conditions leads to HT (M ) =

dy T X X





2

2 E ∂ log p(yt |y1:t−1 , Θ) + ∂yt 2(k) t=1 k=1

 !2 ∂ log p(yt |y1:t−1 , Θ)  y1:t − ∂yt (k)

#!2  ∂ log p(yt |y1:t−1 , Θ) , E y1:t ∂yt (k) "

(4)

where the conditional expectations are taken with respect to the posterior distributions Θ ∼ p(dθ|y1:t ). Assumptions under which Eq. (4) holds can be found in Appendix B. In general, expectations with respect to p(dθ|y1:t ) for all successive t ≥ 1 can be consistently estimated using sequential or annealed importance sampling (Neal, 2001) and SMC samplers (Chopin, 2002; Del Moral et al., 2006), as briefly described in Section 2.3. The case of state-space models with intractable likelihoods is presented in Section 3. For discrete observations, we provide an extension of the H-score in Section 5. Several remarks are in order. First, we need to contrast the prequential approach described above with a batch approach, where one would assess the prediction of y1:T at once. Indeed, one could consider H(y1:T , p(dy1:T ))  PT instead of the prequential version t=1 H yt , p(dyt |y1:t−1 ) , which would allow for batch estimation with Markov chain Monte Carlo. However, the batch approach is generally not consistent for model selection (see Section 8.1 in Dawid and Musio, 2015). Therefore, the prequential framework not only has a natural interpretation that relates to sequential probability forecasts (Dawid, 1984), but is also necessary for consistency. This leads to the task of approximating all the successive predictive distributions p(dyt |y1:t−1 ). This distinction does not arise for the PT logarithmic scoring rule, for which we always have − log p(y1:T ) = − t=1 log p(yt |y1:t−1 ). Secondly, for i.i.d. models, different orderings of the observations lead to different sequences of predictive distributions, hence yielding different values of the H-score. This does not occur with the logarithmic score − log p(y1:T ). For large samples, this issue is mitigated by the convergence of the rescaled H-score to a limit that is the same for any ordering of the observations (see Section 4.1). For small samples, one could average the H-score associated with multiple orderings, at the cost of extra computations. Finally, the terms of the sum in Eq. (4) might not be well-defined when improper posterior distributions arise as a result of using improper priors. If τ denotes the first index such that the posterior p(dθ|y1:τ ) is proper, then PT we would redefine the H-score as t=τ H (yt , p(dyt |y1:t−1 )). For simplicity of exposition, we will thereafter assume that the posterior distributions of all the models at hand are proper after assimilating one observation.

2.2

Illustration with a Normal location model

Let M denote the following toy model i.i.d.

Y1 , ..., YT | µ ∼ N (µ, 1)

with µ ∼ N (0, σ02 ),

 where σ0 > 0 is a known hyperparameter. For all t ∈ {0, ..., T }, we can use conjugacy to obtain µ | y1:t ∼ N µt , σt2 , Pt where σt2 = (t + σ0−2 )−1 , µ0 = 0, and µt = σt2 i=1 Yi for all t ∈ {1, ..., T }. The log-evidence under M is then  T  X 1 y12 T 1 (yt − µt−1 )2 2 2 − log(2π) − log(σt−1 + 1) + , log p(y1:T ) = − log(σ0 + 1) − 2 2 2 (σ02 + 1) 2 2 2 (σt−1 + 1) t=2 4

which behaves as − log σ0 when σ0 → +∞. This implies that the log-evidence diverges to −∞ as σ0 increases. In other words, the Bayes factor between this model and virtually any other model can be made arbitrarily small by making the prior on µ arbitrarily vague. On the other hand, the H-score is given by  2 X  2 ! T 2 y1 yt − µt−1 2 HT (M ) = − 2 + + + − 2 2 σ0 + 1 σ02 + 1 σt−1 + 1 σt−1 +1 t=2   !2  2 t−1 T X X 1 t − 1 2 (t − 1) . − + yt − yi −−−−−→ σ0 →+∞ t t − 1 t t=2 i=1 Therefore, increasing σ0 can only influence the H-score to a certain extent. The limit of HT (M ) as σ0 → +∞ also unambiguously defines the value of the score under improper priors on µ.

2.3

Approximation with sequential Monte Carlo

A sequential Monte Carlo sampler starts by sampling a set of Nθ particles θ(1:Nθ ) = (θ(1) , . . . , θ(Nθ ) ) independently from an initial distribution q(dθ). The algorithm then assigns weights, resamples, and moves these particles in order to successively approximate the posterior distributions p(dθ|y1:t ) for each t ≥ 1. In particular, we can move the sample from the posterior p(dθ|y1:t−1 ) to the posterior p(dθ|y1:t ) by successively targeting intermediate distributions whose densities are proportional to p(dθ|y1:t−1 )p(yt |y1:t−1 , θ)γt,j , where 0 = γt,0 < γt,1 < ... < γt,Jt = 1 with Jt ∈ N∗ . The temperatures γt,j can be determined adaptively to maintain a chosen level of non-degeneracy in the importance weights of the particles, e.g. by forcing the effective sample size to stay above a desired threshold or by imposing a minimum number of unique particles. The resampling steps can be performed with systematic resampling (see comparisons in Douc and Capp´e, 2005; Murray, Lee and Jacob, 2016; Gerber, Chopin and Whiteley, 2017), and the move steps with Metropolis-Hastings kernels using independent proposals. In the numerical experiments of the present article, such proposals are obtained by fitting mixtures of Normal distributions on the current weighted particles. The initial distribution q(dθ) can be taken as the uniform distribution on a set (e.g. Fearnhead and Taylor, 2013), or the prior distribution p(dθ), when that prior is proper. More generally, q(dθ) should be chosen as an approximation of the first proper posterior, possibly obtained from preliminary Markov chain Monte Carlo runs. Further implementation details are provided in Appendix A. Sequential estimation of the H-score can thus be achieved at a cost comparable to that of estimating the logevidence. Indeed, both can be obtained from the same SMC runs. However, numerical experiments suggest that the estimator of the H-score tends to have a larger variance than the estimator of the log-evidence, for a given number of particles. This can be explained informally as follows. For the evidence, the Monte Carlo approaches approximate expectations of the form E[p(yt |y1:t−1 , Θ)|y1:t−1 ] with respect the posterior p(dθ|y1:t−1 ). On the other hand, the Hscore involves expectations of the form E[∇y log p(yt |y1:t−1 , Θ)|y1:t ] with respect to the posterior p(dθ|y1:t ). When t is large, these expectations are with respect to similar distributions, whereas the integrands are respectively θ 7→ p(yt |y1:t−1 , θ) and θ 7→ ∇y log p(yt |y1:t−1 , θ). For a range of models, the first type of integrands will be easier to integrate than the second one, e.g. when the former is bounded in θ while the latter is not, as in the case of Normal location models (see Section 2.2).

3

H-score for state-space models

The H-score raises additional computational questions in the case of state-space models. State-space models, also known as hidden Markov models, are a flexible and widely used class of time series models (Capp´e, Moulines and Ryd´en, 2005; Douc, Moulines and Stoffer, 2014), which describe the observations (Yt )t∈N∗ as conditionally independent given a Markov chain of latent states (Xt )t∈N∗ living in X ⊆ Rdx . A state-space model with parameter θ ∈ T ⊆ Rdθ specifies an initial distribution µθ (dx1 ) of the first state X1 , a Markov kernel fθ (dxt+1 |xt ) for 5

the transition of the latent process, a measurement distribution gθ (dyt |xt ), and a prior distribution p(dθ) on the parameter. For t ≥ 1 and θ ∈ T, the filtering distributions are the conditional distributions p(dxt |y1:t , θ) of the latent states Xt at time t, given all the observations y1:t up to time t.

3.1

Estimation via filtering means

The conditional predictive distributions p(yt |y1:t−1 , θ) appearing in Eq. (4) correspond to integrals over the latent R states, i.e. p(yt |y1:t−1 , θ) = p(xt |y1:t−1 , θ) gθ (yt |xt ) dxt . These integrals are in general intractable and Eq. (4) is not directly implementable for state-space models. Interchanging differentiation and integration under suitable regularity conditions yields the following proposition, whose results are similar to Fisher’s and Louis’ identities (see Proposition 10.1.6 in Capp´e et al., 2005), except that the differentiation is with respect to the observation instead of the parameter. Proposition 1. Under Assumption A2 (see Appendix B), we have the following two identities for all θ ∈ T, all observed values y1:T ∈ Y T , all k ∈ {1, ..., dy }, and all t ∈ {1, ..., T }: ! Z ∂ log p(yt |y1:t−1 , θ) ∂ log gθ (yt |xt ) = p(xt |y1:t , θ) dxt , (5) ∂yt (k) ∂yt (k) !2 Z  !2  2 2 ∂ log gθ (yt |xt ) ∂ log p(yt |y1:t−1 , θ) ∂ log p(yt |y1:t−1 , θ) ∂ log gθ (yt |xt )  =  p(xt |y1:t , θ) dxt . (6) + + ∂yt (k) ∂yt (k) ∂yt 2(k) ∂yt 2(k) The proof of Proposition 1 is presented in the supplementary material. The right hand sides of Eqs. (5)-(6) can be regarded as expectations with respect to the filtering distributions p(dxt |y1:t , θ). Applying Proposition 1 to each term of the sum in Eq. (4) and using the tower property of conditional expectations leads to    !2 " #!2  dy T X 2 X ∂ log g (y |X ) ∂ log g (y |X ) ∂ log g (y |X ) Θ t t Θ t t Θ t t y1:t  − E 2 E   , (7) HT (M ) = + y1:t 2 ∂y ∂y ∂y t t t (k) (k) (k) t=1 k=1 where the expectations are taken with respect to the joint posterior distributions (Θ, Xt ) ∼ p(dθ|y1:t )p(dxt |y1:t , θ). For many state-space models, the measurement density gθ (y|x) and its log-derivatives can be evaluated at any point (θ, y, x) ∈ T × Y × X. Under the mild condition that we can simulate the transition kernel of the latent states, we can use the SMC2 algorithm (Fulop and Li, 2013; Chopin et al., 2013) to consistently estimate all the conditional expectations appearing in Eq. (7). At each time t ∈ {1, ..., T }, SMC2 produces a set of weighted particles targeting p(θ, xt |y1:t ), which can be directly used to update the H-score. Regarding the initialization of SMC2 in the presence of vague or improper priors, the considerations discussed in Section 2.3 apply. The implementation is detailed in Appendix A.

3.2

Kernel density estimation

When Eqs. (5)-(6) do not hold, as illustrated by the stochastic volatility models in Section 6.2, we can directly estimate the partial derivatives of y˜t 7→ p(˜ yt |y1:t−1 , θ) at the observed yt , by using approximate draws from the conditional predictive distribution p(dyt |y1:t−1 , θ). Approximate draws from p(dyt |y1:t−1 , θ) can be obtained from a run of SMC 2 , as long as one can sample from the measurement distribution gθ (dyt |xt ). For a chosen bandwidth h > 0 and a twice continuously differentiable kernel K integrating to 1, e.g. a standard (1) (n) Gaussian kernel K(u) = (2π)−1/2 exp(−u2 /2), we can use n draws y˜t , . . . , y˜t from p(dyt |y1:t−1 , θ) to consistently estimate p(yt |y1:t−1 , θ) by the kernel density estimator ! n (i) 1 X yt − y˜t pb(yt |y1:t−1 , θ) = K . nh i=1 h 6

This kernel density estimator is twice differentiable with respect to yt and leads to the following estimators ! n (i) ∂ pb(yt |y1:t−1 , θ) 1 X ∂ yt − y˜t , = K ∂yt (k) nh i=1 ∂yt (k) h ! n (i) ∂ 2 pb(yt |y1:t−1 , θ) 1 X ∂2 yt − y˜t = K , ∂yt 2(k) nh i=1 ∂yt 2(k) h

(8)

(9)

which respectively converge to the partial derivatives ∂p(yt |y1:t−1 , θ)/∂yt (k) and ∂ 2 p(yt |y1:t−1 , θ)/∂yt 2(k) , as n → +∞ and h → 0 (e.g. Bhattacharya, 1967). The bias-variance trade-off in the choice of the bandwidth h has been extensively discussed in the density estimation literature (e.g. Hardle, Marron and Wand, 1990, and references in Section 1.11 of Tsybakov, 2009). Although differentiating kernel density estimators can sometimes lead to poor estimators of density derivatives in terms of integrated squared error (Sasaki, Noh, Niu and Sugiyama, 2016), this is not a concern here since we are interested in the simpler task of estimating the partial derivatives of y˜t 7→ p(˜ yt |y1:t−1 , θ) only at a single point yt . In fact, Eqs. (8)-(9) coincide with a simplified version of the direct density derivative estimation approach proposed in Sasaki et al. (2016), where we would fit a constant regression function to the density derivatives and minimize a weighted integrated squared error that puts more weight around the observed yt via the kernel K.

4

Consistency of the H-score

The H-score can be justified for finite samples, as a scoring rule satisfying properties including propriety, locality, and homogeneity (Dawid et al., 2012; Ehm and Gneiting, 2012). In this section, we show that, as the number of observations grows, choosing the model with the smallest H-score eventually leads to selecting the model that is the closest to the data-generating process in a certain sense. In particular, the H-score asymptotically favors wellspecified models over misspecified ones. We focus on statistical intuition rather than technicalities, and postpone the formal statements of the assumptions to the appendix. We examine two different settings: in Section 4.1, we consider i.i.d. models and assume the observations are truly generated as i.i.d.; in Section 4.2, we consider state-space models and put less restrictions on the form of the data-generating process. For ease of proof, we limit ourselves to non-nested models with continuous observations, leaving the more challenging settings of nested models and discrete observations for future work. The proofs are detailed in Appendix C and rely on the expression !  !  T T X X ∂ log p(y |y , Θ) t 1:t−1 y1:t , (10) HT (M ) = E H (yt , p(dyt |y1:t−1 , Θ)) y1:t + Var ∂yt t=1

t=1

which follows directly from rearranging the terms in Eq. (4). The key insight is that, as the number of observations grows and the posterior distribution p(dθ|y1:T ) concentrates to a point mass, the sum of the conditional expectations in Eq. (10) will eventually dominate and drive the behavior of the H-score, while the sum of the conditional variances acts as a penalty term that becomes negligible in non-nested settings. Some perspective on consistency of prequential scores can be found in Dawid and Musio (2015).

4.1

Independent data and i.i.d. models

In this section, we assume that (Yt )t∈N∗ is a sequence of i.i.d. observations drawn from p? . We consider two i.i.d. i.i.d. models M1 and M2 , each describing the data respectively as Y1 , ..., YT | θj ∼ pj (dy|θj ), with parameter θj ∈ Tj and prior density pj (θj ), where j ∈ {1, 2}. Under well-established regularity conditions (e.g. Wald, 1949; Le Cam, 1953; Kiefer and Wolfowitz, 1956; Huber, 1967; Perlman et al., 1972, see Condition C1 in Appendix C.2.1), the posterior distribution pj (θj |Y1:T ) under model Mj concentrates around some θj? ∈ Tj , as T → +∞, P? -almost surely. 7

Further continuity and uniform integrability conditions ensure that posterior concentration leads to convergence of the posterior expectations and variances in Eq. (10), respectively to a finite limit and to 0. The last steps of the proof consist in a combined application of Stolz-Ces`aro’s theorem and the standard law of large numbers, which yields  h i h i 1 P? −a.s. ? ? HT (M2 ) − HT (M1 ) − − − − − → E H (Y, p (dy|θ )) − E H (Y, p (dy|θ )) , (11) ? 2 ? 1 2 1 T →+∞ T where the expectations are taken with respect to Y ∼ p? . If we make the additional assumption that the density p? is twice differentiable with integrable H-score, then for each j ∈ {1, 2}, we can define the quantity h h i i ∆(p? , Mj ) = E? H Y, pj (dy|θj? ) − E? H (Y, p? (dy)) . (12) When limkyk→+∞ p? (y)∇y log pj (y|θj? ) = 0, the strict propriety of the H-score holds (Hyv¨arinen, 2005) and implies that ∆(p? , Mj ) ≥ 0, with ∆(p? , Mj ) = 0 if and only if pj (dy|θj? ) = p? (dy). In other words, ∆(p? , Mj ) can be interpreted as a divergence between the data-generating distribution p? and model Mj . This divergence is 0 only for well-specified models. Theorem 1 restates Eq. (11) in a more interpretable form. Theorem 1. If M1 and M2 both satisfy Assumptions A1, and A3 to A7 (see Appendix C), then we have  1 P? −a.s. −−−−→ ∆(p? , M2 ) − ∆(p? , M1 ) , HT (M2 ) − HT (M1 ) − T →+∞ T where ∆ is defined in Eq. (12) and satisfies ∆(p? , Mj ) ≥ 0, with ∆(p? , Mj ) = 0 if and only if pj (dy|θj? ) = p? (dy). Under Theorem 1, the H-score asymptotically chooses the model Mj that is the closest to the data-generating distribution p? with respect to the divergence ∆. In particular, if M1 is well-specified and M2 is misspecified, then we have ∆(p? , M1 ) = 0 < ∆(p? , M2 ), which leads to HT (M2 ) − HT (M1 ) > 0 for all sufficiently large T , P? -almost surely. In other words, the H-score eventually chooses the well-specified model M1 over the misspecified model M2 . The divergence appearing in Theorem 1 should be contrasted with the divergence associated with the log-score. Under similar assumptions, one can prove (e.g. Dawid, 2011) that the log-Bayes factor of M1 against M2 satisfies     1  P? −a.s. − log p(Y1:T |M2 ) − − log p(Y1:T |M1 ) − −−−−→ KL(p? , M2 ) − KL(p? , M1 ), T →+∞ T   where KL(p? , Mj ) = E? − log p(dy|θj? ) − E? [− log p? (dy)] denotes the Kullback-Leibler divergence between p? and p(dy|θj? ) for each j ∈ {1, 2}. In other words, the log-score − log p(Y1:T |Mj ) asymptotically favors the model that is the closest to p? with respect to the Kullback-Leibler divergence KL(p? , Mj ), whereas the H-score HT (Mj ) asymptotically favors the model that is the closest to p? with respect to the divergence ∆(p? , Mj ). When only one of the candidate models is well-specified, the Bayes factor and the H-factor both agree on consistently selecting that well-specified model. When both M1 and M2 are misspecified, each criterion selects a model according to its own notion of divergence. In some settings, the geometries induced by these divergences may differ, leading the Bayes factor and the H-factor to select different misspecified models (see Section 4.3). In the case where E? [H (Y, p1 (dy|θ1? ))] = E? [H (Y, p2 (dy|θ2? ))], the limit in Eq. (11) becomes 0 and calls for a more careful look at the higher order penalty term formed by the conditional variances in Eq. (10). Quantifying the asymptotic behavior of these posterior variances would require finer posterior concentration results (e.g. Bernstein von Mises type of results and multivariate delta method). Such refined asymptotics would be needed for instance if two well-specified models were nested, in the sense that {p1 (dy|θ1 ) : θ1 ∈ T1 } ⊂ {p2 (dy|θ2 ) : θ2 ∈ T2 }. For nested well-specified models, our simulation studies suggest that the H-score asymptotically selects the model of smallest dimension, like the Bayes factor typically does (e.g. Moreno, Gir´on and Casella, 2010; Rousseau and Taeryon, 2012; Chib and Kuffner, 2016). Qualitatively, different dimensions of the parameter spaces T1 and T2 should lead to different concentration speeds of the respective posterior distributions, which would then be reflected in the posterior variances of ∂ log p(yt |Θ)/∂yt under each model and drive the choice towards the well-specified model of smallest dimension. The particular case of nested Normal linear models is discussed in Sections 8 and 9 of Dawid and Musio (2015). 8

4.2

Dependent data and state-space models

In this section, the observations are no longer assumed to be i.i.d. and we consider two candidate state-space models, M1 and M2 . As in the i.i.d. setting, proving the consistency of the H-score relies on suitable concentration of the posterior distribution pj (θj |Y1:t ) for each j ∈ {1, 2}. For state-space models with dependent data, related concentration results have been derived in specific cases (e.g. Lijoi, Pr¨ unster and Walker, 2007; De Gunst and Shcherbakova, 2008; Shalizi, 2009; Gassiat, Rousseau et al., 2014; Douc et al., 2014; Douc, Olsson and Roueff, 2016, and references therein). However, to the best of our knowledge, general results on posterior concentration for possibly misspecified state-space models have yet to be established. As a consequence, our following result on consistency of H-scores for state-space models will use posterior concentration as a working assumption. Despite this technical limitation, we believe that our result provides valuable insight on the H-score and motivates further theoretical research in Bayesian asymptotics for state-space models. Our numerical examples suggest that concentration of posterior distributions can be a reasonable assumption in practice, even for complex state-space models such as the ones illustrated in Section 6 (see posterior density plots in the supplementary material). Further research could also weaken the assumptions made in the appendices. For each j ∈ {1, 2}, we assume that, P? -almost surely, the posterior distribution pj (θj |Y1:T ) under model Mj concentrates around some θj? ∈ Tj , as T → +∞. We also assume that the stochastic process (Yt )t∈N∗ is strongly stationary, so that we can artificially extend its set of indices to negative integers and consider the two-sided process (Yt )t∈Z . Under suitable domination and integrability conditions, the latent Markov chain forgets its initial distribution quickly enough, so that we can provably define H(Yt , pj (dyt |Y−∞:t−1 , θj? )) as the P? -almost sure limit of H(Yt , pj (dyt |Y−m+1:t−1 , θj? )) when m → +∞, for all t ∈ N∗ . A difficulty in proving the consistency of the H-score with dependent observations lies in showing that HT (Mj ) can be approximated arbitrarily well by some stationary analog, as T → +∞, in the sense of T  1 1X HT (Mj ) − H Yt , pj (dyt |Y−∞:t−1 , θj? ) T T t=1

P −a.s.

− −?−−−→ T →+∞

0.

(13)

Assuming the ergodicity of (Yt )t∈Z with suitable integrability conditions enables the application of Birkhoff’s ergodic theorem to the stationary analog of the H-score, which leads to T  1X H Yt , pj (dyt |Y−∞:t−1 , θj? ) T t=1

P −a.s.

− −?−−−→ T →+∞

h i E? H Y1 , pj (dy1 |Y−∞:0 , θj? ) .

(14)

The limit statements in Eqs. (13)-(14) are the main ingredients we use to prove the consistency of the H-score. Assuming the previous quantities also exist for the data-generating distribution p? , we can define h h i i ∆(p? , Mj ) = E? H Y1 , pj (dy1 |Y−∞:0 , θj? ) − E? H (Y1 , p? (dy1 |Y−∞:0 )) . (15) If M1 and M2 both satisfy Assumptions A1 to A4, and A8 to A12 (see Appendix C), then Eqs. (13)-(14) can be combined with Eq. (15) to yield  1 P? −a.s. HT (M2 ) − HT (M1 ) − −−−−→ ∆(p? , M2 ) − ∆(p? , M1 ) . T →+∞ T

(16)

Upon justifying that we can define pj (dy1 |Y−∞:0 , θj? ) and p? (dy1 |Y−∞:0 ) as conditional densities of Y1 given (Yt )t≤0 , the divergence ∆ satisfies ∆(p? , Mj ) ≥ 0, with ∆(p? , Mj ) = 0 if and only if pj (dy1 |Y−∞:0 , θj? ) = p? (dy1 |Y−∞:0 ), P? -almost surely. As in the i.i.d. setting, this divergence is strictly positive for misspecified models.

9

4.3

Numerical illustration of consistency with Normal models

Inspired by Section 3.2. of O’Hagan (1995), we consider the two Normal models   i.i.d. M1 : Y1 , ..., YT | θ1 ∼ N θ1 , 1 , θ1 ∼ N 0, σ02 , i.i.d.

M2 : Y1 , ..., YT | θ2 ∼ N (0, θ2 ) ,

 θ2 ∼ Inv-χ2 ν0 , s20 .

The positive hyperparameters are chosen as σ02 = 10, ν0 = 0.1, and s20 = 1. We compare M1 and M2 , under the following data-generating processes: i.i.d.

(1) Y1 , ..., YT ∼ N (1, 1), i.e. M1 is well-specified while M2 is not, i.i.d.

(2) Y1 , ..., YT ∼ N (0, 5), i.e. M2 is well-specified while M1 is not, i.i.d.

(3) Y1 , ..., YT ∼ N (4, 3), i.e. both M1 and M2 are misspecified, i.i.d.

(4) Y1 , ..., YT ∼ N (0, 1), i.e. both M1 and M2 are well-specified. In this setting, conjugacy of the priors allows all the posterior distributions, scores, and divergences to be computed i.i.d. in closed form. In particular, given Y1 , ..., YT ∼ N (µ? , σ?2 ), the posteriors under M1 and M2 concentrate respectively around θ1? = µ? and θ2? = σ?2 + µ2? . We compute ∆ and the Kullback-Leibler divergence for Normal densities analytically (see Section 6.1 of Dawid and Musio, 2015) and get the theoretical limits 2 σ?2 − 1 µ2? − , (17) ∆(p? , M2 ) − ∆(p? , M1 ) = 2 2 σ? (µ? + σ?2 ) σ?2    2  σ?2 − 1 − log σ?2 1 µ? + σ?2 KL(p? , M2 ) − KL(p? , M1 ) = log . (18) − 2 σ?2 2 For each of the four cases, we generate T = 1000 observations, and we perform 5 runs of SMC with Nθ = 1024 particles to estimate the log-Bayes factors and H-factors of M1 against M2 . The results are shown in Figure 1. As expected in cases 1 and 2, the H-factor selects the well-specified model and diverges to infinity at a linear rate, with respective slopes matching the theoretical limits 0.5 and -3.2 from Eq. (17). Similar behavior is obtained for the logBayes factor, which correctly diverges to infinity at the same linear rate, with theoretical slopes given by Eq. (18). In case 3, both models are misspecified, and Eqs. (17)-(18) with (µ? , σ?2 ) = (4, 3) yield ∆(p? , M2 ) − ∆(p? , M1 ) ≈ −1.05 and KL(p? , M2 ) − KL(p? , M1 ) ≈ 0.47. This leads the Bayes factor and the H-factor to favor different misspecified models. This illustrates the existence of cases where the Bayes factor and the H-factor disagree, even asymptotically in the number of observations, although we have not yet encountered such cases in applications. Finally, in case 4, the theoretical slopes are exactly 0, while the models are of equal dimensions, hence no model prevails over the other. Note that Figure 1 features H-factors and log-Bayes factors overlaid on the same plots in order to track their evolution jointly, but their values should not be directly compared.

5

H-score for discrete observations

In this section, we propose an extension of the H-score to discrete observations, for which Eq. (1) is not directly applicable. We assume that each observation y = (y(1) , ..., y(dy ) )> takes finite values (i.e. kyk < +∞) in some discrete space Y = Ja1 , b1 K × ... × Jady , bdy K, where Jak , bk K = [ak , bk ] ∩ Z and ak , bk ∈ Z ∪ {−∞, +∞}, with ak < bk for all k ∈ {1, ..., dy }. For ease of exposition, assume for now that bk − ak ≥ 3 for all k ∈ {1, ..., dy }. Let ek denote the canonical vector of Zdy that has all coordinates equal to 0 except for its k-th coordinate that equals 1. For all y ∈ Y and all probability mass functions p on Y, we define the score function D

H (y, p) =

dy X k=1

10

HkD (y, p) ,

(19)

Figure 1. Estimated log-Bayes factors (log-BF, red) and H-factors (HF, blue) of M1 against M2 , computed for 5 replications (thin solid lines), under four i.i.d. data-generating processes: N (1, 1) (Case 1, top-left), N (0, 5) (Case 2, top-right), N (4, 3) (Case 3, bottom-left), and N (0, 1) (Case 4, bottom-right). See Section 4.3.

where  HkD (y, p) = 2 

p(y+2 ek )−p(y) 2 p(y+ek )



p(y)−p(y−2 ek ) 2 p(y−ek )

2

 +



p(y + ek ) − p(y − ek ) 2 p(y)

2 (20)

if ak + 2 ≤ y(k) ≤ bk − 2, or else  ek )−p(y)   p(y+2  2 p(y+ek )    2   ek )−p(y) p(y+ek )−p(y−ek )   p(y+2 + 2 p(y+ek ) 2 p(y) HkD (y, p) =  2  p(y)−p(y−2 ek ) p(y+ek )−p(y−ek )   − +  2 p(y−ek ) 2 p(y)     p(y)−p(y−2 ek )  − 2 p(y−ek )

if y(k) = ak , if y(k) = ak + 1, (21) if y(k) = bk − 1, if y(k) = bk .

The expression of HkD in Eq. (20) can be regarded as a discrete analog of the H-score, where the partial derivatives are replaced by central finite differences. Proper scores for discrete observations can be entirely characterized as super-gradients of concave entropy functions (McCarthy, 1956; Hendrickson and Buehler, 1971; Dawid et al., 2012). Using this characterization, we can prove that HD is indeed a proper scoring rule. If bk = ak + 1 (e.g. for binary data) or bk = ak + 2, we could still define HkD via Eq. (21) by ignoring the cases y(k) = ak + 1, or y(k) = bk − 1, or both. Alternatively, we could use instead a forward differencing scheme, such as     2 p(y+ek )−p(y) p(y+ek )−p(y)  2 + if y(k) = ak ,  p(y) p(y)       2 k )−p(y) k) k )−p(y) 2 p(y+ep(y) − p(y)−p(y−e + p(y+ep(y) if ak < y(k) < bk , p(y−ek )       −2 p(y)−p(y−ek ) if y(k) = bk . p(y−ek ) 11

All these definitions lead to scores that meet the requirements of being insensitive to prior vagueness, while being proper and local. Deciding which one to use is then a matter of preferences and further considerations, which are left for future research. The construction of HD and the proof of its propriety are detailed in the supplementary material. For parsimony of terminology, we keep referring to HD as the H-score for discrete observations.

6

Applications

6.1

Diffusion models for population dynamics of red kangaroos

In this example, we illustrate the use of the H-score for discrete observations by comparing three nonlinear nonGaussian state-space models, describing the dynamics of a population of red kangaroos (Macropus rufus) in New South Wales, Australia. These models were compared in Knape and Valpine (2012) using Bayes factors, although the authors acknowledged the undesirable sensitivity of their results to their choice of prior distributions. The data (Caughley, Shepherd and Short, 1987) is a time series of 41 bi-variate observations (Y1,t , Y2,t ), formed by double transect counts of red kangaroos, measured at irregular time intervals ranging from two to six months, between 1973 and 1984 (see Figure 2). The small number of observations calls for a model selection criterion that is principled for finite samples, which discards, e.g., the Bayesian Information Criterion. The models are nested and will be referred to as M1 , M2 , and M3 , by decreasing order of complexity. The largest model (M1 ) is a logistic diffusion model. Particular versions include an exponential growth model (M2 ) and a random-walk model (M3 ). These models involve a latent population size (Xt ) that follows a stochastic differential equation (SDE), and whose transition kernel is intractable in the case of M1 . Further motivation for this type of models can be found in Dennis and Costantino (1988); Knape and Valpine (2012). The specification of each model is provided below, where (Wt )t≥0 denotes a standard Brownian motion. M1 : X1 ∼ LN(0, 5) ;

dXt /Xt = (σ 2 /2 + r − bXt ) dt + σdWt ;

i.i.d.

Y1,t , Y2,t | Xt , τ ∼ NB(Xt , Xt + τ Xt2 ) ;

i.i.d.

with independent priors σ, τ, b ∼ Unif(0, 10) and r ∼ Unif(−10, 10). M2 : X1 ∼ LN(0, 5) ;

dXt /Xt = (σ 2 /2 + r) dt + σdWt ;

i.i.d.

Y1,t , Y2,t | Xt , τ ∼ NB(Xt , Xt + τ Xt2 ) ;

i.i.d.

with independent priors σ, τ ∼ Unif(0, 10) and r ∼ Unif(−10, 10). M3 : X1 ∼ LN(0, 5) ;

dXt /Xt = (σ 2 /2) dt + σdWt ;

i.i.d.

Y1,t , Y2,t | Xt , τ ∼ NB(Xt , Xt + τ Xt2 ) ;

i.i.d.

with independent priors σ, τ ∼ Unif(0, 10). We perform 5 runs of SMC2 to estimate the log-score (negative log-evidence) and H-score of each model, with an adaptive number Nx of latent particles. For model M1 , we use Nθ = 16384 particles in θ, Nx = 32 initial particles in x, and we simulate the latent process using the Euler-Maruyama method with discretization step ∆t = 0.001. For models M2 and M3 , we solve the SDE analytically, and use Nθ = 4096 particles in θ, with Nx = 32 initial particles in x. The estimated log-scores and H-scores are shown in Figure 2, rescaled by the number of observations. Given the data, our approach would select model M3 . The same conclusion was reached by Knape and Valpine (2012) using Bayes factors. Their conclusion was mitigated by the sensitivity of the evidence to the choice of vague priors: for instance, changing the prior on r in model M2 to Unif(−100, 100) effectively divides the evidence of M2 by a factor 10. On the other hand, we have found the impact of that change of prior on the H-score to be indistinguishable from the Monte Carlo variation across the 5 runs.

6.2

L´ evy-driven stochastic volatility models

This example is a simulation study illustrating the consistency of the H-score, in the case of nonlinear non-Gaussian state-space models with continuous observations. More specifically, we consider the L´evy-driven stochastic volatility 12

Figure 2. Top panel: double transect counts of red kangaroos performed at 41 occasions. Middle and bottom panels: estimated log-scores and H-scores of the logistic diffusion model (M1 , red circles), the exponential growth model (M2 , yellow triangles), and the random-walk model (M3 , blue squares), rescaled by the number of observations and computed for 5 replications (thin solid lines). The average scores across replications (thick lines with shapes) are plotted for better readability. See Section 6.1.

models introduced by Barndorff-Nielsen and Shephard (2001) and used as a case study in Section 4.1 of Chopin et al. (2013). These models feature transition kernels whose densities are intractable, but that can be simulated. Such models describe the joint evolution of the log-returns Yt and the instantaneous volatility Vt of a financial asset. The former is modeled as a continuous time process driven by a Brownian motion, while the latter is modeled as a L´evy process. More insights can be found in Barndorff-Nielsen and Shephard (2001, 2002). Given a triplet of parameters (λ, ξ, ω), we can generate random variables (Vt , Zt )t≥1 recursively using the following procedure:    i.i.d. i.i.d. 2 2 2  k ∼ Poisson λξ /ω ; C1:k ∼ Unif(t − 1, t); E1:k ∼ Exp ξ/ω ;    (22) P P k k Z0 ∼ Gamma ξ 2 /ω 2 , ξ/ω 2 ; Zt = e−λ Zt−1 + j=1 e−λ(t−Cj ) Ej ; Vt = λ1 Zt−1 − Zt + j=1 Ej .  We let (εt )t≥1 denote a sequence of i.i.d. standard Normal errors. The first model (M1 ) describes the volatility as driven by a single factor, expressed in terms of a finite rate Poisson process. M1 : (Vt , Zt ) generated from Eq. (22) given (λ, ξ, ω); Xt = (Vt , Zt ); i.i.d. i.i.d. priors λ ∼ Exp(1); ξ, ω 2 ∼ Exp (1/5) ; µ, β ∼ N (0, 10).

1/2

Yt = µ + βVt + Vt

εt ;

with independent

The second model (M2 ) introduces an additional independent component to drive the behavior of the volatility, leading to the multi-factor model presented below. M2 : (V1,t , Z1,t ) and (V2,t , Z2,t ) from Eq. (22) independently given (λ1 , ξw1 , ωw1 ) and (λ2 , ξw2 , ωw2 ), with w1 = w 1/2 and w2 = 1 − w; Xt = (V1,t , V2,t , Z1,t , Z2,t ); Vt = V1,t + V2,t ; Yt = µ + βVt + Vt εt ; with independent i.i.d. i.i.d. priors λ1 ∼ Exp(1); λ2 − λ1 ∼ Exp (1/2); w ∼ Unif(0, 1); ξ, ω 2 ∼ Exp (1/5) ; µ, β ∼ N (0, 10). We simulate T = 1000 observations from a single-factor L´evy-driven stochastic volatility model with parameters λ = 0.01, ξ = 0.5, ω 2 = 0.0625, µ = 0, and β = 0. These values were chosen according to the simulation studies from 13

Figure 3. Top panel: log-returns simulated from a single-factor L´evy-driven stochastic volatility model with parameters λ = 0.01, ξ = 0.5, ω 2 = 0.0625, µ = 0, and β = 0. Middle and bottom panels: estimated log-Bayes factor (log-BF) and H-factor (HF) of a L´evy-driven single-factor model (M1 ) against a multi-factor model (M2 ), computed for 5 replications (thin solid lines). The average scores across replications (thick solid lines) are plotted for better readability. See Section 6.2.

Barndorff-Nielsen and Shephard (2002). The H-factor of M1 against M2 is computed via kernel density estimation (see Section 3.2) for 5 replications of SMC2 , using Nθ = 1024 particles in θ, and an adaptive number of particles in x starting at Nx = 128. The kernel density estimation is performed with a Gaussian kernel, using n = 1024 predictive draws and h = 0.1. The use of kernel density estimation is needed in this example due to the fact that Eqs. (5)-(6) do not hold for these particular stochastic volatility models. Indeed, under model M1 with parameter θ = (λ, ξ, ω, µ, β), the derivative of the measurement log-density at time t = 1 is given by ∂ log gθ (y1 |X1 )/∂y1 = −V1−1 (y1 − µ − βV1 ), whose absolute value is equivalent to V1−1 |y1 − µ| when V1 goes to 0. For any fixed parameter θ and conditional Pk on (k, C1:k ), the variance V1 = λ−1 (1 − e−λ )Z0 + j=1 λ−1 (1 − e−λ(1−Cj ) )Ej is a linear combination of k + 1 independent Gamma random variables, whose respective shapes and rates are α0 = ξ 2 ω −2 , αj = 1 for j ∈ {1, ..., k}, and β0 = ξω −2 λ(1 − e−λ )−1 , βj = ξω −2 λ(1 − e−λ(1−Cj ) )−1 for j ∈ {1, ..., k}. The conditional density of V1 given P+∞ (θ, k, C1:k ) is then equal to the uniformly converging series v 7→ i=0 γ(i, k, α0 , β0 ) v α0 +k+i−1 e−v/β0 for v > 0, where the γ(i, k, α0 , β0 )’s are positive constants (Moschopoulos, 1984). In the neighborhood of 0, this density is 2 −2 equivalent to v ξ ω +k−1 up to a multiplicative constant, so that integrating |v −1 | with respect to this density yields an infinite integral whenever ξ 2 ω −2 + k ≤ 1, which happens with positive probability when ξ ≤ ω. By the law of total expectation, this implies that E[ |∂ log gθ (y1 |X1 )/∂y1 | ] = +∞ for some values of θ, which prevents the use of Eqs. (5)-(6) to estimate the H-score of model M1 . Similar issues arise for model M2 . The estimated log-Bayes factor and H-factor of M1 against M2 are plotted in Figure 3. In this setting, the two models are nested and well-specified, but their dimensions differ. We see that both criteria correctly select the single-factor model over the more complex multi-factor model.

14

7

Discussion

The H-factor constitutes a competitive alternative to the Bayes factor. It is justified non-asymptotically since it relies on assessing predictive performances using a proper local scoring rule, and it is robust to the arbitrary vagueness of prior distributions. It can be applied to a large variety of models – including nonlinear non-Gaussian state-space models with tractable measurement distributions and transition kernels that can be simulated – and it can be estimated sequentially with SMC or SMC2 , at a cost comparable to that of the Bayes factor. However, the H-score puts additional smoothness restrictions on the models, such as the twice differentiability of their predictive distributions with respect to the observations (see Dawid and Musio, 2015, and its rejoinder). Thus there are models for which the Bayes factor is applicable but not the H-factor. We have also discussed in Section 4.3 a case where the two criteria disagree, even asymptotically, contrarily to more standard criteria such as partial and intrinsic Bayes factors (Santis and Spezzaferri, 1999). Finally, the prequential form of the score might require some precautions in i.i.d. settings, where the observations are not naturally ordered and where different orderings yield different scores. For continuous observations and non-nested parametric models satisfying strong regularity assumptions, we have proved that the H-score leads to consistent model selection. The asymptotic behavior of the H-factor is determined by how close the candidate models are from the data-generating process, where the closeness is quantified by a divergence associated with the H-score, in contrast to the Kullback–Leibler divergence associated with the Bayes factor. Our proofs rely on strong assumptions, but the numerical results indicate that the results might hold in much more generality. The applicability of the H-score to Bayesian nonparametric models, for which sequential algorithms are available (e.g. MacEachern, Clyde and Liu, 1999; Ulker, Gunsel and Cemgil, 2010; Arbel and Salomond, 2016; Griffin, 2017), remains an open question. The more challenging proofs of consistency for discrete observations, nested well-specified models, and nonparametric settings would be interesting avenues of research. PT Other alternatives to the log-evidence include Bayesian cross-validation criteria, e.g. t=1 log p(yt |y−t ), where y−t = {ys : 1 ≤ s ≤ T and s 6= t} for all t ∈ {1, ..., T }. Such criteria would be applicable under weaker smoothness assumptions on the predictive densities, while still being robust to arbitrary vagueness of within-model prior distributions. Efficient computation of these criteria could be achieved for i.i.d. models using SMC methods (Bornn, Doucet and Gottardo, 2010); the case of state-space models would deserve further investigation. Another approach suggested in Kamary et al. (2014) is to cast model selection as a mixture estimation problem, which would also raise interesting computational challenges for time series models.

Supplementary material and code The supplementary material is available on the second author’s webpage. It details the proofs of Eq. (4) and Proposition 1, the construction of the discrete H-score, and all the results used in Appendix C to derive the consistency of the H-score. It also provides posterior density plots for the models introduced in Section 6, as well as an additional simulation study illustrating the consistency of the H-score for ARMA models. The R package, along with reproducible examples, can be downloaded from github.com/pierrejacob/bayeshscore.

Acknowledgements Stephane Shao thanks the participants of Greek Stochastics iota in July 2017 for their helpful feedback. Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grant DMS-1712872.

15

References Andrieu, C., Doucet, A. and Holenstein, R. (2010). Particle Markov Chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 72 (3), 269–342. Arbel, J. and Salomond, J.-B. (2016). Sequential quasi Monte Carlo for Dirichlet process mixture models. NIPS, Conference on Neural Information Processing Systems, Dec 2016. Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society: Series B, 63 (2), 167—241. Barndorff-Nielsen, O. E. and Shephard, N. (2002). Econometric analysis of realized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society: Series B, 64 (2), 253–280. Bartlett, M. S. (1957). A comment on D. V. Lindley’s statistical paradox. Biometrika, 44 (3-4), 533–534. Berger, J. O. and Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91 (433), 109–122. Berger, J. O. and Pericchi, L. R. (2001). Objective Bayesian methods for model selection: introduction and comparison. Model Selection, IMS Lecture Notes – Monograph Series, 68, 135–207. Berger, J. O., Pericchi, L. R. and Varshavsky, J. (1998). Bayes factors and marginal distributions in invariant situations. Sankhya A: The Indian Journal of Statistics, 60, 307–321. Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory. John Wiley & Sons. Bhattacharya, P. (1967). Estimation of a probability density function and its derivatives. Sankhy¯ a: The Indian Journal of Statistics, Series A, pp. 373–382. Billingsley, P. (1995). Probability and measure. Wiley Series in Probability and Mathematical Statistics. Bissiri, P. G., Holmes, C. and Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B, 78 (5), 1103–1130. Bornn, L., Doucet, A. and Gottardo, R. (2010). An efficient computational approach for prior sensitivity analysis and cross-validation. Canadian Journal of Statistics, 38 (1), 47–64. Bret´ o, C., He, D., Ionides, E. L. and King, A. A. (2009). Time series analysis via mechanistic models. The Annals of Applied Statistics, pp. 319–348. Capp´e, O., Moulines, E. and Ryd´en, T. (2005). Inference in hidden Markov models. Springer Series in Statistics. Caughley, G., Shepherd, N. and Short, J. (1987). Kangaroos, their ecology and management in the sheep rangelands of Australia. Cambridge University Press. Chib, S. and Kuffner, T. A. (2016). Bayes factor consistency. Preprint, arXiv:1607.00292. Chopin, N. (2002). A sequential particle filter method for static models. Biometrika, 89 (3), 539–552. Chopin, N., Jacob, P. E. and Papaspiliopoulos, O. (2013). SMC2 : an efficient algorithm for sequential analysis of state-space models. Journal of the Royal Statistical Society: Series B, 75 (3), 397–426. Chopin, N., Ridgway, J., Gerber, M. and Papaspiliopoulos, O. (2015). Towards automatic calibration of the number of state particles within the SMC2 algorithm. Preprint, arXiv:1506.00570.

16

Dawid, A. P. (1984). The prequential approach. Journal of the Royal Statistical Society: Series A, 147 (2), 278–292. Dawid, A. P. (2011). Posterior model probabilities. Handbook of the Philosophy of Science, 7, 607–630. Dawid, A. P. and Lauritzen, S. L. (2005). The geometry of decision theory. In Proceedings of the Second International Symposium on Information Geometry and Its Applications, pp. 22–28. Dawid, A. P., Lauritzen, S. and Parry, M. (2012). Proper local scoring rules on discrete sample spaces. The Annals of Statistics, 40 (1), 593–608. Dawid, A. P. and Musio, M. (2015). Bayesian model selection based on proper scoring rules. Bayesian Analysis, 10 (2), 479–499. Dawid, A. P., Musio, M. and Columbu, S. (2017). A note on Bayesian model selection for discrete data using proper scoring rules. Statistics and Probability Letters, 129, 101–106. Dawid, A. P., Musio, M. and Ventura, L. (2016). Minimum scoring rule inference. Scandinavian Journal of Statistics, 43 (1), 123–138. De Gunst, M. and Shcherbakova, O. (2008). Asymptotic behavior of Bayes estimators for hidden Markov models with application to ion channels. Mathematical Methods of Statistics, 17 (4), 342–356. Del Moral, P., Doucet, A. and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B, 68 (3), 411—436. Dennis, B. and Costantino, R. (1988). Analysis of steady-state populations with the gamma-abundance model: application to tribolium. Ecology, 69, 1200–1213. Douc, R. and Capp´e, O. (2005). Comparison of resampling schemes for particle filtering. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pp. 64–69. Douc, R., Moulines, E., Olsson, J., Van Handel, R. et al. (2011). Consistency of the maximum likelihood estimator for general hidden Markov models. The Annals of Statistics, 39 (1), 474–513. Douc, R., Moulines, E. and Stoffer, D. (2014). Nonlinear Time Series: Theory, Methods and Applications with R Examples. 1st edn. Chapman and Hall/CRC. Douc, R., Moulines, E. et al. (2012). Asymptotic properties of the maximum likelihood estimation in misspecified hidden Markov models. The Annals of Statistics, 40 (5), 2697–2732. Douc, R., Olsson, J. and Roueff, F. (2016). Posterior consistency for partially observed Markov models. Preprint, arXiv:1608.06851. Duan, J.-C. and Fulop, A. (2015). Density-tempered marginalized sequential Monte Carlo samplers. Journal of Business & Economic Statistics, 33 (2), 192–202. Ehm, W. and Gneiting, T. (2012). Local proper scoring rules of order two. The Annals of Statistics, 40 (1), 609–637. Fearnhead, P. and Taylor, B. M. (2013). An adaptive sequential Monte Carlo sampler. Bayesian Analysis, 8 (2), 411– 438. Fulop, A. and Li, J. (2013). Efficient learning via simulation: A marginalized resample-move approach. Journal of Econometrics, 176 (2), 146–161.

17

Gassiat, E., Rousseau, J. et al. (2014). About the posterior distribution in hidden Markov models with unknown number of states. Bernoulli, 20 (4), 2039–2075. Gerber, M., Chopin, N. and Whiteley, N. (2017). Negative association, ordering and convergence of resampling methods. Preprint, arXiv:1707.01845. Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian nonparametrics. Springer Science & Business Media. Golightly, A. and Kypraios, T. (2017). Efficient SMC2 schemes for stochastic kinetic models. Preprint, arXiv:1704.02791. Griffin, J. E. (2017). Sequential Monte Carlo methods for mixtures with normalized random measures with independent increments priors. Statistics and Computing, 27 (1), 131–145. Hardle, W., Marron, J. and Wand, M. (1990). Bandwidth choice for density derivatives. Journal of the Royal Statistical Society: Series B, pp. 223–232. Hendrickson, A. D. and Buehler, R. J. (1971). Proper scores for probability forecasters. The Annals of Mathematical Statistics, 42 (6), 1916–1921. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1 (1), 221–233. Hyv¨ arinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6, 695–709. Jeffreys, H. (1939). Theory of Probability. Oxford University Press. Kamary, K., Mengersen, K., Robert, C. P. and Rousseau, J. (2014). Testing hypotheses via a mixture estimation model. Preprint, arXiv:1412.2044v2. Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90 (430), 773–795. Key, J. T., Pericchi, L. R. and Smith, A. F. (1999). Bayesian model choice: what and why. Bayesian statistics, 6, 343–370. Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pp. 887–906. Knape, J. and Valpine, P. D. (2012). Fitting complex population models by combining particle filters with Markov Chain Monte Carlo. Ecology, 93 (2), 256–263. Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related results. University of California publications in Statistics, 1, 277–330. Lee, J. and MacEachern, S. N. (2011). Consistency of Bayes estimators without the assumption that the model is correct. Journal of Statistical Planning and Inference, 141 (2), 748–757. Lijoi, A., Pr¨ unster, I. and Walker, S. G. (2007). Bayesian consistency for stationary models. Econometric Theory, 23 (4), 749–759. Lv, J. and Liu, J. S. (2014). Model selection principles in misspecified models. Journal of the Royal Statistical Society: Series B, 76 (1), 141–167.

18

MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential importance sampling for nonparametric Bayes models: the next generation. Canadian Journal of Statistics, 27 (2), 251–267. McCarthy, J. (1956). Measures of the value of information. Proceedings of the National Academy of Sciences of the United States of America, 42 (9), 654–655. Moreno, E., Gir´ on, F. J. and Casella, G. (2010). Consistency of objective Bayes factors as the dimension grows. The Annals of Statistics, 38 (4), 1937–1952. Moschopoulos, P. G. (1984). The distribution of the sum of independent gamma random variables. Annals of the Institute of Statistical Mathematics, 37 (1), 541–544. Murray, L. M., Lee, A. and Jacob, P. E. (2016). Parallel resampling in the particle filter. Journal of Computational and Graphical Statistics, 25 (3), 789–805. Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11 (2), 125–139. O’Hagan, A. (1995). Fractional Bayes factor for model comparison. Journal of the Royal Statistical Society: Series B, 57 (1), 99–138. Parry, M., Dawid, A. P. and Lauritzen, S. (2012). Proper local scoring rules. The Annals of Statistics, 40 (1), 561– 592. Perlman, M. D. et al. (1972). On the strong consistency of approximate maximum likelihood estimators. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, 94 (446), 590–599. Robert, C. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media. Rousseau, J. and Taeryon, C. (2012). Bayes factor consistency in regression problems. HAL, archives ouvertes, hal-00767469. Rudin, W. (1964). Principles of mathematical analysis. McGraw–Hill. Santis, F. and Spezzaferri, F. (1999). Methods for default and robust Bayesian model comparison: the fractional Bayes factor approach. International Statistical Review, 67 (3), 267–286. Sasaki, H., Noh, Y.-K., Niu, G. and Sugiyama, M. (2016). Direct density derivative estimation. Neural Computation, 28 (6), 1101–1140. Shalizi, C. R. (2009). Dynamics of Bayesian updating with dependent data and misspecified models. Electronic Journal of Statistics, 3, 1039–1074. Tsybakov, A. B. (2009). Introduction to nonparametric estimation. Springer Series in Statistics. Ulker, Y., Gunsel, B. and Cemgil, A. T. (2010). Sequential Monte Carlo samplers for Dirichlet process mixtures. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS, 9. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20 (4), 595–601.

19

Walker, S. G. (2013). Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 143 (10), 1621–1633. Zhou, Y., Johansen, A. M. and Aston, J. A. D. (2016). Towards automatic model comparison: an adaptive sequential Monte Carlo approach. Journal of Computational and Graphical Statistics, 25 (3).

A

Implementation (1:N )

(1)

(N )

At each step t ∈ {0, ..., T }, an SMC sampler produces a set of Nθ particles θt θ = (θt , ..., θt θ ) with associated (1:N ) (1) (N ) normalized weights Wt θ = (Wt , ..., Wt θ ), targeting the posterior distribution p(dθ|y1:t ) (Chopin, 2002; Del (1:N ) Moral et al., 2006). These particles are initialized as i.i.d. draws θ0 θ from a proposal distribution q(dθ). When the prior p(dθ) is proper and can be sampled from, one can chose q(dθ) = p(dθ). Otherwise, q(dθ) should be chosen as an approximation of the first posterior distribution p(θ|y1:τ ) that is proper. Going from an approximate sample of the posterior p(dθ|y1:t ) to the next posterior p(dθ|y1:t+1 ) is achieved by successively targeting the intermediate bridging distributions pγt,j (dθ) ∝ p(dθ|y1:t )p(yt+1 |y1:t , θ)γt,j , where the γt,j ’s are well-chosen temperatures satisfying 0 = γt,0 < γt,1 < ... < γt,Jt = 1, with Jt ∈ N∗ . Let N denote the minimum effective sample size, fixed at some (1:N ) (1:N ) desired value. Given the current temperature γt,j < 1 and current particles θt,j θ with weights Wt,j θ targeting pγt,j (dθ), the next temperature is determined adaptively and the particles are moved as follows: (m)

(m)

1. For each m ∈ {1, ..., Nθ }, compute Gt 2. Find the largest γt,j+1

= p(yt+1 |y1:t , θt,j ).   (1:N ) (1:N ) ∈ (γt,j , 1] such that ESS γt,j , γt,j+1 , Gt θ , θt,j θ ≥ N , where 

  (1:N ) (1:N ) = ESS γt,j , γt,j+1 , Gt θ , θt,j θ

(m) m=1 Wt,j

PNθ

PNθ

m=1

(m)

(m)

3. For each m ∈ {1, ..., Nθ }, compute wt,j+1 = Wt,j



(m)

Gt

(m)



(m) Wt,j

γt,j+1 −γt,j (m)



(m) Gt



(m) Gt

γt,j+1 −γt,j 2 γt,j+1 −γt,j 2 .

(m)

(m)

and set Wt,j+1 = (m)

(m)

wt,j+1 PNθ (i) . i=1 wt,j+1 (m)

(m)

4. If γt,j+1 = 1 and t < T , then set γt+1,0 = 0, θt+1 = θt+1,0 = θt,j , and Wt+1 = Wt+1,0 = Wt,j+1 , for (1:N )

each m ∈ {1, ..., Nθ }. If γt,j+1 < 1, then resample the particles according to the weights Wt,j+1θ , reset (m)

the weights to Wt,j+1 = 1/Nθ for each m ∈ {1, ..., Nθ }, and move the particles by sampling independently (m)

(m)

θt,j+1 ∼ K(θt,j , dθ) for each m ∈ {1, ..., Nθ }, where K is a Markov kernel that leaves pγt,j+1 invariant. More generally, these moves can be performed as many times as desired in order to improve the rejuvenation rate of the particles, albeit at the expense of additional computations. (1:Nθ )

Starting from ESS(γ0,0 , γ0,0 , G0

(1:Nθ )

, θ0,j

(1:Nθ )

) = Nθ , we have ESS(γt,j , γt,j , Gt

(1:Nθ )

, θt,j

) ≥ N at any given time,

(1:N ) (1:N ) , Gt θ , θt,j θ )

by construction. Besides, γ 7→ ESS(γt,j , γ is a continuous function of γ. These two facts guarantee the existence and uniqueness of the γt,j+1 defined in step 2. Other diagnostics than the effective sample size could be used to monitor the degeneracy of the weights. In step 4, we choose to perform the resampling using systematic resampling (Douc and Capp´e, 2005; Murray et al., 2016; Gerber et al., 2017). Regarding the Markov kernel K, we use a Metropolis-Hastings kernel, with a mixture of Normals as a proposal. This mixture of Normals is fitted to the latest set of weighted particles, using five components by default throughout our numerical experiments. (1:N ) For general state-space models, the SMC2 algorithm attaches one particle filter made of Nx particles xt x (m) (m) to each of the particles θt . These particle filters produce estimators pˆ(yt+1 |y1:t , θt,j ), which are used in place (m)

of the intractable incremental likelihoods p(yt+1 |y1:t , θt,j ) in step 1. Our implementation of SMC2 starts with a 20

small initial number Nx , and adaptively doubles it whenever the acceptance rate of the moves in step 4 are below some desired threshold. Increasing Nx is achieved by using conditional sequential Monte Carlo steps, which sample new filters made of a larger number of particles, as detailed in Section 3.6.2 of Chopin et al. (2013). We rely on the bootstrap particle filter for simplicity, but more efficient filters — such as the auxiliary particle filter (Pitt and Shephard, 1999) — could be used within SMC2 , as illustrated in Golightly and Kypraios (2017). Other relevant considerations are discussed in Chopin, Ridgway, Gerber and Papaspiliopoulos (2015); Duan and Fulop (2015).

B

Identities for the H-score

Eq. (4) and Proposition 1 are proved in the supplementary material. They result from algebraic manipulations, under regularity assumptions guaranteeing the existence of all the relevant derivatives and integrals, as well as enabling differentiation under the integral sign. Such assumptions can be stated as follows. Assumption A1. For all t ∈ N∗ , the following conditions hold: (a) For all y1:t ∈ Yt , the function θ 7→ p(yt |y1:t−1 , θ) p(θ|y1:t−1 ) is integrable on T. (b) For all (y1:t−1 , θ) ∈ Yt−1 × T, the function yt 7→ p(yt |y1:t−1 , θ) is twice differentiable on Y. (c) For all k ∈ {1, ..., dy }, there exist integrable functions h1,k and h2,k such that, for all (y1:t , θ) ∈ Yt × T, ∂ 2 p(y |y ∂p(y |y t 1:t−1 , θ) t 1:t−1 , θ) p(θ|y1:t−1 ) ≤ h1,k (θ) and p(θ|y ) ≤ h2,k (θ). 1:t−1 2 ∂yt (k) ∂yt (k) Assumption A2. For all t ∈ N∗ , the following conditions hold: (a) For all (y1:t , θ) ∈ Yt × T, the function xt 7→ p(xt |y1:t−1 , θ) gθ (yt |xt ) is integrable on X. (b) For all (θ, xt ) ∈ T × X, the function yt 7→ gθ (yt |xt ) is twice differentiable on Y. (c) For all k ∈ {1, ..., dy }, there exist integrable functions h3,k and h4,k such that, for all (y1:t , θ, xt ) ∈ Yt × T × X, ∂g (y |x ) ∂ 2 g (y |x ) θ t t θ t t and p(xt |y1:t−1 , θ) ≤ h3,k (xt ) p(xt |y1:t−1 , θ) ≤ h4,k (xt ). 2 ∂yt (k) ∂yt (k)

C

Consistency of the H-score

Without much loss of generality, we prove the results in the case of continuous univariate observations (dy = 1 and Y ⊆ R). Thanks to Eq. (3), the proofs can be generalized to multivariate observations by working on each dimension separately. Unless stated otherwise, we assume A1 and A2 hold, so that we may use Eqs. (7) and (10). Appendix C.1 should be read as a proof of concept: we prove Theorem 1 and Eq. (16) by using intermediary results as high-level assumptions (A3 to A12). This allows us to highlight the key steps of the proofs. In Appendix C.2, we present explicit conditions (C1 to C6) that are sufficient for these assumptions to hold. Some of these conditions are strong, which enables intuitive proofs; our simulation studies suggest that the consistency of the H-score is likely to hold under weaker conditions. Detailed proofs of all the intermediary results are provided in the supplementary material. Proofs under weaker conditions or discrete observations are left for future work.

C.1

Proofs of Theorems 1 and Eq. (16)

The first ingredient is the P? -almost sure concentration of the posterior distribution p(dθ|Y1:t ) around some limit value θ? ∈ T, as the number of observations increases (Assumption A3). 21

D

Assumption A3. P? -a.s., there exists θ? ∈ T such that, if Θt ∼ p(dθ|Y1:t ) for all t ∈ N∗ , then Θt − −−−→ θ? . t→+∞ Posterior concentration in i.i.d. settings has been well-studied and can be formally enforced by explicit regularity conditions (e.g. Condition C1 in Appendix C.2.1). In the case of state-space models with dependent observations, we treat posterior concentration as a working assumption. From now on, we assume that Assumption A3 holds, so that we can unambiguously refer to the limit point θ? around which the posterior distribution concentrates. In addition to concentration of the posterior distribution, we also want the posterior moments of specific test functions to converge, P? -almost surely. In particular, as the posterior distribution concentrates to a point mass, we want the posterior expectations and variances appearing in Eq. (10) to respectively converge to a finite limit and to 0, as the number of observations increases (Assumption A4). Assumption A4. The following limits hold: i h (a) E H (Yt , p(dyt |Y1:t−1 , Θ)) Y1:t − H (Yt , p(dyt |Y1:t−1 , θ? ))   ∂ log p(Yt |Y1:t−1 , Θ) P? −a.s. (b) Var Y1:t −− −−−→ 0. t→+∞ ∂yt

P −a.s.

? −− −−−→ t→+∞

0.

By Stolz-Ces` aro’s theorem, the P? -a.s. convergence of the posterior moments in Assumption A4 implies the P? -a.s. convergence of their Ces` aro means. This leads to the convergence of the prequential quantities, so that ! ! T T i 1X 1X h P? −a.s. ? E H (Yt , p(dyt |Y1:t−1 , Θ)) Y1:t H (Yt , p(dyt |Y1:t−1 , θ )) − −→+∞ −−−→ 0 , (C.1.1) − T T t=1 T t=1 and   T 1X ∂ log p(Yt |Y1:t−1 , Θ) Y Var 1:t T t=1 ∂yt

P −a.s.

− −?−−−→ T →+∞

0.

(C.1.2)

At this stage, the proof starts to differ depending on which setting we consider. C.1.1

i.i.d. models

For i.i.d. models, we have H (Yt , p(dyt |Y1:t−1 , θ? )) = H (Yt , p(dy|θ? )) for all t ∈ N∗ . If the Yt ’s are generated as i.i.d. from p? (Assumption A5), then the integrability of H (Y, p(dy|θ? )) with respect to Y ∼ p? (Assumption A6) PT enables the application of the law of large numbers to the quantity T −1 t=1 H (Yt , p(dy|θ? )). Assumption A5. The observations (Yt )t∈N∗ are i.i.d. draws from p? . h i Assumption A6. E? |H (Y, p(dy|θ? ))| < +∞. Under Assumptions A5 and A6, the law of large numbers reduces Eq. (C.1.1) for i.i.d. models to T i 1X h E H (Yt , p(dy|Θ)) Y1:t T t=1

P −a.s.

− −?−−−→ T →+∞

h i E? H (Y, p(dy|θ? )) ,

(C.1.3)

where the expectation is taken with respect to Y ∼ p? . If M1 and M2 are both i.i.d. models satisfying A1, A3, A4, and A6, then combining Eqs. (10), (C.1.2) and (C.1.3) leads to h i 1 P? −a.s. HT (Mj ) − −→+∞ −−−→ E? H Y, pj (dy|θj? ) , T T for each j ∈ {1, 2}. Taking the difference of the respective scores proves Eq. (11). In order to interpret the consistency of the H-score in terms of an appropriate divergence, we impose further regularity assumptions on the models and the data-generating process itself (Assumption A7). A7(a) allows us to define the H-score of p? , assumed to be integrable by A7(b). A7(c) ensures the strict propriety of the H-score. 22

Assumption A7. The data-generating process and the model satisfy the following conditions: (a) y 7→ p? (y) is twice differentiable. h i (b) E? |H (Y, p? (dy))| < +∞. (c)

∂ log p(y|θ? ) p? (y) −−−−−−→ 0. kyk→+∞ ∂y

Under Assumptions A3, A6, and A7, we can define the divergence ∆(p? , Mj ) as in Eq. (12). By adding and subtracting E? [H (Y, p? (dy))] in Eq. (11), we get  1 P? −a.s. HT (M2 ) − HT (M1 ) − −−−−→ ∆(p? , M2 ) − ∆(p? , M1 ). T →+∞ T Under Assumption A7(c), integration by parts (Hyv¨arinen, 2005; Dawid and Musio, 2015) leads to Z  ∆(p? , Mj ) =

∂ log p? (y) ∂ log pj (y|θj? ) − ∂y ∂y

2 p? (y)dy.

Therefore, we have ∆(p? , Mj ) ≥ 0. If ∆(p? , Mj ) = 0, then ∂ log p? (y)/∂y = ∂ log pj (y|θj? )/∂y for all y ∈ Y. Hence, log p? (y) = log pj (y|θj? ) + log(c) for all y ∈ Y and some constant c > 0. This leads to p? (y) = c pj (y|θj? ) for all y ∈ Y. Since probability densities integrate to 1, we necessarily have c = 1, i.e. p? (y) = pj (y|θj? ) for all y ∈ Y. This concludes the proof of Theorem 1. C.1.2

State-space models

In the case of state-space models and dependent observations, more subtle arguments are needed since we can no PT longer apply the standard law of large numbers to the term T −1 t=1 H (Yt , p(dyt |Y1:t−1 θ? )) in Eq. (C.1.1). Instead, we approximate this term by a stationary analog, to which ergodic theorems will apply. To this end, we assume the process (Yt )t∈N∗ is strongly stationary and ergodic (Assumption A8). Under strong stationarity, we can artificially extend the index set to negative integers and consider the two-sided process (Yt )t∈Z . We also need the dependence of the H-score on the initial distribution of the latent Markov chain to vanish quickly enough. This will be referred to as the forgetting property of the H-score (Assumption A9). Assumption A8. (Yt )t∈N∗ is strongly stationary and ergodic. Assumption A9. There exist ρ ∈ (0, 1) and γ > 0 such that, for all t ∈ N∗ , all m ∈ N, and all y−m:t ∈ Ym+t+1 , |H(yt , p(dyt |y−m+1:t−1 , θ? )) − H(yt , p(dyt |y−m:t−1 , θ? ))| ≤ γ ρt+m−1 . Under Assumptions A8 and A9, we can prove that, P? -a.s., for all t ∈ N∗ , (H(Yt , p(dyt |Y−m+1:t−1 , θ? )))m∈N is a real-valued Cauchy sequence, and thus converges to a P? -a.s. limit denoted by H(Yt , p(dyt |Y−∞:t−1 , θ? )). In other words, P? -almost surely, and for all t ∈ N∗ , we have H(Yt , p(dyt |Y−m+1:t−1 , θ? ))

− −−−−→ m→+∞

H(Yt , p(dyt |Y−∞:t−1 , θ? )) .

Using Eq. (C.1.4) and the forgetting property in A9, we can prove that ! ! T T X 1 1X H (Yt , p(dyt |Y1:t−1 , θ? )) − H (Yt , p(dyt |Y−∞:t−1 , θ? )) T t=1 T t=1

P −a.s.

− −?−−−→ T →+∞

(C.1.4)

0.

(C.1.5)

The proofs of Eqs. (C.1.4) and (C.1.5) are provided in the supplementary material. Equation (C.1.5) implies that, PT P? -almost surely, the term T −1 t=1 H (Yt , p(dyt |Y1:t−1 θ? )) in Eq. (C.1.1) can be asymptotically approximated P T by the stationary quantity T −1 t=1 H (Yt , p(dyt |Y−∞:t−1 , θ? )), to which ergodic theorems can be applied under adequate integrability conditions (Assumption A10). 23

h i Assumption A10. E? |H (Y1 , p(dy1 |Y−∞:0 , θ? ))|


0 for all θ ∈ T. (b) y 7→ p(y|θ) is measurable for all θ ∈ T, and θ 7→ p(y|θ) is continuous for all y ∈ Y. R (c) Y supθ∈T | log p(y|θ)| p? (y)dy < +∞. Condition C1 can be relaxed to allow for semi-continuity and non-compact parameter spaces, as discussed in Remark 1.3.5 of Ghosh and Ramamoorthi (2003) and its references (e.g. Wald, 1949; Le Cam, 1953; Kiefer and Wolfowitz, 1956; Huber, 1967; Perlman et al., 1972). Posterior concentration for general state-space models with dependent data is less standard, especially when allowing for misspecification. Some concentration results have been proved in specific cases (e.g. Lijoi et al., 2007; De Gunst and Shcherbakova, 2008; Shalizi, 2009; Gassiat et al., 2014; Douc et al., 2016, and references therein). However, as far as we know, a formal proof of posterior concentration with explicit conditions on possibly misspecified state-space models has yet to be derived. C.2.2

Assumption A4: Convergence of specific posterior moments

Concentration of the posterior distribution does not guarantee convergence of any posterior moments. The latter can be ensured by further imposing equicontinuity (Condition C2) and uniform integrability (Condition C3). Condition C2. P? -almost surely, the following statements hold: (a) {θ 7→ H (Yt , p(dyt |Y1:t−1 , θ)) : t ∈ N∗ } is equicontinuous at θ? .   ∂ log p(Yt |Y1:t−1 , θ) (b) θ 7→ : t ∈ N∗ is equicontinuous at θ? . ∂yt Equicontinuity at θ? of a family of functions {θ 7−→ ht (θ) : t ∈ N∗ } means that all the functions in the family share a common (i.e. not depending on t) modulus of continuity at θ? . Equicontinuity can be enforced by the stronger but more explicit condition that there exists a neighborhood Uθ? of θ? , on which the functions θ 7−→ ht (θ) are differentiable for all t ∈ N∗ , and such that sup(t,θ)∈N∗ × Uθ? k∇θ ht (θ)k = M < +∞. Indeed, by the mean value theorem, such uniform boundedness of the gradients ensures that the functions θ 7−→ ht (θ) are M -Lipschitz on Uθ? for all t ∈ N∗ , where M does not depend on t. Then, for any arbitrary ε > 0, we can find δε > 0 not depending on t (e.g. δε = ε/M if M > 0, or else any δε > 0 if M = 0) such that, for all θ ∈ Uθ? , kθ − θ? k < δε implies supt∈N∗ kht (θ) − ht (θ? )k < ε, which proves the equicontinuity at θ? of the family {θ 7−→ ht (θ) : t ∈ N∗ }. Condition C3. P? -almost surely, if Θt ∼ p(dθ|Y1:t ) for all t ∈ N∗ , then the following statements hold: (a) {H (Yt , p(dyt |Y1:t−1 , Θt )) : t ∈ N∗ } is uniformly integrable given (Yt )t∈N∗ . ) ( 2 ∂ log p(Yt |Y1:t−1 , Θt ) : t ∈ N∗ is uniformly integrable given (Yt )t∈N∗ . (b) ∂yt Uniform integrability of a family of random variables {Ht : t ∈ N∗ } can be enforced by the stronger but more explicit condition of Lp -boundedness: if there exists p > 1 such that supt∈N∗ E [ |Ht |p ] < +∞, then {Ht : t ∈ N∗ } is uniformly integrable (e.g. see Theorem 25.12 and its corollary in Billingsley, 1995). Convergence of the relevant posterior moments in Assumption A4 can be obtained as a consequence of Assumption A3 combined with Conditions C2 and C3. This is summarized by the following Lemma. Lemma 1. Assume A3, C2, and C3. Then Assumption A4 holds and we have: h i P? −a.s. (a) E H (Yt , p(dyt |Y1:t−1 , Θ)) Y1:t − H (Yt , p(dyt |Y1:t−1 , θ? )) −− −−−→ 0. t→+∞   ∂ log p(Yt |Y1:t−1 , Θ) P? −a.s. (b) Var Y −− −−−→ 0. 1:t t→+∞ ∂yt The proof of Lemma 1 is provided in the supplementary material. 25

C.2.3

Assumption A9: Forgetting property of the H-score

For state-space models, the forgetting property of the H-score can be obtained as a consequence of the forgetting property of the latent Markov chain stated in Eq. (C.2.1) (following from Condition C4) and some appropriate boundedness conditions on the first two derivatives of the observation log-density (Condition C5). Condition C4 corresponds to a simplified version of Assumption A13.1 in Douc et al. (2014). Condition C4. The model satisfies the following conditions: (a) There exists a dominating probability measure η on X such that the transition kernel fθ? (dxt+1 |xt ) has density νθ? (xt+1 |xt ) = (dfθ? (·|xt )/dη)(xt+1 ) with respect to η. (b) There exist positive constants σ − and σ + such that, for all (xt , xt+1 ) ∈ X × X, the transition density νθ? (xt+1 |xt ) satisfies 0 < σ − < νθ? (xt+1 |xt ) < σ + < +∞. R (c) For all yt ∈ Y, the integral X gθ? (yt , xt ) η(dxt ) is bounded away from 0 and +∞. Under strong stationarity of the process (Yt )t∈N∗ , Lemma 13.2 in Douc et al. (2014) guarantees that for all t ∈ N∗ , all m ∈ N, and all realizations y−m:t ∈ Ym+t+1 , the filtering distributions of the latent states satisfy   dT V p(dxt |y−m+1:t , θ? ), p(dxt |y−m:t , θ? ) ≤ ρt+m−1 , (C.2.1) where dT V stands for the total variation distance and ρ = 1 − (σ − /σ + ) ∈ (0, 1). Condition C4 would typically require the latent space X to be finite or compact, and ensures that the transition kernel is geometrically ergodic. Such a condition can generally be weakened to allow for non-finite and non-compact spaces (e.g. Douc, Moulines, Olsson, Van Handel et al., 2011; Douc, Moulines et al., 2012). When Proposition 1 holds, H-scores for a fixed θ? can be written in terms of expectations with respect to the corresponding filtering distributions. Differences of H-scores can then be related to total variation distance between filtering distributions by assuming the integrands in Eqs. (5)-(6) are bounded (Condition C5). Condition C5. The model satisfy the following domination conditions: ∂ 2 log g ? (y|x)  ∂ log g ? (y|x) 2 θ θ < +∞. + (a) b = sup ∂y 2 ∂y

x∈X y∈Y

(b) c = sup ∂ log gθ? (y|x) < +∞. ∂y

x∈X y∈Y

Condition C5 could be enforced by the stronger conditions that X and Y are compact, and the first two derivatives of y 7→ log gθ? (y|x) are continuous with respect to (x, y). Compactness conditions may look quite restrictive, since most well-known continuous distributions have non compact supports. In reality, for all practical purposes, we could always envision a sufficiently large compact space in which all our numerical values would lie. As stated earlier, Conditions C4 and C5 should only be regarded as mere sufficient conditions that allow for more straightforward proofs. The models of our simulation studies do not satisfy these conditions, and yet we observe the consistency of H-scores. This indicates that consistency is likely to hold under weaker conditions. Under Assumptions A2, A3, and A8, Conditions C4 and C5 combined with Proposition 1 guarantee that the forgetting property of the H-score in Assumption A9 holds, as stated by the following Lemma. Lemma 2. Assume A2, A3, A8, C4, and C5. Then, for all t ∈ N∗ , all m ∈ N, and all y−m:t ∈ Ym+t+1 , |H(yt , p(dyt |y−m+1:t−1 , θ? )) − H(yt , p(dyt |y−m:t−1 , θ? ))| ≤ 2(b + c 2 ) ρt+m−1 , sup |H(yt , p(dyt |y−m+1:t−1 , θ? ))| ≤ 2 b + c 2 , m∈N

where ρ = 1 −

σ− σ+

∈ (0, 1). 26

(C.2.2) (C.2.3)

Eq. (C.2.2) in Lemma 2 enforces Assumption A9 with γ = 2(b + c2 ), while Eq. (C.2.3) directly buys us Assumptions A6 and A10. The proof of Lemma 2 is provided in the supplementary material. C.2.4

Assumption A11: H-score of conditional density given the infinite past

Ensuring that we may define y1 7→ p(y1 |Y−∞:0 , θ? ) as an actual probability density function can be done under further domination and integrability conditions on the observation density (Condition C6). Condition C6. Let νθ? be the probability measure from Condition C4. The following statements hold: (a) sup gθ? (y|x) < +∞. x∈X y∈Y

   R (b) E? log X gθ? (Y1 |x)νθ? (dx) < +∞. Condition C6 corresponds to a simplified statement of Assumption A13.3 in Douc et al. (2014). Under Assumption A8 with Conditions C4 and C6, Lemma 13.12 and Proposition 13.5 from Douc et al. (2014) show that y1 7→ log p(y1 |Y−∞:0 , θ? ) = limm→+∞ log p(y1 |Y−m+1:0 , θ? ) exists and defines an actual log-density, P? -almost surely. The P? -almost sure twice differentiability of y1 7→ log p(y1 |Y−∞:0 , θ? ) follows from the uniform convergence of the first two derivatives of y1 7→ log p(y1 |Y−m+1:0 , θ? ) as m → +∞ (e.g. Theorem 7.17 from Rudin, 1964), which can be proved using Proposition 1 and the domination conditions from C5. In other words, under Assumptions A2 and A8, Conditions C4 to C6 ensure that Assumption A11 holds. This is stated by the following Lemma. Lemma 3. Assume A2, A8, C4, C5, and C6. Then, P? -almost surely, there exists a continuous probability density function x1 7→ p(x1 |Y−∞:0 , θ? ) = lim p(x1 |Y−m+1:0 , θ? ) with respect to νθ? . Define the function m→+∞

?

y1 7→ p(y1 |Y−∞:0 , θ ) =

Z

gθ? (y1 |x1 )p(x1 |Y−∞:0 , θ? )νθ? (dx1 ).

Then, P? -almost surely, p(Y1 |Y−∞:0 , θ? ) = p(y1 |Y−∞:0 , θ? )|y1 =Y1 , and y1 7→ p(y1 |Y−∞:0 , θ? ) is the conditional density with respect to the Lebesgue measure of Y1 given the σ-algebra generated by (Y−m )m∈N under P? . Moreover, P? -a.s., y1 7→ log p(y1 |Y−∞:0 , θ? ) = limm→+∞ log p(y1 |Y−m+1:0 , θ? ) exists and is twice differentiable on Y, with ∂ log p(y1 |Y−m+1:0 , θ? ) ∂ log p(y1 |Y−∞:0 , θ? ) = lim , m→+∞ ∂y1 ∂y1 ∂ 2 log p(y1 |Y−∞:0 , θ? ) ∂ 2 log p(y1 |Y−m+1:0 , θ? ) = lim , m→+∞ ∂y12 ∂y12 and ∂ 2 log p(y1 |Y−∞:0 , θ? ) H(y1 , p(dy1 |Y−∞:0 , θ )) = 2 + ∂y12 ?

for all y1 ∈ Y. The proof of Lemma 3 is provided in the supplementary material.

27



∂ 2 log p(y1 |Y−∞:0 , θ? ) ∂y12

2 ,

Suggest Documents