BAYESIAN MODEL SELECTION Arto Luoma Department ... - CiteSeerX

BAYESIAN MODEL SELECTION Arto Luoma Department of Mathematics and Statistics, FI-33014 University of Tampere, FINLAND, [email protected] ABSTRACT Traditionally, Bayes factors, posterior odds and posterior model probabilities are used in Bayesian model selection. This approach has, however, the problem that the conclusions are often too sensitive to prior specifications. Another approach is to use discrepancy measures in model comparison and posterior predictive checks in the assesment of model adequecy. These approaches are briefly summarized and an illustrative example provided. 1. INTRODUCTION The Bayesian paradigm provides a coherent approach to statistical analysis. In this paradigm all statistical conlusions are based on the posterior distribution, which combines the prior information and the information given by data. The Bayesian model includes a prior distribution for parameters and a sampling model represented by the likelihood function. The results of Bayesian analysis are sometimes sensitive to the prior distribution, and it is therefore beneficial to do sensitivity analysis with respect to priors. However, the choice of a correct sampling model is probably more crucial and is the main focus of this paper. One approach is to consider a set of competing models and give them prior probabilities. Each of these models have their own likelihood functions and prior distributions. Using the Bayes formula, it is possible to compute the posterior probabilities of these models. These probabilities can be used to select the best model or they can serve as weights when the predictions of individual models are averaged. Bayesian hypothesis testing, in which two models are compared, is also based on the idea of giving probabilities to models. This approach may be criticized on the grounds that it assumes one of the models to be true. A more realistic view would consider statistical models as approximations to true data generating processes. A practical problem with this approach is that posterior model probabilities are very sensitive the choice of parametric priors, especially when the priors are uninformative. Another approach is to use discrepancy measures to compare models. In this approach the goal is to choose the model which is closest to the true model. The adequecy of the selected model may be checked by doing posterior predictive checks. This approach is preferred by Gelman et al [1], for example.

In the following section these two approaches are summarized. In Section 3, an illustrative example on interest rate modelling is given. Finally, conclusions are given in Section 4. 2. ALTERNATIVE APPROACHES TO BAYESIAN MODELLING 2.1. Preliminaries The results of a Bayesian statistical analysis can be summarized using the posterior distribution, which combines the prior information and the information provided by data. The posterior density function of an unknown parameter (vector) θ can be formally calculated using the Bayes formula p(θ)p(y|θ) , (1) p(θ|y) = R p(θ)p(y|θ)dθ

where y denotes data, p(y|θ) the likelihood function and p(θ) the prior density. Possible prior information can be incorporated in the prior distribution. In the absence of prior information or when one wants the data speak for itself, uninformative prior distributions can be used. The choice of uninformative prior is not unique and is somewhat controversial. However, the role of prior distribution decreases and becomes insignificant in most cases as the data set becomes larger. 2.2. The full Bayesian probability approach to model comparison The Bayesian formalism can be extended to model selection. Assume that we have a finite set of competing models, M = {Mk , k = 1, ..., K}, which have prior probabilities Pr(Mk ) and parameters (or parameter vectors) θk . Let p(y|θk , Mk ), p(θk |Mk ) and p(θk |y, Mk ) denote the likelihood function, prior distribution and posterior distribution, respectively, related to model Mk . Then the posterior probability of Mk is given by the Bayes formula Pr(Mk )p(y|Mk ) , Pr(Mk |y) = PK j=1 Pr(Mj )p(y|Mj )

(2)

where p(y|Mk ) is the marginal likelihood of model Mk , obtained from R the likelihood by averaging over the prior: p(y|Mk ) = p(θk |Mk )p(y|θk , Mk )dθk . Note that in the

case of equal prior probabilities for the models, the posterior model probabilities are proportional to their marginal likelihoods. The standard Bayesian analogue of frequentist hypothesis testing is to calculate the Bayes factor B10 =

p(y|M1 ) p(y|M0 )

(3)

when the null hypothesis model M0 is tested against its alternative M1 . This may be motivated by observing that the Bayes factor is equal to the posterior odds in favour of M1 in the case that the prior odds are 1:1. The posterior odds are given by Pr(M1 |y) p(y|M1 ) Pr(M1 ) = × . Pr(M0 |y p(y|M0 ) Pr(M0 )

(4)

Jeffreys [2] provides guidelines for the interpretation of Bayes factors, and Raftery [3] reproduces a slightly altered version of them. There are, however, philosophical and practical problems associated with the full Bayesian probability approach described above. A philosophical problem is that one assumes one of the compared models to be true. If one cannot make this assumption, the results cannot be easily interpreted. The problem becomes magnified if all the compared models are clearly nonfitting. In any case, one should do model checking and try to improve models if there are obvious discrepancies between them and the data. A practical difficulty is that marginal likelihoods and the corresponding model probabilites are very sensitive to the choice of priors. This will be illustrated in the next section. The problem does not disappear when the data set becomes large, as it does in the case of parameter estimation. The problem is especially difficult when noninformative prior distributions are needed. It is not possible to use improper priors at all and the results are strongly affected by the dispersion parameters of priors. To circumvent this problem, several alternatives to Bayes factors have been suggested, such as partial Bayes factors, intrinsic Bayes factors and fractional Bayes factors (for review, see Kadane and Lazar [4] or Chapter 7 in O’Hagan and Forster [5]). These are based on the idea of using a part of the data as a training set and are therefore controversial, departing from the coherent Bayesian approach. The use of information criteria, such as the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) provide an alternative and easier approach to model selection. Finding a model minimising the BIC is approximately equivalent to maximising the marginal likelihood p(y|Mk ). Using the asymptotic multivariate normality of posterior distributions and the Laplace method to approximate integrals, one obtains p(y|Mk ) ≈ p(θˆk |Mk )p(y|θˆk , Mk )(2π)dk /2 |n−1 Vk |1/2 , where θˆk is a consistent point estimate of θk , dk the number of parameters in Mk and n−1 Vk the asymptotic posterior dispersion matrix of θk (for further details, see [5]).

Taking the logarithm, multiplying by -2 and ignoring the terms −2 log(p(θˆk |Mk ) − dm log(2π) − |Vk |, which are O(1), one obtains the BIC: BICk = −2 log(p(y|θˆk , Mk ) + dk log(n).

(5)

Conversely, one can use exp(−BIC/2) as an approximation to the marginal likelihood, but should remember that the effect of the prior distribution is ignored. 2.3. Bayesian measures of model fit When one believes that the reality is infinitely complex and cannot be perfectly described using simple models, it is logical to use model selection criteria which measure the goodness of fit. Model fit can be summarized using the deviance, defined as D(y, θ) = −2 log p(y|θ).

(6)

The expected deviance, computed by averaging over the true sampling distribution f (y), is 2 times the KullbackLeibler (K-L) divergence of p(y|θ) from f (y), up to the R constant f (y) log f (y)dy. Thus, the model with minimum expected deviance has the minimumn K-L divergence from the true model. The ’true’ value of θ which minimises the K-L divergence of p(y|θ) from f (y) is unknown, and replacing θ with a point estimate would give a too optimistic view of the model. From a Baysian perspective, it seems reasonable to use the posterior mean of deviance, Davg = E(D(y, θ)|y),

(7)

as a measure of model fit (see Spiegelhalter et al [6] and Gelman et al [1]). This can be estimated as L

X ˆ avg (y) = 1 D(y, θl ), D L

(8)

l=1

where θl , l = 1, ..., L, are posterior simulations. The difference between the posterior mean deviance ˆ (8) and the deviance at θ, ˆ ˆ avg (y) − D(y, θ(y)), pD = D

(9)

represents the decrease in the deviance expected from estimating θ and can be used as a measure of the effective number of parameters. In nonhierarchical models it is usually approximately equal to the number of free parameters. When the goal is to find a model with optimal out-ofsample performance, one can estimate the expected deviance for replicated data, pred ˆ Davg (y) = E[D(y rep , θ(y))],

(10)

ˆ where y and y rep are two independent data sets, θ(y) a parameter estimate and the expectation averages over the unknown true sampling distribution. Note, however, that the use of a point estimate θˆ in prediction departs from

3.5

2.4. Model checking and model improvement An important part of Bayesian statistical modelling is to check that the model corresponds to reality in important aspects. If the model is far from reality, the conclusions of the analysis may be misleading. In such a case one can try to improve the model by adding meaningful parameters. In the following, an approach to model checking which is explained in Gelman and Meng [7] and in Chapter 6 of Gelman et al [1] will be summarized. In model checking one can use test quantities or discrepancy measures, which correspond to test statistics in classical analysis. If a test quantity computed from the original data set obtains an extreme value when compared to its posterior predictive distribution, it tells about a discrepancy between the model and the data. Bayesian test quantities are functions of data and unknown parameters and are thus more general than classical test statistics, which can depend on data alone. This allows more direct comparisons between the sample and population characteristics. Model checking can be easily accomplised by simulating posterior predictive distributions. One first simulates a sample θ1 , ..., θL from the posterior distribution rep and then L hypothetical replicated data sets y1rep , ..., yL rep corresponding to these parameter values. Thus y has R a distribution p(y rep |y) = p(y rep |θ)p(θ|y)dθ. If the model fits the data reasonably well, the replications ylrep should resemble the observed data y. One can check this by choosing a discrepancy variable T (y, θ) and calculating the proportion of cases in which the discrepancy based on replications exceeds that based on the observed data: L

estimated p-value =

1X I(T (ylrep , θl ) ≥ T (y, θl )) L l=1

where I(.) is the indicator function taking value 1 when its argument is true and 0 otherwise. When the test quantity depends only on data, one can plot a histogram of its posterior predictive distribution and compare it to the observed value T (y). When the test quantity also depends on unknown parameters, one can illustrate the test by plotting a scatterplot of T (y, θl ) versus T (y rep , θl ) on the same scale. A good fit is indicated if about half the points fall above the 45o line and half below.

2.0

The expression can be derived for normal models or in the limit of large sample sizes (see Spiegelhalter et al [6]). In simple models it is asymptotically equal to the AIC, but is more general and can also be applied in hierarchical models, where the number of parameters is not clearly defined.

2.5

r

(11) 3.0

ˆ DIC = D(y, θ(y)) + 2pD .

4.0

the usual Bayesian practise of averaging over the posterior distribution. It can be shown that the expected predictive deviance (10) can be approximately estimated by the so called deviance information criterion (DIC), defined as

2002

2003

2004

2005

2006

2007

2008

Time

Figure 1. Eurepo 3-month interest rate 3. EXAMPLE The discussions above will be illustrated using a 3-month Eurepo interest rate series consisting of 1471 daily observations from 4 March 2002 until 6 December 2007. Eurepo is the rate at which one prime bank offers funds in euro to another prime bank if in exchange the former receives from the latter Eurepo GC as collateral. It is a good benchmark for secured money market transactions in the Euro zone. The series is presented in Figure 1. In this example, five interest rate models will be estimated and compared. These models are a subset of those compared by Chan et al [8] and are listed below with their stochastic differential equations (SDEs): 1. 2. 3. 4. 5.

General CIR Vasicek GBM BM

dr dr dr dr dr

= (α + βr)dt + σrγ dZ = (α + βr)dt + σr1/2 dZ = (α + βr)dt + σdZ = βrdt + σrdZ = αdt + σdZ

The Cox-Ingersoll-Ross (CIR) and Vasicek models are typical examples of interest rate models with mean-reverting behaviour, while the Geometric Brownian Motion (GBM) and the Brownian Motion (BM) are more often used to model stock prices. The first model is most general and the other models are its special cases, except that α is restricted to be positive and β negative in Models 1-3 to ensure mean reversion, while in Models 4 and 5 these parameters are unrestricted. Using the Euler discretization of SDEs, the likelihood of Model 1 can be written as ( ) N 2 Y 1 [ri − ri−1 − (α + βri−1 )δ] q exp − , 2γ 2σ 2 ri−1 δ 2πσ 2 r2γ δ i=1 i−1

where r0 , ..., rN are the observations, given in percentages, and δ = 1/255 the discretization interval (we have around 255 observations per year). The standard exponential distribution Exp(1) is used as a prior distribution for the parameters α, −β, γ and σ except that the standard

normal distribution N(0,1) is used for β in Model 4 and α for Model 5. These priors are assumed to be independent and they are fairly uninformative, since the posterior distributions are concentrated close to zero, as Table 1 indicates. The models were estimated using the Metropolis algorithm for Markov Chain Monte Carlo (MCMC) simulation.

Table 2. Measures of fit and posterior probabilities for the compared models. Process General CIR Vasicek GBM BM

ˆ avg D -8889.5 -8884.9 -8821.2 -8880.2 -8822.6

DIC -8885.5 -8881.7 -8817.8 -8878.2 -8820.6

BIC -8864.3 -8866.2 -8802.7 -8867.6 -8810.1

Pr(BIC) 0.11 0.29 0.00 0.60 0.00

Pr 0.37 0.55 0.00 0.08 0.00

GBM 0.121 (0.081)

4.2

δ 0.096 (0.008) 0.115 (0.002) 0.192 (0.004) 0.071 (0.001) 0.192 (0.004)

3.4

BM

γ 0.692 (0.087)

4.0

Vasicek

β -0.047 (0.040) -0.044 (0.040) -0.042 (0.038) 0.035 (0.029)

3.8

CIR

α 0.200 (0.119) 0.205 (0.122) 0.233 (0.132)

3.6

Process General

4.4

Table 1. Posterior means of the parameters. Posterior deviations are given in parentheses.

Table 2 reports the values of the model fit criteria introduced in this paper as well as the posterior model probabilities, assuming equal prior probabilities for the models. The posterior model probabilities were calculated in two ways: firstly, using the BIC to approximate the marginal likelihoods, and secondly, estimating them with reciprocal importance sampling. As may be seen, the general model fits best and the ˆ avg and DIC CIR model second best, according to the D criteria. Surprisingly, the Vasicek model performs more poorly than the GBM and BM models which are not so commonly used for interest rate modelling. The AIC values are not reported but they are approximately the same as the DIC values. The BIC penalizes more severely for the parameteres and the general model with 4 parameters obtains a poorer BIC value than the CIR model with 3 parameters. The GBM model with 2 parameters turns out to be best according to the BIC. There is a big difference in the two methods to calculate posterior model probabilities. This highlights the role of prior distributions. The posterior probability of the GBM is 0.60 when calculated using the BIC approximation, but only 0.08 when the accurate value is estimated using simulation. This may be explained by the fact that in the case of the GBM β has √ the standard normal prior, whose density is around 1/ 2π ≈ 0.4 in the neighbourhood of zero, while α and −β have the Exp(1) distribution, whose density is around 1 in this neighbourhood. Finally, posterior predictive checking will be illustrated. In Figure 2, four replications of the interest rate series are shown, simulated from their posterior predictive distribution under Model 1. As may be seen, the original series in Figure 1 appears to be more rigid than the simulated series. The use of test quantities is illustrated using the range T (r) = rmax − rmin as a test statistic. The simulated

2002

2003

2004

2005

2006

2007

2008

Time

Figure 2. Four replicated interest rate series using their posterior predictive distribution posterior predictive distribution is shown in Figure 3 and the test statistic computed from the observed data is indicated by a vertical line. As may be seen, the observed range (2.35) is larger than would be expected under a well fitting model. 4. CONCLUSION In this paper, two alternative views of Bayesian model selection were presented and an example of interest rate modelling was provided. In the example the different criteria led to substantially different conclusions regarding the order of superiority of the models, even though the number of observations was large. Furthermore, it was demonstrated that the posterior probabilities were very sensitive to the choice of priors. It is suggested that instead of assigning probabilities to models one would compare them by measuring their distance to the data. If possible, one should replace a discrete set of models with an expanded continuous family of models. 5. REFERENCES [1] H. S. Stern A. Gelman, J. B. Carlin and D. B. Rubin, Bayesian data analysis, Chapman & Hall/CRC, Boca Raton, 2004. [2] H. Jeffreys, Theory of Probability, Oxford University Press, Oxford, 1961.

150 100 Frequency 50 0

0.5

1.0

1.5

2.0

2.5

Figure 3. Posterior predictive distribution of the range and the observed value. [3] A. E. Raftery, “Hypothesis testing and model selection,” in Markov Chain Monte Carlo in Practice, London, UK, 1996, pp. 163–187. [4] J. B. Kadane and N. A. Lazar, “Methods and criteria for model selection,” Journal of the American Statistical Association, vol. 99, pp. 279–290, March 2004. [5] A. O’Hagan and J. Forster, Arnold, London, 2004.

Bayesian inference,

[6] B. P. Carlin D. J. Spiegelhalter, N. G. Best and A. van der Linde, “Bayesian measures of model complexity and fit (with discussion),” Journal of the Royal Statistical Society: Series B, vol. 64, pp. 583–639, 2002. [7] A. Gelman and X-L. Meng, “Model checking and model improvement,” in Markov Chain Monte Carlo in Practice, London, UK, 1996, pp. 163–187. [8] F. A. Longstaff K. C. Chan, G. A. Karolyi and A. B. Sanders, “An empirical comparison of alternative models of the short-term interest rate,” The Journal of Finance, vol. 47, pp. 1209–1227, 1992.