Compatible Prior Distributions - Semantic Scholar

ISBA 2000, Proceedings, pp. 000–000  ISBA and Eurostat, 2001

Compatible Prior Distributions A. PHILIP DAWID and STEFFEN L. LAURITZEN University College London, UK and Aalborg University, Denmark

SUMMARY We investigate two approaches to constructing compatible prior laws over alternative models: ‘projection’ and ‘conditioning’. Each of these is shown to require additional inputs. We suggest that these can be chosen in a natural way in each case, leading to ‘Kullback-Leibler projection’ and ‘Jeffreys conditioning’. We recommend the former for the case of coexisting models, and the latter for competing models. Keywords: BAYES FACTOR; COMPATIBLE PRIORS; CONDITIONING; JEFFREYS CONDITIONING; KULLBACK-LEIBLER PROJECTION; MARGINALISATION; MODEL AVERAGING; MODEL SELECTION; PROJECTION.

1. INTRODUCTION Suppose we wish to compare two models, M and M0 , for the same observable X . Let M be parametrised by , with model densities f (x j ), and M0 by , with model densities f0 (x j ). Within each model, we have a prior density, () and 0 () respectively, representing uncertainty about its parameter conditional on the model. If we observe X = x, the impact of this on model uncertainty can be isolated in the corresponding Bayes fa tor :

B (M : M0) :=

R

M f (x j ) () d : M0 f0 (x j ) 0 () d

R

In general there is no compelling reason to relate the prior densities over different models, even when one is a submodel of another. However, the task of assessing a prior density for a model parameter is difficult enough when we only have a single model to deal with, and we might hope that, having once conducted this exercise, we can use the resulting distribution to assist in assigning an appropriate prior over a different model. Furthermore, it is known that Bayes factors can be quite sensitive to the specific choice of priors. It would lend some degree of objectivity if the priors over the different models were chosen to be, in some sense still to be made precise, as similar as possible — we shall then call them ompatible. In this case we can argue that the Bayes factor is truly responding to the data, rather than merely reflecting prior prejudices. Our purpose in this paper is to investigate possible explications of this informal concept of ‘compatibility’. For simplicity, we restrict attention to the case that M0 is a lower dimensional submodel of M, and examine two methods that have commonly been used

1

in this case: ‘projection’ and ‘conditioning’. We point out that neither method is uniquely defined, each depending on additional inputs. We suggest that there are natural choices for these, leading to ‘Kullback-Leibler projection’ and ‘Jeffreys conditioning’. And we make tentative recommendations as to when each of these methods might be appropriate. 2. EXAMPLES We focus attention on two simple examples, each involving a random sample of n observations on a pair of outcome variables. In the first example the outcome variables have a joint Gaussian distribution, while in the second they are binary. Our notation for multivariate and matrix-variate distributions follows Dawid (1981).

Example 1 [Two Gaussian variables℄ Our data arise as n independent opies of a pair of ontinuous variables X = (X0 ; X1 ). Under model M, X is assumed to follow a bivariate Gaussian distribution: X N2 (0; ), with arbitrary dispersion matrix 00 01 : = 10 11

The standard onjugate prior for in this model is the inverse Wishart distribution: IW (Æ; ). That is, the on entration matrix K := 1 is assumed to have the Wishart distribution, K W2 (Æ + 1; 1 ), with density, with respe t to Lebesgue measure on the set of positive de nite matri es: 1 f (K ) / fdet(K )g 2 Æ 1 expf tr(K )=2g:

(1) We take as M0 the model under whi h X0 and X1 are independent, ea h still being normal with zero mean. This model may be parametrised by (0 ; 1 ), where i := var(Xi ). It an also be des ribed as the submodel of M under whi h satis es the restri tion 01 = 0, or, equivalently, K satis es k01 = 0. The question is: Whi h prior distribution for (0 ; 1 ) in M0 ` orresponds naturally' to the inverse Wishart distribution IW (Æ; ) spe i ed for in M? As we shall see, this question an be answered in a variety of ways. 2 Example 2 [Two binary variables℄ Consider now n independent opies of a pair of binary variables I and J . Under model M, the asso iated probabilities =

00 01 10 11

;

where ij := P (I = i; J = j ), are arbitrary, subje t only to ij > 0 and

X

ij = 1:

(2)

The standard onjugate prior for in this model is the Diri hlet distribution D(), where = 00 01 : 10 11 2

Then the prior density of (with respe t to Lebesgue measure on the simplex (2)) is Y f () / ij 1 : ij

ij

We again take as M0 the model in whi h I and J are independent. We an parametrise this by := P (I = 0) and := P (J = 0). We an also des ribe M0 as the submodel P of M for whi h satis es the restri tion: for all i and j , ij = i+ +j ; where i+ := j ij and +j := Pi ij . Again the question is: Whi h prior distribution for (; ) in M0 ` orresponds naturally' to the Diri hlet distribution D() spe i ed for in M? And again, this question an be answered in a variety of ways. 2 3. PROJECTION A statistical model parameter can be regarded, either as an abstract label, identifying one out of many possible distributions in the model; or, in some cases at least, more concretely as a direct measure of some aspect of that distribution: for example, its mean, or correlation. If we take the latter approach then, since M0 is a submodel of M, we might continue to describe any distribution in M0 by the same parameter we used for M. In fact, because M has lower dimension, only a subvector 0 of will typically be required. We can then use the original prior distribution for to induce a marginal distribution for 0 , thus supplying a ‘compatible’ prior for M0 . The following example demonstrates that this seemingly ‘natural’ construction is in fact non-unique, being dependent on exactly how we choose to interpret our parametrisation.

In Example 1, there are two equally `obvious' ways of identifying the parameters of model M0 with (a subset of) those of model M. The rst sets i = ii ; i = 0; 1 while the se ond takes i = 1=kii ; i = 0; 1: That is, in the rst ase we identify the varian es in the two models, whereas in the se ond ase we identify the on entrations. In the rst ase, using the indu ed joint distribution for (00 ; 11 ) under the IW (Æ; ) prior for leads to i ii =2Æ whereas in the se ond ase we obtain 0 00:1 =2Æ+1 ; 1 11:0 =2Æ+1 ; where 00:1 := 00 201 =11 ; 11:0 := 11 201 =00 :

Example 3

3

In either ase 0 and 1 are not independent, their joint distribution being ompli ated. We note in passing that the ompatible hyper inverse Wishart priors suggested in Dawid and Lauritzen (1993) would have the same marginal distributions as in the rst ase, but in addition 0 and 1 would be independent. 2 A more abstract way of thinking about the above approach is that, explicitly or implicitly, we have to specify a way of associating, with each distribution P in M, a corresponding distribution P0 in M0 . Thus in the first method suggested in Example 3, P0 is the (unique) distribution in M0 having the same values for var(X0 ) and var(X1 ) as does P ; in the second method, P0 has the same value for var(X1 j X0 ) and var(X0 j X1 ). Such a specification can be described by an appropriate mapping r : M ! M0 , such that P0 = r(P ); we can think of r as a generalised ‘projection’ function. Given r, the distribution of r(P~ ), when P~ varies according to its assigned prior law in M, will supply the ‘compatible’ prior law for P~0 over M0 . Here we are using the term “law” to indicate a “distribution for a distribution”. However, as there is an almost limitless choice for the function r, the above ‘projection method’ does not, without additional input, yield a unique answer, as Example 3 makes clear. 3.1. Kullback–Leibler projection One approach to defining a projection map r is in terms of a ‘discrepancy function’ D(P; Q) between distributions P and Q for X . We could then define the ‘minimum discrepancy’ projection map onto M0 by:

r(P ) := arg Qmin D(P; Q); 2M 0

provided a mimimiser exists and is uniquely defined. A popular discrepancy function is the Kullba k{Leibler divergen e ,

KL(P; Q) := EP flog p(X )=q(X )g; where p() denotes the density of P , et ., leading to the projection rKL (P ) := arg min KL(P; Q): Q2M0

This KL-projection approach was used by McCulloch and Rossi (1992), who proposed Monte Carlo methods for calculating the associated Bayes factors. Among other advantages, KL-projection does not depend on the parametrisations of M and M0 , is invariant under one-to-one transformations of the observable X , and scales by sample size for observables modelled as independent and identically distributed under both P and Q, so that, for this common case, the associated projection map does not depend on sample size.

In the multivariate Gaussian ase, the KL divergen e between P = N (0; ) and P = N (0; ) is 1 KL(P ; P ) = tr( 1 I ) log det( 1 ) ; 2 and for restri ted to being diagonal this is minimised when ii = ii . That is, using KL-proje tion onto the model of omplete independen e identi es varian es rather than on entrations, thus resolving the ambiguity seen in Example 3. 2

Example 4

4

Example 5 In Example 2, the KL-proje tion method relates the parameters (; of M0 to those of M through

)

= 0+ ;

= +0 : Used with the Diri hlet prior D() for M0 , this leads to (0+ ; 1+ ); (+0 ; +1 ): On e again, however, and are not independent in this indu ed distribution, in

ontrast to the hyper Markov spe i ations of Dawid and Lauritzen (1993), who suggest these marginal distributions for and under model M0 , but in addition require their independen e. 2 4. CONDITIONING We now consider a different general approach to constructing compatible priors. The submodel M0 can typically be derived from M by imposing a constraint on its parameter : say = 0 , for an appropriate parameter := (). It might then appear ‘natural’ to derive the compatible prior distribution in M0 by simply conditioning that for in M on () = 0 . But, once again, this proposal turns out to be subject to ambiguity: there is typically a variety of choices for the function , each leading to a different answer. This phenomenon is sometimes termed the Borel-Kolmogorov paradox .

Consider Example 1 from the point of view of onditioning. The onstraint X ?? Y an be expressed in any of the following ways, among others: (i). k01 = 0 (ii). 01 = 0 (iii). 1:0 = 0, where 1:0 := 01 =00 is the regression oeÆ ient of X1 on X0 (iv). = 0, where := 01 =f00 11 g 21 is the oeÆ ient of orrelation between X0 and X1 . We shall see that we obtain dierent answers, depending on whi h of these onstraints we ondition on. (i). The joint prior density (k00 ; k01 ; k11 ) is given by (1). To ondition on k01 = 0, we substitute this value and renormalise. We obtain:

Example 6

1 1 2 Æ 1 exp ( 00 k00 =2) k 2 Æ 1 exp ( 11 k11 =2) ; 0 (k00 ; k11 ) / k00 11

i.e., in terms of 0 = k001 , 1 = k111 ,

0 00 =2Æ ; 1 11 =2Æ ; independently.

5

(ii). Making a hange of variables in (1), we nd that the joint prior density (00 ; 01 ; 11 ) is proportional to (det ) 21 Æ 2 exp tr 1 =2 : (3) Restri ting this to 01 = 0 yields the onditional joint density: 1 1 0 (00 ; 11 ) / 002 Æ 2 exp 00 001 =2 112 Æ 2 exp 11 111 =2 ; i.e., in terms of 0 = 00 , 1 = 11 , 0 00 =2Æ+2 ; 1 11 =2Æ+2 ; independently. (iii). To ondition on 1:0 = 0, we note (Dawid, 1988, Lemma 2) that, under the spe i ed IW (Æ; ) prior distribution for , we have 00 ?? (k11 ; 1:0 ), with joint distribution given by: 00 00 =2Æ k11 111:0 2Æ+1 1:0 j k11 N b1:0; (00 k11 ) 1 ; where b1:0 := 01 =00 . Consequently, onditioning on 1:0 = 0, we still have 00 00 =2Æ , independently of k11 ; while the onditional distribution of k11 is readily found to be 111 2Æ+2 . Thus, in terms of 0 = 00 ; 1 = k111 , we have: 0 00 =2Æ ; 1 11 =2Æ+2 ; independently. (iv). Using k01 = (k00 k11 ) 12 with (1) to evaluate (k00 ; k11 ; ), we now nd that, onditional on = 0, 0 00 =2Æ+1 ; 1 11 =2Æ+1 ; independently. 2 Example 7 Similar instan es of the Borel-Kolmogorov paradox arise in Example 2. Introdu ing := 0 j 0 ; := 0 j 1 , the joint distribution of (I; J ) an be parametrised by ( ; ; ) where = P (J = 0) = +0 as earlier. Note that, under the

onstraint of independen e between I and J imposed by M0 , we have = = = P (I = 0). We now de ne: 1 := 2 := (1 ) 00 11 3 := (1 ) 10 01 : These parameters are familiar quantities in epidemiology: 1 is the ex ess risk (for having I = 0, due to having J = 0), 2 is the risk ratio, and 3 is the odds ratio. The logarithm of 3 is the intera tion between I and J . The independen e onstraint of M0 an be expressed in any of the forms: 6

(i). 1 = 0 (ii). 2 = 1 (iii). 3 = 1. Again we start with the prior distribution D() in M, and ondition on the hosen independen e onstraint. We nd that, onditionally on any of (i), (ii) or (iii), and are independent, with (+0 ; +1 ); but has dierent distributions in the three ases: (i). (0+ 1; 1+ 1), (ii). (0+ ; 1+ 1), (iii). (0+ ; 1+ ). 2 4.1. Jeffreys conditioning The fact that the method of conditioning depends on the particular parameter-function used to define the submodel M0 suggests the need for an alternative formulation of the method. One possible generalisation proceeds by choosing referen e measures , and 0 , on M and M0 respectively. Given a prior law in M, we can form its density function with respect to : () := d=d . This is a scalar function, with a value for each P 2 M (we suppose that there is sufficient smoothness in the problem that we are able to define this function everywhere, rather than merely almost everywhere). We can then construct a law 0 in M0 by requiring that its density with respect to 0 , 0 () := d0 =d0 , which is defined for P 2 M0 , should be proportional to there. That is, using the given measures and 0 as the basis for our construction of densities, we are merely restricting the density on M to the submodel M0 , and then renormalising. In the special case that is a probability measure, and 0 is obtained from by conditioning on a constraint = 0 , this ‘density restriction’ method will give the same answer as simply conditioning on = 0 . The arbitrariness in the form of the function is now replaced by the arbitrariness in the choice of the measures and 0 . However, this reinterpretation opens up new possibilities for resolving this ambiguity: by choosing measures and 0 that are in some way intrinsic to the models, and independent of specific ways of parametrising them. One such intrinsic measure, for a general smooth model M, is the Jereys measure J = JM , which, in any smooth parametrisation, has density j = jM with respect to Lebesgue measure given by: j () := fdet I ()g1=2 ; where I () is the Fisher information matrix, having (r; s)-entry

2 f (X j ) fI ()grs := E log : r s

7

The Jeffreys measure is the Riemannian uniform measure when M is equipped with its Riemannian information geometry (Amari et al. , 1987; Kass and Vos, 1997) and is thus invariant under reparametrisation. The Jeffreys measure is often used as a (typically improper) prior distribution representing ignorance about a parameter. Our usage here is entirely different: as a base measure for defining the density of a proper prior. We may term a density with respect to the Jeffreys measure invariantised. If denotes the density of with respect to Lebesgue measure, the invariantised density is given by: () = ()=j () = ()fdet I ()g 1=2 : When the method of density restriction is applied using the invariantised densities, formed with respect to the respective Jeffreys measures = JM , 0 = JM0 , we shall refer to the procedure as Jereys onditioning . Then the Jeffreys conditioning of , with invariantised density 0 / , has density with respect to Lebesgue measure

det I0 () 1=2 : det I () Example 8 In the bivariate Gaussian ase of Example 1, we have, for Models M and M0 respe tively, 3 j () / (det ) 2 1 1: j0 (0 ; 1 ) / 0 () / ()j0 () = ()fdet I0 ()g1=2 = ()

0 1

The invariantised form of (3) is thus 1 () / (det ) 2 (Æ+1) exp 1 =2 ; and restri ting to M0 (and identifying i = ii ) yields 1 1 1 1 0 (0 ; 1 ) / 0 2 (Æ+1) e 2 00 =0 1 2 (Æ+1) e 2 11 =1 : The orresponding Lebesgue density is then 1 1 1 1 0 (0 ; 1 ) / 0 2 (Æ+1) 1 e 2 00 =0 1 2 (Æ+1) 1 e 2 11 =1 ; i.e. 0 00 =2Æ+1 , 1 11 =2Æ+1 , independently. In this ase, Jereys onditioning yields the same answer as simple onditioning on the onstraint = 0. 2 Example 9 Consider again the 2 2 table of Example 2. Using the notation of Example 7, the joint probability fun tion of (I; J ) is p(i; j j ; ; ) = 1 j (1 )j (1 i)(1 j) (1 )i(1 j) (1 i)j (1 )ij : The density j of the Jereys measure J in M is found to be (1 ) 1=2 1 j ( ; ; ) = (1 ) (1 ) (1 ) = f(1 ) (1 )g 1=2 ; 8

leading to the invariantised prior density in M: ( ; ; ) / +0 1 (1 )+1 1 00 1=2 (1 )10 1=2 01 1=2 (1 )11 1=2 : For M0 we similarly obtain j0 ( ; ) = f (1 ) (1 )g 1=2 ; so that using Jereys onditioning we obtain for the prior density in M0 with respe t to Lebesgue measure: 0 ( ; ) = ( ; ; ) j0 ( ; ) / +0 3=2 (1 )+1 3=2 0+ 3=2 (1 )1+ 3=2 i.e. ?? , and (+0 1=2; +1 1=2); (0+ 1=2; 1+ 1=2): This is distin t from all the solutions found in Example 7. Note that the `ee tive sample size' is redu ed by 1 ompared to onditioning with respe t to the odds ratio 3 . We ould regard Jereys onditioning as ompensating for additional prior information arising from simpli ation of the model. 2 5. CONCLUSIONS There are many problems in which we might wish to entertain two (or more) distinct parametric models for our data. Within these, we can distinguish two somewhat different scenarios: (i). We have ompeting models, only one of which can be true. (ii). We have oexisting models, describing the same reality but at different levels of detail. For example, M may describe a linear regression of weight on height and age, while M0 omits the regressor age. Under interpretation (i), M0 is valid only if, once height is known, age is of no further value in predicting height. Under interpretation (ii), we may feel that, even when further information on age might still be relevant, it could nevertheless be adequate for our purposes to use only height to predict weight. A Bayesian analysis of coexisting regression models may be found in Dawid (1988). In case (i), since we are unsure about the true model, an appropriate Bayesian approach would be to assign probabilities to the truth of the various models (conditional on whatever data we have), and represent our opinion about reality as a mixture over the models, with these probabilities. This is the method of ‘model averaging’. However, in case (ii) all models are true, and we merely wish to work with the most useful, i.e. it would be more appropriate to undertake ‘model selection’. Although further investigations are required, it is tempting to recommend the method of Kullba k-Leibler proje tion to define model compatibility in the case of co-existing models, and Jereys onditioning in the case of competing models.

9

ACKNOWLEDGEMENTS Steffen Lauritzen’s work has received support from ESPRIT project 29105 (BaKE). REFERENCES Amari, S.-I., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., and Rao, C. R. (1987). Dierential Geometry in Statisti al Inferen e. Hayward, CA: IMS Dawid, A. P. (1981). Some matrix-variate distribution theory: Notational considerations and a Bayesian application. Biometrika 68, 265–74. Dawid, A. P. (1988). The infinite regress and its conjugate analysis (with discussion). In Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.). Oxford: University Press, 95–110. Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21, 1272–317. Kass, R. E. and Vos, P. W. (1997). Geometri al Foundations of Asymptoti Inferen e. New York: Wiley McCulloch, R. E. and Rossi, P. E. (1992). Bayes factors for non-linear hypotheses and likelihood distributions. Biometrika 79, 663–76.

10