Bayesian Nonparametric Goodness of Fit Tests. Surya T Tokdar1, Arijit Chakrabarti2 and Jayanta K Ghosh3,4. 1. Department of Statistical Science, Duke ...
Bayesian Nonparametric Goodness of Fit Tests Surya T Tokdar1 , Arijit Chakrabarti2 and Jayanta K Ghosh3,4 1. Department of Statistical Science, Duke University, Durham NC USA 2. Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, Kolkata India. 3. Department of Statistics, Purdue University, West Lafayette IN USA 4. Division of Theoretical Statistics and Mathematics, Indian Statistical Institute, Kolkata, India
Abstract We survey in some detail the rather small literature on Bayes nonparametric Testing. We mostly concentrate on Bayesian testing of goodness of fit to a parametric null with nonparametric alternatives. We also survey briefly some related unpublished material. We discuss both methodology and posterior consistency.
1
Introduction
In goodness of fit problems we test a parametric null against a nonparametric alternative. For example, we may wish to test the accuracy of an algorithm that is supposed to generate standard normal variates, from n independent observations X1:n = (X1 , X2 , · · · , Xn ) produced by it. We will test for goodness of fit to standard normal against a rather general class of alternatives, which will typically be nonparametric. More generally, it is common to test the normality assumption of a given data set before processing it further. In this case one would test the null hypothesis that the true density is N(µ, σ 2 ) against a rich nonparametric class of densities. The Bayesian will either put a prior on the null and alternative and calculate the Bayes factor or put an overarching prior on both null and alternative and compute the posterior probability of the null. The Bayesian literature on these testing problems is still rather meagre, unlike the case of nonparametric estimation on which elegant theory, algorithms and applications exist; see the review papers by M¨ uller and Quintana (2004) and Choudhuri, Ghosal and Roy (2005), more references are provided in the following sections. In this paper we survey what has been done by way of Bayesian nonparametric testing of goodness of fit problems involving both simple and composite parametric nulls and nonparametric classes of alternatives. We take up the papers in roughly the same sequence they have been published. We also include some unpublished material being written for submission to a journal.
1
Our survey begins with a review of Rubin and Sethuraman (1965) in Section 2. Rubin and Sethuraman (1965) use neither a Bayes factor, nor the posterior probability of the null. Instead they start with a classical goodness of fit test and evaluate it asymptotically from a Bayesian point of view. We next consider in Section 3 a paper by Dass and Lee (2004) presenting a theoretical treatment of consistency of posterior for testing a simple null hypothesis H0 : f = f0 , where f0 is a fully specified density like N (0, 1) and the alternative is a nonparametric class of densities. Invoking Doob’s theorem on consistency (see for example Ghosh and Ramamoorthi 2003), Dass and Lee (2004) prove that if the prior assigns a positive prior probability to the null and the null is actually true, then the posterior probability of the null will tend to one and hence the Bayesian will accept the null with probability tending to one as sample size n tends to infinity. Dass and Lee (2004) also present consistency theorems for the alternative. The use of Doob’s theorem reduces a hard problem to one where we have an answer almost without any technical work. If the null is composite as in the case of H0 : f = N(µ, σ 2 ), −∞ < µ < ∞, 2 σ > 0, Doob’s theorem cannot be used to settle the question of consistency under the null. This case has to be treated by other methods, due to Ghosal, Lember and van der Vaart (2008) and McVinish, Roussueau and Mengersen (2009). We discuss these in Section 4. Possibly the most important paper for our survey is Berger and Guglielmi (2001) who consider a composite null as described above and embeds it into an alternative model constructed with a Polya Tree prior. They argue how such an embedding lets the null parameters carry their identity over to the alternative and discuss choice of objective prior distributions for these parameters. They show how to calculate the Bayes Factor efficiently. Most importantly, they also examine in detail how the test accepts and rejects the null for different data sets and then discuss carefully if such a decision is indeed consistent with our intuitive notions of normality or lack of it for a finite data set. We survey the test and the above facts in Section 5. We also include in this section a brief discussion of earlier attempts at Bayesian nonparametric testings based on Dirichlet processes and a recent, yet unpublished work that uses Dirichlet mixture priors. We end this introduction by drawing attention to a closely related problem of Bayesian cluster analysis via Dirichlet mixtures, as developed in Bhattacharya, Ghosh and Samanta (2009). We may treat each clustering of parameters as related to a partitioning of the Xi ’s in such a way that all Xi ’s in the partition are iid N(µi , σi2 ). In particular the case of a single cluster containing all Xi ’s would correspond to the null hypothesis. Other clusterings will correspond to other models that are part of the alternative.
2
2
An early application of Bayesian ideas in goodness of fit problems
One of the earliest applications of Bayesian ideas to a nonparametric goodness of fit problem is due to Rubin and Sethuraman (1965). They don’t use a nonparametric prior, in fact their paper predates that of Ferguson’s (1973) seminal contribution by nearly a decade. We include it because it is a pioneering Bayesian contribution to a goodness of fit problem with a nonparametric flavor. We will avoid the involved notations in the original paper but simply summarize some relevant facts in our own notation. Suppose X1 , . . . , Xn are iid with a normal distribution. Rubin and Sethuraman (1965, Section 3) test H0 : true density is N (0, 1) against H1 : true density is N(µ, σ 2 ) where (µ, σ 2 ) is arbitrary, and assume that the critical region is given by large values of the Kolmogorov-Smirnov test statistic. They determine the cut off point by varying critical regions of this kind and finding the one which minimizes the Bayes risk with respect to their assumed product of loss function and prior probability of (µ, σ 2 ) under the alternative. The “optimal” threshold is of q
the form a logn n , where a (possibly dependent on n) is essentially determined by the product of loss function and prior. They also determine the region in the parameter space for high acceptance of the null and observe that when µ ≈ 0 and σ ≈ 1, this region is approximately the region where the Kolmogorov-Smirnov distance between N (0, 1) and N (µ, σ 2 ) is less than the cut-off point. It is unclear if this test has any other good nonparametric property. There is an interesting discussion in the last paragraph of Section 2 of Rubin and Sethuraman (1965) about the effect of an “incorrect” use of the constant a, instead of the optimal one for the critical region, which seems to be applicable beyond the example treated in that section. They show that type I error can be highly sensitive to even a small deviation from the optimal a while type II error is not.
3
Testing a point null versus non-parametric alternatives
In this section we focus on the issue of consistency in Bayesian testing of a point null hypothesis versus a non-parametric alternative. Operationally a Bayes test is performed through the evaluation of a Bayes factor and consistency (or lack of it) of a test is determined by the behavior of the same as the sample size grows.
3
Our discussion is based on Dass and Lee (2004) and we will usually adhere to the notations adopted by them. We start with the basic definitions first before discussing the main results in their paper. Consider a complete separable metric space X with the corresponding Borel sigma field A. Let µ be a σ-finite measure on (X , A) and F the space of all probability densities with respect to µ with support X . Let Pf be the probability measure having density f . Up to an equivalence class (whereby two densities f and g belong to the same equivalence class if and only if f = g a.e. µ), the correspondence between f and Pf is unique. In what follows X ∞ and Pf∞ will denote usual infinite products of X and Pf respectively. Let X be a random variable taking values in X with a distribution having density f ∈ F. We are interested in testing H0 : f = f0 versus H1 : f 6= f0 , where f0 is completely specified. In the Bayesian approach to this testing problem, one first puts prior probabilities p0 > 0 and p1 = (1 − p0 ) for H0 and H1 to be true and also puts a non-parametric prior π1 on the space H1 . This amounts to defining an overall prior π ∗ (f ) = p0 If0 (f ) + (1 − p0 )π1 (f ) on H0 ∪ H1 . The Bayes factor based on sample X1:n = (X1 , X2 , · · · , Xn ), where Xi ’s are independent copies of X, is defined as Qn f (x ) R Qni=1 0 i . BF01 (X1:n ) = f i=1 (xi )π1 (df ) H1 Taking the usual choice of p0 = 1/2, the Bayes factor exactly equals the posterior odds ratio of H0 with respect to H1 , given by π ∗ (H0 | X1:n )/π ∗ (H1 | X1:n ). Thus the Bayes factor is a measure of the relative odds of H0 in comparison to H1 in the light of the observed data. As the data size grows quite large, ideally the Bayes factor should be such that either H0 or H1 is favored very strongly (through their posterior odds ratio), depending on the true data generating density. Thus the Bayes factor is defined to be consistent if lim BF01 (X1:n )
= ∞, a.s. Pf∞ , and 0
lim BF01 (X1:n )
=
n→∞
n→∞
0 a.s. Pf∞ for any f 6= f0 .
Trivially, consistency of Bayes factor implies and is implied by posterior consistency of the testing procedure. We are now in a position to state the main results about consistency of Bayes factors in Dass and Lee (2004). Theorem 1 limn→∞ BF01 (X1:n ) = ∞, a.s. Pf∞ . 0 Theorem 2 Let Θ = {f ∈ H1 : BF01 (X1:n ) → 0 a.s. Pf∞ }. Then π1 (Θ) = 1. Before stating the last result we need a definition. Given f ∈ F, define K (f ) = {g ∈ F : K(f, g) < } for > 0, where K(f, g) is the Kullback-Leibler divergence between f and g. Say that f is in the Kullback-Leibler support of a prior on F if the prior puts positive probability to the neighborhood K (f ) of 4
f , for each > 0. Theorem 3 Suppose f ∈ H1 and f is in the Kullback-Leibler support of π1 . Then, limn→∞ BF01 (X1:n ) = 0 a.s. Pf∞ . Remark Theorem 1 establishes consistency of Bayes factor under the null for any arbitrary nonparametric prior on H1 , while Theorem 2 says that the consistency holds under the alternative for a class of densities that has π1 probability 1. But Theorem 2 is not of much practical use since given a non-null density, one is not sure whether consistency holds or not if the true data were coming from this density. Theorem 3 specifies an explicit condition that takes care of this lacuna and can tell us definitively whether the Bayes factor is consistent for any given non-null density. For certain non-parametric priors π1 , sufficient conditions ensuring that a density indeed belongs to its Kullback-Leibler support are available in the literature (Ghosal, Ghosh and Samanta 1995, Ghosal, Ghosh and Ramamoorthi 1999a, 1999b, Barron, Schervish and Wasserman 1999, Petrone and Wasserman 2002, Tokdar 2006, Tokdar and Ghosh 2007, Choi and Ramamoorthi 2008, Wu and Ghosal 2008, 2009). See also the next section for a general sketch of proof of this result. We will not give the details of the proofs of the above results, but just sketch the main idea of the proof of Theorem 1. This is based on a clever application of Doob’s theorem (Ghosh and Ramamoorthi, 2003, pp 22-24) about posterior consistency. Doob’s theorem says that the posterior is consistent for almost all f ’s under the prior π ∗ on H0 ∪H1 . Since the prior probability of the (simple) null is positive, this implies that conditionally on the null being true, the posterior probability of the null will tend to 1 for almost all sequences under the null density f0 . This, in turn, implies that the Bayes factor BF01 (X1:n ) → ∞ a.s. π ∗ ({f0 }|˜ xn ) Pf∞ , since BF01 (X1:n ) = 1−π . ∗ ({f }|˜ 0 0 xn )
4
Posterior Consistency for a Composite Goodness of Fit test
When the null hypothesis is composite, i.e., H0 = {f ∈ F0 } where F0 is not singleton, the study of posterior consistency gets a lot more involved than the elegant treatment of Dass and Lee (2004) discussed in the previous section. We shall denote a Bayesian testing procedure for the composite case by the triplet (p0 , π0 , π1 ) with the understanding: Pr(H0 ) = p0 = 1 − Pr(H1 ), (f | H0 ) ∼ π0 and (f | H1 ) ∼ π1 . We call (p0 , π0 , π1 ) consistent if Pr(H0 | X1:n ) → 1 when the null is true, and Pr(H0 | X1:n ) → 0 otherwise. As noted in the previous section, consistency of the testing procedure is equivalent to requiring that the Bayes factor R Qn f (Xi )dπ0 (f ) BF01 (X1:n ) = R Qi=1 n i=1 f (Xi )dπ1 (f )
5
of the null hypothesis to the alternative converges asymptotically to infinity when the null is true, and to zero otherwise. For testing a parametric null against non-parametric alternatives, it is relatively easy to establish BF01 (X1:n ) → 0 when Xi ’s arise from a density f which is not a member of the null parametric family. A common requirement is that density estimation based on π1 is consistent at the true f in some topology whereas the closure of F0 , determined by the same topology, leaves out f . For example, if f belongs to the Kullback-Leibler support of π1 , then density estimation based on π1 is weakly consistent at f , due to Schwartz’s theorem (see Schwartz 1965 or Ch 4.4 of Ghosh and Ramamoorthi 2003). For any such f that is outside the weak closure of F0 , BF01 (X1:n ) → 0 asymptotically. This result can be formally proved by arguing that the composite prior π ∗ = 0.5π0 + 0.5π1 has f in its Kullback-Leibler support and hence F0 , lying entirely outside a weak neighborhood of f , receives vanishing mass from the posterior distribution π ∗ (· | X1:n ). Consequently, BF01 (X1:n ) = π ∗ (F0 |X1:n )/[1 − π ∗ (F0 |X1:n )] → 0. It is much more difficult to prove BF01 (X1:n ) → ∞ when f ∈ F0 . This is because the usual choices of π1 contain F0 in their support and can recover an f ∈ F0 from data nearly as efficiently as the parametric model itself. See, for example, the almost-parametric rate of convergence of Dirichlet mixture of normal priors discussed in Ghosal and van der Vaart (2001). The special case of a simple null hypothesis H0 : f = f0 as discussed in Section 3, eschews this difficulty because no estimation is required for the null model. For the composite null case, one needs a careful comparison of how π1 concentrates around an f ∈ F0 in comparison to π0 . Both Ghosal, Lember and van der Vaart (2008) and McVinish, Rousseau and Mengersen (2009) discuss formal ways to make this comparison based on neighborhoods of f ∈ F0 , shrinking at the same rate at which π1 can recover an f in a density estimation setting. Ghosal, Lember and van der Vaart (2008) discuss goodness of fit tests as a special case of the more general problem of adaptive selection of nested models. For this review, we summarize the the more direct treatment of McVinish, Rousseau and Mengersen (2009). To fix notations, let F0 = {fθ : θ ∈ Θ} where Θ is a d-dimensional Euclidean subspace. Fix an f ∗ = fθ∗ for some θ∗ ∈ Θ and let P ∗ denote the product measure of X1:∞ under this density. Assume that for some universal constants c > 0, C > 0 the following conditions hold. A1. Density estimation based on π1 is consistent at fθ∗ with rate εn (see Ghosal, Ghosh and van der Vaart 2000 and Shen and Wasserman 2001). That is, there exists a metric d(·, ·) on the densities and positive numbers εn ↓ 0 such that π1 ({f : d(f, fθ∗ ) < εn } | X1:n ) → 1 in P ∗ probability. −1 −d/2 A2. π0 (Kn := {θ : K(f ∗ , fθ ) < cn−1 , V (f ∗ , fθ ) < where R cn }) ≥ Cn for any two densities p and q, K(p, q) = p log(p/q) and V (p, q) =
6
2 p(log(p/q)) . One could replace Kn with a simpler definition: Kn = R ∗ {θ : f (log(f ∗ /fθ ))+ < cn−1 } where x+ = max(x, 0).
R
A3. π1 (An := {f : d(f, fθ∗ ) < εn }) = o(n−d/2 ). A2 is a natural support condition which holds for many standard parametric models. Conditions A1 and A3 maintain a fine balance with respect to the rate εn . Slowing down this rate favors A1 but may lead to a violation of A3, speeding up has the opposite effect. The main result of McVinish, Rousseau and Mengersen (2009, Theorem 1) and its elegant proof offered by these authors are reproduced below. Theorem 4.1. [BF01 (X1:n )]−1 → 0 in P ∗ probability. Proof. Express the inverse of the Bayes factor as R Qn f (Xi ) 1 i=1 f ∗ (X ) π1 (df ) A −1 [BF01 (X1:n )] = R nQn f (X )i i θ π (A 1 n | X1:n ) i=1 f ∗ (Xi ) π0 (dθ) Θ R Qn i) nd/2 An i=1 ff∗(X 1 (Xi ) π1 (df ) ≤ R Qn fθ (Xi ) d/2 n π0 (dθ) π1 (An | X1:n ) ∗ i=1 f (Xi )
Kn
∗ Pn (Xi ) From A1, [π1 (An | X1:n )]−1 = OP ∗ (1). Define Ln (θ) = n1 i=1 log ffθ (X , i) θ ∈ Θ and take π0,Kn to be the restriction of π0 to Kn . Then for small δ > 0,
n Y fθ (Xi ) π (dθ) ≤ δ 0 ∗ Kn i=1 f (Xi ) Z 1 π0 (Kn ) ∗ ≤P Ln (θ)π0,Kn (dθ) ≥ log −d/2 n δn Z 1 π0 (Kn ) ∗ ∗ ≤P [Ln (θ) − K(f , fθ )]π0,Kn (dθ) ≥ log −d/2 − c n δn R n V (f ∗ , fθ )π0,Kn (dθ) ≤ (log π0 (Kn ) + (d/2) log n − log δ − c)2 c ≤ (log(C/δ) − c)2
Z P ∗ nd/2
where the first inequality follows from Jensen’s inequality, the third from ChebyR Qn shev’s inequality and the second and the fourth from A2. Therefore, [nd/2 Kn i=1 OP ∗ (1). An application of Markov’s inequality shows P
∗
n
d/2
n Y f (Xi ) π1 (dθ) > δ ≤ δ −1 nd/2 π1 (An ) ∗ An i=1 f (Xi )
Z
R Qn and hence, by A3, nd/2 An i=1 get BF01 (X1:n )−1 = oP ∗ (1).
f (Xi ) f ∗ (Xi ) π1 (dθ)
7
= oP ∗ (1). Combining these we
fθ (Xi ) −1 f ∗ (Xi ) π0 (dθ)]
=
A number of recent papers give sharp rates of convergence for popular nonparametric priors on densities, including the Dirichlet mixture priors (Ghosal and van der Vaart 2007) and the logistic Gaussian process priors (van der Vaart and van Zanten 2009). However, a rigorous demonstration of A3 is currently available for only a handful of cases, such as the Dirichlet mixture of Bernstein polynomials, log spline densities, (Ghosal, Lember and van der Vaart 2008) and mixtures of triangular densities (McVinish, Rousseau and Mengersen 2009). It is not yet clear whether A3 is a natural condition that is going to be automatically satisfied by many nonparametric priors, as it requires the prior distribution not to sit too tightly around elements of F0 . Some insight into A3 may be gained by considering a1 n−1/2 ≤ εn ≤ a2 n−1/2+Bea for some Bea ∈ (0, 1/2), a1 , a2 > 0. In this case one can replace A3 with A30 . π1 ({f : d(f, f ∗ ) < }) = O(D ) for some D > d/(1/2 − β). The above gives a sharp lower bound on the effective dimensionality of the nonparametric prior π1 around f ∗ . Deeper investigations of the popular choices of π1 will reveal whether such conditions hold, possibly for well chosen values of hyper-parameters. McVinish, Rousseau and Mengersen (2009), however, argue the necessity of A1-A3 via an example where a violation of these conditions leads to inconsistency of the testing procedure. If it turns out that use of popular nonparametric priors like Dirichlet mixtures of normals may lead to inconsistency in goodness of fit tests, one should explore the possibility of introducing an indifference zone to separate the null and the alternative. This would make the consistency problem easier to resolve but specification of the indifference zone cannot be done without some subjective input from the user.
5
Bayesian Goodness of Fit Tests
The earliest papers, for example Florens, Richard and Rolin (1996) and Carota and Parmigiani (1996), use a mixture of Dirichlet process prior (Antoniak 1974, Ghosh and Ramamoorthi 2003, p113) on the alternative hypotheses. Letting θ denote the parameter underlying the null model, these authors use the following hierarchical formulation of the alternative: the alternative distribution given θ follows a Dirichlet process (DP) prior with precision constant αθ and base measure α ¯ θ and θ is modeled by a parametric prior that may match with the prior used on the null model. As the authors of these papers and also Berger and Guglielmi (2001) point out, this approach is flawed as the prior on the alternative sits on the discrete distributions, whereas in goodness of fit problems, we want it to sit on densities. Both these papers run into serious difficulties caused by this. A more promising paper, also mentioned and discussed in more detail by Berger and Guglielmi (2001) is Verdinelli and Wasserman (1998) which assigns a mixture of Gaussian processes as the prior on the alternative. The prior is constructed quite ingeniously and allows calculation of the Bayes factor.
8
Why are the priors on the alternatives mixtures? This is done so that the composite null like N(µ, σ 2 ) can have a natural embedding, i.e., nesting in some sense in the alternative space. For example, Florens, Richard and Rolin (1996) ensure that for every θ, the conditional predictive distribution of a single observation is the same under both hypotheses. In the same vein, but avoiding a prior like the mixture of Dirichlet or the ingenious construction used by Verdinelli and Wasserman (1998), Berger and Guglielmi (2001) take resort to a mixture of Polya tree priors (Lavine 1992, 1994, Mauldin, Sudderth and Williams 1992). Suppose the parametric null is θ ∈ Θ and the distribution of X corresponding to θ is µθ . Berger and Guglielmi (2001) require for all θ ∈ Θ
EPθ = µθ ,
where Pθ is the random distribution of X under the nonparametric alternative and θ on the left is a hyperparameter of the Polya tree prior under the alternative. There are many advantages of the Polya tree priors. A Polya tree process defines a random measure by assigning random conditional probabilities to sets within a nested dyadic partition of an infinite depth. For each dyadic set its random conditional probability given the parent in the previous partition is assigned a beta distribution with two shape parameters, namely, α’s, which function like the two parameters of the Dirichlet process. They specify the mean and precision. Unlike the Dirichlet process, the use of infinitely many precision constants offers a greater control over the smoothness of the random measure defined by a Polya tree process. Indeed, by choosing a common value rm for the P −1/2 precision constants at depth m of the partition, such that m rm < ∞, one ensures that the Polya tree sits on densities and obtains posterior consistency for estimating a large class densities (Ghosh and Ramamoorthi 2003 pp 186, 190, Walker 2003, 2004; see also an earlier weaker result in Barron, Schervish and Wasserman 1999). Berger and Guglielmi (2001) discuss two constructions of a Polya tree prior, one where only the partition depends on θ and another where only the beta shape parameters depend on θ. Either choice embeds as prior mean the conditional predictive density under the null model with parameter θ. Berger and Guglielmi (2001) use the second approach in their examples, as a fixed partition offers significant computational advantages, as detailed in their Section 5. This flexible embedding of the null model within the alternative appears to be a key motivation for Berger and Guglielmi (2001) to use mixtures of Polya tree priors for the alternative model. We quote ...as we were striving for a nonsubjective approach, we wished to utilize (typically improper) noninformative prior distributions for θ in the models. Although θ occurs under both H0 and H1 , it could have very different meanings under each and could, therefore, require different noninformative priors. We felt that use of the mixture Polya tree process could solve this problem, in the sense that we could use results from Berger and Pericchi (1996) and Berger, Pericchi, and Varshavsky (1998) to show that the usual noninformative priors 9
for θ are appropriately calibrated between H0 and H1 when the alternative is a suitable mixture of Polya tree processes. We refer the reader to Berger and Guglielmi (2001) for the details of their subtle justification for using the same noninformative prior for θ under both null and alternative. Berger and Guglielmi (2001) also introduce an additional scaling factor h for rm free of the partition depth m. They use different values of h to see the robustness of the Bayes factor with respect to h. They minimize over h to make a conservative choice of the Bayes Factor in their Table 1. This is another innovative contribution of Berger and Guglielmi (2001), for which the reader needs to go back to the original paper to appreciate is full significance. Berger and Guglielmi (2001) examine the results of the test for three examples. We consider only the first one with data being 100 stress rupture life times of Kevlar vessels, taken from Andrews and Herzberg (1985, p 183). In the log scale the data are claimed to be N(µ, σ 2 ) under the null. Table 1 of Berger and Guglielmi (2001)gives the minimum of the Bayes factor with respect to the constant h for the two different choices of the Polya tree mixture mentioned above. The values of the minimized Bayes factor are very close to each other (in the range 0.0072 to 0.00180) and to the value of the Bayes factor based on the prior of Verdinelli and Wasserman (1998). This is a pleasant fact that provides support to both the Bayes factor tests and shows that the introduction of h seems to ameliorate the problem of the proper choice of rm by reporting the minimized Bayes factor. All these nice properties of the Polya tree priors notwithstanding, the random densities chosen by the prior have discontinuities on the countable dense set formed by the boundaries of the sets in the partition. This is at least one reason why there has been a lot of interest in priors sitting on smooth densities like Gaussian process priors, vide Tokdar and Ghosh (2007), Tokdar (2007), Lenk (1988, 1991) and the Dirichlet process mixture of normals, vide Lo (1984), Ferguson (1983), Escobar and West (1995), M¨ uller, Erkanli and West (1996), McEachern and M¨ uller (1998, 2000) and McEachern (1998). Recently, for test2 ing H0 : f = N(µ, σ 2 ) for some (2009) has proposed the following R µ, σ , Tokdar alternative model: H1 : f = N(φ, ρσ 2 )P (dφ, dρ) where the mixing distribution P is modeled as p(P | µ, σ 2 ) = DP(α, G0 (· | µ, σ 2 )) with G0 (φ, ρ | µ, σ 2 ) = N(φ | µ, (1 − ρ)σ 2 )Be(ρ | ω1 , ω2 ). This alternative ensures E[f | µ, σ 2 , H1 ] = N(µ, σ 2 ) and thus provides an embedding similar to the one in Berger and Guglielmi (2001). Tokdar (2009) also discuss efficient computation of the Bayes factor via a variation of the sequential importance sampling algorithm of Liu (1996).
References [1] Andrews, D. F. and Herzberg, A. M. (1985) Data. New York: SpringerVerlag. 10
[2] Antoniak, C. (1974). Mixtures of Dirichlet processes with application to Bayesian non-parametric problems. The Annals of Statistics, 2, 1152-1174. [3] Barron, A., Schervish, M. J. and Wasserman, L. (1999). The Consistency of Posterior Distributions in Nonparametric Problems. The Annals of Statistics, 27, 536-561. [4] Berger, J. O. and Guglielmi, A. (2001). Bayesian and Conditional Frequentist Testing of a Parametric Model against Nonparametric Alternatives. Journal of the American Statistical Association, 96, 174-184. [5] Berger, J. O. and Pericchi, L. R. (1996). On the Justification of Default and Intrinsic Bayes Factors. In Modelling and Prediction, eds. J. C. Lee, W. Johnson, and A. Zellner, New York: Springer-Verlag, pp. 276-293. [6] Berger, J. O., Pericchi, L. R. and Varshavsky, J. A. (1998). Bayes Factors and Marginal Distributions in Invariant Situations. Sankhy¯ a Ser. A, 60, 307321. [7] Bhattacharya, S., Ghosh, J. K. and Samanta, T. (2009). On Bayesian “Central Clustering”: Application to Landscape Classification of Western Ghats. Preprint. [8] Carota, C. and Parmigiani, G. (1996). On Bayes Factors for Nonparametric Alternatives. In Bayesian Statistics (Vol. 5), eds. J. M. Bernardo J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 507-511. [9] Choi, T. and Ramamoorthi, R.V. (2008). Remarks on consistency of posterior distributions. In IMS Collections 3. Pushing the limits of contemporary statistics: Contributions in honor of Jayanta K. Ghosh. eds. S. Ghosal and B. Clarke, pp 170-186. [10] Choudhuri, N., Ghosal, S. and Roy, A. (2005). Bayesian Models for Function Estimation. In Handbook of Statistics (25), eds. Dipak Dey, pp 373-414 [11] Dass, S. C. and Lee, J. (2004). A note on the consistency of Bayes factors for testing point null versus non-parametric alternatives. Journal of Statistical Planning Inference, 119, 143-152. [12] Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of The American Statistical Association, 90, 577-588. [13] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209-230. [14] Ferguson, T.S. (1983). Bayesian density estimation by mixtures of Normal distributions. In Recent Advances in Statistics, eds. M. Rizvi, J. Rustagi and D. Siegmund, pp 287-302. 11
[15] Florens, J. P. , Richard, J. F. and Rolin, J. M. (1996). Bayesian Encompassing Specification Tests of a Parametric Model Against a Nonparametric Alternative. Technical Report 96.08, Universit´e Catholi que de Louvain, Institut de Statistique. [16] Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999a). Consistency issues in Bayesian nonparametrics. In Asymptotic Nonparametrics and Time Series: a Tribute to Madan Lal Puri, eds. Ghosh, S., Marcel Dekker, New York, pp 639-667. [17] Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999b). Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27, 143-158. [18] Ghosal, S., Ghosh, J. K. and Samanta, T. (1995). On convergence of posterior distribution. Annals of Statistics, 23, 2145–2152. [19] Ghosal, S., Ghosh, J. K., van der Vaart, A. W. (2000). Convergence rates of posterior distributions. The Annals of Statistics, 28, 500531. [20] Ghosal, S., Lember, J. and van der Vaart, A. W. (2008). Nonparametric Bayesian model selection and averaging. Electronic Journal of Statistics, 2, 6389. [21] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics, 29, 12331263. [22] Ghosal, S. and van der Vaart, A. W. (2007). Posterior Convergence Rates of Dirichlet Mixtures of Normal Distributions for Smooth Densities. The Annals of Statistics, 35, 192223. [23] Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics. Springer-Verlag, New York. [24] Lavine, M. (1992). Some Aspects of Polya Tree Distributions for Statistical Modelling. The Annals of Statistics, 20, 12221235. [25] Lavine, M. (1994). More Aspects of Polya Tree Distributions for Statistical Modelling. The Annals of Statistics, 22, 11611176. [26] Lenk, P. J. (1988). The logistic normal distribution for Bayesian, nonparametric, predictive densities. Journal of the American Statistical Association, 83, 509516. [27] Lenk, P.J. (1991). Towards a practicable Bayesian nonparametric density estimator. Biometrika, 78, 531543. [28] Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. The Annals of Statistics, 24, 911–930.
12
[29] Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates I: Density estimates. The Annals of Statistics, 12, 351 357. [30] Mauldin, R. D., Sudderth, W. D. and Williams, S. C. (1992), Polya Trees and Random Distributions. The Annals of Statistics, 20, 1203-1221. [31] MacEachern, S. N. and Muller, P. (1998). Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7, 223228. [32] MacEachern, S. N. and M¨ uller, P. (2000). Efficient MCMC schemes for robust model extensions using encompassing Dirichlet process mixture models. In Robust Bayesian Analysis. Lecture Notes in Statistics, 152, 295–316. Springer, New York. [33] MacEachern, S. N. (1998). Estimation of Bayes Factors in Mixture of Dirichlet Process Models. Technical Report, Ohio State University. [34] McVinish R., Roussueau, J. and Mengersen, K. ( 2009) Bayesian Goodness of Fit Testing with Mixtures of Triangular Distributions. Scandinavian Journal of Statistics, 36, 337-354. [35] M¨ uller, P., Erkanli, A., and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures.Biometrika, 83, 67-79. [36] M¨ uller, P. and Quintana, F. A. (2004). Nonparametric Bayesian Data Analysis. Statistical Science, 19, 95-110. [37] Petrone, S., Wasserman, L. (2002). Consistency of Bernstein polynomial posteriors. Journal of the Royal Statistical Society, Series B, 64, 79100. [38] Rubin, H. and Sethuraman, H. (1965). Bayes Risk Efficiency. Sankhy¯ a Series A, 347-356. [39] Schwartz, L. (1965). On Bayes procedures. Zeitschrift fir Wahrscheinlichkeitstheorie und Verwandte Gebiete, 4, 10-26. [40] Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Annals of Statistics, 29 666-686. [41] Tokdar, S. T. (2006). Posterior Consistency of Dirichlet Location-scale Mixture of Normals in Density Estimation and Regression. Sankhy¯ a, 68, 90–110. [42] Tokdar, S. T. (2007). Towards a Faster Implementation of Density Estimation with Logistic Gaussian Process Priors. Journal of Computational and Graphical Statistics, 16, 633 - 655. [43] Tokdar, S. T. (2009). Testing normality with Dirichlet mixture alternatives. Preprint. [44] Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137, 34–42. 13
[45] van der Vaart, A. W. and van Zanten, H. (2008). Rates of contraction of posterior distributions based on Gaussian process priors . Annals of Statistics, 36, 1435-1463. [46] Verdinelli, I. and Wasserman, L. (1998). Bayesian Goodness of Fit Testing Using Infinite Dimensional Exponential Families. The Annals of Statistics, 20, 1203-1221. [47] Walker, S.G. (2003). On sufficient conditions for Bayesian consistency. Biometrika, 90, 482-490. [48] Walker, S. G. (2004). New approaches to Bayesian consistency. The Annals of Statistics, 32, 20282043. [49] Wu, Y. and Ghosal, S. (2009). Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics, 2, 29833. [50] Wu, Y. and Ghosal, S. (2008). Posterior consistency for some semiparametric problems. Sankhy¯ a, Series A, 70, 0-46, 2008.
14