An enriched conjugate prior for Bayesian ... - Project Euclid

Bayesian Analysis (2011)

6, Number 3, pp. 359–386

An enriched conjugate prior for Bayesian nonparametric inference Sara Wade∗ , Silvia Mongelluzzo† and Sonia Petrone‡ Abstract. The precision parameter α plays an important role in the Dirichlet Process. When assigning a Dirichlet Process prior to the set of probability measures on Rk , k > 1, this can be restrictive in the sense that the variability is determined by a single parameter. The aim of this paper is to construct an enrichment of the Dirichlet Process that is more flexible with respect to the precision parameter yet still conjugate, starting from the notion of enriched conjugate priors, which have been proposed to address an analogous lack of flexibility of standard conjugate priors in a parametric setting. The resulting enriched conjugate prior allows more flexibility in modelling uncertainty on the marginal and conditionals. We describe an enriched urn scheme which characterizes this process and show that it can also be obtained from the stick-breaking representation of the marginal and conditionals. For non atomic base measures, this allows global clustering of the marginal variables and local clustering of the conditional variables. Finally, we consider an application to mixture models that allows for uncertainty between homoskedasticity and heteroskedasticity. Keywords: Bayesian nonparametric inference, conjugate priors, generalized Dirichlet, Dirichlet process, mixture models, P´ olya urns, multivariate random distribution functions

1

Motivation

Conjugacy is a desirable property because the posterior distribution remains analytically tractable; this is especially true in nonparametric inference where the posterior distribution of non-conjugate priors can be very complex. The most popular prior in Bayesian nonparametric inference is the Dirichlet Process, and it is conjugate; if Zi | P = P are independent and identically distributed (i.i.d.) according to P , and P is a Dirichlet process, DP (αP0 ), with precision parameter α and measure P0 on the sample space Z, Pbase n then P | Z1 = z1 , . . . , Zn = zn ∼ DP (αP0 + i=1 δzi ). However, when Z is a random vector and P is a random probability measure on Rk , k > 1, as in many applications, the choice of a Dirichlet process prior implies that the variability is determined by a single parameter, α. Indeed, the precision parameter α plays an important role; it not only reflects the strength of belief in the prior guess of P0 , but also controls the ties configuration in a random sample from P. Thus, having only one degree of freedom, α, ∗ Department of Decision Sciences, Bocconi University, Milan, Italy, mailto:sara.wade@phd. unibocconi.it † Department of Decision Sciences, Bocconi University, Milan, Italy, mailto:silvia.mongelluzzo@ phd.unibocconi.it ‡ Department of Decision Sciences, Bocconi University, Milan, Italy, mailto:sonia.petrone@ unibocconi.it

© 2011 International Society for Bayesian Analysis

DOI:10.1214/11-BA614

360

An enriched conjugate prior for Bayesian nonparametric inference

in the prior can be quite restrictive. In fact, a similar lack of flexibility arises in a parametric setting; standard conjugate priors for the natural exponential family have only one parameter to control variability. To overcome this issue, a general class of enriched conjugate priors (Consonni and Veronese (2001)) have been proposed. A Dirichlet Process, DP (αP0 ), is characterized by the fact that the finite dimensional distributions of the probability over any measurable partition,(C1 , . . . , Ck ), of Z, are Dirichlet with parameters (αP0 (C1 ), . . . , αP0 (Ck )). The Dirichlet Process inherits conjugacy from the property of conjugacy of the standard Dirichlet distribution prior for multinomial sampling, but also inflexibility from the fact that the Dirichlet distribution, as all standard conjugate priors, has only one parameter to control variability. The question addressed in this paper is whether one can extend the notion of enriched conjugate priors to nonparametric inference and construct a prior on a random probability measure over Rk , that is more flexible than the DP in allowing more parameters to control the variability, yet is still conjugate. Actually, Doksum’s Neutral to the Right Process (Doksum (1974)) is an extension of the enriched conjugate Generalized Dirichlet distribution to a process, providing a more flexible, conjugate prior for univariate random distribution functions. The Generalized Dirichlet distribution is defined for a specific ordering of the random probabilities; thus, extension to a multivariate random distribution is not obvious, since there is no natural ordering in Rk . Therefore, we start our analysis by constructing an enriched Dirichlet prior for a multivariate random distribution when the sample space is finite. To convey the main ideas, we will focus on the case when the random vector Z can be partitioned into two groups, Z = (X, Y ), and the sample space can be written as the product of two finite spaces (or in the more general case, the product of two complete separable metric spaces, Z = X × Y). In the finite case, the enriched Dirichlet distribution is obtained based on the reparametrization of the joint probabilities in terms of the marginal and the conditionals. Then, we extend this construction to a process by reparametrizing the joint random probability measure in terms of the marginal and conditionals and assigning independent Dirichlet Process priors to each of these terms. The parameters of the resulting enriched Dirichlet process again include a base measure controlling the location, but there are now many more parameters to control the variability. We show that the Dirichlet Process is in fact a special case, which consequently, characterizes the distribution of the random conditionals. Although many desirably properties are maintained, some are necessarily weakened, including a clear asymmetry in the two (groups of) variables, that however may be reasonable in several applications. The paper is organized as follows. In Section 2, we give a brief overview of enriched conjugate priors for the natural exponential family. In Section 3, we discuss the enriched Dirichlet distribution in the finite case as a particular enriched conjugate prior for multinomial sampling and provide a Pólya urn characterization. These notions are extended to a process in Section 4. Finally, a simple application to mixture models is illustrated using data on national test scores to compare schools in Section 5. Proofs

S. Wade, S. Mongelluzzo, and S. Petrone

361

are given in the Appendix.

2

Preliminaries: Enriched Conjugate Priors

For a Natural Exponential Family (NEF) F on Rd , where d represents the dimension of the sufficient statistics, the likelihood for the natural parameter θ is given by: Lθ (θ|s, n) = exp(θT s − nM (θ))

for θ ∈ Θ,

R where s is a d-dimensional vector of the sufficient statistics, M (θ) = log exp(θT x)η(dx), and η is a σ-finite measure on the Borel sets of Rd . The parameter space Θ is the interior of the set N = {θ ∈ Rd : M (θ) < ∞}. More generally, we have a Standard Exponential Family (SEF) if Θ ⊆ N , and it is non-empty and open. A family of measures on the Borel sets of Θ whose densities with respect to the Lebesgue measure are of the form πθ (θ|s0 , n0 ) ∝ Lθ (θ|s0 , n0 ) is called the standard conjugate family of priors of F relative to the parametrization θ, where the sufficient statistics, s, are replaced by parameters, s0 , which control the location of the prior, and the sample size, n, is replaced by a single parameter, n0 , which controls the precision; see Diaconis and Ylvisaker (1979). Consonni and Veronese (2001) discuss enriched conjugate priors for the NEF, moving from the notion of conditional reducibility. A d-dimensional NEF is called k conditionally reducible if the density can be decomposed as the product of k standard exponential families, each depending on their own parameters. The notion of enriched conjugate priors involves replacing the sufficient statistics and the sample size with different hyperparameters within each SEF. This means giving independent standard conjugate priors to the parameters of the conditional densities and induces a prior on the original parameter of the NEF which enriches the standard conjugate prior by allowing for k precision parameters. For a deeper discussion, see Consonni and Veronese (2001). One important example is given by the Generalized Dirichlet distribution of Connor and Mosimann (1969), which provides an enriched conjugate prior for the parameters of a multinomial distribution; see Consonni and Veronese (2001), Example 4. Briefly, if (N1 , . . . , Nk ) is multinomial given (p1 = p1 , . . . , pk = pk ), one can decompose the multinomial probability function as p(N1 = n1 , . . . , Nk = nk | p1 , . . . , pk ) = p(N1 = n1 | V1 )p(N2 = n2 | N1 = n1 , V2 ) · · · p(Nk = nk | N1 = n1 , . . . , Nk−1 = nk−1 , Vk ),

where each factor in the product is a NEF (namely, binomial), depending on its own Pi−1 parameter, V1 = p1 , Vi = pi /(1− j=1 pj ), i = 2, . . . , k −1, and Vk is degenerate at 1. indep

The standard, Dirichlet(α1 , . . . , αk ) conjugate prior corresponds to assuming Vi ∼ Pk beta(αi , j=i+1 αj ), i = 1, . . . , k − 1. The enriched, or Generalized, Dirichlet conjugate indep

prior allows a more flexible choice of the beta hyperparameters: Vi ∼ beta(αi , βi ), i = 1, . . . , k−1. It is worth underlining that some properties of the Dirichlet distribution are necessarily weakened. In particular, the Dirichlet prior implies that any permutation

362


of (p1 , . . . , pk ) is completely neutral (the vector (p1 , . . . , pk ) is completely neutral iff Pk−1 (p1 , p2 /(1 − p1 ), ..., pk /(1 − j=1 pj )) are independent). The Generalized Dirichlet only assumes that one ordered vector (p1 , . . . , pk ) is completely neutral. This makes applications to the bivariate case of contingency tables pi,j not obvious, since there is no natural ordering in two dimensions. The enriched conjugate prior that we propose in the next section is a simple proposal in this direction.

3

Finite case: Enriched Dirichlet distribution

Let {(Xn , Yn )}n∈N be a sequence of discrete random vectors with values in X × Y = i.i.d

{1, . . . , k}×{1, . . . , m}, such that (Xi , Yi ) | p = p ∼ p, where p is a random probability function with mass pi,j on (i, j), i = 1, . . . , k; j = 1, . . . , m. Then, given p = p, the vector of counts (N1,1 , . . . , Nk,m ), where Ni,j is the number of times the pair (i, j) is observed in a sample ((X1 , Y1 ), . . . , (Xn , Yn )), has a multinomial probability function p(n1,1 , ..., nk,m−1 | p1,1 , ..., pk,m−1 ) = n! n1,1 !...nk,m−1 !(n −

P

n1,1 p1,1

ni,j )!

nk,m−1 · · · pk,m−1

(1 −

X

n−

pi,j )

P

(i,j)6=(k,m)

ni,j

,

(i,j)6=(k,m)

(i,j)6=(k,m)

(1) Pk Pm for ni,j ≥ 0; i=1 j=1 ni,j = n. The standard conjugate prior for (p1,1 , . . . , pk,m ) is the Dirichlet distribution, which involves replacing the km − 1 sufficient statistics in (1) with hyperparameters, s0 = (s01,1 , ..., s0k,m−1 ), that control the location of the prior, and the sample size with a single hyperparameter, n0 , that controls the precision of the prior. As discussed in Section 2, a generalized Dirichlet prior is problematic in this case, since there is no natural ordering of the probabilities pi,j . However, a fairly natural and simple enrichment can be obtained by first applying the linear transformation: Ni+ =

m X

for i = 1, ..., k − 1,

Ni,j

j=1

Ni,j = Ni,j

for i = 1, ..., k

j = 1, ..., m − 1,

followed by the reparametrization: pi+ =

m X

pi,j

for i = 1, ..., k − 1,

j=1

pj|i = pj|k =

pi,j pi+ 1−

for i = 1, ..., k − 1 pk,j Pk−1 i=1

pi+

j = 1, ..., m − 1,

for j = 1, ..., m − 1.


363

Define: N + = (N1+ , ..., Nk−1+ ); N i = (Ni,1 , ..., Ni,m−1 ); p+ = (p1+ , ..., pk−1+ ), and pi = (p1|i , ..., pm−1|i ), for i = 1, ..., k. Under this linear transformation and reparametrization, the multinomial is a k + 1 conditionally reducible NEF: p(n+ , n1 , ..., nk | p+ , p1 , ..., pk ) = p(n+ | p+ )

k Y

p(ni | pi , n+ ),

(2)

i=1

(Ni,1 , ..., Ni,m | ni+ , p1|i , ..., pm|i ) ∼ Mult ni+ , p1|i , ..., pm|i

for i = 1, ..., k,

(N1+ , ..., Nk+ | p1+ , ..., pk+ ) ∼ Mult(n, p1+ , ..., pk+ ). By replacing the sufficient statistics and sample size with different parameters within each SEF in the right hand side of (2), one can create a more flexible conjugate prior. In particular, letting (s0(+) , s0(1) , ..., s0(k) ) denote the km − 1 location parameters and (n0+ , n01 , ..., n0k ) denote the precision parameters, in terms of (p+ , p1 , ..., pk ), the Enriched Dirichlet conjugate prior is: p1+ , ..., pk+ ∼ Dir(s01+ , ..., s0k−1+ , n0+ −

k−1 X

s0i+ ),

(3)

i=1

p1|i , ..., pm|i ∼ Dir(s0i,1 , ..., s0i,m−1 , n0i −

m−1 X

s0i,j ),

j=1

where (p1+ , ..., pk+ ), (p1|1 , ..., pm|1 ), ..., (p1|k , ..., pm|k ) are independent. We get back to Pk the Dirichlet distribution if n0i = s0i+ for i = 1, ...k − 1 and n0+ = i=1 n0i . Remark 1. The Dirichlet distribution on the vector p = (p1,1 , ..., pk,m ) defining the random marginal, px , py , and conditional, py|x , px|y , probability functions is characterized by the properties (i) px (·) and py|x (·|i), i = 1, . . . , k are independent, and (ii) py (·) and px|y (·|j), j = 1, . . . , m are independent; see Geiger and Heckerman (1997). The Enriched Dirichlet relaxes that the independence properties holds in both directions. We maintain (i) and allow more degrees of freedom in the distributions of px and py|x . Remark 2. Under the linear transformation discussed here, the multinomial could also be viewed as a km − 1 conditionally reducible NEF; it can be written as the product of km − 1 SEFs (namely, binomial) each depending on its own parameters. The resulting enriched conjugate prior has km − 1 parameters to control the precision and can be seen as nested version of Generalized Dirichlet distribution of Connor and Mosimann (1969). In the rest of the paper, we will use the following parametrization of the distributions (3). Let α(·) be a finite measure on X and µ(·, ·) be a mapping from 2Y × X to R+ such that for every x ∈ X , µ(·, x) is a finite measure on (Y, 2Y ). Then we assume that the parameters in (3) are chosen in terms of α(·) and µ(·, ·): p1+ , ..., pk+ ∼ Dir(α(1), ..., α(k)), p1|i , ..., pm|i ∼ Dir(µ(1, i), ..., µ(m, i)) i = 1, ..., k,

(4)

364


with the convention that if α(i) = 0 then pi+ is degenerate at 0 and if µ(j, i) = 0 then pj|i is degenerate at 0. If α(i) > 0 and µ(j, i) > 0 for all i, j, then the enriched Dirichlet conjugate prior induced on (p1,1 , . . . , pk,m ) is: k−1 Y

Γ(α(X ))

f (p1,1 , ...,pk,m−1 ) = Qk

Γ(α(i))

i=1

m k−1 m X XX α(i)−µ(Y,i) ( pi,j ) (1 − pi,j )α(k)−µ(Y,k)

i=1 j=1

i=1 j=1

m−1 k−1 k X Y Γ(µ(Y, i)) Y µ(j,i)−1 Y µ(m,i)−1 p p (1 − pi,j )µ(m,k)−1 . × i,j i,m m Q i=1 i=1 Γ(µ(j, i)) j=1 (i,j)6=(k,m) j=1

Clearly, the prior of the marginal probabilities (p+1 , . . . , p+m ) on Y is no longer a Dirichlet distribution, and in fact, the density may not be available in closed form. But, we can give the following representation in terms of G-Meijer variables (Springer and Thompson (1970)). First, remembering the Gamma representation of the Dirichlet indep

indep

distribution and defining Ui ∼ Gamma(α(i), 1) and Vij ∼ Gamma(µ(j, i), 1), we have the following G-Meijer representation of the vector (p1,1 , ..., pk,m ): d

(p1,1 , ..., pk,m ) = which is independent of

Pk

Uk Vkm U1 V11 , ..., Pk Pk Pm Pm i=1 Ui i=1 Ui j=1 V1j j=1 Vkj

i=1

Ui

Pm

j=1

V1j , ...,

Pk

i=1

Ui

Pm

j=1

! ,

Vkj ; where the symbol

d

= denotes equality in distribution. Therefore, the marginal probabilities over Y can be represented as the sum of G-Meijer random variables: ! k k X X Ui Vim Ui Vi1 d , ..., . (p+1 , ..., p+m ) = Pk Pk Pm Pm h=1 Uh h=1 Uh j=1 Vij j=1 Vij i=1 i=1

3.1

Enriched P´ olya Urn

An alternative way to define the Enriched Dirichlet distribution is based on a Pólya urn scheme, which will be useful in extending the distribution to a process. In the bivariate setting, the standard Pólya urn scheme describes the predictive distribution of a sequence of random vectors. An urn contains pairs of balls of color (i, j) ∈ X × Y. A pair of balls is drawn from the urn and replaced along with another pair of balls of the same colors. The random vector, (Xn , Yn ), is equal to (i, j) if the n-th pair drawn is of color (i, j). Alternatively, we can consider one urn containing just X-balls and k urns, say Y |i urns, containing only Y -balls. We first draw an X-ball from the X-urn and replace it along with another ball of the same color, and then, depending on color of the X-ball, draw a Y-ball from urn associated to X-ball drawn, and replace it along with another ball of the same color. In this case, the random vector, (Xn , Yn ), is equal to (i, j) if the


365

n-th X-ball drawn is of color i and the Y ball associated with it is of color j. If the number of Y -balls in the Y |i urn is equal to the number balls of color i in the X-urn, the two urn schemes are equivalent. The Enriched P´ olya Urn scheme enriches this urn scheme by relaxing the constraints that the number of Y -balls in the Y |i urn has to equal the number of X-balls of color i in the X-urn for i = 1, ..., k. More precisely, the number of balls in each urn is specified as follows: α(i) is the number of X-balls of color i µ(j, i) is the number of Y -balls of color j in the Y |i urn

Pk where α(X ) = i=1 α(i) is the total number of balls in the X-urn and µ(Y, i) = P m j=1 µ(j, i) is the total number of balls in the Y |i urn for i = 1, ..., k. This urn scheme implies the following predictive distribution: α(i) µ(j, i) , α(X ) µ(Y, i) P r(Xn+1 = i, Yn+1 = j|X1 = i1 , Y1 = j1 , .., Xn = in , Yn = jn ) Pn Pn α(i) + h=1 δih (i) µ(j, i) + h=1 δjh ,ih (j, i) Pn = . α(X ) + n µ(Y, i) + h=1 δih (i)

P r(X1 = i, Y1 = j) =

Theorem 1. Let {(Xn , Yn )}n∈N be a sequence of random vectors taking values in {1, ..., k} × {1, ..., m} with predictive distributions characterized by an Enriched P´ olya urn scheme with parameters α(·) and µ(·, ·). Then, 1. the sequence of random vectors {(Xn , Yn )}n∈N is exchangeable, and its de Finetti measure is an Enriched Dirichlet distribution with parameters α(·) and µ(·, ·). 2. as n → ∞, the sequence of the predictive distributions pn (i, j) = P r(Xn+1 = i, Yn+1 = j|X1 = i1 , Y1 = j1 , .., Xn = in , Yn = jn ) converges a.s with respect to the exchangeable law to a random probability function, p; and p is distributed according to the Enriched Dirichlet de Finetti measure. The proof is an extension of that used for the standard Pólya urn (see Ghosh and Ramamoorthi (2003), pages 94-95). The first step is to show the sequence of random vectors is exchangeable. Next, computing their finite dimensional distributions and using de Finetti’s Representation Theorem, the random vectors are shown to be i.i.d given the random variables (p1+ , ...pk+ , p1|1 , ..., pm|k ) = (p1+ , ...pk+ , p1|1 , ..., pm|k ) which are distributed according to an Enriched Dirichlet distribution with parameters α and µ. A detailed proof is given in the Appendix.

4

Enriched Dirichlet Process

Assume X and Y are complete and separable metric spaces with Borel σ-algebras BX and BY . Let B be the σ-algebra generated by the product of the σ-algebras of X and Y and

366


P(B) be the set of probability measures on the measurable product space (X × Y, B) where P(BX ), P(BY ) are similarly defined. For any P ∈ P(B), let PX denote the marginal probability measure, PY |X (·|x) for x ∈ X denote a version of the conditional, and PY |X denote the entire version of the conditional as an element of P(BY )X . Here, we consider the Borel σ-algebra under weak convergence on P(B), P(BX ), and P(BY ) and the product σ-algebra on P(BY )X . We will define a probability measure on P(B) that is more flexible than the Dirichlet Process with respect to the precision parameter and still retains conjugacy by extending the ideas of the Enriched Dirichlet distribution. Note that trying to enrich the DP by using the Enriched Dirichlet in place of the Dirichlet as the finite dimensional distributions, i.e., defining a random P such that (P(A1 × B1 ), . . . , P(Ak × Bm )) ∼ Enriched Dirichlet distribution, would not succeed because finite additivity holds only with a specification of the parameters that is equivalent to the Dirichlet distribution. Instead, we use directly the idea of the Enriched Dirichlet distribution, which defines a prior for the joint by first, decomposing it in terms of the marginal and conditionals and then, assigning independent conjugate priors to them. If X , Y are general spaces, it is a delicate issue to establish that such an approach induces a priorRon the joint. In particular, given a prior on P(BX ) × P(BY )X , the map (PX , PY |X ) → (·) PY |X (·|x)dPX (x) induces a prior on P(B) if it is jointly measurable in (PX , PY |X ), which is not true in general. Fortunately, if the prior for the marginal concentrates on the set of discrete probability measures and independence assumptions hold, the prior on the marginal and conditionals can be restricted to a subspace of P(BX ) × P(BY )X that has measure one, and on this subspace, the mapping is measurable, which is shown after the following definition. Definition 2. Let α be a finite measure on (X , BX ) and µ be a mapping from (BY × X ) to R+ such that as a function of B ∈ BY it is a finite measure on (Y, BY ) and as a function of x ∈ X it is α-integrable. Assume: 1. Law of Marginal, QX : PX is a random probability measure on (X , BX ) where PX ∼ DP (α). Y |X

2. Law of Conditionals, Qx : ∀x ∈ X , PY |X (·|x) is a random probability measure on (Y, BY ) where PY |X (·|x) ∼ DP (µ(·, x)). Q Y |X 3. Joint Law of Conditionals, QY |X = x∈X Qx : PY |X (·|x), x ∈ X are independent among themselves. X Y |X : PX is independent of 4. Joint Law of Marginal and Conditionals, Q = Q × Q PY |X (·|x) x∈X .

˜ of the stochastic The joint law of the marginal and conditionals, Q, induces the law, Q, process {P(C)}C∈B through the following reparametrization: Z d P(A × B) = PY |X (B | x)dPX (x), for any set A × B ∈ BX × BY . (5) A


367

This process is called an Enriched Dirichlet Process (EDP) with parameters α and µ, and is denoted P ∼ EDP (α, µ). The following arguments verify that the four conditions in definition (2) induce a distribution for the random joint probability measure. In particular, we define a X subspace of P(BX ) × P(B R Y ) that has measure one, such that on this subspace, the mapping (PX , PY |X ) → (·) PY |X (·|x)dPX (x) is measurable. First note that in order for PY |X (·|x), x ∈ X to be a set of conditional random probability measures, the following two properties need to be satisfied: Y |X

1. ∀x ∈ X , PY |X (·|x) is a probability measure on (Y, BY ) a.s Qx

.

2. ∀B ∈ BY , as a function of x, PY |X (B|x) is BX measurable a.s QY |X . The first item is satisfied since PY |X (·|x) ∼ DP (µ(·, x)) implies PY |X (·|x) ∈ P(BY ) with probability one. The second property follows from results of Ramamoorthi and Sangalli (2006). In particular, letting ∆ be the subset of P(BY )X such that PY |X is measurable as a function of x, they show that if PY |X (·|x) are independent among x ∈ X , then Q Y |X the product measure, QY |X = x∈X Qx , given by Kolmogorov’s Extension Theorem, assigns outer measure one to ∆. Let PD (BX ) denote the set of discrete probability measures on the measurable space (X , BX ). From properties of the DP, QX (PD (BX )) = 1. Therefore, by independence of PX and PY |X , the set PD (BX ) × ∆ has Q-measure one. Again, by results of Ramamoorthi andR Sangalli (2006), on PD (BX ) × ∆, for A × B ∈ BX × BY , the function (PX , PY |X ) → A PY |X (B|x)dPX (x) is jointly measurable in (PX , PY |X ). These results ˜ on P(B) induced from Q restricted to PD (BX ) × ∆ imply that we can define a prior, Q, R via the map (PX , PY |X ) → (·) PY |X (·|x)dPX (x). Obviously, this map is not 1 − 1. In fact, the definition of the EDP states that the four conditions hold for the joint distribution of (PX , PY |X ) for a fixed version of the conditional, and this induces a prior on the joint. However, from the induced prior on the random joint probability measure, we can obtain the joint distribution of PX and PY |X through the mapping P → (PX , PY |X ) defined from any version of the conditional. In the next section, we show that although the mapping is not 1-1, the joint law of PX and PY |X defined from any version of the conditional and the induced law of the joint probability measure still satisfies the conditions in definition (2) through an extension of the enriched P´ olya urn scheme to the infinite case.

4.1

Enriched P´ olya Sequence

Similar to Blackwell and MacQueen (1973), we define an Enriched Pólya sequence which extends the enriched P´ olya urn scheme to the case when X and Y are complete separable metric spaces.

368


Definition 3. The sequence of random vectors {(Xn , Yn )}n∈N taking values in X × Y is an Enriched P´ olya sequence with parameters α and µ if: 1. For A ∈ BX and for all n ≥ 1, P r(X1 ∈ A) =

α(A) , α(X )

P r(Xn+1 ∈ A | X1 = x1 , ..., Xn = xn ) =

Pn α(A) + i=1 δxi (A) . α(X ) + n

2. For B ∈ BY and for all n ≥ 1, µ(B, x) , µ(Y, x) P r(Yn+1 ∈ B | Y1 = y1 , ..., Yn = yn , X1 = x1 , ..., Xn = xn , Xn+1 = x) Pnx µ(B, x) + j=1 δyx,j (B) = , µ(Y, x) + nx Pn x where nx = i=1 δxi (x) and {yx,j }nj=1 = {yi : xi = x, i = 1, ..., n}. P r(Y1 ∈ B | X1 = x) =

In words, the predictive distributions characterizing the Enriched Pólya sequence can be interpreted in terms of draws from urns as follows; initially, there is an X-urn containing α(X ) balls of color 0. A ball is first drawn from the X-urn, and once drawn, α(·) its true color, x1 , is revealed (where x1 is the realization of a draw from P0X (·) = α(X ) ). A ball of color x1 is added to the urn along with a ball of color 0, so that the urn is now composed of α(X ) balls of color 0 and one ball of color x1 . Once the true color x1 of the X-ball is revealed, a Y |x1 -urn is created with µ(Y, x1 ) balls of color 0. Next, a ball is drawn from the Y |x1 -urn, and similarly, once drawn its true color is revealed to be µ(·,x1 ) ). This ball is then y1 (where y1 is the realization of a draw from P0Y |X (·|x1 ) = µ(Y,x 1) added to the Y |x1 -urn along with a ball of color 0, so that the urn contains µ(Y, x1 ) balls of color 0 and one ball of color y1 . At the next stage, we again first draw a ball from the X-urn. We can either draw a 0 ball or an x1 ball. If an x1 ball is drawn, we replace it along with another ball of the same color and then draw a Y-ball from the Y |x1 urn. If the X-ball drawn is of color 0, then once drawn its true color is revealed, x2 . We add a ball of color x2 to the X-urn and create a Y |x2 urn with µ(Y, x2 ) balls of color 0. This process is repeated, so that a new Y |x urn is created for each new value of X that is observed. Note that if P ∼ EDP (α, µ) and the random vectors (X1 , Y1 ), ...(Xn , Yn ) given P = P are i.i.d and distributed according to P , then {(Xn , Yn )}n∈N is an enriched P´ olya sequence. Conversely, the following theorem proves that if {(Xn , Yn )}n∈N is an Enriched P´ olya sequence, then given a random probability measure P = P , the random vectors (X1 , Y1 ), ...(Xn , Yn ) are i.i.d and distributed according to P where the joint distribution of (PX , PY |X ) defined from any fixed version of the conditional satisfies


369

the four conditions in defintion (2). Therefore, in addition to the fact that the de Finetti measure of an Enriched Pólya sequence is an Enriched Dirichlet Process, this theorem also shows that the induced law of the random joint from the four conditions in defintion (2) still maintains those properties even though the mapping is not 1 − 1. Theorem 4. If {(Xn , Yn )}n∈N is an Enriched P´ olya sequence with parameters α and µ, then {(Xn , Yn )}n∈N is an exchangeable sequence and its de Finetti measure is an Enriched Dirichlet Process with parameters (α, µ). For a quick sketch of the proof, we start by showing that the sequence {(Xn , Yn )}n∈N is exchangeable, and then apply de Finetti’s Theorem. Next, after reparametrizing in terms of the marginal and conditonals, we verify the de Finetti measure satisfies the four conditions in the definition of the EDP. A detailed proof is given in the Appendix.

4.2

Properties

α(·) Define P0X (·) = α(X ) and for every x ∈ X , P0Y |X (·|x) = properties of the Dirichlet distribution, we have:

µ(·,x) µ(Y,x) .

From well-known

Proposition 1. If P ∼ EDP (α, µ), for A ∈ BX , B ∈ BY ,

E[PX (A)] = P0X (A);

Var(PX (A)) =

P0X (A)(1−P0X (A)) . α(X )+1

∀x ∈ X , E[PY |X (B | x)] = P0Y |X (B|x); P

(B|x)(1−P

(B|x))

0Y |X Var(PY |X (B|x)) = 0Y |X µ(Y,x)+1 . R E[P(A × B)] = A P0Y |X (B|x)dP0X (x) := P0 (A × B).

Therefore, similar to the DP, the location of the EDP is determined by the base measure P0 , but the there are now many more parameters to control the precision, namely α(X ) and µ(Y, x) for every x ∈ X . The following proposition states that the DP is in fact a special case of the EDP. Proposition 2. P ∼ EDP (α, µ) with µ(Y, x) = α(x), ∀x ∈ X is equivalent to P ∼ DP (α(X )P0 ). The proof relies on the urn characterization of both processes; we show that an Enriched P´ olya sequence is equivalent to a Pólya sequence with parameter α(X ) P0 (·), if µ(Y, x) = α(x), ∀x ∈ X . A more detailed proof is given in the Appendix. As a by-product of this proposition, if P ∼ DP (α(X )P0 ), the law of the random conditionals is PY |X (·|x) ∼ DP (α(x)P0Y |X (·|x)), where PY |X (·|x) are independent among x ∈ X . In general, the marginal base measure can assign positive mass to countably many locations. Any random conditional probability measure associated with x that has positive mass under the marginal base measure will be a DP with precision parameter equivalent to the mass under the marginal base measure times α. Since a

370


DP with precision parameter 0 is degenerate on a random location with probability one, the random conditional probability measures associated with all other x’s will be degenerate at some y ∈ Y with probability one. Thus, in the case when P0 is nonatomic, a DP implies assuming the conditionals are independent and degenerate a.s., which is consistent with results in Ramamoorthi and Sangalli (2006). As noted by Ferguson (1973), a prior for nonparametric problems should have large topological support. The following theorem shows that the EDP has full weak support. Here, X = Rk1 and Y = Rk2 , implying X × Y = Rk where k = k1 + k2 . Theorem 5. Let S0 denote the topological support of P0 . If P ∼ EDP (α, µ), then the topological support of P is M0 = {P ∈ P(B) : topological support(P ) ⊆ S0 } .

4.3

Posterior

Just as the finite dimensional Enriched Dirichlet distribution is conjugate to the multinomial likelihood, the Enriched Dirichlet Process is also conjugate for estimating an unknown distribution from exchangeable data. More precisely, iid

Proposition 3. If (Xi , Yi ) | P = P ∼ P , where P ∼ EDP (α, µ), then P | x1 , y1 , ..., xn , yn ∼ EDP (αn , µn ), where αn = α + ∀x ∈ X

Pn

i=1 δxi ,

µn (·, x) = µ(·, x) +

nx X j=1

δyx,j ;

nx =

n X

δxi (x),

x {yx,j }nj=1 = {yj : xj = x}.

i=1

The proof of conjugacy is straightforward; one simply has to demonstrate that given the random sample the four conditions in the definition of EDP hold with the updated parameters specified above. The first two conditions, the fact that the marginal and conditionals are DPs with updated parameters, follow from conjugacy of the DP. The last two conditions, independence of the marginal and conditionals and independence among the conditionals, follow by combining the fact that a priori independence holds with independence of the random vectors (X1 , ..., Xn ) and (Y1 , ..., Yn |X1 = x1 , ..., Xn = x xn ) and independence of the random vectors {Yx,j }nj=1 among x ∈ X . Posterior consistency is a frequentist validation tool that is useful in Bayesian nonparametric inference where the infinite dimension of the parameter space can make specification of a prior challenging and cause the prior to strongly influence the posterior even with large amounts of data. One of the reasons that makes the Dirichlet Process so appealing is that the posterior is weakly consistent for any probability measure, Π, on the product space under the assumption that the sequence of random vectors are distributed according to the i.i.d. product measure Π∞ . Another important property that the EDP maintains is posterior consistency. The proof requires that for a set


371

A × B ∈ BX × BY , the posterior expectation of P(A × B) converges to Π(A × B) a.s. Π∞ and its posterior variance goes to zero. In the following lemma, the variance of the probability over a set A × B ∈ BX × BY is specified. Lemma 1. If P ∼ EDP (α, µ), for A × B ∈ BX × BY , Z P0Y |X (B|x)(1 + µ(Y, x)P0Y |X (B|x)) 1 V ar(P(A × B)) = dP0X (x) α(X ) + 1 A µ(Y, x) + 1 Z Z P0Y |X (B|x)(1 − P0Y |X (B|x)) α(X ) dP0X (x0 )dP0X (x) + α(X ) + 1 A {x} µ(Y, x) + 1 Z Z 1 − P0Y |X (B|x)2 dP0X (x0 )dP0X (x) α(X ) + 1 A {x} Z Z 1 P0Y |X (B|x0 )P0Y |X (B|x)dP0X (x0 )dP0X (x). − α(X ) + 1 A A\{x}

(I1 ) (I2 ) (I3 ) (I4 )

Theorem 6. If P ∼ EDP (α, µ), then, for Π ∈ P(B), the posterior distribution, Qn , of P converges weakly to δΠ for n → ∞, a.s. Π∞ . The proofs are given in the Appendix.

4.4

Square-Breaking Construction

The following square-breaking representation of the EDP is a direct result of Sethuraman’s stick-breaking representation of the DP (Sethuraman (1994)). Proposition 4. If P ∼ EDP (α, µ), it has the following square-breaking a.s. representation: P=

∞ ∞ X X

Y ∗ , πiX πj|i δXi∗ ,Yj|i

i=1 j=1

where: π1X = V1X ; πiX = ViX

Qi−1

h=1 (1

− VhX ), with

i.i.d

Y and for i = 1, 2, ... : π1|i

i.i.d

ViX ∼ beta(1, α(X )), Xi∗ ∼ P0X , Qj−1 Y Y Y Y = V1|i ; πj|i = Vj|i h=1 (1 − Vh|i ), with

i.i.d

i.i.d

Y ∗ Vj|i |Xi∗ = x∗i ∼ beta(1, µ(Y, x∗i )), Yj|i |Xi∗ = x∗i ∼ P0Y |X (·|x∗i ), o∞ n o∞ n ∞ ∞ Y Y , Vj|2 |X2∗ = x∗2 , ... and and the sequences ViX i=1 , {Xi∗ }i=1 , Vj|1 |X1∗ = x∗1 j=1 j=1 n o∞ n o∞ ∗ ∗ , Yj|2 |X2∗ = x∗2 , ... are independent. Yj|1 |X1∗ = x∗1 j=1

j=1

For an alternative view of this proposition, consider a square of area one; we break Y off rectangles of the square defined by a width of πiX and length of πj|i and we assign X Y ∗ ∗ the area of that rectangle, πi πj|i , to a random location (Xi , Yj|i ).

372


Note that while a closed form for the finite dimensional distributions of PY may not be available, we can obtain a square-breaking construction for the random marginal probability measure on (Y, BY ), PY =

∞ X ∞ X

Y ∗ , δYj|i πiX πj|i

i=1 j=1

∞ n Y o∞ where the distribution of πiX i=1 , πj|i

i,j=1

4.5

o∞ n ∗ , Yj|i

is specified above.

i,j=1

Clustering Structure

The clustering structure in a sample from P ∼ EDP is characterized by the predictive rule. In particular, the predictive rule states that if P0 is non-atomic, for A × B ∈ BX × B Y : P r(Xn+1 ∈ A, Yn+1 ∈ B|x1 , y1 , ..., xn , yn ) X α(X ) ni = P0 (A × B) + α(X ) + n α(X )+n ∗

µ(B, x∗i ) +

xi ∈A

Pni

j=1 δyij (B)

! .

µ(Y, x∗i ) + ni

Thus, the pair (Xn+1 , Yn+1 ) is either a “new-new” , “old-new” , or “old-old” pair with probabilities obtained by replacing the set A × B with the sets (X \ {x1 , ..., xn }) × (Y \ {y1 , ..., yn }), {x1 , ..., xn } × (Y \ {y1 , ..., yn }), or {x1 , ..., xn } × {y1 , ..., yn } respectively. Let (x∗1 , ..., x∗dn ) be the unique values of (x1 , ..., xn ) where dn is the number of unique ∗ ∗ ) be the unique values of (yi,1 , ..., yi,ni ) where dni is the number values and (yi,1 , ..., yi,d ni of unique values in this set. Succinctly, the clustering structure is described as follows:  α(X ) wp α(X  )+n ;  new-new, (Xn+1 , Yn+1 ) ∼ P0 µ(Y,x∗ ni ∗ ∗ i) Xn+1 , Yn+1 = ; old-new, xi , i = 1, ..., dn , Yn+1 ∼ P0Y |X (·|xi ) wp α(X )+n µ(Y,x∗ )+n i i  ni,j  ∗ i old-old, x∗i , i = 1, ..., dn , yi,j , j = 1, ..., dni wp α(Xn)+n ∗ µ(Y,x )+ni . i

This gives a “two-level” clustering which reduces to the global clustering of the DP if µ(Y, x) = 0 for all x ∈ X .

4.6

Comparison with different approaches

In recent literature, there have been many proposals of generalizations of the Dirichlet process, particularly, dependent Dirichlet Processes. Such an approach exploits marginal conditional independence. One considers a collection of random variables {Yj , j ∈ J } and assumes that they are conditionally independent, that is, for any Qm j1 , . . . , jm ∈ J , Yj1 , . . . , Yjm | Fj1 , . . . , Fjm ∼ i=1 Fji (·). Then, a prior is given on the family of random distributions {Fj , j ∈ J }, such that the Fj ’s are dependent. A first proposal along these lines was given by Cifarelli and Regazzini (1978), who i.i.d

assumed that Fj | λ ∼ DP (αF0 (·; λ)), with λ ∼ H(λ). A very interesting development is the Hierarchical Dirichlet Process proposed by Teh et al. (2006), who model


373

the random base measure F0 nonparametrically, assuming F0 ∼ DP (γH). A further development is the Nested Dirichlet Process (Rodriguez et al. (2006)) where the model i.i.d

is given as Fj | G ∼ G and G ∼ DP (αDP (γH)). The general and clever scheme given by the Dependent Dirichlet Processes (DDP; MacEachern (1999)), induces dependence across the Fj ’s by exploiting the stick breaking representation of the Dirichlet process and by assuming dependent weights and atoms along j. Dependent DPs define the law of a collection of distribution functions {Fj , j ∈ J } indexed by a non random covariate. If we simply replace J with X , this does not necessarily define the law of the conditionals. In particular, since the covariate is non random, no σ-algebra on X is considered, and thus, measurability with respect to BX a.s. is not required. If measurability with respect BX a.s. is satisfied, this is a model on the random conditionals and does not induce a prior on the random joint distribution of (X, Y ). Instead, our approach gives a prior on the marginal-conditional pair and induces a prior on the joint. For a Dirichlet Process with non atomic base measure, the random conditionals are independent and degenerate a.s. We are extending this by allowing for non degenerate conditionals, but we will assume independence. A further extension would allow for dependence among the random conditionals through a dependent Dirichlet Process if measurability with respect to BX a.s. is satisfied. However, some properties will be lost. For example, for a DDP, we would lose conjugacy, and the model would become much more complex, and using the Hierarchical DP or the Nested DP would remove dependence on x in the base measures for the conditionals. Notice thatP the distribution of the conditional also as a random function of X is ∞ X ∗ PY |X (·|X) ∼ i=1 πi δPY |X (·|Xi ) . This resembles the prior for the Nested Dirichlet Process, but is not directly comparable since PY |X (·|X) is a different object than {Fj , j ∈ J }.

5

Example

We provide an illustration of the properties of the EDP prior in an application to mixture models. The problem we consider is comparing different schools based on national test scores. The dataset we analyze contains two different test scores for students in 65 inner-London schools. The first score is based on the London Reading Test (LRT), taken at age 11, and the second is a score derived from the Graduate Certificate of Secondary Education (GCSE) exams in a number of different subjects, taken at age 16. Taking into account earlier LRT scores can give a sense of the “value added” for each school. To answer the question of which schools are most effective, we consider modeling the relationship between LRT and GCSE for all schools. The data are available at http:// www.stata-press.com/data/mlmus.html. School number 48 is dropped from the dataset since only 2 students were observed. Rabe-Hesketh and Skrondal (2005) (Chapter 4) study the following multilevel parametric model where Yij and Xij represent, respectively, the GCSE and LRT score for

374


student i in school j: indep

Yij | β0j , β1j , xij ∼ N (β0j + β1j xij , σ 2 ), β0j i.i.d β0 ∼ N2 , Σβ , β1j β1

(6)

where β0j and β1j are independent of Xij . The interest is in estimating the school specific coefficients βj = (β0j , β1j ). The intercept is interpreted as the school mean of GCSE scores for the students with the average LRT score of 0. The competitiveness of the school is captured by the school specific slope. Schools with greater slopes are competitive; more “value” is added for students with higher LRT scores. Schools with a slope of 0 are non-competitive; the performance of students is homogeneous regardless of how the students scored on the LRT. If parents are to choose the best school for their children, both average “value added” and competitiveness are important. Maximum likelihood estimates of the parameters of the mixing distribution (RabeHesketh and Skrondal (2005)) give βˆ0 = −.115, with standard error SE(βˆ0 ) = .0199, and βˆ1 = .55, with SE(βˆ1 ) = .3978, and estimated covariance matrix: 9.04 .18 ˆ Σβ = . .18 .0145 Empirical Bayes predictions of school specific intercept and slope were then obtained; figures (1a) and (1b) show the plots of estimated regression lines for each school and ranking of schools based on the intercept.

-20

Empirical Bayes regression lines for model 2 -10 0 10 20 30

10 -10 30 10 Empirical -40 -20 2 0 20 40 LRT Bayes regression lines for model 2

-40

-20

0 LRT

20

40

(a) Empirical Bayes Predictions of schoolspecific regression line

(b) Ranking of Schools

Figure 1: Results of Linear Mixed Effect model By visual inspection of the histograms of the empirical Bayes estimates in figures (2a) and (2b), for the intercept and especially the slope, a normal distribution does not fit well. This may be due to the fact that there are only 65 schools, that the normality assumption does not hold or a combination of the two. To enlarge the class of models, we can consider modelling the mixing distribution of the intercept and slope


375 D2ensity random slopes 5 4 3 2 1 Density .-.2 0 .2 .4 Predicted

0

0

1

.05

2

Density

Density 3

.1

4

.15

5

.05 Density .15 .1 .05 -10 -5 0 1 5 10 Predicted 0 random intercepts

-10

-5

0 Predicted random intercepts

5

10

-.2

(a) Predicted random intercept

0 .2 Predicted random slopes

.4

(b) Predicted random slope

Figure 2: Assessing the model nonparametrically. A pitfall of model (6) is that it assumes the same variability for all schools. In fact, the wide range of the naive OLS estimates of within school variance (not shown) supports a model which allows for school-specific variance. Bayesian nonparametric extensions of this model would assign a DP prior on the mixing distribution of the (β0j , β1j )’s (a DP-location mixture), assuming the same variance σ 2 for each school, or model school specific variances σj2 , with a DP prior for the latent distribution of (β0j , β1j , σj2 ) (DP scale-location mixture). The EDP is an intermediate choice. It may model clusters of schools that share the same variance, with different β’s inside each cluster. We assume that indep

Yij |xij , β0j , β1j , σj2 ∼ N (β0j + β1j xij , σj2 ), i.i.d

βj , σj2 |Pβ,σ2 = Pβ,σ2 ∼ Pβ,σ2 where βj = (β0j , β1j ), Pβ,σ2 ∼ EDP (α, µ) where α = ασ2 P0,σ2 and, for all σ 2 ∈ R+ , µ(·, σ 2 ) = µβ (σ 2 )P0,β|σ2 (·|σ 2 ).

In the analysis reported below, we fixed the baseline measures P0σ as an InverseGamma, with rate and shape parameters, respectively, 8 and 385, and P0,β|σ2 (·|σ 2 ) 2 0 as a bivariate Normal, N2 (µ0 , k0 σ Σ0 ), with µ0 = [0, .5] , k0 = 1/20 and Σ0 = 9 3/16 . 3/16 1/64 Notice that if the precision parameter ασ2 ≈ 0, we get back to a DP location mixture, and if the precision parameters µβ (σ 2 ) ≈ 0 for all σ 2 ∈ R+ , we get a DP scale-location mixture. Thus, with an EDP prior we can express uncertainty between homoskedasticity and heteroskedasticity. We model uncertainty about ασ2 and µβ (σ 2 ) through Gamma hyperpriors: i.i.d

ασ2 ∼ Ga(uα , vα ), where we choose uα = 2 and vα = 1, and for all σ 2 ∈ R+ µβ (σ 2 ) ∼

376


Ga(uµβ , vµβ ), with uµβ = 2 and vµβ = 1. The MCMC scheme to compute posterior distributions is based on the algorithm 6 described in Neal (2000), which is a Metropolis-Hastings algorithm with candidates drawn from the prior. Resampling the precision parameters is done by introducing a latent beta-distributed variable, as described in Escobar and West (1995). The number of iterations is set up to 20, 000 with 10% of burn-in. Looking at the trace and autocorrelation plots, convergence appears reached for the β’s in all schools and for σ 2 ’s in most schools. The results are summarized in Figures (3a) and (3b), which display the estimated regression line for each school and the ranking of schools based on average “value added ” with empirical quantiles.

10 0

51 7

22

-10

1646 3710

44

49

2553

1450 9 40361517

641338 434561

3418 12

4 1 325627 63 472619 8 6031

335548

42 353039

1120 5724 294159

2 54

3

6 62

52

5 21

23

-5

0

GCSE

10

Average value added

5

20

30

Estimated Regression for each School

-10

-20

2858

-40

-20

0

20

40

0

10

20

LRT

30

40

50

60

Rank

(a) Estimated regression line for each school

(b) Ranking of Schools based on average value added with empirical quantile

Figure 3: Results of EDP model

15 10 0

5

Estimated mu_beta

20

The MCMC posterior expectation of ασ2 is 2.5, and Figure (4) depicts the estimated posterior values of µβ (σ 2 ) for different values of σ 2 .

20

40

60

80

100

120

140

sigma^2

Figure 4: Estimated posterior values of µβ (σ 2 ) for different values of σ 2 Neither ασ2 ≈ 0 nor µβ (σ 2 ) ≈ 0 for all σ 2 , and interestingly, the estimated values of µβ (σ 2 ) are high for values of σ 2 which are more likely a posteriori, and close to zero


377

for unlikely values of σ 2 . Thus, the results favor a model which allows for homoskedacity among some schools with a more likely value σ 2 and some outlying schools with abnormally large or small variances.

6

Final remarks

We have proposed an enrichment of the DP starting from the idea of enriched conjugate priors. The advantages of this process are that it allows for more flexible specification of prior information, includes the DP as a special case, and retains some desirable properties including conjugacy and the fact that it can be constructed from an enriched urn scheme. The disadvantages include the difficulty in obtaining a closed form for the distribution of the joint probability over a given set and for the distribution of the marginal probability over a measurable subset of Y. Using an EDP as the prior for the distribution of a random vector, Z, implies one has to determine a partition of Z into two groups and an ordering defining which group comes first. The “two-level” clustering resulting from the EDP introduces a clear asymmetry based on the partition and ordering chosen, and how to choose them depends on the application. There may be a natural ordering or partition and/or computational reasons, including decomposition of the base measure, for choosing the partition and ordering. In our example, we partitioned the random vector (β0 , β1 , σ 2 ) into the two groups, (σ 2 ) and (β0 , β1 ), with σ 2 chosen first due to uncertainty in homoskedasticity and decomposition of the conjugate normal-inverse gamma base measure. One may also examine all plausible and interesting partitions and orderings. We have focused on the partition of the random vector into two groups, but most results could be extended to any finite partition of the random vector, although this would of course imply a further nested structure. Other future works include examining the implied clustering structure in regression settings when the joint model is an EDP mixture and exploring if other conjugate nonparametric priors whose finite dimensionals are standard conjugate priors can be generalized starting from enriched conjugate priors, such as extension of the enriched distribution, mentioned in the Remark 2, to an enriched bivariate Neutral to the Right Processes. We hope that having explored these features can shed light on potentialities and limitations and encourage further developments in constructing more flexible priors for a random probability measure on Rk .

Appendix Proof of Theorem 1 From the predictive distribution, it follows that the joint distribution can be expressed as: l−1 l−1 P P δjh ,ih (jl , il ) δih (il ) µ(jl , il ) + n α(il ) + Y h=1 h=1 , P r(X1 = i1 , Y1 = j1 , ..., Xn = in , Yn = jn ) = P α(X ) + l − 1 µ(Y, il ) + l−1 h=1 δih (il ) l=1

378


which can be equivalently expressed as: Γ(α(X )) Qk i=1 Γ(α(i))

Qk

k k Γ(α(i) + ni+ ) Y Γ(µ(Y, i)) Y Qm Γ(α(X ) + n) j=1 Γ(µ(j, i)) i=1 i=1

Qm

i=1

j=1

Γ(µ(j, i) + nij )

Γ(µ(Y, i) + ni+ )

.

(7)

The joint distribution only depends on the number of unique pairs seen, not on the order in which they are observed. Thus, the pairs {Xn , Yn }n∈N form an exchangeable sequence. By de Finetti’s Representation Theorem, there exists a probability measure ˜ on the simplex Sk,m = {p1,1 , ..., pk,m : pi,j ≥ 0 and Pk Pm pi,j = 1} such that: Q i=1 j=1 k Y m Y

Z P r(X1 = i1 , Y1 = j1 , ..., Xn = in , Yn = jn ) =

[0,1]km i=1 j=1

n ˜ pi,ji,j Q(dp 1,1 , ..., dpk,m ).

Pk Define the simplexes Sk = {p1+ , ..., pk+ : pi+ ≥ 0 and i=1 pi+ = 1} and Pm (i) Sm = {pi|1 , ..., pi|k : pj|i ≥ 0 and j=1 pj|i = 1} for i = 1, ...k. Let Q be the probaQk (i) ˜ via a bility measure on the product of the simplexes Sk × i=1 Sm obtained from Q reparametrization in terms of (p1+ , ..., pk+ , p1|1 , ..., pm|k ). Then, P r(X1 = i1 , Y1 = j1 , ..., Xn = in , Yn = jn ) Z m k Y Y n i+ pj|iij Q(dp1+ , ..., dpm|k ). pn = i+ [0,1]k ×[0,1]km i=1

(8)

j=1

Since the Dirichlet distribution is determined by its moments, combining equations (7) and (8) implies that p1+ , ..., pk+ ∼ Dir(α(1), ..., α(k)), p1|i , ..., pm|i ∼ Dir(µ(1, i), ..., µ(m, i))

i = 1, ..., k,

where (p1+ , ..., pk+ ), (p1|1 , ..., pm|1 ),..., and (p1|k , ..., pm|k ) are independent. The second part of the theorem follows from de Finetti’s results on the asymptotic behavior of the predictive distributions for exchangeable sequences; see Cifarelli and Regazzini (1996). Proof of Theorem 4. We start by noting that the sequence {Xn }n∈N is a Pólya sequence with parameter α. Recall that the predictive distribution of a Pólya sequence converges to a discrete random probability measure with positive mass at the countable number of unique values of the sequence almost surely with respect to the exchangeable law. Therefore, given X1 = x1 , ..., Xn = xn and letting U (x1 , ..., xn ) denote the of the unique values of {x1 , ..., xn }, we have that for x∗ ∈ U (x1 , ..., xn ), Pset n nx∗ = i=1 δx∗ (xi ) → ∞ as n → ∞ almost surely with respect to the exchangeable law. This implies that given {Xn = xn }n∈N , for any x∗ ∈ U ({xn }n∈N ), the set of random variables, {Yx∗ ,j } = {Yi : Xi = x∗ , i ∈ N|{Xn = xn }n∈N } is a countable sequence. Furthermore, by assumption, for x∗1 6= x∗2 ∈ U ({xn }n∈N ), the sequences {Yx∗1 ,j }j∈N and {Yx∗2 ,j }j∈N are independent P´ olya sequences with parameters µ(·, x∗1 ) and µ(·, x∗2 ) respectively. These observations imply exchangeability of the sequence {Xn , Yn }n∈N , as


379

shown in the following argument. P r(X1 ∈ A1 , Y1 ∈ B1 , ..., Xn ∈ An , Yn ∈ Bn ) Z = P r(Y1 ∈ B1 , ..., Yn ∈ Bn |x1 , ..., xn )dP r(x1 , ..., xn ).

(9)

A1 ×...×An

nx∗

nx∗

By independence of {Yx∗1 ,j }j=11 and {Yx∗2 ,j }j=12 for x∗1 6= x∗2 ∈ U (x1 , ..., xn ), we have that (9) is equal to: Z

Y

P r(Yx∗ ,1 ∈ Bx∗ ,1 , .., Yx∗ ,nx∗ ∈ Bx∗ ,nx∗ )dP r(x1 , ., xn ).

(10)

A1 ×...×An x∗ ∈U (x ,...,x ) n 1

A permutation, π, of the sets (x1 × B1 ), ..., (xn × Bn ), is equivalent to the same permutation, π, of (x1 , ..., xn ) and for x∗ ∈ U (xπ(1) , ..., xπ(n) ), a permutation, γx∗ , of (Bx∗ ,1 , ..., Bx∗ ,nx∗ ). To keep notation concise, we will let Uπ,n represent U (xπ(1) , ..., xπ(n) ) (and similarly, Un represent U (x1 , ..., xn )).The term inside the integral is invariant to the permutation, π, of (x1 , ..., xn ), and due to exchangeability of Pólya sequences, the n∗ x laws of the random vectors {Xi }ni=1 and {Yx∗ ,j }j=1 are invariant to the permutations π and γx∗ respectively. Thus, (10) is equal to: Z

Y

Aπ(1) ×...×Aπ(n) x∗ ∈U π,n

P r(Yx∗ ,1 ∈ Bγx∗ (1) , .., Yx∗ ,nx∗ ∈ Bγx∗ (nx∗ ) )dP r(xπ(1) , ..., xπ(n) )

Z P r(Y1 ∈ Bπ(1) , ., Yn ∈ Bπ(n) |xπ(1) , ., xπ(n) )dP r(xπ(1) , ., xπ(n) )

= Aπ(1) ×...×Aπ(n)

= P r(X1 ∈ Aπ(1) , Y1 ∈ Bπ(1) , ., Xn ∈ Aπ(n) , Yn ∈ Bπ(n) ).

De Finetti’s Representation Theorem states that there exists a random probability ˜ on P(B) such that: measure, P, with distribution Q Z P r(X1 ∈ A1 , Y1 ∈ B1 , ., Xn ∈ An , Yn ∈ Bn ) =

n Y

˜ ), P (Ah × Bh )dQ(P

(11)

P(B) h=1

Pn d and n1 h=1 δA×B (Xh , Yh ) → P(A × B) a.s. with respect to the exchangeable law ˜ The distribution Q ˜ determines the joint distribution, Q, of as n → ∞ where P ∼ Q. the marginal and a fixed version of the conditionals. Reparametrizing in terms of the marginal and conditionals implies: P r(X1 ∈ A1 , Y1 ∈ B1 , ., Xn ∈ An , Yn ∈ Bn ) Z n Z Y Y PY |X (Bh |x)dPX (x)dQ(PX , PY |X (·|x)). = P(BX )×P(BY )X h=1

Ah

(12)

x∈X

A simple application of the results of Blackwell and MacQueen (1973) for Pólya urn sequences, verifies that the first two conditions in the definition of the EDP hold. In particular, for any finite partition A1 , ..., Ak ⊆ BX , define the simple measurable function, φ(x) = i if x ∈ Ai for i = 1, ..., k. Noting that {φ(Xn )}n∈N , is a Pólya sequence with parameter α ◦ (φ)−1 taking values in the finite space {1, ..., k}, implies: PX (φ−1 (1), ..., PX (φ−1 (k))) ∼ Dir(α(φ−1 (1)), ..., α(φ−1 (k))) ⇔ PX (A1 ), .., PX (Ak ) ∼ Dir(α(A1 ), .., α(Ak )).

380


Similarly, for any finite partition B1 , .., Bm ⊆ BY , define the simple measurable function ϕ(y) = j if y ∈ Bj . For any x∗ ∈ U ({xn }n∈N ), the sequence {ϕ(Yx∗ ,j )}j∈N is a Pólya sequence taking values in the finite space {1, ..., m} with parameter µ(ϕ−1 (·), x∗ ). Again, it follows that: PY |X (ϕ−1 (1)|x∗ ), ..., PY |X (ϕ−1 (m)|x∗ ) ∼ Dir(µ(ϕ−1 (1), x∗ ), ..., µ(ϕ−1 (m), x∗ )) ⇔ PY |X (B1 |x∗ ), ..., PY |X (Bm |x∗ ) ∼ Dir(µ(B1 , x∗ ), ..., µ(Bm , x∗ )).

(13)

α(·) The unique values of the P´ olya sequence are actually draws from P0X (·) = α(X ) and can therefore take any value in X . Thus, (13) holds for any x ∈ X . Finally, we need to show the last two conditions in the defintion of the EDP hold. Exchangeability of the pairs implies exchangeability of the sequence {Yi |Xi = xi }i∈N . Therefore, by de Finetti’s theorem:

P r(Y1 ∈ B1 , ..., Yn ∈ Bn |x1 , ..., xn ) Z x∗ Y Y nY Y |X PY |X (Bx∗ ,j |x∗ )dQUn ( PY |X (·|x∗ )). = P(BY )Un x∗ ∈U j=1 n

(14) (15)

x∗ ∈Un

Independence of the exchangeable sequences Yx∗1 ,j j∈N and Yx∗2 ,j j∈N for x∗1 6= x∗2 implies: P r(Y1 ∈ B1 , ..., Yn ∈ Bn |x1 , ..., xn ) =

Y

P r(Yx∗ ,1 ∈ Bx∗ ,1 , ..., Yx∗ ,nx∗ ∈ Bx∗ ,nx∗ )

x∗ ∈Un

=

nx ∗

Y Z x∗ ∈Un

Y

Y |X

PY |X (Bx∗ ,j |x∗ )dQx∗ (PY |X (·|x∗ )).

(16)

P(BY ) j=1

Q Y |X Y |X Comparing (15) and (16) shows that QUn = x∗ ∈Un Qx∗ . Since the unique values of {x1 , ...xn } are realizations of P0X and can take any value in X , independence of PY |X (·|x) x∈X among x ∈ X follows. Therefore, (14) can be equivalently written as: n Y

Z P r(Y1 ∈ B1 , ..., Yn ∈ Bn |x1 , ..., xn ) = P(BY )X

PY |X (Bh |xh )d(

Y

QYx |X (PY |X (·|x))).

x∈X

h=1

Now combining this result with the fact that {Xn }n∈N is an exchangeable sequence implies: P r(X1 ∈ A1 , Y1 ∈ B1 , ., Xn ∈ An , Yn ∈ Bn ) Z = P r(Y1 ∈ B1 , ., Yn ∈ Bn |x1 , ., xn )dP r(x1 , ., xn ) A1 ×·×An

Z

Z

Z

=

n Y

h=1 P(BX ) A1 ×·×An P(BY )X

Z =

Z

n Y

PY |X (Bh |xh )d(

Y x∈X

Z

h=1A P(BX ) P(BY )X h

QYx |X (PY |X (·|x)))d(

PY |X (Bh | xh )dPX (xh )d(

n Y

PX (xh ))dQX (PX )

h=1

Y x∈X

QYx |X (PY |X (· | x)))dQX (PX ).

(17)


381

Comparing (12) with (17) implies that Q = QX × and PY |X (·|x) x∈X .

Y |X

Q

x∈X

Qx

, i.e independence of PX

Proof of Proposition 2 We show that an Enriched Pólya sequence is equivalent to a Pólya sequence with parameter α(X ) P0 (·), if µ(Y, x) = α(x), ∀x ∈ X . For an Enriched Pólya sequence with parameters α, µ and for A ∈ BX , B ∈ BY , since lim P r(Y1 ∈ µ(Y,x)→α(x)

B | X1 = x) = P0Y |X (B|x), then if µ(Y, x) = α(x), ∀x ∈ X , P r(X1 ∈ A, Y1 ∈ B) = P0 (A × B). The joint predictive distribution is given by, P r(Xn+1 ∈ A, Yn+1 ∈ B|X1 = x1 , Y1 = y1 , ..., Xn = xn , Yn = yn ) P x P Z µ(B, x) + n α(x) + n j=1 δyx,j (B) i=1 δxi (x) = d . µ(Y, x) + nx α(X ) + n A

(18)

Rewriting this as the sum of the integrals over the sets A\{x1 , ..., xn } and A∩{x1 , ..., xn } and replacing µ(Y, x) with α(x), we get (18) is equal to, α(X ) P0 (A \ {x1 , ..., xn } × B) + α(X ) + n =

α(x)P0Y |X (B|x) +

nx P

δyx,j (B)

j=1

X

α(x) + nx

x∈A∩{x1 ,...,xn }

α(x) + nx α(X ) + n

n X δxi ,yi (A, B) α(X ) n P0 (A × B) + . α(X ) + n α(X ) + n i=1 n

Proof of Theorem 5 This proof is based on the proof of Theorem 3.2.4 in Ghosh and Ramamoorthi (2003). To show M0 is the topological support - the smallest closed set of measure one - it is enough to show that M0 is a closed set of measure one, such that for every Π ∈ M0 , Q(U ) > 0 for any neighborhood U of Π. weakly

First, we show M0 is closed. If Pn ∈ M0 , then Pn (S0 ) = 1 for all n and if Pn → P , then for any closed set C ∈ B, lim supn Pn (C) ≤ P (C). Together these imply P (S0 ) = 1, or equivalently, P ∈ M0 . Secondly, the set M0 has measure one. This follows from thePsquare breaking con∞ ∗ X ∗ (S0 ) = 1 a.s., struction of P. Since Xi∗ , Yj|i ∼ P0 implies δXi∗ ,Yj|i i=1 πi = 1 a.s., and P∞ Y for all i, j=1 πj|i = 1 a.s, then P(S0 ) = 1 a.s. (⇔ Q(M0 ) = 1). Lastly, our theorem will be proved if we show that for any Π ∈ M0 and any neighborhood U of Π, Q (U ) > 0. By extension of Proposition 2.5.2 in Ghosh and Ramamoorthi (2003), there exists points q1,j < ... < qnj ,j in R for j = 1, .., k, and δ > 0, such that ( ∗

U =

P ∈ P(B) : |P (

k Y

[qij ,j , qij +1,j )) − Π(

j=1

Π(∂

k Y

k Y

[qij ,j , qij +1,j ))| < δ and

j=1

) [qij ,j , qij +1,j )) = 0 for i = 1, ..., nj , j = 1, ..., k

j=1

⊆ U.

382


Qk2 Qk1 [qij ,j , qij +1,j ) and without [qij ,j , qij +1,j ) and Bi1 ,..,ik2 = j=k Define Ai1 ,..,ik1 = j=1 1 +1 loss of generality, we denote these sets as A1 , ..., AN and B1 , ..., BM . If P0 (An ×Bm ) = 0, ∗ (S0 ) = 0 a.s. and P(An ×Bm ) is degenerate 0. In addition, P0 (An ×Bm ) = 0 then δXi∗ ,Yj|i combined with the facts that Π(∂An ×Bm ) = 0 and Π(S0 ) = 1, imply that Π(An ×Bn ) = 0. Therefore, |P(An × Bm ) − Π(An × Bm )| = 0 a.s.. If P0 (An × Bm ) > 0, then ∗ (An × Bm ) = 1 with positive probability. Thus, the square breaking construcδXi∗ ,Yj|i tion implies that Q(U ∗ ) > 0. Proof of Lemma 1 E[P(A × B)2 ] = E[ + E[

∞ X

πi2 PY |X (B|Xi∗ )2 δXi∗ (A)]

i=1 ∞ X X

(J1 )

πi πj PY |X (B|Xi∗ )2 δXi∗ (A)δXj∗ ({Xi∗ })]

(J2 )

πi πj PY |X (B|Xi∗ )PY |X (B|Xj∗ )δXi∗ (A)δXj∗ (A \ {Xi∗ })].

(J3 )

i=1 j6=i

+ E[

∞ X X i=1 j6=i

P∞ Using the fact that Eπ [ i=1 πi2 ] =

(J1 ) = Eπ [

∞ X

1 α(X )+1

and properties of the Dirichlet distribution,

πi2 EX ∗ [EQY |X [PY |X (B|Xi∗ )2 |Xi∗ ]δXi∗ (A)]]

i=1

=

1 α(X ) + 1

Z A

P0Y |X (B|x)(1 + µ(Y, x)P0Y |X (B|x)) dP0X (x). µ(Y, x) + 1

P∞ P Now, using the fact that Eπ [ i=1 i6=j πi πj ] = Dirichlet distribution,

(J2 ) = Eπ [

∞ X X

α(X ) α(X )+1

and, again, properties of the

πi πj EX ∗ [EQY |X [PY |X (B|Xi∗ )2 |Xi∗ ]δXi∗ (A)δXj∗ ({Xi∗ })]]

i=1 i6=j

=

α(X ) α(X ) + 1

Z Z A

{x}

P0Y |X (B|x)(1 + µ(Y, x)P0Y |X (B|x)) dP0X (x0 )dP0X (x), µ(Y, x) + 1

∞ X X (J3 ) = Eπ [ πi πj EX ∗ [EQY |X [PY |X (B|Xi∗ )PY |X (B|Xj∗ )|Xi∗ , Xj∗ ]δXi∗ (A)δXj∗ (A \ {Xi∗ })]] i=1 i6=j

α(X ) = α(X ) + 1

Z Z A

P0Y |X (B|x0 )P0Y |X (B|x)dP0X (x0 )dP0X (x). A\{x}

The result is obtained following some algebra. Proof of Theorem 6 First, we show that E[P(A×B)|X1 = x1 , Y1 = y1 , ..., Xn = xn , Yn =


383

yn ] → Π(A × B) a.s. Π∞ . E[P(A × B)|X1 = x1 , Y1 = y1 , ..., Xn = xn , Yn = yn ] P x X µ(Y, x) + n α(X ) j=1 δyx,j (B) α(x) + nx = P0 (A \ {x1 , ..., xn } × B) + α(X ) + n α(X ) + n µ(Y, x) + nx x∈A∩{x1 ,...,xn }

∼

1 n

nx X

X

δyx,j (B) =

x∈A∩{x1 ,...,xn } j=1

1 n

n X

δxi ,yi (A, B)

i=1

→ Π(A × B) a.s Π∞ .

Using lemma (1), we show the posterior variance of P(A×B) goes to 0, by showing each Pn (A) 1 of the four terms in (1) goes to 0. Since ααnn(X i=1 δxi (A) and for, x ∈ {x1 , ..., xn }, ) ∼ n Pnx µn (B,x) 1 i=1 δyx,j (B), µn (Y,x) ∼ nx (I1 ) ∼

Z

1 n

( A

nx nx n 1 1X 1 X 1 X δyx,j (B))( δyx,j (B))d( + δx (x)) nx i=1 nx nx i=1 n i=1 i

→ 0, Z Z (I2 ) ∼ A

{x}

nx nx n n 1 X 1X 1X 1 1 X δyx,j (B))( δyx,j (B c ))d( ( δxi (x0 ))d( δx (x)) nx nx i=1 nx i=1 n i=1 n i=1 i

→ 0, (I3 ) ∼ −

1 n

Z Z ( A

{x}

nx n n 1X 1 X 1X δyx,j (B))2 d( δxi (x0 ))d( δx (x)) nx i=1 n i=1 n i=1 i

→ 0, (I4 ) ∼ −

1 n

Z Z ( A

A\{x}

nx 0 nx n n 1 X 1 X 1X 1X δyx,j (B))( δyx0 ,j (B))d( δxi (x0 ))d( δx (x)) nx i=1 nx0 i=1 n i=1 n i=1 i

→ 0.

This holds for any finite collection of sets. By a straightforward extension of Theorem 2.5.2 of Ghosh and Ramamoorthi (2003), this implies weak convergence of Qn to δΠ a.s. Π∞ .

References Blackwell, D. and MacQueen, J. B. (1973). “Ferguson Distributions Via Pólya Urn Schemes.” Annals of Statistics, 1: 353–355. 367, 379 Cifarelli, D. and Regazzini, E. (1978). “Problemi statistici nonparametrici in condizioni di scambiabilit` a parziale e impiego di medie associative.” Quaderni Istituto di Matematica Finanziaria, Universit` a di Torino, 12: 1–36. English translation available at www.unibocconi.it/wps/allegatiCTP/CR-Scamb-parz[1].20080528.135739.pdf. 372

384


— (1996). “De Finetti’s contribution to probability and statistics.” Statistical Science, 11: 253–282. 378 Connor, R. and Mosimann, J. E. (1969). “Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution.” Journal of the American Statistical Association, 64: 194–206. 361, 363 Consonni, G. and Veronese, P. (2001). “Conditionally reducible natural exponential families and enriched conjugate priors.” Scandinavian Journal of Statistics, 28: 377– 406. 360, 361 Diaconis, P. and Ylvisaker, D. (1979). “Conjugate priors for exponential families.” Annals of Statistics, 7: 269–281. 361 Doksum, K. A. (1974). “Tailfree and Neutral random probabilities and their posterior distributions.” Annals of Probability, 2: 183–201. 360 Escobar, M. and West, M. (1995). “Bayesian density estimation and inference using mixtures.” Journal of the American Statistical Association, 90: 577–588. 376 Ferguson, T. (1973). “A Bayesian analysis of some nonparametric problems.” Annals of Statistics, 1: 209–230. 370 Geiger, D. and Heckerman, D. (1997). “A characterization of the Dirichlet distribution through global and local parameter independence.” Annals of Statistics, 25: 1344– 1369. 363 Ghosh, J. and Ramamoorthi, R. (2003). Bayesian nonparametrics. New York: SpringerVerlag, Springer Series in Statistics. 365, 381, 383 MacEachern, S. (1999). “Dependent nonparametric processes.” In ASA Proceedings of the Section on Bayesian Statistical Science, 50–55. Alexandria, VA: American Statistical Association. 373 Neal, R. (2000). “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistcs, 9: 249–265. 376 Rabe-Hesketh, S. and Skrondal, A. (2005). Multilevel and longitudinal modeling using stata. College Station, Texas: Stata Press. 373, 374 Ramamoorthi, R. and Sangalli, L. (2006). “On a Characterization of Dirichlet Distribution.” In Upadhyay, S., Singh, U., and Dey, D. (eds.), Proceedings of the International Conference on Bayesian Statistics and its Applications, Jan. 6-8, 2005, 385–397. Varanasi, India: Banaras Hindu University. 367, 370 Rodriguez, A., Dunson, D., and Gelfand, A. (2006). “The nested Dirichlet Process.” Journal of the American Statistical Association, 103: 1131–1154. 373 Sethuraman, J. (1994). “A Constructive Definition of Dirichlet Priors.” Statistica Sinica, 4: 639–650. 371


385

Springer, M. and Thompson, W. (1970). “The distribution of Products of Beta, Gamma and Gaussian Random Variables.” Journal on Applied Mathematics, 18: 721–737. 364 Teh, Y., Jordan, M., Beal, M., and Blei, D. (2006). “Hierarchical Dirichlet Process.” Journal of the American Statistical Association, 101: 1566–1581. 372

Acknowledgments The authors are grateful to R.V. Ramamoorthi for his guidance, and to the two referees and the Associate Editor for their very careful and helpful comments. This work has been partially funded by grants from Bocconi University and by PRIN from the Italian Ministry of Education, University, and Research.

386