ting of multivariate continuous density estimation to the mixed scale ... and continuous, Multivariate density estimation, Nonparametric Bayes, Strong posterior.
Submitted to the Annals of Statistics arXiv: 1110.1265
BAYESIAN MULTIVARIATE MIXED-SCALE DENSITY ESTIMATION By Antonio Canale and David B. Dunson
arXiv:1110.1265v2 [math.ST] 11 Oct 2011
Universit` a degli Studi di Padova and Duke University Although univariate continuous density estimation has received abundant attention in the Bayesian nonparametrics literature, there is essentially no theory on multivariate mixed scale density estimation. In this article, we consider a general framework to jointly model continuous, count and categorical variables under a nonparametric prior, which is induced through rounding latent variables having an unknown density with respect to Lesbesgue measure. For the proposed class of priors, we provide sufficient conditions for large support, strong consistency and rates of posterior contraction. These conditions, which primarily relate to the prior on the latent variable density and heaviness of the tails for the observed continuous variables, allow one to convert sufficient conditions obtained in the setting of multivariate continuous density estimation to the mixed scale case. We provide new results in the multivariate continuous density estimation case, showing the Kullback-Leibler property and strong consistency for different mixture priors including priors that parsimoniously model the covariance in a multivariate Gaussian mixture via a sparse factor model. In particular, the results hold for Dirichlet process location and location-scale mixtures of multivariate Gaussians with various prior specifications on the covariance matrix.
1. Introduction. It is routine in many application domains to collect multivariate mixed scale data consisting of binary, categorical, continuous and count measurements, motivating an increasingly rich literature on methods for analysis of such data. Perhaps the most common approach in practice is to link each observed variable to one or more underlying Gaussian variables. Relationships among the underlying Gaussian variables are typically characterized through latent factor or structural equation models as in Muth´en [23]. Although the underlying Gaussian class is convenient computationally and in terms of providing an interpretable framework for studying dependence among mixed scale variables, the flexibility is limited in implying Gaussian distributions for continuous variables, probit models for categorical variables and a restrictive dependence structure. In addition, issues Keywords and phrases: Convergence rate, Factor model, Large support, Mixed discrete and continuous, Multivariate density estimation, Nonparametric Bayes, Strong posterior consistency
1 imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
2
A. CANALE AND D. B. DUNSON
arise in modeling counts and categorical variables having very many levels due to the need to introduce and do computation for very many threshold parameters. An alternative class of joint models for mixed scale data can be induced by defining a separate generalized linear model for each variable, with shared latent variables included within these models to induce dependence [26; 22; 5; 6]. This framework assumes that observed variables are independently drawn from distributions in the exponential family conditionally on latent variables. In marginalizing out the latent variables, one obtains a multivariate distribution with essentially unknown properties and computation can be quite challenging. In certain cases, pitfalls can arise due to the dual role of the latent factors in controlling the dependence structure and the shape of the marginal distributions. Given these issues, it is quite appealing to consider nonparametric models for flexibly and parsimoniously estimating unknown joint distributions for mixed scale data. Somewhat surprisingly given the considerable applied interest, the literature on nonparametric estimation for mixed scale data is very small. From a frequentist kernel smoothing perspective, Li, Racine and co-authors [18; 17; 24; 19] proposed mixed kernel methodology and considered properties under somewhat restrictive conditions. Efromovich [7] recently relaxed these conditions and proposed a data-driven estimator designed to combat the curse of dimensionality. However, his work still assumed compact support for continuous variables and bounded support for discrete variables. To our knowledge, there has been no work on general nonparametric Bayes mixed data density estimation. We note that one can potentially combine two existing models to obtain a seemingly flexible and computationally tractable model for mixed scale densities, which also addresses the curse of dimensionality problem. Namely, we can combine the class of underlying Gaussian models mentioned above with mixture of factor analyzers, which characterize multivariate continuous densities as mixtures of Gaussians with a factor analytic factorization of the covariance [11; 4; 21]. A conceptually related approach was proposed by Yang and Dunson [34], but instead of mixing Gaussian factor models, they used a nonparametric Bayes approach to allow unknown latent variable distributions in structural equation models that accommodate relationships among the latent variables. To define a nonparametric model for counts, Canale and Dunson [3] proposed to round an underlying variable having an unknown density given a Dirichlet process mixture (DPM) of Gaussians prior [20; 8]. Our focus is on defining classes of Bayesian models for mixed scale densities, which are computationally convenient and can be shown to have apimsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
3
pealing theoretical properties, such as large support, posterior consistency and near optimal rates of convergence. Instead of developing fundamentally new theoretical tools for the study of mixed scale densities, our goal is to provide theorems that allow leveraging on results obtained for multivariate continuous densities. With this goal in mind, we focus on a multivariate mixed scale generalization of the rounding framework of [3]. In particular, we propose to induce a prior on a mixed scale density by defining a prior on a multivariate continuous density for underlying variables, which are rounded to induce categorical or count measurements. This framework has the appealing feature of avoiding computation for thresholds, greatly facilitating handling of discrete variables having many levels. Theoretical properties depend crucially on the prior for the underlying continuous density. For modeling continuous densities, standard nonparametric Bayes methods rely on mixture models of the form Z f (y) = K(y; Θ)dP (Θ), P ∼ Π, where K(·; Θ) is a known probability kernel having parameters Θ (e.g., Gaussian with Θ including the mean and covariance) and P is an unknown mixing measure assigned a prior Π. A common choice for Π is the Dirichlet process of Ferguson [9; 10]. Ghosal, Ghosh and Ramamoorthi [12] derive sufficient conditions on the prior and the true distribution f0 in order to achieve strong posterior consistency under a DPM of univariate Gaussians. Tokdar [30] relaxed their conditions assuming a Dirichlet process location-scale mixture of univariate Gaussians. Ghosal and van der Vaart [14; 15] give the rate of convergence for Bayesian univariate density estimation using a DPM of Gaussians. The only results (to our knowledge) on asymptotic properties of Bayesian procedures for multivariate continuous density estimation are presented by Ghosal and co-authors [33; 29]. In both papers the models considered are quite limited in scope in focusing on DP location mixtures of Gaussian kernels. Posterior consistency is studied in [33] assuming a truncated inverseWishart prior for the Gaussian covariance. In [29] near minimax optimal rates of posterior contraction are shown under some conditions on the true density assuming a diagonal covariance in the Gaussian kernel with independent truncated inverse-gamma priors on the diagonal elements. In practice, it is well known that using a diagonal covariance may lead to less efficient results in small to moderate samples. In addition, it is preferable to avoid arbitrary truncations and allow broader priors than inverse gammas and inverse Wisharts. For example, for high-dimensional data it is well known imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
4
A. CANALE AND D. B. DUNSON
that inverse Wisharts provide a poor choice and alternatives based on factor analytic and other factorizations are commonly used. In the next section we describe the space of mixed scale densities and introduce priors on this space via priors on a latent space and mapping functions. The section presents Theorems on the KL support of the prior, strong posterior consistency and rates of posterior contraction. Given the lack of results in the multivariate continuous density estimation context we also generalize the results of [33] using a modification of the sieve suggested by Pati, Dunson and Tokdar [25]. Such a construction avoids the explosion of the L1 -metric entropy noted by [33] and allows us to obtain strong posterior consistency also in multivariate density estimation under a location scaleassociation mixture model. Proofs not given in the text are reported in the Appendix. 2. Mixed-scale densities. 2.1. Preliminaries. Our focus is on modeling of joint probability distributions of mixed scale data y = (y1T , y2T )T , where y1 = (y1,1 , . . . , y1,p1 ) ∈ Rp1 is a p1 × 1 N vector of continuous observations and y2 = (y2,p1 +1 , . . . , y2,p ) ∈ Q 2 with Q = pj=1 {0, 1, . . . , qj −1} is a p2 ×1 vector of discrete variables having q = (q1 , . . . , qp2 )T as the respective number of levels and p2 = p − p1 . Clearly y2 can include binary variables (qj = 2), categorical variables (qj > 2) or counts (qj = ∞). Hence, y is a p × 1 vector of variables having mixed measurement scales. We let y ∼ f , with f denoting the joint density with respect to an appropriate dominating measure µ to be defined below. The set of all possible such joint densities is denoted F. Following a Bayesian nonparametric approach, we propose to specify a prior f ∼ Π for the joint density having large support over F. For the continuous variables, we let (Ω1 , S1 , µ1 ) denote the σ-finite measure space having Ω1 = Rp1 , S1 the Borel σ-algebra of subsets of Ω1 , and µ1 the Lesbesgue measure. Similarly for the discrete variables we let (Ω2 , S2 , µ2 ) denote the σ-finite measure space having Ω2 ⊂ Np2 , a subset of the p2 dimensional set of natural numbers, S2 containing all non-empty subsets of Ω2 , and µ2 the counting measure. Then, we let µ = µ2 × µ2 be the product measure on the product space (Ω, S) = (Ω1 , S1 ) × (Ω2 , S2 ). To formally define the joint density f , first let ν denote a σ-finite measure on (Ω, S) that is absolutely continuous with respect to µ. Then, by R the Radon-Nikodym theorem there exists a function f such that ν(A) = A f dµ. In studying properties of a prior Π for the unknown density f , such as large support and posterior consistency, it is necessary to define notions of distance and neighborhoods within the space of densities F. Letting f0 ∈ F imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
5
denote an arbitrary density, such as the true density that generated the data, the Kullback-Leibler divergence of f from f0 can be defined as Z Z Z f0 log(f0 /f )dµ1 dµ2 dKL (f0 , f ) = f0 log(f0 /f )dµ = Ω1 Ω2 Z X f0 (y1 , y2 ) = dµ1 (y1 ) f0 (y1 , y2 ) log f (y1 , y2 ) Rp1 y2 ∈Q
with the integrals taken in any order from Fubini’s theorem. Another topology is induced by the L1 -metric. If f and f0 are probability distributions with respect to the product measure µ, their L1 -distance is defined as Z Z Z d1 (f0 , f ) = |f0 − f |dµ = |f0 − f |dµ1 dµ2 Ω1 Ω2 Z X = |f0 (y1 , y2 ) − f (y1 , y2 )|dµ1 (y1 ). Rp1 y ∈Q 2
2.2. Rounding prior. In order to induce a prior f ∼ Π for the density of the mixed scale variables, we let (1)
y = h(y ∗ ),
y∗ ∼ f ∗,
f ∗ ∼ Π∗ ,
where h : Rp → Ω, y ∗ = (y1∗ , . . . , yp∗ )T ∈ Rp , f ∈ F ∗ , F ∗ is the set of densities with respect to Lesbesgue measure over Rp , and Π∗ is a prior over F ∗ . To introduce an appropriate mapping h, we let T (2) h(y ∗ ) = h1 (y1∗ )T , h2 (y2∗ )T , where h1 (y1∗ ) = y1∗ is the identity function and h2 are thresholding functions that replace the real-valued inputs with non-negative integer outputs (j) (j) by thresholding the different inputs separately. Let A(j) = {A1 , . . . , Aqj } denote a prespecified partition of < into qj mutually exclusive subsets, for (j) (j) j = 1, . . . , p2 , with the subsets ordered so that Ah is placed before Al for ∗ ∈ A(j) , j = 1, . . . , p }, the mixed all h < l. Then, letting Ay2 = {y2∗ : y2,j y2j 2 scale density f is defined as Z (3) f (y) = g(f ∗ ) = f ∗ (y ∗ )dy ∗ . Ay2
The function g : F ∗ → F defined in (3) is a mapping from the space of densities with respect to Lesbesgue measure on Rp to the space of mixedscale densities F. imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
6
A. CANALE AND D. B. DUNSON
This framework generalizes [3], which focused only on count variables. The theory is substantially more challenging in the mixed scale case when there are continuous variables involved. Clearly the properties of the induced prior f ∼ Π will be driven largely by the properties of f ∗ ∼ Π∗ . Lemma 1 shows that the mapping g : F ∗ → F maintains Kullback-Leibler (KL) neighborhoods. The proof is omitted as being a straightforward modification of that for Lemma 1 in [3]. Lemma 1. Choose any f0∗ such that f0 = g(f0∗ ) for any fixed f0 ∈ F. Let K (f0∗ ) = {f ∗ : dKL (f0∗ , f ∗ ) < } be a Kullback-Leibler neighborhood of size around f0∗ . Then the image g(K (f0∗ )) contains values f ∈ F in a Kullback-Leibler neighborhood of f0 of at most size . Large support of the prior plays a crucial role in posterior consistency. Under the theory of Schwartz [27], given f0 in the KL support of the prior, to obtain posterior consistency we need to ensure the existence of an exponentially consistent sequence of tests for the hypothesis H0 : f = f0 versus H1 : f ∈ U C (f0 ) where U (f0 ) is a neighborhood of f0 . Ghosal et al. [12] show that the existence of such a sequence of tests is guaranteed by balancing the size of a sieve and the prior probability assigned to its complement. We now provide sufficient conditions for L1 posterior consistency for priors in the class proposed in expression (1). Our Theorem 1 builds on Theorem 8 of [12]. The main differences are that we define the sieve Fn as g(Fn∗ ), where Fn∗ is a sieve on F ∗ and that we require conditions on the prior probability in terms of the underlying Π∗ . The proof relies on the same steps of [12] given lemmas 4 and 5 (reported in the Appendix) which give an upper bound for the L1 metric entropy J(δ, Fn∗ ) defined as the logarithm of the minimum number of δ-sized L1 balls needed to cover Fn∗ . Theorem 1. Let Π be a prior on F induced by Π∗ as described in expression (1). Suppose f0 is in the KL support of Π and let U = {f ∈ F : ||f − f0 || < }. If for each > 0, there is a δ < , c1 , c2 > 0, β < 2 /8 and there exist sets Fn∗ ⊂ F ∗ such that for n large (i) Π∗ (Fn∗C ) ≤ c1 e−nc2 ; (ii) J(δ, Fn∗ ) < nβ then Π(U | y1 , . . . , yn ) → 1 a.s. Pf0 . We now state a theorem on the rate of convergence (contraction) of the posterior distribution. The theorem gives conditions on the prior Π∗ similar to those directly required by Theorem 2.1 of [13]. The proof is reported in the Appendix. imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
7
Theorem 2. Let Π be the prior on F induced by Π∗ as described in expression (1) and U = {f : d(f, f0 ) ≤ M n } with d the L1 or Hellinger distance. Suppose that for a sequence n , with n →R 0 and n2n → ∞, a constant C > 0, sets Fn∗ ⊂ F ∗ and Bn∗ = {f ∗ : f0∗ log(f0∗ /f ∗ )dµ ≤ R ∗ 2 n , f0 (log(f0∗ /f ∗ ))2 dµ ≤ 2n } defined for a given f0∗ ∈ g −1 (f0 ), we have (iii) J(n , Fn∗ ) < Cn2n ; (iv) Π∗ (Fn∗C ) ≤ exp{−n2n (C + 4)}; (v) Π∗ (Bn∗ ) ≥ exp{−Cn2n } then for sufficently large M , we have that Π(U C | y1 , . . . , yn ) → 0 in Pf0 probability. 3. Consistency in multivariate continuous density estimation. From Section 2.2 it is clear that the properties of the induced prior f = g(f ∗ ) ∼ Π depend heavily on the choice of prior f ∗ ∼ Π∗ . Our hope is to leverage the rich literature on models and theory for continuous density estimation in developing associated models and theory for the mixed scale case. A first step in utilizing the methodology and theory developed in Sections 2.1-2.2 is to define priors for unknown multivariate densities f ∗ with respect to Lesbesgue measure on Rp and verify that these priors have appealing properties in terms of large support and posterior consistency in the simple case in which p1 = p, so that y = y ∗ and all underlying variables are observed directly. Dirichlet process mixtures (DPMs) form the most widely applied class of models for Bayesian density estimation, with a rich theoretical literature available in the univariate continuous case on posterior consistency [12; 1; 30; 32] and rates of posterior contraction [13; 14; 15; 31; 28]. DPMs of Gaussian kernels have proven successful for multivariate density estimation in challenging cases involving high-dimensional data [4]. However, to our knowledge the only results currently available on L1 consistency for multivariate density estimation rely on DPMs of multivariate Gaussian kernels [33; 29]. Our focus is also on DPMs of Gaussians, but we generalize the results in [33; 29] to more flexible mixtures that enable scaling to higher dimensions using sparse models for the kernel covariance and other modifications. 3.1. Dirichlet process location mixtures. We initially consider DP location mixtures of multivariate Gaussian kernels, which let Z ∗ ˜ (4) fP,Σ (y) = φ(y; θ, Σ)dP (θ), P ∼ DP (αP0 ), Σ ∼ Π,
imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
8
A. CANALE AND D. B. DUNSON
with φ(·; θ, Σ) denoting the multivariate Gaussian density with location θ ∈ Rp and covariance Σ ∈ Mp , α > 0 is a concentration parameter, P0 is a DP ˜ is a prior on the space Mp of p × p positive base probability measure, and Π semi-definite matrices. In marginalizing out P one can write the model (4) as (5)
∗ fP,Σ (y)
=
∞ X
πk φ(y; θk , Σ),
k=1
θ ∼ P0 ,
πk = V k
Y (1 − Vl ), l 0, f0∗ (y) log φ0δ (y) dy < ∞, where φδ (y) = inf ||y0 −y|| 0, ||y||2p(1+η) f0∗ (y)dy < ∞.
1. 2. 3.
Then f0∗ is in the KL support of Π∗ . In proving strong posterior consistency in non-compact spaces, such as F ∗ , a critical step is to introduce a compact subset Fn∗ that is indexed by the sample size n and that grows to fill the entire space as n → ∞. This sequence of subsets is typically referred to as a sieve, and the choice of the sieve for multivariate density estimation is quite important as standard sufficient conditions for posterior consistency require the L1 metric entropy of Fn∗ to grow slower than linearly in n, while also requiring the prior probability assigned outside of Fn∗ (to Fn∗C ) to decrease exponentially fast in n. If the sieve is not very carefully chosen, Fn∗ may be quite small, so that the condition on the prior becomes very restrictive and the prior may need to have very light tails or even compact support. The choice of the sieve is particularly crucial in multivariate density estimation, since naive choices may lead the L1 metric entropy to “blow up” with dimension p. Lemma 2 proposes a sieve, which modifies the formulation of [25], and provides a bound on the L1 metric entropy. This sieve is then utilized in Theorem 4, which gives imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
9
sufficient conditions on the prior under which the posterior Π∗ (· | y1 , . . . , yn ) is L1 consistent. This result is of significant general interest in multivariate density estimation in relaxing conditions of [33]. Lemma 2. Let a, l, m be positive constants, λi (Σ) the ith eigenvalue ∗ ordered from the smallest to the largest of Σ and Fa,l,m = {fP,Σ : ||θk || ≤ p P a, k = 1, . . . , m, λ1 (Σ) > l, k>m πk < }. Then o n a p + d2 + d3 m log(d4 m) J(4, Fa,l,m ) ≤ m log d1 l where d1 , d2 , d3 and d4 are constants depending on . In [33] they stress that the usual method of constructing a sieve does not lead to a consistency result in the multivariate case because of the explosion of the L1 -metric entropy. They propose to include a fixed bound on the highest eigenvalue of Σ through the prior. Lemma 2 introduces a novel sieve that is particularly designed to bypass this constraint through modifying the sieve of [25] to be formulated through a lower bound on the smallest eigenvalue of Σ. Theorem 4. Assume we observe an iid sample y = y1 , . . . , yn from f0∗ satisfying 1-4 of Theorem 3. Consider the prior Π∗ defined in (4). For any √ √ > 0 and β < 2 /8, if there exist sequences an = O( n) and ln = O(1/ n) and β1 > 0 such that the following conditions hold: n o p ˜ Σ : λ1 (Σ) ≤ ln ≤ e−nβ1 ; 5. Π 6. P0 {||θ|| > an } ≤ e−nβ1 ; 7. (an /ln )p ≤ nβ then Π∗ ({f : ||f ∗ − f0∗ || < } | y1 , . . . , yn ) → 1 a.s. Pf0 . The proof relies on verifying the conditions of Theorem 8 of [12]. In particular the first two conditions ensure that the prior probability of the complement of the sieve is exponentially small while condition 7 determines the upper bound for the L1 -metric entropy of the sieve described in Lemma 2 with m = O(n/ log(n)). 3.2. Dirichlet process mean-covariance mixtures. A second mixture model generalizes model (4) by mixing also the covariance matrix: Z (6) fP∗ (y) = φ(y; θ, Σ)dP (θ, Σ), P ∼ DP (αP0 ), imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
10
A. CANALE AND D. B. DUNSON
with P0 being now a measure on Rp × Mp . In marginalizing out P one can write the model (6) as (7)
fP∗ (y)
=
∞ X
πk φ(y; θk , Σk ), (θ, Σ) ∼ P0 ,
πk = Vk
k=1
Y (1 − Vl ), l 0 we have an open set P ∈ F ∗ × Mp such that, for any P ∈ P, we have Z f0 (y) dy ≤ . f0 (y) log R 2 φ(y : θ, ωIp )dP (θ, ω) The rest of the proof follows from [33]. The next lemma describes the sieve and its size in terms of L1 metric entropy. It is a generalization of the sieve defined in Lemma 2. The proof is reported in the Appendix. Lemmap3. Let a, h,pl, m be positive constantsP and Fa,h,l,m = {fP : ||θk || ≤ a, λ1 (Σk ) > l, λp (Σk ) < h, k = 1, . . . , m, k>m πk < }. Then a p h J(4, Fa,h,l,m ) ≤ m log d1 + d2 log + d3 + d4 m log(d5 m) l l where d1 , d2 , d3 , d4 and d5 are constants depending on . The next theorem gives the conditions for L1 posterior consistency under the Dirichlet process mixture models (6). Theorem 6. Assume we observe an iid sample y = y1 , . . . , yn from f0∗ satisfying 1-4 of Theorem 3. Consider the prior Π∗ defined in (6). For any √ > 0 and β < 2 /8, if there exist sequences an = O( n), hn = O(exp(n)) √ and ln = O(1/ n) and β1 > 0 such that the following conditions hold: imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
11
n o p p 8. P0 ||θ|| > an , λ1 (Σ) ≤ ln , λp (Σ) ≥ hn ≤ e−nβ1 ; 9. (an /ln )p ≤ nβ, log(hn /ln ) ≤ nβ then Π∗ ({f : ||f ∗ − f0∗ || < } | y1 , . . . , yn ) → 1 a.s. Pf0 . Also here the proof consists in verifying that the conditions of Theorem 8 of [12] are satisfied using the sieve construction of Lemma 3 with m = O(n/log(n)). 3.3. Examples. The results of the previous subsections rely on Dirichlet process mixtures of multivariate Gaussian kernels, but the results are ˜ nonetheless quite broad in allowing general priors for the covariance Σ ∼ Π in the location mixture case and general choices of base measure P0 in the location-covariance mixture case. These choices make a substantial practical difference in applications, particularly in high-dimensional settings. In what follows, we show that posterior consistency can be obtained for some ˜ and P0 . particular potentially useful cases of Π A common and convenient prior for the covariance matrix is Σ−1 ∼ W (Σ0 , r). Corollary 2 of [33] shows that a truncated Wishart has exponentially small probability on the complement of the sieve introduced therein. This artificial truncation of the Wishart, which may lead to hurdles in implementation, is no longer necessary under our sieve construction, and ˜ = IW (Σ0 , r) in the settings of section 3.2 and P0 (θ, Σ) = we can let Π N (θ; θ0 , Ω0 ) × IW (Σ; Σ0 , r) in the settings of section 3.1. The next corollary formalizes the consistency of the two above mentioned priors showing that they satisfy the conditions of Theorems 4 and 6. The proofs are reported in the Appendix. Corollary 1. Assume we observe an iid sample y1 , . . . , yn from f0∗ satisfying the conditions 1–4 in Theorem 3. Consider a prior Π∗ induced by (4) with P0 = N (θ0 , Ω0 ) and Σ−1 ∼ W (Σ0 , r) (by (6) with P0 = N (θ0 , Ω0 ) × IW (Σ0 , r)) with Σ0 = σ0 Ip . Then for any > 0, Π∗ ({f : ||f ∗ − f0∗ || < } | y1 , . . . , yn ) → 1 a.s. Pf0∗ . Remark 1. The statement of Corollary 1 is true also if we consider the prior induced by (6) with P0 (θ, Σ) = N (θ0 , τ0 Σ)IW (r, Σ0 ), since P0 {||θ|| > an } is exponentially small. It is well known that an inverse Wishart prior for the covariance tends to provide a poor choice when the dimensional p of the data is large, even in the simpler parametric setting involving one multivariate Gaussian kernel. imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
12
A. CANALE AND D. B. DUNSON
To address this problem, there is a rich literature on shrinkage priors for covariance matrices, with a commonly used and successful approach relying on factor analytic factorizations in which Σ = ΓΓT + Ω, Γ ∼ ΠΓ , Ω ∼ ΠΩ ,
(8)
where Γ is a p × r matrix and Ω is a p × p diagonal matrix. For example, [2] induces a prior for a covariance matrix through (8) with ΠΓ and ΠΩ induced through letting −1 γjh |φjh τh ∼ N (0, φ−1 jh τh ), φjh ∼ Ga(3/2, 3/2), τh =
h Y
δl ,
l=1
(9)
δ1 ∼ Ga(a1 , 1), δl ∼ Ga(a2 , 1), l > 1, σj−2 ∼ Ga(a, b), j = 1, . . . , p,
where γjh is the element in row j and column h of Γ and σj2 is the jth diagonal element of Ω. The next corollary provides conditions on ΠΓ and ΠΩ for L1 posterior consistency in DP mixtures of multivariate normal factor models. Corollary 2. Assume we observe an iid sample y1 , . . . , yn from f0∗ satisfying the conditions 1–4 in Theorem 3. Consider a prior Π∗ defined in ˜ and Π ˜ following (4) with P0 = N (θ0 , Ω0 ) (in (6) with P0 = N (θ0 , Ω0 ) × Π) the factor model rapresentation of (8). Further assume that 10. E{tr(ΓΓT )} < ∞, E{tr(Ω)} < ∞; √ 11. ΠΩ {σj2 ≤ ln2 } ≤ e−cn with ln = O(1/ n); Then for any > 0, Π∗ ({f : ||f ∗ − f0∗ || < } | y1 , . . . , yn ) → 1 a.s. Pf0∗ . Remark 2. The particular sparse factor model of [2] shown in (9) satisfies conditions 10 and Q 11. In fact tr(ΓΓT ) and tr(Ω) have expectations equal to pb/(a + 1) and p rh=1 a1 ah−1 respectively since they are distributed 2 according to tr(ΓΓT ) ∼
p r X X
−1/2 −1/2 2 τh χ1 ,
φjh
tr(Ω) =
h=1 j=1
p X
σj−2 , σj2 ∼ IGa(a, b).
j=1
0. Acknowledgement. This research was partially supported by grant R01 ES017240-01 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH) and grant CPDA097208/09 by University of Padua, Italy. imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
13
4. Appendix. Proof of Theorem 1. The next two lemmas are useful to determine the size of the parameter space of F, measured in terms of L1 metric entropy. The first shows that the L1 topology is maintained under the mapping g and the second bounds the L1 metric entropy of a sieve. Lemma 4. Assume that the true data generating density is f0 ∈ F. Choose any f0∗ such that f0 = g(f0∗ ). Let U (f0∗ ) = {f ∗ : ||f0∗ − f ∗ || < } be a L1 neighborhood of size around f0∗ . Then the image g(U (f0∗ )) contains values f ∈ F in a L1 neighborhood of f0 of at most size . The proof is omitted since it follows directly from the definition of L1 neighborhood and from Fubini’s theorem. Lemma 5. Let Fn∗ ⊂ F ∗ denote a compact subset of F ∗ , with J(δ, Fn∗ ) the L1 metric entropy corresponding to the logarithm of the minimum number of δ-sized L1 balls needed to cover Fn∗ . Letting Fn = g(Fn∗ ), we have J(δ, Fn ) ≤ J(δ, Fn∗ ). Proof of Lemma 5. Let k = exp{J(δ, Fn∗ )} be the number of δ balls needed to cover Fn∗ , with f1∗ , . . . , fk∗ denoting the centers of these balls so S ∗ , where F ∗ = {f ∗ : ||f ∗ − f ∗ || < δ}. From Lemma 4, that Fn∗ ⊂ ki=1 Fn,i n,i i Sk ∗ ) is an L it is clear we can define Fn ⊂ i=1 Fn,i where Fn,i = g(Fn,i 1 ∗ neighborhood around fi = g(fi ) of size at most δ. This defines a covering of Fn using k δ-sized L1 balls, but this is not necessarily the minimal covering possible and hence J(δ, Fn∗ ) provides an upper bound on J(δ, Fn ). The rest of the proof follows along almost the same lines of [12] in showing that the sets Fn ∩ {f : ||f − f0 || < } and FnC satisfy the conditions of an unpublished result of Barron (see Theorem 4.4.3 of [16]). Proof of Theorem 2. Let Fn = g(Fn∗ ). From Lemma 5 we have J(δ, Fn ) ≤ J(δ, Fn∗ ). Let D(, F) the -packing number of F, i.e. is the maximal number of points in F such that the distance between every pair is at least . For every > n , using (iii) we have log D(/2, F) < log D(n , F) < Cn2n . Therefore applying Theorem 7.1 of [13] with j = 1, D() = exp(n2n ) and = M n with M > 2 there exist a sequence of tests {Φn } that satisfies (10) Ef0 {Φn } ≤ exp{−(KM 2 −1)n2n }, sup Ef {1−Φn } ≤ C exp{−KnM 2 2n }. f ∈U C ∩Fn
imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
14
A. CANALE AND D. B. DUNSON
The posterior probability assigned to U C can be written as R
Π U
C
| y1 , . . . , y n =
U C ∩Fn
≤ Φn +
(1
R Qn f (yi ) f (yi ) i=1 f0 (yi ) dΠ(f ) + U C ∩FnC i=1 f0 (yi ) dΠ(f ) R Qn f (yi ) i=1 f0 (yi ) dΠ(f ) R R Qn f (yi ) Q i) − Φn ) U C ∩Fn ni=1 ff0(y i=1 f0 (yi ) dΠ(f ) (yi ) dΠ(f ) + U C ∩FnC . R Qn f (yi ) i=1 f0 (yi ) dΠ(f )
Qn
Taking KM 2 > K + 1 the first summand Ef0 {Φn } ≤ 2 exp{−Kn2n } by (10). The rest of the proof consists in proving that the remaining equation goes to zero in Pf0 -probability. By Fubini’s theorem and (10) we have ( ) Z n Y f (yi ) Ef0 (1 − Φn ) dΠ(f ) ≤ sup Ef {1−Φn } ≤ exp{−KnM 2 2n }, f (y ) 0 i U C ∩Fn f ∈U C ∩Fn i=1
while by (iv) we have (Z ) n Y f (yi ) Ef0 dΠ(f ) ≤ Π(FnC ) = Π∗ (Fn∗C ) ≤ exp{−n2n (C + 4)}. f0 (yi ) U C ∩FnC i=1
The numerator of the second summand is hence exponentially small for p M > (C + 4)/K. Finally we need to lower bound the denumerator. Clearly Z Z ∗ 2 2 2 g(Bn ) ⊆ Bn = f : f0 log(f0 /f )dµ ≤ n , f0 (log(f0 /f )) dµ ≤ n and then Π(Bn ) ≥ Π(g(Bn∗ )) = Π∗ (Bn∗ ) and using condition (v) on Π∗ (Bn∗ ) we have R R R 2 Bn f0 log(f0 /f )dµdΠ(f ) ≤ Bn n dΠ(f ) R R R 2 2 Bn f0 (log(f0 /f )) dµdΠ(f ) ≤ Bn n dΠ(f ). Then using Lemma 8.1 of [13] we obtain EP0
Z Y n f (yi ) dΠ(f ) ≥ exp{−n2n (C + 4)} f0 (yi ) i=1
that concludes the proof.
imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
15
Proof of Lemma 3. The proof is similar to Theorem 7.5 of [25] while dealing with the generalization of the mixture model introduced in (6). For any f1 , f2 ∈ F we have ||f1 − f2 || ≤
m X l=1
(1) πl
Z |φθ(1) ,Σ(1) − φθ(2) ,Σ(2) | + l
l
l
l
m X
(1)
|πl
(2)
− πl | + 2.
l=1
We are going to give the upper bound for the sieve using the usual steps as in [12], [30] and [33]. We start by showing that two single multivariate normal kernels with suitable parameters have L1 distance smaller than . It can be shown that for any two multivariate normals with the same vector of mean and with det(Σ1 ) < det(Σ2 ) we have ! Z 1/2 det(Σ2 )1/2 − det(Σ1 )1/2 det(Σ ) 2 − 1 dx = . ||φΣ1 −φΣ2 || < φΣ2 (x) det(Σ1 )1/2 det(Σ1 )1/2 Hence ||φθ1 ,Σ1 − φθ2 ,Σ2 || ≤ ||φθ1 ,Σ2 − φθ2 ,Σ2 || + ||φθ1 ,Σ1 − φθ1 ,Σ2 || r 2 ||θ1 − θ2 || |det(Σ2 )1/2 − det(Σ1 )1/2 | p + . ≤ π λ1 (Σ1 ) det(Σ1 )1/2 where the first summand can be obtained following the first part of the proof of Lemma 5 in [33] and the second follows from above. Let ζ = min(/2, 1). Define ∆m = lp (1 + ζ)m , m ≥ 0. Let M be the smallest integer such that lp (1 + ζ)M ≥ hp . This√ clearly implies M ≤ p(1 + ζ)−1 log(h/l) + 1. For 1/p 1 ≤ j ≤ M , let Nj = d √32p a/(∆j−1 )e. For 1 ≤ i ≤ Nj ; 1 ≤ j ≤ M , define π Eij =
2ai p 2a(i − 1) , −a + × (∆j−1 , ∆j ]. −a + Nj Nj
Then for (θ1 , Σ1 ) and (θ2 , Σ2 ) ∈ {θ, Σ : (θ, det(Σ)) ∈ Eij } we have that ||φθ1 ,Σ1 − φθ2 ,Σ2 || < . Let N be the minimum number of ball to cover
imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
16
A. CANALE AND D. B. DUNSON
Θa,l,h = {φθ,Σ : ||θ|| ≤ a, N≤
M X j=1
r
p p λ1 (Σ) > l, λp (Σ) < h}. Clearly
32p a +1 π ∆1/p
!p
j−1
p r j M M X X 1 a 32p ≤ + 1 l π (1 + ζ)1/p j=1 j=1 pj M a p 32p p/2 X 1 ≤D + M l π (1 + ζ)1/p j=1 pj M a p 32p p/2 X h p 1 log + 1 ≤D + l π 1+ζ l (1 + ζ)1/p j=1 a p h ≤ d1 + d2 log + d3 . l l Let Θπ = {π m = (π1 , . . . , πm )}. Fix π1m and π2m ∈ Θπ . Let for k = 1, 2 P P (1) (2) (k) (k) (k) Vh = πh (1 − l an }, np p λ1 (Σ) ≤ ln and P0 λp (Σ) ≥ hn . Note that to shorten the notaP0 p tion we are using P0 {θ < a} and P0 λj (Σ) ≤ a for P0 {(θ, Σ) : θ < a} p and P0 (θ, Σ) : λj (Σ) ≤ a . Since θ ∼ N (θ0 , Ω0 ) we have, for some constant d > 0, P0 {||θ|| > an } ≤ d exp(−a2n ) ≤ d exp(−C12 n). Then np o P0 λ1 (Σ) ≤ ln = P0 λp (Σ)−1 ≥ ln−2 ≤ P0 tr(Σ−1 ) ≥ ln−2
imsart-aos ver. 2011/05/20 file: BMSDE.tex date: October 12, 2011
17
By definition of a Wishart distribution we have that tr(Σ−1 ) ∼ σ0 χ2pr and since the χ2 has exponential tail, we have np o < P0 λ1 (Σ) ≤ ln ≤ P0 tr(Σ−1 ) ≥ σ0−1 C2−2 n ∼ e−cn . Finally q −2 P0 λp (Σ) ≥ hn ≤ P0 tr(Σ) ≥ h2n ≤ h−2 n EP0 {tr(Σ)} = hn tr(EP0 {(Σ)}) by Markov’s inequality. Since Σ is inverse Wishart distributed its expectation is a matrix with finite entries and hence its trace is finite almost surely. It p < follows that P0 λp (Σ) ≥ hn ∼ e−cn that concludes the proof. Proof of Corollary 2. We prove the theorem as stated in the within bracket version. Let {an }, {hn } and {ln } as in the proof of Corollary 1 so that condition 9 is satisfied. P0 {||θ|| > an } is exponentially small. Then q q o np p P0 λ1 (Σ) ≤ ln , λp (Σ) ≥ hn ≤ P0 λ1 (Σ) ≤ ln +P0 λp (Σ) ≥ hn . We start on the condition on the smaller eigenvalue. We have o np λ1 (Σ) ≤ ln = P0 λ1 (Ω) + λ1 (ΓΓT ) ≤ ln2 ≤ ΠΩ {λ1 (Ω) ≤ ln2 }. P0 Then ΠΩ {λ1 (Ω) ≤ ln2 } = ΠΩ { min σj2 ≤ ln2 } =1− ≤1−
j=1,...,p ΠΩ {σj2 ΠΩ {σj2
≥ ln2 }p ≥ ln2 }