Document not found! Please try again

Bernstein-von Mises Theorems for Functionals of Covariance Matrix

3 downloads 44 Views 414KB Size Report
Dec 1, 2014 - ST] 1 Dec 2014. Bernstein-von Mises Theorems for Functionals of Covariance. Matrix ∗. Chao Gao and Harrison H. Zhou. Yale University.
arXiv:1412.0313v1 [math.ST] 1 Dec 2014

Bernstein-von Mises Theorems for Functionals of Covariance Matrix ∗ Chao Gao and Harrison H. Zhou Yale University November 30, 2014

Abstract We provide a general theoretical framework to derive Bernstein-von Mises theorems for matrix functionals. The conditions on functionals and priors are explicit and easy to check. Results are obtained for various functionals including entries of covariance matrix, entries of precision matrix, quadratic forms, log-determinant, eigenvalues in the Bayesian Gaussian covariance/precision matrix estimation setting, as well as for Bayesian linear and quadratic discriminant analysis. Keywords. Bernstein-von Mises Theorem, Bayes Nonparametrics, Covariance Matrix.

1

Introduction

The celebrated Bernstein-von Mises (BvM) theorem [20, 3, 29, 21, 27] justifies Bayesian methods from a frequentist point of view. It bridges the gap between Bayesians and frequentists.  Consider a parametric model Pθ : θ ∈ Θ , and a prior distribution θ ∼ Π. Suppose we have i.i.d. observations X n = (X1 , ..., Xn ) from the product measure Pθn∗ . Under some weak assumptions, Bernstein-von Mises theorem shows that the conditional distribution of √

ˆ n n(θ − θ)|X

is asymptotically N (0, V 2 ) under the distribution Pθn∗ with some centering θˆ and covariance V 2 when n → ∞. In a local asymptotic normal (LAN) family, the centering θˆ can be taken as the maximum likelihood estimator (MLE) and V 2 as the inverse of the Fisher information matrix. An immediate consequence of the Bernstein-von Mises theorem is that the distributions √ ∗

ˆ n n(θ − θ)|X

and



n(θˆ − θ)|θ = θ ∗

The research of Chao Gao and Harrison H. Zhou is supported in part by NSF Grant DMS-1209191.

1

are asymptotically the same under the sampling distribution Pθn∗ . Note that the first one, known as the posterior, is of interest to Bayesians, and the second one is of interest to frequentists in the large sample theory. Applications of Bernstein-von Mises theorem include constructing confidence sets from Bayesian methods with frequentist coverage guarantees. Despite the success of BvM results in the classical parametric setting, little is known about the high-dimensional case, where the unknown parameter is of increasing or even infinite dimensions. The pioneering works of [11] and [13] (see also [17]) showed that generally BvM may not be true in non-classical cases. Despite the negative results, further works on some notions of nonparametric BvM provide some positive answers. See, for example, [22, 8, 9, 24]. In this paper, we consider the question whether it is possible to have BvM results for matrix functionals, such as matrix entries and eigenvalues, when the dimension of the matrix p grows with the sample size n. This paper provides some positive answers to this question. To be specific, we consider a multivariate Gaussian likelihood and put a prior on the covariance matrix. We prove that the posterior distribution has a BvM behavior for various matrix functionals including entries of covariance matrix, entries of precision matrix, quadratic forms, log-determinant, and eigenvalues. All of these conclusions are obtained from a general theoretical framework we provide in Section 2, where we propose explicit easy-to-check conditions on both functionals and priors. We illustrate the theory by both conjugate and non-conjugate priors. A slight extension of the general framework leads to BvM results for discriminant analysis. Both linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are considered. This work is inspired by a growing interest in studying the BvM phenomena on a lowdimensional functional of the whole parameter. That is, the asymptotic distribution of √ n(f (θ) − fˆ)|X n , with f being a map from Θ to Rd , where d does not grow with n. A special case is the semiparametric setting, where θ = (µ, η) contains both a parametric part µ and a nonparametric part η. The functional f takes the form of f (µ, η) = µ. The works in this field are pioneered by [19] in a right-censoring model and [26] for a general theory in the semiparametric setting. However, the conditions provided by [26] for BvM to hold are hard to check when specific examples are considered. To the best of our knowledge, the first general framework for semiparametric BvM with conditions cleanly stated and easy to check is the beautiful work by [7], in which the recent advancement in Bayes nonparametrics such as [2] and [15] are nicely absorbed. [25] proves BvM for linear functionals for which the distribution of √ n(f (θ) − fˆ)|X n converges to a mixture of normal instead of a normal. At the point when this paper is drafted, the most updated theory is due to [10], which provides conditions for BvM to hold for general functionals. The general framework we provide for matrix functional BvM is greatly inspired by the framework developed in [10] for functionals in nonparametrics. However, the theory in this paper is different from theirs since we can take advantage of the structure in the Gaussian likelihood and avoid unnecessary expansion and approximation. Hence, in the covariance matrix functional case, our assumptions can be significantly weaker. 2

The paper is organized as follows. In Section 2, we state the general theoretical framework of our results. It is illustrated with two priors, one conjugate prior and one non-conjugate prior. Section 3 considers specific examples of matrix functionals and the associated BvM results. The extension to discriminant analysis is developed in Section 4. Finally, we devote Section 5 to some discussions on the assumptions and possible generalizations. Most of the proofs are gathered in Section 6.

1.1

Notation

Given a matrix A, we use ||A|| to denote its spectral norm, and ||A||F to denote its Frobenius norm. The norm || · ||, when applied to a vector, is understood to be the usual vector norm. Let S p−1 be the unit sphere in Rp . For any a, b ∈ R, we use notation a ∨ b = max(a, b) and a ∧ b = min(a, b). The probability PΣ stands for N (0, Σ) and P(µ,Ω) is for N (µ, Ω−1 ). In most cases, we use Σ to denote the covariance matrix, and Ω to denote the precision matrix (including those with superscripts or subscripts). The notation P is for a generic probability, whenever the distribution is clear in the context. We use OP (·) and oP (·) to denote stochastic orders under the sampling distribution of the data. We use C to indicate constants throughout the paper. They may be different from line to line.

2

A General Framework

Consider i.i.d. samples X n = (X1 , ..., Xn ) drawn from N (0, Σ∗ ), where Σ∗ is a p×p covariance matrix with inverse Ω∗ . A Bayes method puts a prior Π on the precision matrix Ω, and the posterior distribution is defined as   R exp l (Ω) dΠ(Ω) n B   , Π(B|X n ) = R exp ln (Ω) dΠ(Ω) where ln (Ω) is the log-likelihood of N (0, Ω−1 ) defined as ln (Ω) =

n n ˆ log det(Ω) − tr(ΩΣ), 2 2

n

X ˆ= 1 where Σ Xi XiT . n i=1

We deliberately omit the logarithmic normalizing constant in ln (Ω) for simplicity and it will not affect the definition of the posterior distribution. Note that specifying a prior on the precision matrix Ω is equivalent to specifying a prior on the covariance matrix Ω−1 . The goal of this work is to show that the asymptotic distribution of the functional f (Ω) under the posterior distribution is approximately normal, i.e.,  √  Π nV −1 f (Ω) − fˆ ≤ t|X n → P(Z ≤ t),

where Z ∼ N (0, 1), as (n, p) → ∞ jointly with some appropriate centering fˆ and variance V 2 . In this paper, we choose the centering fˆ to be the sample version of f (Ω) = f (Σ−1 ), 3

ˆ and compare the BvM results with the where Σ is replaced by the sample covariance Σ, classical asymptotical normality for fˆ in the frequentist sense. Other centering fˆ, including bias correction on the sample version, will be considered in the future work. We first provide a framework for approximately linear functionals, and then use the general theory to derive results for specific examples of priors and functionals. For clarity of presentation, we consider the cases of functionals of Σ and functionals of Ω separately. Though a functional of Σ is also a functional of Ω, we treat them separately, since some functional may be “more linear” in Σ than in Ω, or the other way around.

2.1

Functional of Covariance Matrix

Let us first consider a functional of Σ, f = φ(Σ). The functional is approximately linear in a neighborhood of the truth. We assume there is a set An satisfying An ⊂ {||Σ − Σ∗ || ≤ δn } ,

(1)

for any sequence δn = o(1), on which φ(Σ) is approximately linear in the sense that there exists a symmetric matrix Φ such that

−1   √

ˆ − tr (Σ − Σ)Φ ˆ (2) sup n Σ∗1/2 ΦΣ∗1/2 φ(Σ) − φ(Σ) = oP (1). F

An

The main result is stated in the following theorem.

Theorem 2.1. Under the assumptions of (2) and ||Σ∗ || ∨ ||Ω∗ || = O(1), if for a given prior Π, the following two conditions are satisfied: 1. Π(An |X n ) = 1 − oP (1), 2. For any fixed t ∈ R,

then

R

exp An

R



An exp



ln (Ωt ) dΠ(Ω)





= 1 + oP (1) for the perturbed precision matrix

ln (Ω) dΠ(Ω)



2t

Φ, Ωt = Ω + √ ∗1/2

n Σ ΦΣ∗1/2 F

!  √ ˆ  n φ(Σ) − φ(Σ)

≤ t X n − P Z ≤ t = oP (1), sup Π √ 2 Σ∗1/2 ΦΣ∗1/2 F t∈R

where Z ∼ N (0, 1).

The theorem gives explicit conditions on both prior and functional. The first condition says that the posterior distribution concentrates on a neighborhood of the truth under the spectral norm, on which the functional is approximately linear. The second condition says that the bias caused by the shifted parameter can be absorbed by the posterior distribution. Under both conditions, Theorem 2.1 shows that the asymptotic posterior distribution of φ(Σ) is 

2 

1/2 1/2 −1 ∗ ∗ ˆ 2n Σ N φ(Σ), ΦΣ .

F

4

2.2

Functional of Precision Matrix

We state a corresponding theorem for functionals of precision matrix in this section. The condition for linear approximation is slightly different. Consider the functional f = ψ(Ω). Let An be a set satisfying √ An ⊂ { rp||Σ − Σ∗ || ≤ δn } , (3) for some integer r > 0 and any sequence δn = o(1). We assume the functional ψ(Ω) is approximately linear on An in the sense that there exists a symmetric matrix Ψ satisfying rank(Ψ) ≤ r, such that

−1   √

∗1/2 ∗1/2 ˆ −1 ) − tr (Ω − Σ ˆ −1 )Ψ = oP (1). ΨΩ sup n Ω

ψ(Ω) − ψ(Σ F

An

(4)

The main result is stated in the following theorem.

Theorem 2.2. Under the assumptions of (4), rp2 /n = o(1) and ||Σ∗ || ∨ ||Ω∗ || = O(1), if for a given prior Π, the following conditions are satisfied: 1. Π(An |X n ) = 1 − oP (1), 2. For any fixed t ∈ R,

R

An exp

R

An



exp



ln (Ωt ) dΠ(Ω)





= 1 + oP (1) for the perturbed precision matrix

ln (Ω) dΠ(Ω)

√ 2t

Ω∗ ΨΩ∗ , Ωt = Ω − √ ∗1/2 n Ω ΨΩ∗1/2 F

then

!  √ ˆ −1 ))  n ψ(Ω) − ψ(Σ

≤ t X n − P Z ≤ t = oP (1), sup Π √ 2 Ω∗1/2 ΨΩ∗1/2 F t∈R

where Z ∼ N (0, 1).

Remark 2.1. The extra condition rp2 /n = o(1) does not appear in Theorem 2.1. We show that this condition is indeed sharp for Theorem 2.2 in Section 5.3 in comparison with the asymptotics of MLE.

2.3

Priors

In this section, we provide examples of priors. In particular, we consider both a conjugate prior and a non-conjugate prior. Note that the result of a conjugate prior can be derived by directly exploring the posterior form without applying our general theory. However, the general framework provided in this paper can handle both conjugate and non-conjugate priors in a unified way.

5

2.3.1

Wishart Prior

Consider the Wishart prior Wp (I, p + b − 1) on Ω with density function ! dΠ(Ω) 1 b−2 ∝ exp log det(Ω) − tr(Ω) , dΩ 2 2

(5)

supported on the set of symmetric positive semi-definite matrices. Lemma 2.1. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and p/n = o(1). Then, for any integer b = O(1), the prior Π = Wp (I, p + b − 1) satisfies the two conditions in Theorem 2.1 for some An . If the extra assumption rp2 /n = o(1) is made, the two conditions in Theorem 2.2 are also satisfied for some An . Remark 2.2. In the proof of Lemma 2.1 (Section 6.2), we set r   p ∗ An = ||Σ − Σ || ≤ M , n for some M > 0. 2.3.2

Gaussian Prior

Consider Gaussian prior on Ω with density function

supported on the following set

for some constant Λ > 0.



  1 dΠ(Ω) ∝ exp − ||Ω||2F , dΩ 2

(6)

Ω = ΩT , ||Ω|| < 2Λ, ||Σ|| ≤ 2Λ , 2

n Lemma 2.2. Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) and p log = o(1). The Gaussian prior Π n defined above satisfies the two conditions in Theorem 2.1 for some appropriate An . If the 3 extra assumption rp nlog n = o(1) is made, the two conditions in Theorem 2.2 are also satisfied for some appropriate An .

Remark 2.3. In the proof of Lemma 2.2 (Section 6.3), we set ) ( r p2 log n ∗ , An = ||Σ − Σ ||F ≤ M n for some constant M > 0.

6

3

Examples of Matrix Functionals

We consider various examples of functionals in this section. The two conditions of Theorem 2.1 and Theorem 2.2 are satisfied by Wishart prior and Gaussian prior, as is shown in Lemma 2.1 and Lemma 2.2 respectively. Hence, it is sufficient to check the approximate linearity of the functional with respect to Σ or Ω for the BvM result to hold. Among the four examples we consider, the first two are exactly linear and the last two are approximately linear. In the below examples, Z is always a random variable distributed as N (0, 1).

3.1

Entry-wise Functional

We consider the elementwise functional σij = φij (Σ) and ωij = ψij (Ω). Note that these two functionals are linear with respect to Σ and Ω respectively. For σij , we write  1  1 σij = tr Σ Eij + Eji , 2 2 p×p where the matrix Eij is the (i, j)-th basis in R with 1 on its (i, j)-the element and 0 elsewhere. For ωij , we write  1  1 ωij = tr Ω Eij + Eji . 2 2   Note that rank 12 Eij + 21 Eji ≤ 2. Hence, the corresponding matrices Φ and Ψ in the linear expansion of φ and ψ are 21 Eij + 12 Eji . In view of Theorem 2.1 and Theorem 2.2, the  √ ˆ is asymptotic variance for n φ(Σ) − φ(Σ)

2

∗ ∗2 + σij . 2 Σ∗1/2 ΦΣ∗1/2 = σii∗ σjj F  √ ˆ −1 ) is The asymptotic variance for n ψ(Ω) − ψ(Σ

2

∗1/2 ∗ ∗2 ∗1/2 + ωij . 2 Ω ΨΩ

= ωii∗ ωjj F

Plugging these quantities in Theorem 2.1, Theorem 2.2, Lemma 2.1, and Lemma 2.2, we have the following Bernstein-von Mises results. Corollary 3.1. Consider the Wishart prior Π = Wp (I, p+b−1) in (5) with integer b = O(1). Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and p/n = o(1), then we have ! √  n(σ − σ ˆ ) ij ij n n PΣ∗ sup Π q − P Z ≤ t → 0, ≤ t X ∗ + σ ∗2 t∈R σii∗ σjj ij

ˆ If we additionally assume where σ ˆij is the (i, j)-th element of the sample covariance Σ. 2 p /n = o(1), then ! √  n(ωij − ω ˆ ij ) n n PΣ∗ sup Π q − P Z ≤ t → 0, ≤ t X ∗ ∗ ∗2 t∈R ωii ωjj + ωij

ˆ −1 . where ω ˆ ij is the (i, j)-th element of Σ

7

Corollary 3.2. Consider the Gaussian prior Π in (6). Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) 2 n and p log = o(1), then we have n ! √  n(σij − σ ˆij ) n n − P Z ≤ t → 0. ≤ t X PΣ∗ sup Π q ∗ + σ ∗2 t∈R σii∗ σjj ij If we additionally assume

p3 log n n

= o(1), then ! √  n(ωij − ω ˆ ij ) n n PΣ∗ sup Π q − P Z ≤ t → 0, ≤ t X ∗ ∗ ∗2 t∈R ωii ωjj + ωij

where σ ˆij and ω ˆ ij are defined in Corollary 3.1.

3.2

Quadratic Form

Consider the functional φv (Σ) = v T Σv = tr(Σvv T ) and ψv (Ω) = vΩv T = tr(Ωvv T ) for some v ∈ Rp . Therefore, the corresponding matrices Φ and Ψ are vv T . It is easy to see that rank(vv T ) = 1. The asymptotic variances are

2

2



2 Σ∗1/2 ΦΣ∗1/2 = 2|v T Σ∗ v|2 , 2 Ω∗1/2 ΨΩ∗1/2 = 2|v T Ω∗ v|2 . F

F

Plugging these representations in Theorem 2.1, Theorem 2.2, Lemma 2.1 and Lemma 2.2, we have the following Bernstein-von Mises results. Corollary 3.3. Consider the Wishart prior Π = Wp (I, p+b−1) in (5) with integer b = O(1). Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and p/n = o(1), then we have ! √ T ˆ  n(v Σv − v T Σv) n n √ PΣ∗ sup Π ≤ t X − P Z ≤ t → 0. 2|v T Σ∗ v| t∈R If we additionally assume p2 /n = o(1), then ! √ T ˆ −1 v)  n(v Ωv − v T Σ n n √ PΣ∗ sup Π − P Z ≤ t → 0. ≤ t X 2|v T Ω∗ v| t∈R

Corollary 3.4. Consider the Gaussian prior Π in (6). Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) 2 n and p log = o(1), then we have n ! √ T ˆ  n(v Σv − v T Σv) n n √ ≤ t X − P Z ≤ t → 0. PΣ∗ sup Π 2|v T Σ∗ v| t∈R 3

n = o(1), then If we additionally assume p log n ! √ T ˆ −1 v)  n(v Ωv − v T Σ n n √ − P Z ≤ t → 0. PΣ∗ sup Π ≤ t X 2|v T Ω∗ v| t∈R

8

Remark 3.1. The entry-wise functional and the quadratic form are both special cases of the functional uT Σv for some u, v ∈ Rp . It is direct to apply the general framework to this functional and obtain the result ! √ T Σv − uT Σv) ˆ  n(u ≤ t X n − P Z ≤ t → 0. PΣn∗ sup Π p |uT Σ∗ v|2 + |uT Σ∗ u||v T Σ∗ v| t∈R

Similarly, for the functional uT Ωv for some u, v ∈ Rp , we have ! √ T Ωv − uT Σ ˆ −1 v)  n(u PΣn∗ sup Π p ≤ t X n − P Z ≤ t → 0, T ∗ 2 T ∗ T ∗ |u Ω v| + |u Ω u||v Ω v| t∈R

Both results can be derived under the same conditions of Corollary 3.3 and Corollary 3.4.

3.3

Log Determinant

In this section, we consider the log-determinant functional. That is φ(Σ) = log det(Σ). Different from entry-wise functional and quadratic form, we do not need to consider log det(Ω) because of the simple observation log det(Ω) = − log det(Σ). The following lemma establishes the approximate linearity of log det(Σ). Lemma 3.1. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and p3 /n = o(1), then for any δn = o(1), we have r   n ˆ − tr (Σ − Σ)Ω ˆ ∗ = oP (1). log det(Σ) − log det( Σ) sup √ √ p { n/p||Σ−Σ∗ ||2 ∨ p||Σ−Σ∗ || ≤δ } F

F

n

By Lemma 3.1, the corresponding matrix Φ is Ω∗ . The asymptotic variance of  ˆ is φ(Σ)

2

2 Σ∗1/2 ΦΣ∗1/2 = 2p.

√  n φ(Σ)−

F

Corollary 3.5. Consider the Wishart prior Π = Wp (I, p+b−1) in (5) with integer b = O(1). Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and p3 /n = o(1), then we have ! r    n ˆ ≤ t X n − P Z ≤ t → 0, log det(Σ) − log det(Σ) PΣn∗ sup Π 2p t∈R

ˆ is the sample covariance matrix. where Σ

Proof. By Theorem 2.1 and Lemma 2.1, we only need to check the approximate linearity of the functional. According to the proof of Lemma 2.1, the choice of An such that Π(An |X n ) = 1 − oP (1) is r   p ∗ An = ||Σ − Σ || ≤ M , n 9

q

2

for some M > 0. This implies ||Σ − ≤ M pn . Therefore, p √ An ⊂ { n/p||Σ − Σ∗ ||2F ∨ p||Σ − Σ∗ ||F ≤ δn }, Σ∗ ||F

for some δn = o(1). By Lemma 3.1, we have r   n ∗ ˆ ˆ log det(Σ) − log det( Σ) − tr (Σ − Σ)Ω sup = oP (1), p An and the approximate linearity holds.

Corollary 3.6. Consider the Gaussian prior Π in (6). Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) 3 n)2 = o(1), then we have and p (log n ! r    n ˆ ≤ t X n − P Z ≤ t → 0, PΣn∗ sup Π log det(Σ) − log det(Σ) 2p t∈R

ˆ is the sample covariance matrix. where Σ

Proof. The proof of this corollary is the same as the proof of the last one using Wishart prior. The only difference is that the choice of An , according to the proof of Lemma 2.2, is ) ( r 2 log n p , An = ||Σ − Σ∗ ||F ≤ M n for some M > 0. Therefore, p √ An ⊂ { n/p||Σ − Σ∗ ||2F ∨ p||Σ − Σ∗ ||F ≤ δn },

for some δn = o(1) under the assumption, and the approximate linearity holds. One immediate consequence of the result is the Bernstein-von Mises result for the entropy functional, defined as p p log(2π) log det(Σ) + . H(Σ) = + 2 2 2 Then it is direct that r   2n ˆ X n ≈ N (0, 1). H(Σ) − H(Σ) p

3.4

Eigenvalues

In this section, we consider the eigenvalue functional. In particular, let {λm (Σ)}pm=1 be eigenvalues of the matrix Σ with decreasing order. We investigate the posterior distribution of λm (Σ) for each m = 1, ..., p. Define the eigen-gap  ∗ ∗  m = 1,  |λ1 (Σ ) − λ2 (Σ )| δ = min{|λm (Σ∗ ) − λm−1 (Σ∗ )|, |λm (Σ∗ ) − λm+1 (Σ∗ )|} m = 2, 3, ..., p − 1,   |λ (Σ∗ ) − λ (Σ∗ )| m = p. m−1

m

The asymptotic order of δ plays an important role in the theory. The following lemma characterizes the approximate linearity of λm (Σ). 10

Lemma 3.2. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and sup

√ √ {δ−1 n||Σ−Σ∗ ||2 ∨(δ−1 + p)||Σ−Σ∗ ||≤δn }

p √ δ n

= o(1), then for any δn = o(1), we have

  √ ˆ − tr (Σ − Σ)u ˆ ∗m u∗T n |λm (Σ∗ )|−1 λm (Σ) − λm (Σ) m = oP (1),

where u∗m is the m-th eigenvector of Σ∗ .

Lemma 3.2 implies that the corresponding Φ in the linear expansion of φ(Σ) is u∗m u∗T m , and the asymptotic variance is

2

∗1/2 ∗1/2 ΦΣ 2 Σ

= 2|λm (Σ∗ )|2 . F

We also consider eigenvalues of the precision matrix. With slight abuse of notation, we define the eigengap of λm (Ω∗ ) to be  ∗ ∗  m = 1,  |λ1 (Ω ) − λ2 (Ω )| ∗ ∗ ∗ ∗ δ = min{|λm (Ω ) − λm−1 (Ω )|, |λm (Ω ) − λm+1 (Ω )|} m = 2, 3, ..., p − 1,   |λ (Ω∗ ) − λ (Ω∗ )| m = p. m−1

m

The approximate linearity of λm (Ω) is established in the following lemma.

Lemma 3.3. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1), then for any δn = o(1), we have   √ ∗ −1 ˆ −1 ) − tr (Ω − Σ ˆ −1 )u∗ u∗T = o(1), sup n|λ (Ω )| λ (Ω) − λ ( Σ m m m m m √ √ {δ−1 n||Σ−Σ∗ ||2 ∨(δ−1 + p)||Σ−Σ∗ ||≤δn }

where u∗m is the m-th eigenvector of Ω∗ .

Similarly, Lemma 3.3 implies that the corresponding Ψ in the linear expansion of ψ(Ω) is and the asymptotic variance is

u∗m u∗T m ,

2

2 Ω∗1/2 ΨΩ∗1/2 = 2|λm (Ω∗ )|2 . F

Plugging the above lemmas into our general framework, we get the following corollaries. Corollary 3.7. Consider the Wishart prior Π = Wp (I, p+b−1) in (5) with integer b = O(1). Assume ||Σ∗ || ∨ ||Ω∗ || = O(1) and δ√p n = o(1), then we have ! r    n ˆ ≤ t|X n − P Z ≤ t → 0, √ PΣn∗ sup Π λm (Σ) − λm (Σ) ∗ 2λm (Σ ) t∈R

ˆ is the sample covariance matrix. If we instead assume √p = o(1) with δ being the where Σ δ n eigengap of λm (Ω∗ ), then r     n −1 n n ˆ √ λm (Ω) − λm (Σ ) ≤ t|X − P Z ≤ t → 0. PΣ∗ sup Π 2λm (Ω∗ ) t∈R 11

Proof. We only need to check the approximate linearity. According to Lemma 2.1, the choice of An is r   p ∗ An = ||Σ − Σ || ≤ M , n for some M > 0. The assumption

p √ δ n

= o(1) implies

√ √ δ−1 n||Σ − Σ∗ ||2 ∨ (δ−1 + p)||Σ − Σ∗ || = o(1), on the set An . By Lemma 3.2 and Lemma 3.3, we have   √ ˆ − tr (Σ − Σ)u ˆ ∗ u∗T = oP (1), sup n|λm (Σ∗ )|−1 λm (Σ) − λm (Σ) m m An

and

  √ ˆ −1 ) − tr (Ω − Σ ˆ −1 )u∗m u∗T sup n|λm (Ω∗ )|−1 λm (Ω) − λm (Σ m = oP (1). An

Corollary 3.8. Consider the Gaussian prior Π in (6). Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) 2 log n and p δ√ = o(1), then we have n PΣn∗

! r    n n ˆ ≤ t|X √ sup Π − P Z ≤ t → 0, λm (Σ) − λm (Σ) ∗ 2λ (Σ ) t∈R m 2

log n ˆ is the sample covariance matrix. If we instead assume p √ where Σ = o(1) with δ being δ n ∗ the eigengap of λm (Ω ), then r     n −1 n n ˆ ) ≤ t|X − P Z ≤ t → 0. √ λm (Ω) − λm (Σ PΣ∗ sup Π 2λm (Ω∗ ) t∈R

Proof. We only need to check the approximate linearity. According to Lemma 2.2, the choice of An is ) ( r 2 log n p , An = ||Σ − Σ∗ ||F ≤ M n for some M > 0. The assumption

p2 √ log n δ n

= o(1) implies

√ √ δ−1 n||Σ − Σ∗ ||2 ∨ (δ−1 + p)||Σ − Σ∗ || = o(1), on the set An . By Lemma 3.2 and Lemma 3.3, we have   √ ˆ − tr (Σ − Σ)u ˆ ∗m u∗T sup n|λm (Σ∗ )|−1 λm (Σ) − λm (Σ) m = oP (1), An

and

  √ ˆ −1 ) − tr (Ω − Σ ˆ −1 )u∗ u∗T = oP (1). sup n|λm (Ω∗ )|−1 λm (Ω) − λm (Σ m m An

12

4

Discriminant Analysis

In this section, we generalize the theory in Section 2 to handle the BvM theorem in discriminant analysis. Let X n = (X1 , ..., Xn ) and Y n = (Y1 , ..., Yn ) be n i.i.d. training samples, where Yi ∼ N (µ∗Y , Ω∗−1 Xi ∼ N (µ∗X , Ω∗−1 Y ). X ), The discriminant analysis problem is to predict whether an independent new sample z is from the X-class or Y -class. For a given (µX , µY , ΩX , ΩY ), Fisher’s QDA rule can be written as ∆(µX , µY , ΩX , ΩY ) = −(z − µX )T ΩX (z − µX ) + (z − µY )T ΩY (z − µY ) + log

det(ΩX ) . det(ΩY )

In this section, we are going to find the asymptotic posterior distribution  √ −1  ˆ X n , Y n , z, ∆(µX , µY , ΩX , ΩY ) − ∆ nV

with some appropriate variance V 2 and some prior distribution. Since the result is conditional on the new observation z, we treat it as a fixed (non-random) vector in this section without loss of generality. Note that when ΩX = ΩY is assumed, the QDA rule can be reduced to the LDA rule. We give general results for Bernstein-von Mises theorem to hold in both cases respectively.

4.1

Linear Discriminant Analysis

Assume Ω∗X = Ω∗Y . For a given prior Π, the posterior distribution for LDA is defined as   R exp l (µ , µ , Ω) dΠ(µX , µY , Ω) n X Y B   , Π(B|X n , Y n ) = R exp ln (µX , µY , Ω) dΠ(µX , µY , Ω) where ln (µX , µY , Ω) is the log-likelihood function decomposed as

ln (µX , µY , Ω) = lX (µX , Ω) + lY (µY , Ω), where lX (µX , Ω) =

n n ˜ X ), log det(Ω) − tr(ΩΣ 2 2

˜ X = 1 Pn (Xi − µX )(Xi − µX )T , and lY (µY , Ω) is defined in the similar way. with Σ i=1 n Consider the LDA functional ∆(µX , µY , Ω) = −(z − µX )T Ω(z − µX ) + (z − µY )T Ω(z − µY ). Define the following quantities Φ=

 1 ∗ Ω (z − µ∗X )(z − µ∗X )T − (z − µ∗Y )(z − µ∗Y )T Ω∗ , 2 13

ξX = 2(z − µ∗X ),

t µY,t = µY + √ ξY , n ! n n 1X 1X T T ¯ ¯ ¯ ¯ , (Xi − X)(Xi − X) + (Yi − Y )(Yi − Y ) n n i=1 i=1

2

T ∗ Ω ξX + ξYT Ω∗ ξY . V 2 = 4 Σ∗1/2 ΦΣ∗1/2 + ξX

2t Ωt = Ω + √ Φ, n ˆ=1 Σ 2

ξY = 2(µ∗Y − z),

t µX,t = µX + √ ξX , n

F

(7)

Assume An is a set satisfying  n√  n ||µX − µ∗X ||2 + ||µY − µ∗Y ||2 + ||Σ − Σ∗ ||2 An ⊂ o  √  ∨ p ||µX − µ∗X || + ||µY − µ∗Y || + ||Σ − Σ∗ || ≤ δn ,

for some δn = o(1),

The main result for LDA is the following theorem.

Theorem 4.1. Assume that ||Σ∗ || ∨ ||Ω∗ || = O(1), p2 /n = o(1), and V −1 = O(1). If for a given prior Π, the following two conditions are satisfied: 1. Π(An |X n , Y n ) = 1 − oP (1), 

2. For any fixed t ∈ R, then

R

An

R

exp

An



ln (µX,t ,µY,t ,Ωt ) dΠ(µX ,µY ,Ω)





= 1 + oP (1),

exp ln (µX ,µY ,Ω) dΠ(µX ,µY ,Ω)

√     ˆ ≤ t|X n , Y n − P Z ≤ t = oP (1), sup Π nV −1 ∆(µX , µY , Ω) − ∆ t∈R

ˆ = ∆(X, ¯ Y¯ , Σ ˆ −1 ). where Z ∼ N (0, 1) and the centering is ∆

A curious condition in the above theorem is V −1 = O(1). The following proposition shows it is implied by the separation of the two classes. Proposition 4.1. Under the setting of Theorem 4.1, if ||µ∗X − µ∗Y || ≥ c for some constant c > 0, then we have V −1 = O(1). Proof. By the definition of V 2 , we have   T ∗ V 2 ≥ ξX Ω ξX + ξYT Ω∗ ξY ≥ C ||ξX ||2 + ||ξY ||2   = 4C ||z − µ∗X ||2 + ||z − µ∗Y ||2 ≥ 2C||µ∗X − µ∗Y ||2 ,

which is greater than a constant under the separation assumption.

14

Now we give examples of priors for LDA. Let us use independent priors. That is Ω ∼ ΠΩ ,

µX ∼ ΠX ,

µY ∼ ΠY ,

independently. The prior for the whole parameter (Ω, µX , µY ) is a product measure defined as Π = ΠΩ × ΠX × ΠY . Let ΠΩ be the Gaussian prior defined in (6). Let both ΠX and ΠY be N (0, Ip×p ). √  Theorem 4.2. Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1), V −1 = O(1), and p2 = o lognn . The prior defined above satisfies the two conditions in Theorem 4.1 for some appropriate An . Thus, the Bernstein-von Mises result holds.

4.2

Quadratic Discriminant Analysis

For the general case that Ω∗X = Ω∗Y may not be true, the posterior distribution for QDA is defined as   R exp l (µ , µ , Ω , Ω ) dΠ(µX , µY , ΩX , ΩY ) n X Y X Y B   , Π(B|X n ) = R exp ln (µX , µY , ΩX , ΩY ) dΠ(µX , µY , ΩX , ΩY ) where ln (µX , µX , ΩX , ΩY ) has decomposition

ln (µX , µX , ΩX , ΩY ) = lX (µX , ΩX ) + lY (µY , ΩY ). We define the following quantities,   ΦX = −Ω∗X Σ∗X − (z − µ∗X )(z − µ∗X )T Ω∗X ,   ΦY = Ω∗Y Σ∗Y − (z − µ∗Y )(z − µ∗Y )T Ω∗Y ,

ΩX,t

ξX = 2(z − µ∗X ), ξY = 2(µ∗Y − z), 2t 2t t t = ΩX + √ ΦX , ΩY,t = ΩY + √ ΦY , µX,t = µX + √ ξX , µY,t = µY + √ ξY , n n n n n n X X ˆX = 1 ¯ ¯ T ˆY = 1 Σ (Xi − X)(X Σ (Yi − Y¯ )(Yi − Y¯ )T , i − X) , n n i=1 i=1

2

2



∗1/2 ∗1/2 ∗1/2 ∗1/2 T ∗ Ω ξX + ξYT Ω∗ ξY . (8) V 2 = 2 ΣX ΦX ΣX + 2 ΣY ΦY ΣY + ξX F

F

Assume An is a set satisfying n√   An ⊂ n ||µX − µ∗X ||2 + ||µY − µ∗Y ||2 + ||Σ − Σ∗ ||2  √  ∨ p ||µX − µ∗X || + ||µY − µ∗Y || + ||Σ − Σ∗ ||   p ∨ n/p ||ΣX − Σ∗X ||2F + ||ΣY − Σ∗Y ||2F o  √  ∨ p ||ΣX − Σ∗X ||F + ||ΣY − ΣY ||F ≤ δn ,

with some δn = o(1). The main result for QDA is the following theorem. 15

Theorem 4.3. Assume ||Σ∗ || ∨ ||Ω∗ | = O(1), V −1 = O(1), and p3 /n = o(1). If for a given prior Π, the following two conditions are satisfied: 1. Π(An |X n , Y n ) = 1 − oP (1), 

2. For any fixed t ∈ R,

R

An exp

R

An



ln (µX,t ,µY,t ,ΩX,t ,ΩY,t ) dΠ(µX ,µY ,ΩX ,ΩY )

exp

then





= 1 + oP (1),

ln (µX ,µY ,ΩX ,ΩY ) dΠ(µX ,µY ,ΩX ,ΩY )

√     ˆ ≤ t|X n , Y n − P Z ≤ t = oP (1), sup Π nV −1 ∆(µX , µY , ΩX , ΩY ) − ∆ t∈R

ˆ −1 ). ˆ = ∆(X, ¯ Y¯ , Σ ˆ −1 , Σ where Z ∼ N (0, 1) and the centering is ∆ Y X

Remark 4.1. With the new definition of V in QDA, the assumption V −1 = O(1) is also implied by the separation condition ||µX −µY || > c by applying the same argument in Proposition 4.1. Remark 4.2. For independent prior in the sense that dΠ(µX , µY , ΩX , ΩY ) = dΠ(µX , ΩX ) × dΠ(µY , ΩY ), the posterior is also independent because of the decomposition of the likelihood. In this case, we have Π(An |X n , Y n ) = ΠX (AX,n |X n ) × ΠY (AY,n |Y n ), with AX,n and AY,n being versions of An involving only (µX , ΩX ) and (µY , ΩY ). In the same way, we also have   R exp l (µ , µ , Ω , Ω ) dΠ(µX , µY , ΩX , ΩY ) n X,t Y,t X,t Y,t An   R An exp ln (µX , µY , ΩX , ΩY ) dΠ(µX , µY , ΩX , ΩY )     R R exp l (Ω , µ ) exp l (Ω , µ ) dΠ (µ , Ω ) n Y,t Y,t dΠY (µY , ΩY ) n X,t X,t X X X AY,n AX,n     × R . = R exp l (Ω , µ ) dΠ (µ , Ω ) exp l (Ω , µ ) dΠ (µ , Ω ) n X X X X X n Y Y Y Y Y AX,n AY,n Hence, for the two conditions in Theorem 4.3, it is sufficient to check 1. Π(AX,n |X n ) = 1 − oP (1), 2. For any fixed t ∈ R,

R

AX,n

R

exp

AX,n



exp



ln (ΩX,t ,µX,t ) dΠX (µX ,ΩX )





= 1 + oP (1),

ln (ΩX ,µX ) dΠX (µX ,ΩX )

and the corresponding conditions for Y , when the prior has an independent structure.

16

The example of prior we specify for QDA is similar to the one for LDA. Let us use independent priors. That is ΩX ∼ ΠΩX ,

ΩY ∼ ΠΩY ,

µX ∼ ΠX ,

µY ∼ ΠY ,

independently. The prior for the whole parameter (ΩX , ΩY , µX , µY ) is a product measure defined as Π = ΠΩX × ΠΩY × ΠX × ΠY . Let ΠΩX and ΠΩY be the Gaussian prior defined in Section 2.3.2. Let both ΠX and ΠY be N (0, Ip×p ). Theorem Assume ||Σ∗X || ∨ ||Ω∗X || ∨ ||Σ∗Y || ∨ ||Ω∗Y || ≤ Λ = O(1), V −1 = O(1) and  √ 4.4.  n p2 = o log n . The prior defined above satisfies the two conditions in Theorem 4.1 for some appropriate An . Thus, the Bernstein-von Mises result holds.

5 5.1

Discussion ˆ and ψ(Σ ˆ −1 ) Comparison: Asymptotic Normality of φ(Σ)

ˆ In this section, we present the classical results for asymptotic normality of the estimators φ(Σ) −1 ˆ ). Note that in many cases, they coincide with MLE. The purpose is to compare and ψ(Σ them with the BvM results obtained in this paper. We first review and define some notation. ˆ and ω ˆ −1 . We let Remember σ ˆij is the (i, j)-th element of Σ ˆ ij is the (i, j)-th element of Σ ∆L and ∆Q be the LDA and QDA functionals respectively. The corresponding asymptotic variances are denoted by VL2 and VQ2 , defined in (7) and (8) respectively. As p, n → ∞ jointly, ˆ or ψ(Σ ˆ −1 ) holds under different asymptotic regimes for the asymptotic normality of φ(Σ) different functionals. For comparison, we assume that VL , VQ and the eigengap δ are at constant levels. Theorem 5.1. Let p, n → ∞ jointly, then for any asymptotic regime of (p, n), √ ∗) n(ˆ σij − σij q N (0, 1), ∗2 + σ ∗ σ ∗ σij ii jj √

ˆ − v T Σ∗ v) n(v T Σv √ 2|v T Σ∗ v|

N (0, 1).

Assume p2 /n = o(1), we have



√ ∗) n(ˆ ωij − ωij q ∗2 + ω ∗ ω ∗ ωij ii jj

N (0, 1),

ˆ −1 v − v T Ω∗ v) n(v T Σ √ 2|v T Ω∗ v| 17

N (0, 1),

(9)

(10)

√ √ √

 ˆ − λm (Σ∗ ) n λm (Σ) √ 2λm (Σ∗ )

 ˆ −1 ) − λm (Ω∗ ) n λm (Σ √ 2λm (Ω∗ )

N (0, 1),

(11)

N (0, 1),

  ¯ Y¯ , Σ ˆ −1 ) − ∆L (µ∗ , µ∗ , Ω∗ ) nVL−1 ∆L (X, X Y

(12) N (0, 1).

Assume p3 /n = o(1), we have r   n ˆ − log det(Σ∗ ) log det(Σ) N (0, 1), 2p  √ −1  ˆ −1 ) − ∆Q (µ∗X , µ∗Y , Ω∗X , Ω∗Y ) ¯ Y¯ , Σ ˆ −1 , Σ nVQ ∆Q (X, Y X

(13)

(14) N (0, 1).

(15)

Since the above results are more or less scattered in the literature, we do not present their proofs in this paper. Readers who are interested can derive these results using delta method. We remark that the condition p2 /n = o(1) is sharp for (9)-(13). For (9) and (10), a common example is ω11 = eT1 Ωe1 , where eT1 = (1, 0, ..., 0). By distributional facts of inverse √ ∗ ) is not asymptotically normal if p2 /n = o(1) does not hold. Since Wishart, n(ˆ ω11 − ω11 the functional ∆L is harder than v T Ωv (the latter is a special case of the former if µ∗X and µ∗Y are known), p2 /n = o(1) is also sharp for (13). For (11) and (12), we have the following proposition to show that p2 /n = o(1) is necessary. ∗ − σ ∗ be at constant level Proposition 5.1. Consider a diagonal Σ∗ . Let the eigengap σ11 22 when p, n → ∞ jointly. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1), n1/2 = o(p) and p = o(n2/3 ). Then ˆ is not √n-consistent. As a consequence, λp (Σ ˆ −1 ) = λ−1 (Σ) ˆ is not √n-consistent. λ1 (Σ) 1

The condition p3 /n = o(1) is sharp for (14) and (15). If p3 /n = o(1) does not hold, a bias correction is necessary for (14) to hold (see [4]). That the condition p3 /n = o(1) is necessary for (15) is because the functional ∆Q contains the part log det(Σ). In the next section, we are going to discuss the asymptotic regime of (p, n) for BvM and compare them with the frequentist results listed in this section.

5.2

The Asymptotic Regime of (p, n)

For all the BvM results we obtain in this paper, they assume different asymptotic regime of the sample size n and the dimension p. Ignoring the log n factor and assume constant eigengap δ and asymptotic variances for LDA and QDA, the asymptotic regime for (p, n) is summarized in the following table.

18

functional σij ωij v T Σv v T Ωv log det(Σ) λm (Σ) λm (Ω) LDA QDA

ˆ or ψ(Σ ˆ −1 ) φ(Σ) *** p2 ≪ n *** p2 ≪ n p3 ≪ n p2 ≪ n p2 ≪ n p2 ≪ n p3 ≪ n

conjugate p≪n p2 ≪ n p≪n p2 ≪ n p3 ≪ n p2 ≪ n p2 ≪ n p2 ≪ n p3 ≪ n

non-conjugate p2 ≪ n p3 ≪ n p2 ≪ n p3 ≪ n p3 ≪ n p4 ≪ n p4 ≪ n p4 ≪ n p4 ≪ n

ˆ and ψ(Σ ˆ −1 ) and for The table has three columns for the asymptotic normality of φ(Σ) BvM with conjugate and non-conjugate priors respectively. The purpose is to compare our BvM result with the classical frequentist asymptotic normality. The priors are the Wishart prior and Gaussian prior we consider in this paper. For discriminant analysis, we did not consider conjugate prior because of limit of space. The conjugate prior in the LDA and QDA settings is the normal-Wishart prior. Its posterior distribution can be decomposed as a marginal Wishart times a conditional normal. The analysis of the BvM result for this case is direct, and we claim the asymptotic regimes for LDA and QDA are p2 ≪ n and p3 ≪ n respectively without giving a formal proof. Comparing the first and the second columns, the condition for p and n we need for the BvM results with conjugate prior matches the conditions for the frequentist results. The two exceptions are σij and v T Σv, where for the frequentist asymptotic normality to hold, there is no assumption on p, n. Our technique of proof requires p ≪ n. This is because our theory requires a set An ⊂ {||Σ − Σ∗ || ≤ δn } for some δn = o(1) to satisfy Π(An |X n ) = 1 − oP (1). p The best rate of convergence for ||Σ − Σ∗ || is p/n, which leads to p ≪ n. Such assumption may be weaken if a different theory than ours can be developed (or through direct calculation by taking advantage of the conjugacy). The comparison of the second and the third columns suggests that using of non-conjugate prior requires stronger assumptions. We believe these stronger assumptions can all be weakened. The current stronger assumptions on p and n are caused the technique we use in this paper to prove posterior contraction, which is Condition 1 in Theorem 2.1 and Theorem 2.2. The current way of proving posterior contraction in nonparametric Bayes theory only allows loss functions which are at the same order of the Kullback-Leibler divergence. In the covariance matrix estimation setting, we can only deal with Frobenius loss. We choose   p2 log n ∗ 2 . An = ||Σ − Σ ||F ≤ M n For functionals of covariance such as σij and v T Σv, we need An ⊂ {||Σ − Σ∗ || ≤ δn } for some δn . We have to bound ||Σ − Σ∗ || as r p2 log n ∗ ∗ , ||Σ − Σ || ≤ ||Σ − Σ ||F ≤ M n 19

q 2 n and require M p log ≤ δn = o(1). This leads to p2 ≪ n. For functionals of precision n √ matrix, we need An ⊂ { p||Σ − Σ∗ || ≤ δn }. Again, we have bound r p3 log n √ √ ∗ ∗ p||Σ − Σ || ≤ p||Σ − Σ ||F ≤ M , n q 3 n and require M p log = o(1). This leads to p3 ≪ n. It would be great if we can prove a n p posterior contraction on {||Σ − Σ∗ || ≤ M p/n} directly without referring to the Frobenius loss. However, under the current technique of Bayes nonparemtrics [15], this is impossible. See a lower bound argument in [16].

5.3

Sharpness of The Condition rp2 /n = o(1) in Theorem 2.2

It is curious whether the condition rp2 /n = o(1) is sharp in Theorem 2.2. Let us consider the funcitonal ψ(Ω) = log det(Ω). In this case, the corresponding matrix Ψ in the linear expansion of ψ(Ω) is Ψ = Σ∗ and r = rank(Σ∗ ) = p. Then, the condition rp2 = o(1) becomes p3 /n = o(1). Since log det(Ω) = − log det(Σ) and p3 /n = o(1) is sharp for BvM to hold for log det(Σ), it is also sharp for log det(Ω).

5.4

Covariance Priors

The general framework in Section 2 only considers prior defined on precision matrix Ω. However, sometimes it is more natural to use prior defined on covariance matrix Σ, for example, Gaussian prior on Σ. Then, the first conditions in Theorem 2.1 and Theorem 2.2 are hard to check. We propose a slight variation of this condition, so that our theory can also be user-friendly for covariance priors. We first consider approximate linear functionals of Σ satisfying (2). Then, the first condition of Theorem 2.1 can be replaced by   R −1 ) dΠ(Σ) exp l (Σ n t An   = 1 + oP (1), for each fixed t ∈ R, R −1 ) dΠ(Σ) exp l (Σ n An

where Σt = Σ − √n||Σ∗1/22tΦΣ∗1/2 || Σ∗ ΦΣ∗ . Then we consider approximate linear functionals F of Ω satisfying (4). The first condition of Theorem 2.2 can be replaced by   R −1 exp l (Σ ) dΠ(Σ) n t An   = 1 + oP (1), for each fixed t ∈ R, R −1 ) dΠ(Σ) exp l (Σ n An

2t where Σt = Σ + √n||Ω∗ 1/2ΨΩ ∗1/2 || Ψ. F With the new conditions, it is direct to check them for covariance priors by change of variable, as is done in the proof of Lemma 2.1 and Lemma 2.2. In particular, for the Gaussian prior on covariance matrix, we claim the conclusion of Lemma 2.2 holds. We avoid expanding the technical details for the covariance priors in this paper due to the limit of space.

20

5.5

Relation to Matrix Estimation under Non-Frobenius Loss

As we have mentioned in the end of Section 5.2, the current Bayes nonparametric technique for proving posterior contraction rate only covers losses which are at the same order of Kullback-Leiber divergence. It cannot handle other non-intrinsic loss [16]. In the Bayes matrix estimation setting, whether we can show the following conclusion r   p n ∗ Π ||Σ − Σ || ≤ M (16) X = 1 − oP (1), n

for a general non-conjugate prior still remains open. This explains why there is so little literature in this field compared to the growing research using frequentist methods. See, for example, [5] and [6]. However, we observe that for the spectral norm loss, ||Σ − Σ∗ || ≤ 2 sup |v T (Σ − Σ∗ )v|, v∈N

where N is a subset of S p−1 with cardinality bound log |N | ≤ cp for some c > 0. The BvM result we establish for the functional v T Σv indicates that for each v, the posterior distribution of |v T (Σ − Σ∗ )v| is at the√order of n−1/2 . Therefore, heuristically, 2 supv∈N |v T (Σ − Σ∗ )v| p log |N | should be at the order of √n , which is p/n. We will use this intuition as a key idea in our future research project on the topic of Bayes matrix estimation. Once (16) is established for a non-conjugate prior (e.g. Gaussian prior in this paper), then we may use (16) to weaken the conditions in the third column of the table in Section 5.2. In fact, most entries of that column can be weakened to match the conditions in the second column for a conjugate prior. As argued in Section 5.2, (16) directly implies the concentration Π(An |X n ) = 1 − oP (1), which is Condition 1 in both Theorem 2.1 and Theorem 2.2.

6

Proofs

6.1

Proof of Theorem 2.1 & Theorem 2.2

Before stating the proofs, we first display some lemmas. The following lemma is Lemma 2 in [10]. It allows us to prove BvM results through convergence of moment generating functions. Lemma 6.1. Consider the random probability measure Pn and a fixed probability measure P . R R Suppose for any real t, the Laplace transformation etx dP (x) is finite, and etx dPn (x) → R tx e dP (x) in probability. Then, it holds that     sup Pn (−∞, t] − P (−∞, t] = oP (1). t∈R

The next lemma is an expansion of the Gaussian likelihood.

21

Lemma 6.2. Assume ||Σ∗ || ∨ ||Ω∗ || = O(1). For any symmetric matrix Φ and the perturbed precision matrix √ 2t

Φ, Ωt = Ω + √ ∗1/2 n Σ ΦΣ∗1/2 F

the following equation holds for all Ω ∈ An with An satisfying (1) or (3).

√ p   1 Σ1/2 ΦΣ1/2 2 t n n X (hj − s)2 2 F ˆ

tr (Σ− Σ)Φ ln (Ωt )−ln (Ω) = √ − − t ds,

2 Σ∗1/2 ΦΣ∗1/2 2 2 (1 − s)3 2 Σ∗1/2 ΦΣ∗1/2 F j=1 F (17) where {hj }pj=1 are eigenvalues of Σ1/2 (Ω − Ωt )Σ1/2 . The following lemma is Proposition D.1 in the supplementary material of [23], which is rooted in [12]. Lemma 6.3. Let Yl ∼ N (0, Ip×p ). Then, for any t > 0,

n

!

X

 r p 2 r p 2

n 1 T PI +t + +t ≥ 1 − 2e−nt /2 . Yl Yl − I ≤ 2

n

n n l=1

Proof of Theorem 2.1. We are going to use Lemma 6.1 and establish the convergence of moment generating function. We claim that √   t n ˆ − 1 t2 + oP (1),

φ(Σ) − φ(Σ) (18) ln (Ωt ) − ln (Ω) = √ 2 2 Σ∗1/2 ΦΣ∗1/2 F

uniformly over An . The derivation of (18) will be given at the end of the proof. Define the posterior distribution conditioning on An by ΠAn (B|X n ) = It is easy to see

Π(An ∩ B|X n ) , Π(An |X n )

for any B.

sup ΠAn (B|X n ) − Π(B|X n ) = oP (1),

(19)

B

by thefirst condition of Theorem 2.1. Now we calculation the moment generating function  √

of



ˆ n φ(Σ)−φ(Σ)

2kΣ∗1/2 ΦΣ∗1/2 kF

under the distribution ΠAn (·|X n ), which is  √  Z ˆ ! t n φ(Σ) − φ(Σ)

exp √ dΠAn (Ω|X n ) 2 Σ∗1/2 ΦΣ∗1/2 F   √ ˆ   n φ(Σ)−φ( Σ) t R √ exp + l (Ω) dΠ(Ω) n An 2kΣ∗1/2 ΦΣ∗1/2 kF   = R An exp ln (Ω) dΠ(Ω)   R   exp l (Ω ) dΠ(Ω) n t  An   = 1 + oP (1) exp t2 /2 R An exp ln (Ω) dΠ(Ω)    = 1 + oP (1) exp t2 /2 , 22

where the second equality is because of (18) and the last inequality is because of the second conditionof Theorem 2.1. We have shown that the moment generating function of √ ˆ n φ(Σ)−φ(Σ) √ 2kΣ∗1/2 ΦΣ∗1/2 kF

under the distribution ΠAn (·|X n ) converges to the moment generating func-

 tion of N 0, 1 in probability. By Lemma 6.1 and (19), we have established the desired result. To finish the proof, let us derive (18). Using the result of the likelihood expansion in Lemma 6.2, we will first show √   t2 t n ˆ

tr (Σ − Σ)Φ ln (Ωt ) − ln (Ω) = √ − + o(1), (20) 2 2 Σ∗1/2 ΦΣ∗1/2 F

where the o(1) above is uniform on An . Compare (20) with (17) in Lemma 6.2, it is sufficient to bound X 2 p Σ1/2 ΦΣ1/2 2

n (hj − s) F R1 = ds .

2 − 1 and R2 = 3 Σ∗1/2 ΦΣ∗1/2 (1 − s) 2 j=1

F

We use the following argument to bound R1 on An . 1/2 1/2 2 1/2∗ ∗1/2 2 ΦΣ || − ||Σ ΦΣ || ||Σ F F

= |tr(ΣΦΣΦ) − tr(Σ∗ ΦΣ∗ Φ)|     ≤ tr ΣΦ(Σ − Σ∗ )Φ + tr (Σ − Σ∗ )ΦΣ∗ Φ   = tr Σ1/2 ΦΣΦΣ1/2 (I − Σ−1/2 Σ∗ Σ−1/2 )   + tr Σ∗1/2 ΦΣ∗ ΦΣ∗1/2 (Σ∗−1/2 ΣΣ∗−1/2 − I)   ≤ tr Σ1/2 ΦΣΦΣ1/2 ||I − Σ−1/2 Σ∗ Σ−1/2 ||   +tr Σ∗1/2 ΦΣ∗ ΦΣ∗1/2 ||I − Σ∗−1/2 ΣΣ∗−1/2 ||

= ||Σ1/2 ΦΣ1/2 ||2F ||I − Σ−1/2 Σ∗ Σ−1/2 || + ||Σ1/2∗ ΦΣ∗1/2 ||2F ||I − Σ∗−1/2 ΣΣ∗−1/2 ||

≤ o(1)||Σ1/2 ΦΣ1/2 ||2F + o(1)||Σ1/2∗ ΦΣ∗1/2 ||2F ,

(21)

(22)

where the inequality (21) is by von Neumann’s trace inequality and the inequality (22) is due to the fact that ||Σ − Σ∗ || = o(1) on An . Rearranging the above argument, we get R1 = o(1) uniformly on An . To bound R2 , we first use Weyl’s theorem to get

√ 2t Σ1/2 ΦΣ1/2

= O(n−1/2 ), max |hj | ≤ √ 1≤j≤p n Σ∗1/2 ΦΣ∗1/2 F

23

on An . Thus, on An , we have R2

p Z hj X 2 (hj − s) ds ≤ Cn j=1 0 ≤ Cn

p X j=1

|hj |3

≤ Cn max |hj | 1≤j≤p

≤ Cn × O(n

p X j=1

−1/2

= O(n−1/2 ).

|hj |2

!

1/2

Σ ΦΣ1/2 2 F )×O

2 n Σ∗1/2 ΦΣ∗1/2 F

Hence, (20) is proved. Together with the approximate linearity condition (2) of the functional φ(Σ), (18) is proved. Thus, the proof is complete. Proof of Theorem 2.2. We follow the reasoning in the proof of Theorem 2.1 and omit some similar steps. Define Φ = −Ω∗ ΨΩ∗ . It is easy to see that



∗1/2

∗1/2 ∗1/2 ∗1/2 ΦΣ ΨΩ



.

= Σ F

F

Then by Lemma 6.2 and the similar arguments in the proof of Theorem 2.1, we obtain  √  ˆ t ntr (Σ − Σ)Φ 1

− t2 + o(1), ln (Ωt ) − ln (Ω) = √ 2 2 Ω∗1/2 ΨΩ∗1/2 F  √  ˆ uniformly on An , which is analogous to (20). We are going to approximate ntr (Σ − Σ)Φ  √  ˆ −1 ) on An . Define Ω ˆ =Σ ˆ −1 . The assumption rp2 /n = o(1) implies that by n ψ(Ω) − ψ(Σ ˆ is well defined. By Lemma 6.3, p/n = o(1). Thus, Ω ˆ − Σ∗ || = O ||Σ

r p  n

ˆ − Ω∗ || = O ||Ω

,

r p  . n

2 Using notation V = 2 Ω∗1/2 ΨΩ∗1/2 F , the approximation error on An is   √ −1/2 ˆ − tr (Σ − Σ)Φ ˆ nV ψ(Ω) − ψ(Ω)     √ −1/2 ˆ ˆ ∗ ΨΩ∗ = nV + tr (Σ − Σ)Ω tr (Ω − Ω)Ψ  √ −1/2  ∗ ˆ ˆ nV ΨΩ∗ − ΩΨΩ) = tr (Σ − Σ)(Ω   √  √ −1/2  −1/2 ∗ ∗ ∗ ˆ ˆ ˆ ˆ nV ≤ tr (Σ − Σ)Ω Ψ(Ω − Ω) + nV tr (Σ − Σ)(Ω − Ω)ΨΩ . 24

(23)

P Let the singular value decomposition of Ψ be Ψ = rl=1 dl ql qlT . Then,   ˆ ∗ Ψ(Ω∗ − Ω) ˆ tr (Σ − Σ)Ω ≤



r X

l=1 r X l=1

  ˆ ˆ ∗ ql q T (Ω∗ − Ω) |dl | tr (Σ − Σ)Ω l

 ˆ || ˆ ∗ ql ||||q T (Ω∗ − Ω) |dl |||(Σ − Σ)Ω l

ˆ − Σ||||Σ ˆ − Σ || ≤ OP ||Σ ∗

r X l=1

!

|dl | .

Similarly,   ∗ ˆ ˆ − Ω)ΨΩ tr (Σ − Σ)(Ω

ˆ − Σ||||Σ ˆ − Σ∗ || ≤ OP ||Σ Since

r X l=1

C V −1/2 ≤ C||Ψ||−1 F = qPr

|dl | .

2 l=1 dl

we have

!

,

  √ −1/2 ˆ − tr (Σ − Σ)Φ ˆ nV ψ(Ω) − ψ(Ω) √  ˆ − Σ||||Σ ˆ − Σ∗ || ≤ OP nr||Σ √  √ ˆ − Σ∗ ||2 + nr||Σ ˆ − Σ∗ ||||Σ − Σ∗ || ≤ OP nr||Σ ! r rp2 √ + rp||Σ − Σ∗ || ≤ OP n = oP (1) uniformly on An , where we have used (23) in the second last inequality above. Hence,  √  ˆ −1 ) t n ψ(Ω) − ψ(Σ 1

− t2 + oP (1), (24) ln (Ωt ) − ln (Ω) = √ 2 2 Ω∗1/2 ΨΩ∗1/2 F

uniformly on An . The remaining part of the proof are the same as the corresponding steps in the proof of Theorem 2.1. Thus, the proof is complete.

6.2

Proof of Lemma 2.1

Proof of Lemma 2.1. The proof has two parts. In the first part, we establish the first condition of the two theorems by proving a posterior contraction rate. In the second part, we 25

establish the second condition of the two theorems by showing that a change of variable is negligible under Wishart density.   ˆ + I)−1 , n + p + b − 1 . Conditioning Part I. The posterior distribution Ω|X n is Wp (nΣ on X n , let Zl |X n ∼ P(nΣ+I) −1 i.i.d. for each l = 1, 2, ..., n + p + b − 1. Then the posterior ˆ Pn+p+b−1 distribution of Ω is identical to the distribution of l=1 Zl ZlT X n . Define the set

r   p ∗1/2 ˆ ∗1/2 , Gn = ||Ω ΣΩ − I|| ≤ C n  and we have PΣn∗ (Gcn ) ≤ exp − cp by Lemma 6.3, for some c, C > 0. The event Gn implies q ˆ − Σ∗ || ≤ C||Σ∗ || p , by which we can deduce ||Σ n

 −1

1

Σ ˆ+ I

=

n



1   ˆ + 1I λmin Σ n 1

λmin (Σ∗ ) − n1 ∗

≤ 2||Ω ||.

ˆ − Σ∗ || − ||Σ

Using the obtained results, we can bound the deviation of the sample covariance by

ˆ + I)−1 − Ω∗

(n + p + b − 1)(nΣ

!

b + p − 1 1 n

ˆ − Σ∗ ) − ˆ + I)−1 (Σ Σ∗ + I Ω∗ ≤ (n + p + b − 1)(nΣ

n+b+p−1 n+b+p−1 n+p+b−1



−1

||(Σ ˆ − Σ∗ )Ω∗ || + ||(p + b − 1)(nΣ ˆ + I)−1 || + ||(nΣ ˆ + I)−1 Ω∗ || ˆ + 1I ≤ Σ

n r 2 p 2(p + b − 1) ∗ ∗ 1/2 ∗ 3/2 + ||Ω || + ||Ω∗ ||2 ≤ 2C||Σ || ||Ω || n n n r p , ≤ C ′ ||Σ∗ ||1/2 ||Ω∗ ||3/2 n

26

and the posterior deviation can be bounded by r  p n ∗ ′ ∗ 1/2 ∗ 3/2 n |X PΣ∗ Π ||Ω − Ω || > 2C ||Σ || ||Ω || n r  p n ≤ PΣn∗ Π ||Ω − Ω∗ || > 2C ′ ||Σ∗ ||1/2 ||Ω∗ ||3/2 |X IGn + PΣn∗ (Gcn ) n r

 p n

n −1 ′ ∗ 1/2 ∗ 3/2 ˆ ≤ PΣ∗ Π Ω − (n + p + b − 1)(nΣ + I) > C ||Σ || ||Ω || |X IGn + PΣn∗ (Gcn ) n

n+p+b−1

r !

X

p n

n T −1 ′ ∗ 1/2 ∗ 3/2 ˆ + I) > C ||Σ || ||Ω || = PΣ∗ P Zl Zl − (n + p + b − 1)(nΣ X

n l=1

+PΣn∗ (Gcn )

r ! n+p+b−1

1 X 1 p n

≤ PΣn∗ P (n + p + b − 1)−1 C ′ ||Σ∗ ||1/2 ||Ω∗ ||3/2 Wl WlT − I > X ˆ + n−1 I)−1 ||

2 ||(Σ n l=1

+PΣn∗ (Gcn )

r ! n+p+b−1

1

X p

′ ∗ 1/2 ∗ 1/2 T −1 + PΣn∗ (Gcn ) Wl Wl − I > C ||Σ || ||Ω || ≤ P (n + p + b − 1)

4

n l=1  ′ ≤ exp − c p ,

where we use Wl ∼ N (0, I) in the above equations. In summary, we have proved r ! p n ∗ ′ ∗ 1/2 ∗ 3/2 n PΣ∗ Π ||Ω − Ω || ≤ 2C ||Σ || ||Ω || → 1, X n

(25)

which implies

r   p n ∗ ||Σ − Σ || ≤ M X → 1, n

PΣn∗ Π

with some sufficiently large M > 0. We choose r   p ∗ An = ||Σ − Σ || ≤ M , n |X n )

= 1 − oP (1) is true. For Theorem 2.1, let δn = M

q

p n.

Then δn = o(1) by q 2 assumption, and An ⊂ {||Σ − Σ∗ || ≤ δn }. For Theorem 2.2, let δn = M rnp , then we have √ δn = o(1) and An ⊂ { rp||Σ − Σ∗ || ≤ δn }. Part II. Note that the proof for this part is the same for both Theorem 2.1 and Theorem 2.2 by letting Φ = −Ω∗ ΨΩ∗ . We introduce the notation

−1 √

˜= Φ Φ. 2 Σ∗1/2 ΦΣ∗1/2 so that Π(An

F

  R Now we study the integral An exp ln (Ωt ) dΠ(Ω). Let N (p, b) be the normalizing constant 27

of Wp (I, p + b − 1). We have Z   exp ln (Ωt ) dΠ(Ω) An Z   −1 ˜ + b − 2 log det(Ω) − 1 tr(Ω) dΩ exp ln (Ω + 2tn−1/2 Φ) = N (p, b) 2 2 Z An   b−2 ˜ − 1 tr(Γ − 2tn−1/2 Φ) ˜ dΓ log det(Γ − 2tn−1/2 Φ) exp ln (Γ) + = N −1 (p, b) 2 2 ˜ An +2tn−1/2 Φ Z    b − 2 ˜ + 1 tr(2tn−1/2 Φ) ˜ dΠ(Ω). log det(I − 2tn−1/2 Ω−1 Φ) exp ln (Ω) exp = 2 2 ˜ An +2tn−1/2 Φ  ˜ ⊂ {Ω : Ω > 0, Ω = ΩT }. The above integrals are meaningful because An ∪ An + 2tn−1/2 Φ Note that r  

p

−1/2 ˜ −1 ∗ −1/2 ˜ Φ) − Σ ≤ M An + 2tn Φ = (Ω − 2tn . n q  p ′ ′′ ′ ˜ =o Since ||2tn−1/2 Φ|| n , there exist M , M arbitrarily close to M such that M < M < M ′′ and

˜ ⊂ A′′ A′n ⊂ An + 2tn−1/2 Φ n q o n q o n for A′n = ||Σ − Σ∗ || ≤ M ′ np and A′′n = ||Σ − Σ∗ || ≤ M ′′ np . The result (25) implies

Π(A′n |X n ) = 1 − oP (1) and Π(A′′n |X n ) = 1 − oP (1) are also true when M ′ , M, M ′′ are large ˜ N be the nuclear norm of Φ, ˜ defined as the sum of its absolute eigenvalues. enough. Let ||Φ|| ′′ Note that on An , ˜ N ≤ C ||Φ||N ≤ C √p. ||Φ|| ||Φ||F Since b − 2 1 −1/2 −1 ˜ −1/2 ˜ sup log det(I − 2tn Ω Φ) + tr(2tn Φ) 2 2 A′′ n ˜ −1/2 ||N + ||Φ|| ˜ N ≤ tn−1/2 sup |b − 2|||Ω−1/2 ΦΩ A′′ n

p ≤ O( p/n) = o(1),

we have

and

The facts that

Z Z

An

Z   exp ln (Ωt ) dΠ(Ω) ≤ 1 + o(1) 

Z   exp ln (Ωt ) dΠ(Ω) ≤ 1 − o(1) 

An Π(A′n |X n )

Π(A′′n |X n )

A′′ n

A′n

  exp ln (Ω) dΠ(Ω),   exp ln (Ω) dΠ(Ω).

= 1 − oP (1) and = 1 − oP (1) lead to   R An exp ln (Ωt ) dΠ(Ω)   = 1 + oP (1). R An exp ln (Ω) dΠ(Ω) 28

6.3

Proof of Lemma 2.2

Now we are going to prove Lemma 2.2. Like the proof of Lemma 2.1, it has two parts. The first part is to show posterior contraction on some appropriate set An . Note that Wishart prior is a conjugate prior. The posterior contraction can be directly calculated. For the Gaussian prior, its non-conjugacy requires to apply some general result from nonparametric Bayes theory. To be specific, we follow the testing approach in [2] and [15]. The outline of using testing approach to prove posterior contraction for Bayesian matrix estimation is referred to Section 5 in [14]. We first state some lemmas. Lemma 6.4. Assume p2 = o(n/ log n) and ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1). For the Gaussian prior Π, we have    p2 log n  ≥ exp − Cp2 log n , Π ||Ω||2 ||Σ − Σ∗ ||2F ≤ n

for some constant C > 0.

The next lemma is Lemma 5.1 in [14]. n Lemma 6.5. Let Kn = ||Ω||2 ||Σ − Σ∗ ||2F ≤ PΣn∗

Z

p2 log n n

o

. Then for any b > 0, we have

   exp ln (Ω) − ln (Ω∗ ) dΠ(Ω) ≤ Π(Kn ) exp − (b + 1)p2 log n

!

  ≤ exp − Cb2 p2 log n ,

for some constant C > 0.

The next lemma is Lemma 5.9 in [14]. Lemma 6.6. For ||Σ∗ ||∨||Ω∗ || ≤ Λ = O(1) and ||Σ1 ||∨||Ω1 || ≤ 2Λ, there exist small δ, δ′ > 0 only depending on Λ, and a testing function φ such that   PΣn∗ φ ≤ 2 exp − Cδ′ ||Σ∗ − Σ1 ||2F , sup

{Σ∈supp(Π):||Σ−Σ1 ||F ≤δ||Σ∗ −Σ1 ||F }

for some constant C > 0.

  PΣn (1 − φ) ≤ 2 exp − Cδ′ ||Σ∗ − Σ1 ||2F ,

Proof of Lemma 2.2. Like what we have done in the Wishart case, the proof has two parts. In the first part, we establish the first condition of the two theorems by proving a posterior contraction rate. In the second part, we establish the second condition of the two theorems by showing that a change of variable is negligible under Gaussian density. Part I. Define ) ( r 2 log n p , An = ||Σ − Σ∗ ||F ≤ M n 29

for some M sufficiently large. Then, we may write   R ∗ ) dΠ(Ω) exp l (Ω) − l (Ω n n Acn Nn   = . Π(Acn |X n ) = R Dn exp ln (Ω) − ln (Ω∗ ) dΠ(Ω)

Let us establish a testing between the following hypotheses: H0 : Σ = Σ∗

vs

H1 : Σ ∈ Acn ∩ supp(Π).

c There exists {Σj }N j=1 ⊂ An ∩ supp(Π), such that (

Acn

∩ supp(Π) ⊂ supp(Π) ∩

∪N j=1

||Σ − Σj ||F ≤

r

p2 log n n

)!

.

We choose the smallest N , which is determined by the covering number. Since Acn ∩supp(Π) ⊂ √ {||Ω||F ≤ 2Λ p}, we have √ ! 2Λ n ′ 2 ≤ Cp2 log n. log N ≤ C p log √ p log n By Lemma 6.6, there exists φj such that PΣn∗ φj sup {Σ∈supp(Π):||Σ−Σj ||F ≤





 ≤ 2 exp − CM p log n ,   PΣn (1 − φj ) ≤ 2 exp − CM 2 p2 log n . 2 2

p2 log n/n}

Define φ = max1≤j≤N φj . Using union bound to control the testing error, we have   PΣn∗ φ ≤ exp − C1 M 2 p2 log n ,   n 2 2 sup PΣ (1 − φ) ≤ exp − C1 M p log n , {Σ∈Acn ∩supp(Π)}

for sufficiently large M . We bound Π(Acn |X n ) by

 PΣn∗ Π(Acn |X n ) ≤ PΣn∗ Π(Acn |X n )(1 − φ)I Dn > exp(−2p2 log n)   +PΣn∗ φ + PΣn∗ Dn ≤ exp(−2p2 log n) Z    n 2 exp ln (Ω) − ln (Ω∗ ) (1 − φ)dΠ(Ω) ≤ exp 2p log n PΣ∗ Acn   +PΣn∗ φ + PΣn∗ Dn ≤ exp(−2p2 log n) Z  2 PΣn (1 − φ)dΠ(Ω) ≤ exp 2p log n Acn   +PΣn∗ φ + PΣn∗ Dn ≤ exp(−2p2 log n)  ≤ exp 2p2 log n sup PΣn (1 − φ) c Σ∈An ∩supp(Π)   2 n n +PΣ∗ φ + PΣ∗ Dn ≤ exp(−2p log n) . 30

In the upper bound above, the first two terms are bounded by the testing error we have established. The last term can be bounded by combining the results of Lemma 6.4 and Lemma 6.5. Hence, we have proved that q

Π(Acn |X n ) = 1 − oP (1). p2 log n . n

Then δn = o(1) by assumption, and An ⊂ {||Σ − q 3 Σ∗ || ≤ δn }. For Theorem 2.2, let δn = M rp nlog n , then we have δn = o(1) and An ⊂ √ { rp||Σ − Σ∗ || ≤ δn }. Part II. Let ΠG induce a prior distribution on symmetric Ω with each of the upper triangular element independently following N (0, 1). The density of ΠG is   1 dΠG (Ω) ¯ 2 , = ξp−1 exp − ||Ω|| F dΩ 2 ¯ to zero out the lower triangular elements of Ω except the diagonal part and where we use Ω ξp is the normalizing constant. Write Z   exp ln (Ωt ) dΠ(Ω) An Z  1 ¯ 2 ¯ −1 = ξp exp ln (Ωt ) − ||Ω|| F dΩ. 2 An For Theorem 2.1, let δn = M

˜ defined in the proof of Lemma 2.1, we have Remembering the notation Φ Z  1 ¯ 2 ¯ exp ln (Ωt ) − ||Ω|| F dΩ 2 An Z   1 ¯ ¯˜ 2 dΓ ¯ exp ln (Γ) − ||Γ = − 2tn−1/2 Φ|| F 2 ˜ An +2tn−1/2 Φ Z   1 ¯ 2 ¯˜  − 2t2 n−1 ||Φ|| ¯˜ 2 dΓ. −1/2 ¯ ¯ + 2tn tr Γ Φ exp ln (Γ) − ||Γ|| = F F 2 ˜ An +2tn−1/2 Φ

We may choose M ′ , M ′′ arbitrarily close to M such that M ′ < M < M ′′ and A′n ⊂ An + ˜ ⊂ A′′n for 2tn−1/2 Φ ( ) ( ) r r p2 log n p2 log n ′ ∗ ′ ′′ ∗ ′′ An = ||Σ − Σ ||F ≤ M , An = ||Σ − Σ ||F ≤ M . n n ! q  2 p log n ˜ F = O n−1/2 = o This can always be done because ||2tn−1/2 Φ|| . Moreover, we n

have Π(A′n |X n ) = 1 − oP (1), Π(A′′n |X n ) = 1 − oP (1) and ¯˜  − 2t2 n−1 ||Φ|| ¯˜ 2 ¯Φ sup 2tn−1/2 tr Γ F A′′ n

−1/2 −1 ˜ 2 ˜ ≤ C sup n ||Γ||F ||Φ||F + n ||Φ||F A′′ n

= O

r

1 p + n n

!

= o(1). 31

Therefore, using the same argument in the proof of Lemma 2.1, we have   R exp l (Ω ) n t dΠ(Ω) An   = 1 + oP (1). R exp l (Ω) dΠ(Ω) n An

This completes the proof.

6.4

Proof of Technical Lemmas

Proof of Lemma 6.2. First, we show Ωt is a valid precision matrix under the event An , i.e., Ωt > 0. Using Weyl’s theorem, we have |λmin (Ωt ) − λmin (Ω∗ )| ≤ ||Ωt − Ω|| + ||Ω − Ω∗ ||, where the first term is bounded by

Hence,

√ ||Φ|| 2t

||Ωt − Ω|| ≤ √ n Σ∗1/2 ΦΣ∗1/2

= O(n−1/2 ). F

|λmin (Ωt ) − λmin (Ω∗ )| ≤ O(n−1/2 ) + ||Ω − Ω∗ ||.   Under the current assumption, O(n−1/2 ) + ||Ω − Ω∗ || = o λmin (Ω∗ ) . Hence, λmin (Ωt ) > 0. Knowing the fact that ln (Ωt ) is well-defined, we study ln (Ωt ) − ln (Ω),    n n ˆ ln (Ωt ) − ln (Ω) = tr Σ(Ω − Ωt ) + log det I − (Ω − Ωt )Σ 2 2  n  ˆ n  = tr (Σ − Σ)(Ω − Ωt ) + tr Σ1/2 (Ω − Ωt )Σ1/2 2 2   n 1/2 + log det I − Σ (Ω − Ωt )Σ1/2 . 2 Let {hj }pj=1 be eigenvalues of Σ1/2 (Ω − Ωt )Σ1/2 . Then, we have

 n   n  1/2 tr Σ (Ω − Ωt )Σ1/2 + log det I − Σ1/2 (Ω − Ωt )Σ1/2 2 2 p   X n = hj + log(1 − hj ) 2 j=1 p p Z n X 2 n X hj (hj − s)2 hj − ds = − 4 2 (1 − s)3 j=1 j=1 0 p Z n 1/2 n X hj (hj − s)2 1/2 2 = − ||Σ (Ω − Ωt )Σ ||F − ds, 4 2 (1 − s)3 0 j=1

32

R h (hj −s)2 where 0 j (1−s) 3 ds is the remainder of the Taylor expansion. Therefore, we have obtained the expansion ln (Ωt ) − ln (Ω) p Z  n n X hj (hj − s)2 n  ˆ 1/2 1/2 2 tr (Σ − Σ)(Ω − Ωt ) − ||Σ (Ω − Ωt )Σ ||F − ds = 2 4 2 (1 − s)3 j=1 0 √ p Z   t2 ||Σ1/2 ΦΣ1/2 ||2 t n n X hj (hj − s)2 F ˆ = √ ds. tr (Σ − Σ)Φ − − 2 ||Σ∗1/2 ΦΣ∗1/2 ||2F 2 (1 − s)3 2||Σ∗1/2 ΦΣ∗1/2 ||F j=1 0 The proof is complete. Proof of Lemma 6.4. Define ΠG to be the distribution which specifies i.i.d. N (0, 1) on the upper triangular part of Ω and then take the lower triangular part to satisfy ΩT = Ω. Define D = {||Ω|| ∨ ||Σ|| ≤ 2Λ}. Then according to the definition of Π, we have ΠG (B ∩ D) , ΠG (D)

for any B.

Π(B) ≥ ΠG (B ∩ D),

for any B.

Π(B) = Since ΠG (D) ≤ 1, we have

In particular, we have     Π ||Ω||2 ||Σ − Σ∗ ||2F ≤ p2 log n/n ≥ ΠG ||Ω||2 ||Σ − Σ∗ ||2F ≤ p2 log n/n, ||Ω|| ∨ ||Σ|| ≤ 2Λ .

Since p2 /n = o(1), we have √     p log n p2 log n ∗ 2 ∗ 2 √ ||Ω − Ω ||F ≤ , ||Ω|| ∨ ||Σ|| ≤ 2Λ . ⊂ ||Ω|| ||Σ − Σ ||F ≤ n (2Λ)3 n Thus,

√    p log n  ∗ 2 ∗ 2 2 √ . Π ||Ω|| ||Σ − Σ ||F ≤ p log n/n ≥ ΠG ||Ω − Ω ||F ≤ (2Λ)3 n

Calculate using Gaussian density directly, for example, according to Lemma E.1 in [14], and we have √  p log n  ∗ √ ΠG ||Ω − Ω ||F ≤ (2Λ)3 n    log n  p(p+1)/2 2 −||Ω∗ ||2F P |Z| ≤ ≥ e cn   ∗ 2 2 ≥ exp − ||Ω ||F − Cp log n , where Z ∼ N (0, 1). The proof is complete by observing that ||Ω∗ ||2F = o(p2 log n) under the assumption. 33

A

Proof of Theorem 4.1 & Theorem 4.3

Lemma A.1. Under the setting of Theorem 4.1, assume p2 /n = o(1) and ||Σ∗ || ∨ ||Ω∗ || = O(1), then we have ln (µX,t , µY,t , Ωt ) − ln (µX , µY , Ω) √ √  t√ n n ¯ 2t n  t 1 T ∗ ˆ ¯ − µX ) Ω ξX − = tr (Σ − Σ)Φ − (X (Y − µY )T Ω∗ ξY − t2 + oP (1), V V V 2

uniformly on An . Proof. Since

ln (µX,t , µY,t , Ωt ) − ln (µX , µY , Ω)     = lX (µX,t , Ωt ) − lX (µX , Ω) + lY (µY,t , Ωt ) − lY (µY , Ω) ,

we expand both quantities in the brackets using the general notation l(µt , Ωt )−l(µ, Ω). Using ˜ = 1 Pn (Xi −µ)(Xi −µ)T , Taylor expansion as in the proof of Lemma 6.2 and the notation Σ i=1 n we have l(µt , Ωt ) − l(µ, Ω)

2  n  ˜ 1/2 1/2 ¯ − µ) − n tr (Σ − Σ)(Ω − Ωt ) − n(µ − µt )T Ωt (X (Ω − Ω )Σ =

Σ

t 2 2 F Z p hj 2 X (hj − s) n n − (µ − µt )T Ωt (µ − µt ) − ds 2 2 (1 − s)3 j=1 0 √  t√ n t n  ˜ ¯ − µ) tr (Σ − Σ)Φ + ξ T Ωt (X = V V p Z

t2 t2 T n X hj (hj − s)2

1/2 1/2 2 − 2 Σ ΦΣ − ξ Ωt ξ − ds, V 2V 2 2 (1 − s)3 F 0 j=1

where {hj }pj=1 are eigenvalues of Σ1/2 (Ω − Ωt )Σ1/2 . The same proof in Lemma 6.2 implies

Therefore,

p Z hj 2 n X (hj − s) = o(1), ds 2 3 j=1 0 (1 − s)

on An .

ln (µX,t , µY,t , Ωt ) − ln (µX , µY , Ω) ! √ √ √   t n ¯ 1 ˜ t n ¯ 2t n T ˜ tr Σ − ΣX + ΣY Φ + (X − µX ) Ωt ξX + (Y − µY )T Ωt ξY = V 2 V V

2  t2 

T Ωt ξX + ξYT Ωt ξY + o(1). − 2 4 Σ1/2 ΦΣ1/2 + ξX 2V F 34

We approximate

√ 2t n V tr



error is bounded by

Σ− 21

 ˜ X +Σ ˜Y Φ Σ

!

by

  √ 2t n ˆ tr (Σ− Σ)Φ , V

and the approximation

√  C √n   C n  ˆ ˜ ˆY − Σ ˜ Y )Φ tr (ΣX − ΣX )Φ + tr (Σ V V   T −1 √ ¯ ¯ n (X − µX ) Φ(X − µX ) + (Y¯ − µY )T Φ(Y¯ − µY ) ≤ CV  √  ¯ − µX ||2 + ||Y¯ − µY ||2 ≤ C||Φ||V −1 n ||X  √  ≤ C||Φ||V −1 n ||µX − µ∗X ||2 + ||µY − µ∗Y ||2 + OP (p/n) = oP (1),

under An and the√assumption p2 /n = o(1), √ where we have used the fact that ||Φ||/V ≤ C. t n ¯ t n ¯ T We approximate V (X − µX ) Ωt ξX by V (X − µX )T Ω∗ ξX , and the difference is bounded by T √ √ ¯ − µX ) + C nV −1 ξ T (Ω − Ω∗ )(X ¯ − µX ) C nV −1 ξX (Ωt − Ω)(X X √ ¯ − µX || ¯ − µX ||||Φ|| + C nV −1 ||ξX ||||Ω − Ω∗ ||||X ≤ CV −2 ||ξX ||||X √ ¯ − µX || + C n||Ω − Ω∗ ||||X ¯ − µX || ≤ C||X   p √ √ ∗ ∗ ∗ ∗ ≤ C ||µX − µX || + n||µX − µX ||||Ω − Ω || + p||Ω − Ω || + OP ( p/n) = oP (1),

under An and the fact that √ √ ||ξX ||/V ≤ C. Using the same argument, we can also approximate t n ¯ (Y − µY )T Ωt ξY by t n (Y¯ − µY )T Ω∗ ξY . Now we approximate the quadratic terms. Using V

V

the same argument in the proof of Lemma 6.2, we have

2 2 1/2 1/2 Σ ΦΣ F − Σ∗1/2 ΦΣ∗1/2 F V2

We also have

= o(1).

  T (Ω − Ω∗ )ξ | |ξX t X ∗ ≤ C ||Ω − Ω || + ||Ω − Ω|| = o(1), t V2 |ξ T (Ω −Ω∗ )ξ |

and the same bound for Y tV 2 Y . Therefore,

2  t2 t2 

1/2 T T 1/2 + ξ Ω ξ + ξ Ω ξ + o(1), Σ ΦΣ 4

X t X Y t Y = 2V 2 F 2 on An . The proof is complete by considering all the approximations above. Lemma A.2. Under the same setting of Lemma A.1 and further assume V −1 = O(1), we have √  t n ¯ Y¯ , Σ ˆ −1 ) ∆(µX , µY , Ω) − ∆(X, V√ √  t√n n ¯ 2t n  t T ∗ ˆ ¯ − µX ) Ω ξX − = tr (Σ − Σ)Φ − (X (Y − µY )T Ω∗ ξY + oP (1), V V V uniformly on An . 35

Proof of Theorem 4.1. Combining Lemma A.1 and Lemma A.2, we have √  t n ¯ Y¯ , Σ ˆ −1 ) − 1 t2 + oP (1), ∆(µX , µY , Ω) − ∆(X, ln (µX,t , µY,t , Ωt ) − ln (µX , µY , Ω) = V 2

uniformly in An . The remaining of the proof is the same as the proof of Theorem 2.1.

The proof of Theorem 4.3, is very similar to the proof of Theorem 4.1. We simply state the technical steps in the following lemmas and omit the details of the proof. Lemma A.3. Under the setting of Theorem 4.3, assume p2 /n = o(1) and ||Σ∗ || ∨ ||Ω∗ || = O(1), then we have ln (µX,t , µY,t , ΩX,t , ΩY,t ) − ln (µX , µY , ΩX , ΩY ) √  t√n   t n  ˆ ˆ Y )ΦY = tr (ΣX − ΣX )ΦX + tr (ΣY − Σ V√ √V t 1 t n ¯ n ¯ (X − µX )T Ω∗ ξX − (Y − µY )T Ω∗ ξY − t2 + oP (1), − V V 2 uniformly on An . Lemma A.4. Under the same setting of Lemma A.3 and further assume V −1 = O(1) and p3 /n = o(1), then √  t n ˆ −1 ) ¯ Y¯ , Σ ˆ −1 , Σ ∆(µX , µY , ΩX , ΩY ) − ∆(X, Y X V √ √    t n  ˆ X )ΦX + t n tr (ΣY − Σ ˆ Y )ΦY = tr (ΣX − Σ V√ √V t n ¯ t n ¯ − (X − µX )T Ω∗ ξX − (Y − µY )T Ω∗ ξY + oP (1), V V uniformly on An . Proof of Theorem 4.3. Combining Lemma A.3 and Lemma A.4, we have ln (µX,t , µY,t , ΩX,t , ΩY,t ) − ln (µX , µY , ΩX , ΩY ) √  t n ˆ −1 ) − 1 t2 + oP (1), ¯ Y¯ , Σ ˆ −1 , Σ ∆(µX , µY , ΩX , ΩY ) − ∆(X, = Y X V 2

uniformly in An . The remaining of the proof is the same as the proof of Theorem 2.1.

B

Proof of Theorem 4.2 & Theorem 4.4

In this section, we are going to prove Theorem 4.2 and Theorem 4.4. Due to the similarity of the two theorems, we only present the details of the proof of Theorem 4.4. The proof of Theorem 4.2 will be outlined. By the remark after Theorem 4.4, it is sufficient to check the two conditions in Theorem 4.4 for X and Y separately. Therefore, we only prove for the X part and omit the subscript X from now on. Denote the prior for (Ω, µ) as Π = ΠΩ × Πµ . The following lemma is a generalization of Lemma 6.5 to the nonzero mean case. 36

Lemma B.1. Let ǫ be any sequence such that ǫ → 0. Define  Kn = ||Ω||2 ||Σ − Σ∗ ||2F + 2||Ω||||µ − µ∗ ||2 ≤ ǫ2 .

Then for any b > 0, we have ! Z    n ∗ 2 PΣ∗ exp ln (Ω) − ln (Ω ) dΠ(Ω) ≤ Π(Kn ) exp − (b + 1)nǫ   ≤ exp − Cb2 nǫ2 ,

for some constant C > 0.

˜ = Π(Kn )−1 Π ˜ so that Π ˜ is a distribution with support Proof. We renormalize the prior Π as Π ˜ Define the random variable within Kn . Write EΠ˜ to be the expectation using probability Π. Z   1 dPΣ ˜ (Xi )dΠ(Ω) = c+ (Xi −µ∗ )T (Ω∗ −EΠ˜ Ω)(Xi −µ∗ )+(Xi −µ∗ )T EΠ˜ Ω(µ−µ∗ ) , Yi = log dPΣ∗ 2 for i = 1, ..., n, where c is a constant independent of X1 , ..., Xn . Then, Yi is a sub-exponential random variable with mean Z ˜ −PΣ∗ Yi = D(PΣ∗ ||PΣ )dΠ(Ω) ! Z 1 1 ˜ ≤ ||Ω||2 ||Σ − Σ∗ ||2F + ||Ω||||µ − µ∗ ||2 dΠ(Ω) 4 2 ≤ ǫ2 /4.

Thus, by Jensen’s inequality, we have ! Z   n dP Σ ˜ (X n )dΠ(Ω) ≤ exp − (b + 1)nǫ2 PΣn∗ dPΣn∗ ! n 1X n 2 ≤ PΣ∗ Yi ≤ −(b + 1)ǫ n i=1 ! n X 1 n 2 ≤ PΣ∗ (Yi − PΣ∗ Yi ) ≤ −bǫ n i=1

= PΣn

! n n 1X 1X (Y1i − PΣ∗ Y1i ) + (Y2i − PΣ∗ Y2i ) ≤ −bǫ2 , n n i=1

i=1

where in the last equality we defined 1 Y1i = (Xi − µ∗ )T (Ω∗ − EΠ˜ Ω)(Xi − µ∗ ), 2 for i = 1, ..., n. By union bound, we have PΣn ≤ PΣn

  Y2i = (Xi − µ∗ )T EΠ˜ Ω(µ − µ∗ ) ,

! n n 1X 1X 2 (Y1i − PΣ∗ Y1i ) + (Y2i − PΣ∗ Y2i ) ≤ −bǫ n n i=1 i=1 ! ! n n 2 X X 1 1 bǫ2 bǫ + PΣn . (Y1i − PΣ∗ Y1i ) ≤ − (Y2i − PΣ∗ Y2i ) ≤ − n 2 n 2 i=1

i=1

37

In the proof of Lemma 5.1 of [14], we have shown that ! n   X bǫ2 n 1 ≤ exp − Cb2 nǫ2 . (Y1i − PΣ∗ Y1i ) ≤ − PΣ n 2 i=1

Hence, it is sufficient to bound the second term. Define Zi = Ω∗1/2 (Xi − µ∗ ), and then we have Y2i − PΣ∗ Y2i = ZiT a and Zi ∼ N (0, Ip×p )   with a = Σ∗1/2 EΠ˜ Ω(µ − µ∗ ) . By Bernstein’s inequality (see, for example, Proposition 5.16 of [28]), we have ! ! n  (nbǫ2 )2 nbǫ2  bǫ2 1X T a Zi ≤ − , ≤ exp − C min . P n 2 n||a||2 ||a||∞ i=1

Since

  2

||a||2 ≤ ||Σ∗ || EΠ˜ Ω(µ − µ∗ ) ≤ ||Σ∗ ||EΠ˜ ||Ω(µ − µ∗ )||2



≤ C ′ ǫ2 ,

C ′ ǫ2 , then ! n     bǫ2 1X T ≤ exp − C min b2 nǫ2 , bnǫ = exp − Cb2 nǫ2 , a Zi ≤ − P n 2

and ||a||∞ ≤ ||a|| ≤

i=1

because ǫ → 0. The conclusion follows the fact that ! Z   dPΣn PΣ∗ (X n )dΠ(Ω) ≤ Π(Kn ) exp − (b + 1)nǫ2 dPΣn∗ ! Z   n dP Σ ˜ (X n )dΠ(Ω) ≤ exp − (b + 1)nǫ2 . ≤ PΣn∗ dPΣn∗

The following lemma proves prior concentration. Lemma B.2. Assume p2 = o(n/ log n), ||µ∗ || = O(1) and ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1). For the prior Π = ΠΩ × Πµ , we have    p2 log n  ≥ exp − Cp2 log n , Π ||Ω||2 ||Σ − Σ∗ ||2F + 2||Ω||||µ − µ∗ ||2 ≤ n

for some constant C > 0.

38

Proof. We have  p2 log n  Π ||Ω||2 ||Σ − Σ∗ ||2F + 2||Ω||||µ − µ∗ ||2 ≤ n  2 p log n  ≥ Π ||Ω||2 ||Σ − Σ∗ ||2F + 4Λ||µ − µ∗ ||2 ≤ n  2 p2 log n  p log n , 4Λ||µ − µ∗ ||2 ≤ ≥ Π ||Ω||2 ||Σ − Σ∗ ||2F ≤ 2n 2n    2 p log n p2 log n  = ΠΩ ||Ω||2 ||Σ − Σ∗ ||2F ≤ Πµ 4Λ||µ − µ∗ ||2 ≤ , 2n 2n

 where the first term is lower bounded in Lemma 6.4. It is sufficient to lower bound Πµ 4Λ||µ−  2 log n . By the definition of Gaussian density, µ∗ ||2 ≤ p 2n  p2 log n  Πµ 4Λ||µ − µ∗ ||2 ≤ 2n !  p  p log n ∗ 2 ≥ e−||µ || /2 P |Z|2 ≤ cn   ≥ exp − ||µ∗ ||2 /2 − Cp log n .

The proof is complete by noticing ||µ∗ || = O(1).

Lemma B.3. Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1). Then for any constant M > 0, there exists a testing function φ such that   n P(µ − CM 2 p2 log n , ∗ ,Ω∗ ) φ ≤ exp sup {(µ,Ω)∈supp(Π):||µ−µ∗ ||>M

q

p2 log n } n

  n P(µ,Ω) (1 − φ) ≤ exp − CM 2 p2 log n ,

for some constant C > 0. Proof. Use notation ǫ2 = p2 log n/n. Consider the testing function   Mǫ ∗ ¯ . φ = ||X − µ || > 2 Then we have Mǫ 1 n √ ||Σ∗1/2 ZΣ∗1/2 || > P(µ ∗ ,Ω∗ ) φ = P 2 n

39

!

  ≤ P ||Z||2 ≥ CM 2 nǫ2 ,

where Z ∼ N (0, Ip×p ). We also have for any (µ, Ω) in the alternative set, ! M n n ¯ − µ|| ≤ ||µ − µ∗ || − ||X P(µ,Ω) (1 − φ) ≤ P(µ,Ω) ǫ 2 ! M ǫ n ¯ − µ|| > ||X ≤ P(µ,Ω) 2 ! 1 M ǫ = P √ ||Σ1/2 ZΣ1/2 || > 2 n   ≤ P ||Z||2 ≥ CM 2 nǫ2 .   Finally, it is sufficient to bound P ||Z||2 ≥ CM 2 nǫ2 . We have   P ||Z||2 ≥ CM 2 nǫ2

p X (Zj2 − 1) ≥ CM 2 nǫ2 − p = P

≤ P

j=1 p X

(Zj2

j=1

2

2

!

!

− 1) ≥ CM nǫ /2

  ≤ exp − C min (M 2 nǫ2 )2 /p, M 2 nǫ2   = exp − CM 2 nǫ2 ,

where we have used Bernstein’s inequality. The proof is complete.

Lemma B.4. Assume ||Σ∗ || ∨ ||Ω∗ || ≤ Λ = O(1) and ||Σ1 || ∨ ||Ω1 || ≤ 2Λ. There exist small δ, δ′ , δ¯ > 0 only depending on Λ such that for any M > 0, there exists a testing function φ such that   n P(µ − Cδ′ ||Σ∗ − Σ1 ||2F , ∗ ,Ω∗ ) φ ≤ 2 exp   sup PΣn (1 − φ) ≤ 2 exp − Cδ′ ||Σ∗ − Σ1 ||2F , {(µ,Ω)∈supp(Π):||Σ−Σ1 ||F ≤δ||Σ∗ −Σ1 ||F ,||µ−µ∗ ||≤M ǫ} ¯ 1 − Σ∗ ||2 . for some constant C > 0, whenever 6ΛM 2 ǫ2 ≤ δ||Σ F

Proof. Since the lemma is a slight variation of Lemma 5.9 in [14]. We do not write the proof in full details. We choose to highlight the part where the current form is different from that in [14], and omit the similar part where the readers may find its full details in the proof of Lemma 5.9 in Gao and Zhou. We use the testing function ( n ) 1X φ= (Xi − µ∗ )T (Ω∗ − Ω1 )(Xi − µ∗ ) > log det(ΩΣ1 ) . n i=1

We immediately have n P(µ ∗ ,Ω∗ ) φ





≤ 2 exp − Cδ ||Σ1 − 40

Σ∗ ||2F

 ,

n as is proved in [14]. Now we are going to bound P(µ,Ω) (1−φ) for every (µ, Ω) in the alternative set. Note that we have

1−φ ( n 1X ¯ − µ)T (Ω∗ − Ω)(µ − µ∗ ) (Xi − µ)T (Ω∗ − Ω1 )(Xi − µ) + 2(X = n i=1 ) +(µ − µ∗ )T (Ω∗ − Ω1 )(µ − µ∗ ) < log det(ΩΣ1 )

! n 1X = (Xi − µ)T (Ω∗ − Ω1 )(Xi − µ) − P(µ,Ω) (Xi − µ)T (Ω∗ − Ω1 )(Xi − µ) n i=1 ) ¯ − µ)T (Ω∗ − Ω1 )(µ − µ∗ ) + (µ − µ∗ )T (Ω1 − Ω∗ )(µ − µ∗ ) > ρ¯ , +2(X (

where we have proved in [14] that ¯ 1 − Σ∗ ||2 , ρ¯ ≥ δ||Σ F for some δ¯ only depending on Λ. Using union bound, we have n P(µ,Ω) (1 − φ) ) ( n ! X ρ ¯ 1 n ≤ P(µ,Ω) (Xi − µ)T (Ω∗ − Ω1 )(Xi − µ) − P(µ,Ω) (Xi − µ)T (Ω∗ − Ω1 )(Xi − µ) > n 2 i=1 o n n ¯ − µ)T (Ω∗ − Ω1 )(µ − µ∗ ) + (µ − µ∗ )T (Ω1 − Ω∗ )(µ − µ∗ ) > ρ¯ . 2(X +P(µ,Ω) 2   [14] showed that the first term above is bounded by 2 exp − Cδ′ ||Σ1 − Σ∗ ||2F . It is sufficient to bound the second term to close the proof. Actually, this is the only difference between this proof and the one in [14]. Note that

|(µ − µ∗ )T (Ω1 − Ω∗ )(µ − µ∗ )| ≤ 3Λ||µ − µ∗ ||2 ≤ 3ΛM 2 ǫ2 . By assumption,

1¯ ρ¯ ∗ 2 3ΛM 2 ǫ2 ≤ δ||Σ 1 − Σ ||F ≤ . 2 4

Hence, n o n ¯ − µ)T (Ω∗ − Ω1 )(µ − µ∗ ) + (µ − µ∗ )T (Ω1 − Ω∗ )(µ − µ∗ ) > ρ¯ 2(X P(µ,Ω) 2 o n ρ ¯ T ∗ ∗ n ¯ − µ) (Ω − Ω1 )(µ − µ ) > ≤ P(µ,Ω) 2(X 4 ! ρ¯ = P ZT a > , 8

41

where Z ∼ N (0, 1) and a = Σ1/2 (Ω∗ − Ω)(µ − µ∗ ). Using Hoeffding’s inequality (see, for example, Proposition 5.10 of [28]), we have !  C ρ¯2  ρ ¯ P ZT a > ≤ exp − , 8 ||a||2 where ||a||2 ≤ 9Λ3 ||µ − µ∗ ||2 ≤ 9Λ3 M 2 ǫ2 ≤

3Λ2 ρ¯, 4

according to the assumption. Thus, ρ¯ P ZT a > 8

!

  ≤ exp − Cδ′ ||Σ1 − Σ∗ ||2F ,

  n for some δ′ only depending on Λ. Therefore, P(µ,Ω) (1 − φ) ≤ exp − Cδ′ ||Σ1 − Σ∗ ||2F for all (µ, Ω) in the alternative set and the proof is complete. Proof of Theorem 4.2 and Theorem 4.4. According to the remark after Theorem 4.3, Π(An |X n , Y n ) = ΠX (AX,n |X n )ΠY (AY,n |Y n ). Thus, it is sufficient to show both ΠX (AX,n |X n ) and ΠY (AY,n |Y n ) converge to 1 in probability. Since they have the same form, we treat them together by omitting the subscript X and Y . The posterior distribution is defined as   R ∗ , Ω∗ ) dΠ(µ, Ω) exp l (µ, Ω) − l (µ n n Acn Nn   Π(Acn |X n ) = R = , Dn exp ln (µ, Ω) − ln (µ∗ , Ω∗ ) dΠ(µ, Ω) where we consider

An =

(

||µ − µ∗ || ≤ M

r

p2 log n ¯ , ||Σ − Σ∗ ||F ≤ M n

r

p2 log n n

)

,

¯ sufficiently large. We are going to establish a test between the following for some M and M hypotheses: H0 : (µ, Ω) = (µ∗ , Ω∗ ) vs H1 : (µ, Ω) ∈ Acn ∩ supp(Π). Decompose Acn as

Acn = B1n ∪ B2n ,

where B1n = and B2n =

(

(

||µ − µ∗ || > M

||µ − µ∗ || ≤ M

r

r

p2 log n n

)

,

p2 log n ¯ , ||Σ − Σ∗ ||F > M n 42

r

p2 log n n

)

.

By Lemma B.3, there exists φ1 such that n P(µ ∗ ,Ω∗ ) φ1 ∨

  n sup P(µ,Ω) (1 − φ1 ) ≤ exp − CM 2 p2 log n . supp(Π)∩B1n

For B2n , we pick a covering set {Σj }N j=1 ⊂ B2n ∩ supp(Π), such that B2n ⊂ ∪N j=1 B2nj ,

where B2nj =

(



||µ − µ || ≤ M

r

p2 log n , ||Σ − Σj ||F ≤ n

r

p2 log n n

)

,

and the covering number N can be chosen to satisfy log N ≤ Cp2 log n,

¯ large enough so that the as is shown in detail in the proof of Lemma 2.2. We may choose M assumption of Lemma B.4 is satisfied, which implies the existence of φ2j such that   n n ¯ 2 p2 log n . P(µ,Ω) (1 − φ2j ) ≤ exp − C M P(µ sup ∗ ,Ω∗ ) φ2j ∨ supp(Π)∩B2nj  Define the final test as φ = max φ1 , ∨N j=1 φ2j . Then using union bound, we have   n n 2 2 2 ¯ P P(µ (1 − φ) ≤ exp − C( M ∧ M )p log n , φ ∨ sup ∗ ,Ω∗ ) (µ,Ω) supp(Π)∩Acn ¯ . Combining the testing result and the conclusions from Lemma B.1 and for large M and M Lemma B.2, we have n c n P(µ ∗ ,Ω∗ ) Π(An |X ) = oP (1),

√  by using the same argument in the proof of Lemma 2.2. For QDA, as long as p2 = o lognn , An satisfies the requirement. For LDA, we use a An defined as ( ) r r 2 log n 2 log n p p ¯ An = ||µX − µ∗X || ∨ ||µY − µ∗Y || ≤ M . , ||Σ − Σ∗ ||F ≤ M n n The proof needs some slight modification (including  √ the previous lemmas) which is not essen2 tial and we choose to omit here. When p = o lognn is true, An also satisfies the requirement there. Now we are going to check the second conditions of Theorem 4.1 and Theorem 4.3. We mainly sketch the QDA case. Using the notation in Lemma 2.2, Z Z     1 ¯ 2 1 −1 2 ¯ exp ln (Ωt , µt ) − ||Ω|| exp ln (Ωt , µt ) dΠ(Ω, µ) = ξp dΩdµ, − ||µ|| F 2 2 An An

where ξp is a normalizing constant and Z   1 ¯ 2 1 2 ¯ exp ln (Ωt , µt ) − ||Ω||F − ||µ|| dΩdµ 2 2 An Z   1 ¯ ¯ ¯ 2 − 1 ||θ − tn−1/2 ξ||2 dΓdθ. − 2tn1/2 Φ|| exp ln (Γ, θ) − ||Γ = F 2 2 An +(2tn−1/2 Φ,tn−1/2 ξ) Proceeding as in Lemma 2.2, the result is proved. 43

C

Proof of Lemma 3.1

ˆ =Σ ˆ −1 . Note that Let Ω   ˆ − tr (Σ − Σ)Ω ˆ ∗ log det(Σ) − log det(Σ)     ˆ 1/2 ΣΩ ˆ 1/2 − I + I − tr Ω ˆ 1/2 ΣΩ ˆ 1/2 − I ≤ log det Ω   ∗ ˆ ˆ . + tr (Σ − Σ)(Ω − Ω)

For the second term,

r   n ∗ ˆ ˆ tr (Σ − Σ)(Ω − Ω) p r n ˆ F ||Σ ˆ − Σ∗ ||F ||Σ − Σ|| ≤ C p r   n ˆ ˆ − Σ∗ ||F ||Σ − Σ∗ ||2F + ||Σ − Σ∗ ||F ||Σ ≤ C p ! r p3 √ ∗ + p||Σ − Σ ||F , ≤ OP n which converges to zero whenever



p||Σ − Σ∗ ||F ≤ δn = o(1). For the first term,

r     n ˆ 1/2 ΣΩ ˆ 1/2 − I + I − tr Ω ˆ 1/2 ΣΩ ˆ 1/2 − I log det Ω p r

2 n ˆ 1/2 ˆ 1/2

≤ C

Ω ΣΩ − I p F r   n ˆ − Σ∗ ||2 ||Σ − Σ∗ ||2F + ||Σ ≤ C F p ! r r p3 n + ||Σ − Σ∗ ||2F , ≤ OP n p which converges to zero whenver

D

q

n p ||Σ

− Σ∗ ||2F ≤ δn = o(1). Thus, the proof is complete.

Proof of Lemma 3.2, Lemma 3.3 & Proposition 5.1

Due to the similarity between Lemma 3.2 and Lemma 3.3, we only give the proof of Lemma 3.2. Let us study the linear approximation of eigenvalue perturbation. In particular, we are ˆ and control the error term in going to find the first-order Taylor expansion of λm (Σ) − λm (Σ) some set An . We have the following spectral decomposition for the three covariance matrices ˆ Σ∗ . Σ, Σ, ˆ =U ˆD ˆU ˆ T , Σ∗ = U ∗ D ∗ U ∗T . Σ = U DU T , Σ

44

Denote the m-th column of U, Uˆ , U ∗ by um , u ˆm , u∗m . Then, ˆ = λm (U ˆ T ΣU ˆ ) − λm (U ˆT Σ ˆU ˆ) λm (Σ) − λm (Σ) ˆ +U ˆ T (Σ ˆ − Σ)U ˆ ) − λm (D). ˆ = λm (D ˆ and ∆ = U ˆ T (Σ ˆ − Σ)U ˆ . The problem is reduced to the eigenvalue perturbation Write A = D of a diagonal matrix. According to the expansion formula in [18] and [1], we have λm (A + ∆) − λm (A) =

λ′m (A, ∆)

+

∞ X

λ(k) m (A, ∆),

(26)

k=2

where the first-order term is       ˆ − Σ)ˆ ˆ − Σ)U ˆ Emm U ˆ T = tr (Σ ˆ T (Σ ˆ − Σ)U ˆ Emm = tr (Σ λ′m (A, ∆) = ∆mm = tr U um u ˆTm .

In the remainder term, we have λ(k+1) (A, ∆) = − m

1 k+1

X

v1 +...+vk+1 =k,v1 ,...,vk+1 ≥0

  tr ∆A˜v1 ...∆A˜vk+1 ,

where A˜v is the matrix power when v ≥ 1 with an exception that A˜0 = −em eTm , where em is the m-th vector of the canonical basis of Rp . The matrix A˜ is defined as A˜ =

X

1≤j≤p,j6=m

ej eTj am − aj

,

ˆ is the (j, j)-th entry of A. Therefore, for any integer v ≥ 1, where aj = λj (Σ)   1 1 v v ˜ ≤ max ||A˜ || = ||A|| , . |am − am−1 | |am − am+1 | We are going to show that the first term in (26) is a good enough approximation of λm (A + (k+1) ∆) − λm (A) by bounding the higher-order terms. Let us provide a bound for |λm (A, ∆)|. Let N = {0, 1, 2, ...}. Consider the set {(v1 , ..., vk+1 ) ∈ N : v1 + ... + vk+1 = k} . From its definition, there must be some l, such that vl = 0 to satisfy v1 + ... + vk+1 = k. Thus, the set can be decomposed into a union of disjoint subsets as follows, n o (v1 , ..., vk+1 ) ∈ Nk+1 : v1 + ... + vk+1 = k =

k+1 [n l=1

o (v1 , ..., vl−1 , vl+1 , ..., vk+1 ) ∈ Nk : v1 + ... + vl−1 + vl+1 + ... + vk+1 = k .

Clearly, each the cardinality of each subset is   2k − 1 ≤ (3e)k . k−1 45

We give names to the sets we have mentioned by Vk+1 = ∪k+1 l=1 Vk+1,l . For l = 1, we have X   tr ∆A˜v1 ...∆A˜vk+1 ≤ Vk+1,1 =

=

 X  tr ∆A˜v1 ...∆A˜vk+1

Vk+1 ,1

 X  tr ∆A˜0 ∆A˜v2 ...∆A˜vk+1

Vk+1 ,1

 X  T v2 vk+1 ˜ ˜ ∆ A tr ∆e e ...∆ A m m

Vk+1 ,1

≤ ≤ =

X

Vk+1 ,1

X

Vk+1 ,1

X

Vk+1 ,1



k∆em k eTm ∆A˜v2 ...∆A˜vk+1 ˜ v2 +...+vk+1 ||∆||k+1 ||A|| ˜ k ||∆||k+1 ||A||

 k ˜ = ||∆|| 3e||∆||||A||

In the same way, the bound also holds for other l. Therefore, k+1 X   X 1 v1 vk+1 (k+1) ˜ ˜ tr ∆A ...∆A |λm (A, ∆)| = k + 1 l=1 Vk+1,l k+1  k 1 X ˜ ||∆|| 3e||∆||||A|| k+1 l=1  k ˜ = ||∆|| 3e||∆||||A|| .



˜ < 1, we may sum over k, and obtain When 3e||∆||||A|| ∞ ∞  X k X (k) ˜ 3e||∆||||A|| λm (A, ∆) ≤ ||∆|| k=1

k=2



˜ 3e||∆||2 ||A|| . ˜ 1 − 3e||∆||||A||

Note that ˆ − Σ|| ≤ ||Σ ˆ − Σ∗ || + ||Σ − Σ∗ || ≤ OP ||∆|| = ||Σ and

r p  n

+ ||Σ − Σ∗ ||,

 o n p  −1 −1 −1 −1 ˜ ˆ = OP min δ , n/p . ||A|| ≤ C min max{|λm − λm−1 | , |λm − λm+1 | }, ||Σ − Σ|| 46

Therefore, ˜ = OP 3e||∆||||A||

p

p/n + ||Σ − Σ∗ || δ

!

= oP (1),

p holds under the assumption δ−1 p/n = o(1) and when δ−1 ||Σ−Σ∗ || = oP (1). The remainders are controlled by ∞ ! √ √ X p n||Σ − Σ∗ ||2 (k) √ + λm (A, ∆) = OP n = oP (1), δ δ n k=2

under the assumption have proved

p √ δ n

sup √

√ = o(1) and when δ−1 n||Σ − Σ∗ ||2 = o(1). Hence, by (26), we

{δ−1 ||Σ−Σ∗ ||∨δ−1 n||Σ−Σ∗ ||2 ≤δn }

√ n λm (A + ∆) − λm (A) − λ′m (A, ∆) = oP (1),

for any δn = o(1).   ˆ − Σ)u∗m u∗T Finally, for the first order term λ′m (A, ∆), we approximate it by tr (Σ m , and the approximation error is     ∗ ∗T T ˆ ˆ − Σ)(ˆ tr ( Σ − Σ)K um u ˆTm − u∗m u∗T ) = ||u u − u ˆ u ˆ || tr (Σ , m m F m m m

where K is a rank-two unit Frobenius norm matrix. It has SVD K = c1 d1 dT1 + c2 d2 dT2 , with c1 ∨ c2 ≤ 1. Therefore,  r p   p ∗ ∗T T ∗ ˆ − Σ)K ≤ C||Σ ˆ − Σ||||Σ ˆ − Σ || ≤ OP + OP ||Σ − Σ∗ || . ||um um − u ˆm u ˆm ||F tr (Σ n n √ Under the assumption p = o(n), when p||Σ − Σ∗ || = o(1), we have  √  ˆ − Σ)(ˆ n tr (Σ um u ˆTm − u∗m u∗T ) = oP (1). m

Therefore, the proof of Lemma 3.2 is complete. ˆ ∗ and Now we prove Proposition 5.1. We redefine A = D ∗ and ∆ = U ∗T (Σ∗ − Σ)U correspondingly A˜v . In the case where δ is a constant, we have r ! p ˜ ≤ C, ||∆|| = OP . ||A|| n Similar to (26), we have ˆ − λ1 (Σ∗ ) = λ′ (A, ∆) + λ1 (Σ) 1

∞ X

λ1 (A, ∆),

∞ X

λ1 (A, ∆).

(k)

k=2

  ˆ ∗ u∗T and thus √nλ′ (A, ∆) is asymptotically normal. For where λ′1 (A, ∆) = tr (Σ∗ − Σ)u 1 1 1 the remainder term, we decompose it as ∞ X

(k) λ1 (A, ∆)

=

(2) λ1 (A, ∆)

k=2

+

k=3

47

(k)

Using similar techniques in proving Lemma 3.2, we have ∞ ∞  k √ X √ X (k) ˜ n n λ1 (A, ∆) ≤ ||∆|| 3e||∆||||A|| k=3 k=2 √ ≤ C n||∆||3 r ! p3 , = OP n2 which is oP (1) under the assumption. Therefore,  √  √  √  (2) ∗ ∗ ∗ ∗T ˆ ˆ n λ1 (Σ) − λ1 (Σ ) = ntr (Σ − Σ)u1 u1 + nλ1 (A, ∆) + oP (1).

P √ (2) ej eT It remains to show that nλ1 (A, ∆) is not oP (1). Note that A˜ = j≥2 a1 −aj j , with aj = λj (Σ∗ ). Hence,  √  √ (2) ˜ 1 eT1 n λ1 (A, ∆) = n tr ∆A∆e √ X 1 T ∗T ∗ ∗ 2 ˆ n = |e1 U (Σ − Σ)U ej | j≥2 a1 − aj √ n X T ∗T ∗ ˆ ∗ 2 |e1 U (Σ − Σ)U ej | . ≥ a1 − a2 j≥2

For fixed eigengap, a1 − a2 is a constant. When Σ∗ is diagonal, U ∗ = I, and we have 2 . Moreover, the fact that Σ∗ is diagonal implies σ 2 are indepenˆ ∗ ej |2 = σ ˆ1j ˆ1j |eT1 U ∗T (Σ∗ − Σ)U dent for j = 2, ..., p. Hence, p √ (2) √ X 2 σ ˆ1j , n λ1 (A, ∆) ≥ C n j=2

√ P 2 is at the level of p/√n, which diverges to ∞ under the assumption. The where n pj=2 σ ˆ1j proof is complete.

References [1] Ames, B. P. & Sendov, H. S. (2012). A new derivation of a formula by kato. Linear Algebra and its Applications 436, 722–730. [2] Barron, A., Schervish, M. J. & Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. The Annals of Statistics 27, 536–561. [3] Bernstein, S. (1927). Theory of probability. [4] Cai, T. T., Liang, T. & Zhou, H. H. (2013). Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional gaussian distributions. arXiv preprint arXiv:1309.0482 . 48

[5] Cai, T. T., Zhang, C.-H. & Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics 38, 2118–2144. [6] Cai, T. T. & Zhou, H. H. (2012). Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics 40, 2389–2420. [7] Castillo, I. (2012). A semiparametric bernstein–von mises theorem for gaussian process priors. Probability Theory and Related Fields 152, 53–99. [8] Castillo, I. & Nickl, R. (2013). Nonparametric bernstein–von mises theorems in gaussian white noise. The Annals of Statistics 41, 1999–2028. [9] Castillo, I. & Nickl, R. (2014). On the bernstein-von mises phenomenon for nonparametric bayes procedures. The Annals of Statistics 42, 1941–1969. [10] Castillo, I. & Rousseau, J. (2013). A general bernstein–von mises theorem in semiparametric models. arXiv preprint arXiv:1305.4482 . [11] Cox, D. D. (1993). An analysis of bayesian inference for nonparametric regression. The Annals of Statistics , 903–923. [12] Davidson, K. R. & Szarek, S. J. (2001). Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces 1, 317–366. [13] Freedman, D. (1999). Wald lecture: On the bernstein-von mises theorem with infinitedimensional parameters. The Annals of Statistics 27, 1119–1141. [14] Gao, C. & Zhou, H. H. (2014). Rate-optimal posterior contraction for sparse pca. The Annals of Statistics (to appear) . [15] Ghosal, S., Ghosh, J. K. & van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28, 500–531. [16] Hoffmann, M., Rousseau, J. & Schmidt-Hieber, J. (2013). On adaptive posterior concentration rates. arXiv preprint arXiv:1305.5270 . [17] Johnstone, I. M. (2010). High dimensional bernstein-von mises: simple examples. Institute of Mathematical Statistics collections 6, 87. [18] Kato, T. (1995). Perturbation theory for linear operators, vol. 132. springer. [19] Kim, Y. & Lee, J. (2004). A bernstein–von mises theorem in the nonparametric rightcensoring model. The Annals of Statistics 32, 1492–1512. [20] Laplace, P. (1774). M´emoire sur la probabilit´e des cause par les ´ev´enments. Eng. trans. in Statistic Sciences 1, 359–378. [21] Le Cam, L. & Yang, G. L. (2000). Asymptotics in statistics: some basic concepts. Springer. 49

[22] Leahu, H. (2011). On the bernstein-von mises phenomenon in the gaussian white noise model. Electronic Journal of Statistics 5, 373–404. [23] Ma, Z. et al. (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics 41, 772–801. [24] Ray, K. (2014). Bernstein-von mises theorems for adaptive bayesian nonparametric procedures. arXiv preprint arXiv:1407.3397 . [25] Rivoirard, V. & Rousseau, J. (2012). Bernstein–von mises theorem for linear functionals of the density. The Annals of Statistics 40, 1489–1523. [26] Shen, X. (2002). Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association 97. [27] van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge university press. [28] Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 . [29] Von Mises, R. (1928). Wahrscheinlichkeit, statistik und wahrheit. Berlin.

50