Finite sample Bernstein â von Mises theorems for functionals ... - arXiv

arXiv:1712.03522v1 [math.ST] 10 Dec 2017

Finite sample Bernstein – von Mises theorems for functionals and spectral projectors of covariance matrix Igor Silin∗ Moscow Institute of Physics and Technology, Skolkovo Institute of Science and Technology, Institute for Information Transmission Problems RAS, Moscow, Russia e-mail: [email protected] Abstract: We demonstrate that a prior influence on the posterior distribution of covariance matrix vanishes as sample size grows. The assumptions on a prior are explicit and mild. The results are valid for a finite sample and admit the dimension p growing with the sample size n . We exploit the described fact to derive the finite sample Bernstein – von Mises theorem for functionals of covariance matrix (e.g. eigenvalues) and to find the posterior distribution of the Frobenius distance between spectral projector and empirical spectral projector. This can be useful for constructing sharp confidence sets for the true value of the functional or for the true spectral projector. MSC 2010 subject classifications: Primary 62F15; secondary 62H25. Keywords and phrases: Bayesian nonparametrics, Bernstein – von Mises theorem, covariance matrix, spectral projectors.

Contents 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2

Setup and main result . . . . . . . . . . . . . . . . . . . . . . .

5

2.1

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

Bayesian framework and main result . . . . . . . . . . .

6

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1

Functionals of covariance matrix . . . . . . . . . . . . .

9

3.2

Spectral projectors of covariance matrix

3

∗ Supported

. . . . . . . .

by the Russian Science Foundation grant (project 14-50-00150) 1

12

Igor Silin /BvM for functionals and projectors of covariance matrix

4

2

Main proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . .

15 15

4.2

Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . .

16

A Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . .

23

B Auxiliary proofs . . . . . . . . . . . . . . . . . . . . . . . . . .

24

B.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . .

24

B.2 Proof of Lemma 2.2 . . . . . . . . . . . . . . . . . . . . .

25

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

1. Introduction The prominent Bernstein – von Mises (BvM) phenomenon states some pivotal behaviour of the posterior distribution. It specifies conditions on a prior, under which the influence of the prior vanishes as the number of observations grows, and the posterior is asymptotically Gaussian. The main application of BvM is usage of Bayesian credible sets as frequentist confidence sets. It helps in situations when the frequentist uncertainty quantification does not allow to build confidence sets directly due to unknown parameters of the asymptotic distribution. Classical BvM results for standard parametric setup are formulated in [21, 30]. More general semiparametric models were studied by [3]. In modern statistics main focus is on the growing parameter dimension, so the classical results should be reconsidered; see, e.g. [12, 16] for some examples in high dimensions. Moreover, many statisticians are focused on work with samples of limited size, however, only a few finite sample BvM results are available, e.g. [24]. We also mention BvM for linear functionals of the density derived in [26] and general theory for smooth functionals of the target parameter presented in [7], among other important works in the field. This paper aims at deriving similar results for the following specific model. Let the data X n = (X1 , . . . , Xn ) be independent identically distributed zeromean random vectors in

Rp . Its covariance matrix is given by def

Σ∗ =

E

Xj Xj⊤ .

Natural estimate of the true unknown covariance is the sample covariance ma-


trix, defined as

3

n

1X b def Xj Xj⊤ . Σ = n j=1

b − Σ ∗ k∞ arises in numerous problems and is wellThe spectral norm kΣ examined; see, for instance, [19, 25, 28, 31, 1]. Functionals and spectral projec-

tors of covariance matrix also appear in applications frequently, so this model is of special interest nowadays.

In this work we show that the posterior distribution of the covariance matrix Σ stays approximately the same for different choices of prior distribution as soon as we have enough observations. In particular, we demonstrate that the posterior computed from some arbitrary prior deviates not a lot from the posterior that corresponds to one special class of priors – the Inverse Wishart priors. We do not impose any conditions on a prior; error of approximation of the corresponding posterior by the Inverse Wishart posterior is described in terms of two crucial concepts. One of these concepts is the posterior contraction, and the other is the flatness of the prior. So, our results makes sense if the prior is such that: • the posterior concentrates in a relatively small vicinity of the true covariance Σ ∗ ,

• the prior is “flat” enough in this vicinity, that is, can be well-approximated by a constant;

The described “posterior independence” enables the following strategy that allows to reduce the complexity of a problem at hand drastically. Instead of working with some complicated prior, one can consider the Inverse Wishart prior. Since this prior is conjugate to the multivariate Gaussian distribution, the posterior is again the Inverse Wishart, so we can study it directly. Moreover, in a wide range of situations nice properties of the Inverse Wishart distribution simplify the analysis significantly. We apply the proposed strategy to the following important objects that are of interest in modern applications. First, we derive BvM theorem for approximately linear functionals of covariance matrix. The main focus here is on eigenvalues of covariance matrix, which are extensively studied object, see [22, 8, 15] and references therein. The asymptotic normality of the posterior measure of the functional was already shown in [32]. However, not only is their result for an


4

infinite sample, it also imposes some non-trivial condition on a prior instead of our simple “flatness” assumption. One more important object under consideration is a spectral projector of the covariance matrix. Let P ∗J be the projector onto some set J of eigenspaces b J based on the sample covariance Σ b. of Σ ∗ . Its sample version is given by P b J − P ∗ k2 we refer to [20, 23]. For some recent results on the distribution of kP J 2

These objects are closely related to the Principal Component Analysis (PCA),

probably the most famous dimension reduction method. Nowadays PCA-based methods are actively used in deep networking architecture [13] and finance [10],

along with other applications. Recent developments in theoretical guarantees for sparse PCA in high dimensions engender attention to such methods, see [14, 4, 2, 5, 11]. Our approach is applied to the squared Frobenius distance b J k22 between the projector of Σ and the empirical projector from kP J − P b . One remarkable fact is that while the posterior distribution of a functional Σ

is approximated by Gaussian, which is more or less usual for BvM, in the case of spectral projectors the limiting distribution is the distribution of Gaussian quadratic form. Even though the assumptions that the eigenvalues of Σ ∗ are bounded from

above and separated from zero, or that the spectral gaps (differences between consequent eigenvalues) are separated from zero, are pretty common, we avoid them. Our results admit growing spectral norm kΣ ∗ k∞ and vanishing smallest eigenvalue and spectral gaps. The provided error bounds are explicit and allow to

track what regimes of Σ ∗ still ensure convergence to the limiting distribution. It is also worth mentioning that the presented approach does not rely on Gaussianity of the data. Even though we work with Gaussian likelihood, we allow model misspecification and formulate our results for quasi-posterior. As to the distribution of the data, we require only one property: the concentration of b around the true covariance Σ ∗ . the sample covariance Σ The main contributions of this paper are as follows.

• We establish a result stating that the prior influence on the posterior distribution of covariance matrix disappears as the sample size grows. The

assumptions on a prior are mild and easy to verify. The data distribution can also be pretty general. • We propose a novel strategy for analysing the posterior distribution for

arbitrary prior. The strategy includes: first, approximation of our posterior


5

at hand by the posterior based on the conjugate prior, and second, study of the latter posterior which has nice properties. • The described strategy is applied to derive finite sample BvM theorems for functionals and spectral projectors of covariance matrix.

The rest of the paper is structured as follows. Some notations are introduced in Section 2.1. Section 2.2 explains our setup. Bayesian framework is described in Section 2.3. Section 3 is dedicated to the applications of the proposed strategy: functionals of covariance matrix are examined in Subsection 3.1, and spectral projectors are considered in Subsection 3.2. The proofs of main theorems are collected Section 4. Appendix A and Appendix B gather some auxiliary results from the literature and the rest of the proofs, respectively. 2. Setup and main result This section explains our setup and states the main results. 2.1. Notations We will use the following notations throughout the paper. The space of realvalued p × p matrices is denoted by

Rp×p , while Sp+

means the set of positive-

semidefinite matrices. We write I d for the identity matrix of size d × d , r(A)

and Tr(B) stand for the rank of a matrix A and the trace of a square matrix

B . Further, kAk∞ stands for the spectral norm of a matrix A , while kAk1

means the nuclear norm. The Frobenius scalar product of two matrices A and def

B of the same size is hA, Bi2 = Tr(A⊤ B) , while the Frobenius norm is denoted

by kAk2 . When applied to a vector, k · k means just its Euclidean norm. The def Tr(B) kBk∞

effective rank of a square matrix B is defined by er(B) =

. The relation

a . b means that there exists an absolute constant C , different from line to line, such that a ≤ Cb , while a ≍ b means that a . b and b . a . By a ∨ b and a ∧ b we mean maximum and minimum of a and b , respectively. In the sequel

we will often be considering intersections of events of probability greater than 1 − n1 . Without loss of generality, we will write that probability measure of such an intersection is 1 − P

1 n

, since it can be easily achieved by adjusting constants.

We write ηn −→ η for convergence in P -probability of random elements ηn to P

some random element η . If ηn −→ 0 , we also will use the notation ηn = oP (1) .

Throughout the paper we will assume that p < n .


6

2.2. Setup Without loss of generality, we can assume that Σ ∗ ∈ Sp+ is invertible (otherwise

one can easily transform data in such a way that the covariance matrix for the

transformed data will be invertible). Therefore, when necessary, we can work in terms of precision matrix Ω which is the inverse of a covariance matrix Σ . We do not need the assumption on Gaussianity of the data. The only condition that our main result require from the underlying distribution of the independent random vectors X n = (X1 , . . . , Xn ) is the concentration of the sample b around the true covariance Σ ∗ : covariance matrix Σ b − Σ ∗ k∞ ≤ δbn kΣ ∗ k∞ kΣ

(2.1)

with probability 1 − 1/n . Clearly, the bound δbn from the condition can vary

for different distributions of the data, but it allows to work with much wider

classes of probability measures rather than just Gaussian or sub-Gaussian. For instance, in the Gaussian case one may take r r ∗ e r (Σ ) log(n) ∨ . δbn ≍ n n

Theorem A.1 from Appendix A provides a few more examples of possible distributions and the corresponding δbn for them. So, throughout the rest of the

paper we assume that the data satisfy condition (2.1). 2.3. Bayesian framework and main result

In Bayesian framework one imposes a prior distribution Π on the covariance matrix Σ . Even though our data are not Gaussian, we can consider the Gaussian log-likelihood: ln (Σ) = −

n n b − np log (2π), log det(Σ) − Tr(Σ −1 Σ) 2 2 2

where further we will omit the additive constant term that does not affect the analysis. The posterior measure of a set A ⊂ Sp+ can be expressed as R n exp (ln (Σ)) dΠ(Σ) R Π A X = A . p exp (ln (Σ)) dΠ(Σ) S +


7

As the Gaussian log-likelihood ln (Σ) does not necessarily correspond to the true distribution of our data, we call the random measure Π · X n a quasiposterior. Once a prior is fixed, we can easily sample matrices Σ from this quasi-posterior distribution.

Our technique relies on two crucial concepts. The first one is posterior contraction. Define the following δ− vicinity of Σ ∗ : def

B(δ) =

Σ ∈ Sp+ : kΣ − Σ ∗ k∞ ≤ δ kΣ ∗ k∞ .

Then we can find a radius δn such that the following posterior contraction condition is fulfilled: 1 Π B(δn ) X n ≥ 1 − n

(2.2)

with probability 1 − 1/n . The second concept that we introduce is “flatness”

of the prior, defined as

def

ρ(δ) =

Π(Σ) . − 1 sup ∗ Π(Σ )

(2.3)

Σ∈B(δ)

We will extensively use the conjugate prior to the multivariate Gaussian distribution, that is, the Inverse Wishart distribution IW p (G, p + b − 1) with G ∈ Sp+ and 0 < b . p . Its density is given by

dΠ W (Σ) 2p + b 1 ∝ exp − log det(Σ) − Tr(GΣ −1 ) . dΣ 2 2

The following two lemmas state the posterior contraction and a bound for flatness of this special prior, respectively. Lemma 2.1. Let the prior Π W be given by the Inverse Wishart distribution IW p (Σ ∗ , p + b − 1) . Define def δnW ≍

Then

with probability 1 − 1/n .

r

kGk∞ log(n) + p b . + δn + n n kΣ ∗ k∞

1 Π W B(δnW ) X n ≥ 1 − n

(2.4)


8

Lemma 2.2. Let ρW (δ) be the flatness parameter of the prior Π W given by the Inverse Wishart distribution IW p (Σ ∗ , p + b − 1) . Then the following bound holds:

ρW (δ) . δ p kΩ ∗ k1 + kGk1 kΩ ∗ k2∞ .

The proofs are postponed to Appendix B.

The following result shows that the posterior distribution is approximately the same for sufficiently flat priors satisfying the posterior contraction condition. Theorem 2.3. Assume the distribution of the data X n = (X1 , . . . , Xn ) fulfills the sample covariance concentration property (2.1). Consider arbitrary prior Π and the prior Π W given by the Inverse Wishart distribution IW p (G, p+b−1) .

Let ρ, ρW be their flatnesses defined by (2.3). Define also def

δ n = δn ∨ δnW with δn from (2.2) and δnW from (2.4). Then the following holds with probability 1 − 1/n : sup Π A X n − Π W A X n . ♦∗ ,

A⊂

where

S

p +

def

♦∗ = ρ(δ n ) + ρW (δ n ) +

1 . n

(2.5)

3. Applications Before considering particular examples, let us introduce some additional notations concerning the true covariance Σ ∗ . Let σ1∗ ≥ . . . ≥ σp∗ be the ordered eigenvalues of Σ ∗ . Suppose that among

them there are q distinct eigenvalues µ∗1 > . . . > µ∗q . Introduce groups of indices ∆∗r = {j : µ∗r = σj∗ } and denote by m∗r the multiplicity factor (dimension) |∆∗r |

for all r = 1, q . The corresponding eigenvectors are denoted as u∗1 , . . . , u∗p . We will use the projector on the r -th eigenspace of dimension m∗r : P ∗r =

X

j∈∆∗ r

u∗j u∗j ⊤


9

and the eigendecomposition Σ∗ =

p X

σj∗ u∗j u∗j ⊤ =

q X r=1

j=1



µ∗r 

X

j∈∆∗ r



u∗j u∗j ⊤  =

We also introduce the spectral gaps gr∗ :   µ∗ − µ∗2 ,    1 gr∗ = (µ∗r−1 − µ∗r ) ∧ (µ∗r − µ∗r+1 ),    ∗  µq−1 − µ∗q ,

q X

µ∗r P ∗r .

r=1

r = 1, r ∈ 2, q − 1, r = q.

b has p (distinct with probability one) eigenvalues Similarly, suppose that Σ

σ b1 > . . . > σ bp . The corresponding eigenvectors are denoted as u b1 , . . . , u bp . ∗ 1 ∗ b Suppose that kΣ − Σ k∞ ≤ min g . Then, as shown in [18], we can identify 4

r∈1,q

r

b corresponding to each eigenvalue of Σ ∗ and clusters of the eigenvalues of Σ

therefore determine ∆∗r and m∗r for all r ∈ 1, q , so further we assume that they are known.

3.1. Functionals of covariance matrix Represent a linear functional φ(·) as e − φ(Σ ∗ ) = Tr[(Σ e − Σ ∗ )Φ] + ε(Σ, e Σ ∗ ), φ(Σ)

where we suppose there exists a symmetric matrix Φ ∈ e Σ ∗ ) is bounded in the following way: residual ε(Σ,

Rp×p

(3.1) such that the

e Σ ∗ )| ≤ C φ (Σ ∗ ) kΣ e − Σ ∗ k2 , |ε(Σ, ∞

(3.2)

where the constant C φ (Σ ∗ ) is different for different functionals and depends only on Σ ∗ . So, in some sense φ(·) is “approximately linear” functional. While entries Σ ij or quadratic forms v ⊤ Σv of covariance matrix are examples of linear functionals with C φ (Σ ∗ ) = 0 , one particular example of approximately linear functional is eigenvalue of covariance matrix. Define the functional as e = φ(Σ)

1 X σ ej . m∗r ∗ j∈∆r


10

Clearly, φ(Σ ∗ ) = µ∗r and its natural estimate is def

b =µ φ(Σ) br =

1 X σ bj . m∗r ∗ j∈∆r

The next lemma shows that the functional defined in such a way is approximately linear functional of covariance matrix. e −Σ ∗ k ≤ Lemma 3.1. Assume kΣ

gr∗ 2e2 .

Then the following bound for first-order

approximation takes place: ∗ 2e2 e ∗ Pr ∗ e µ Σ ≤ ∗ kΣ − Σ ∗ k2 , e − µ − Tr ( − Σ ) r r ∗ mr gr or, in other terms, introducing Φ =

P∗ r , m∗ r

i h 2 e − Σ ∗ ) Φ ≤ 2e kΣ − Σ ∗ k2 . e − φ(Σ ∗ ) − Tr (Σ e Σ ∗ )| = φ(Σ) |ε(Σ, gr∗

So, for eigenvalues the assumption (3.2) is fulfilled with C φ (Σ ∗ ) = 2e2 /gr∗ . We omit the proof of this result. We refer to [17] for the details on perturbation theory for eigenvalues. Let us continue with arbitrary functional φ(·) satisfying (3.1), (3.2), but keeping this example with eigenvalues in mind. To derive the finite sample BvM theorem for the functionals of covariance matrix, we apply our general strategy described in the previous section. We first state the result for the Inverse Wishart prior. Theorem 3.2. Assume the distribution of the data X n = (X1 , . . . , Xn ) fulfills the sample covariance concentration property (2.1). Consider the prior Π W given by the Inverse Wishart distribution IW p (G, p + b − 1) . Let ζ ∼ N (0, 1) . Then with probability 1 −

where

1 n

 √ b n φ(Σ) − φ(Σ) n W   ≤ x X − P(ζ ≤ x) . ♦φ , sup Π √ 1/2 1/2 ∗ ∗ x∈R 2kΣ ΦΣ k2 def

♦φ = ♦1 + ♦2 + ♦3 + ♦4 .

(3.3)


11

The terms ♦1 through ♦4 can be described as def

♦1 =

2 C (Σ ∗ ) kΣ ∗ k2 √ W φ ∞ n δn + δbn , ∗ 1/2 ∗ 1/2 kΣ ΦΣ k2

kΦk1 {(log(n) + p)kΣ ∗ k∞ + kGk∞ } def 1 , ♦2 = √ · 1/2 1/2 n kΣ ∗ ΦΣ ∗ k2 def

♦3 =

r3 (Φ) √ , n

kΣ ∗ k∞ · kΦΣ ∗ Φk1 p def ♦4 = δbn + ∗ 1/2 ∗ 1/2 2 n kΣ ΦΣ k2

with δbn and δnW from (2.1) and (2.4), respectively.

Then we just exploit our main result Theorem 2.3 to derive the extended

version of Theorem 3.2 for arbitrary prior. Theorem 3.3. Assume the distribution of the data X n = (X1 , . . . , Xn ) fulfills the sample covariance concentration property (2.1). Let ζ ∼ N (0, 1) . Then with

probability 1 −

1 n

√  b n φ(Σ) − φ(Σ) n X  − P(ζ ≤ x) . ♦φ + ♦∗ , ≤ x sup Π  √ 1/2 1/2 x∈R 2kΣ ∗ ΦΣ ∗ k2

where ♦φ and ♦∗ are defined in (3.3) and (2.5), respectively.

Together with classical results on approximate normality of functionals of covariance matrix, these theorems provide a procedure for construction of credible sets with frequentist coverage guarantees for the true value of functional at Σ ∗ . Next, we compare our theorem asymptotic functional BvM presented in [32]. In that work, a prior distribution is imposed on the precision matrix Ω = Σ −1 . Assume that there exist a set An and a small value δn = o(1) such that An ⊂ {kΣ − Σ ∗ k∞ ≤ δn kΣ ∗ k∞ } , and let the functional φ(·) be approximately linear on this set, that is, there exists a symmetric matrix Φ ∈ Rp×p such that h i b − Tr (Σ − Σ) b Φ − φ( Σ) φ(Σ) √ sup n = oP (1). 1/2 1/2 Σ∈An kΣ ∗ Φ Σ ∗ k2

The following result takes place.

(3.4)


12

Theorem 3.4 ([32], Theorem 2.1). Under the assumptions of (3.4) and kΣ ∗ k ∨ kΩ ∗ k = O(1) , if for a given prior Π the following two conditions are

satisfied:

1. Π An X n = 1 − oP (1) .

2. For any fixed t ∈ R holds R exp (ln (Ω t )) dΠ(Ω) RAn = 1 + oP (1) An exp (ln (Ω)) dΠ(Ω) for the perturbed precision matrix

√ 2t Φ. Ωt = Ω + √ 1/2 ∗ 1/2 nkΣ ΦΣ ∗ k2 Then √  b n φ(Σ) − φ(Σ) n   ≤ x X sup Π √ − P(ζ ≤ x) = oP (1), 1/2 1/2 ∗ ∗ x∈R 2kΣ ΦΣ k2

where ζ ∼ N (0, 1) .

The proof of this result is based on the Laplace transform approach. Even though [32] assumes the data X1 , . . . , Xn to be Gaussian, the result can be extended in a straightforward way to non-Gaussian data. Nevertheless, this technique doesn’t allow to obtain a result for finite sample, since even if we know the rate of convergence of the exponential moments, it cannot provide any guarantee on the rate of convergence of distributions. Moreover, the second condition on a prior may not be easy to verify. 3.2. Spectral projectors of covariance matrix First, define the sample projector on the r -th eigenspace of dimension m∗r : br = P

X

j∈∆∗ r

u bj u b⊤ j .

More generally, pick a block of eigenspaces corresponding to an interval J in

{1, . . . , q} from r− to r+ :

J = {r− , r− + 1, . . . , r+ }.


13

Define also the subset of indices def

IJ =

k : k ∈ ∆∗r , r ∈ J ,

and introduce the projector onto the direct sum of the eigenspaces associated with P ∗r for all r ∈ J : def

P ∗J =

X

P ∗r =

r∈J

X

u∗k u∗k ⊤ .

k∈IJ

Its empirical counterpart is given by b J def = P

X

r∈J

br = P

X

k∈IJ

u bk u b⊤ k.

b J is exactly For instance, when J = {1, . . . , qeff } for some qeff < q , then P

what is recovered by PCA.

The projector dimension for J is given by m∗J =

can be defined as

∗ gJ

Define also

P

   µ∗ − µ∗r+ +1 ,   r+ def = µ∗r− −1 − µ∗r− ,     µ∗ ∗ ∗ ∗ r − −1 − µr − ∧ µr + − µr + +1 ,

r∈J

m∗r . Its spectral gap

if r− = 1; if r+ = q; otherwise.

∗ lJ = µ∗r− − µ∗r+ .

For Σ generated by Π we can introduce P J similarly. To describe the posb J k2 , introduce the following matrix Γ ∗ of size terior distribution of kP J − P 2

m∗J (p

−

m∗J )

×

m∗J (p

−

J

m∗J ) :

def

ΓJ∗ = diag ΓJr def

r∈J

,

(3.5)

ΓJr = diag (Γ r,s )s∈J / , def

Γ r,s =

2µ∗r µ∗s · I m∗r m∗s , (µ∗r − µ∗s )2

r ∈ J,s ∈ / J.

Before formulating the result for arbitrary prior, we consider the Inverse Wishart prior.


14

Theorem 3.5 ([27], Theorem 2.1). Assume that the distribution of the data X n = (X1 , . . . , Xn ) fulfills the sample covariance concentration property (2.1). Consider the prior Π W given by the Inverse Wishart distribution IW p (G, p +

b−1) . Let ξ ∼ N (0, ΓJ∗ ) with ΓJ∗ defined by (3.5). Then with probability 1− n1 b J k2 ≤ x X n − P(kξk2 ≤ x) . ♦P , sup Π W nkP J − P 2

x∈

where

R

♦1 + ♦2 + ♦3 1 def ♦P = q + . n ∗ ∗ λ1 (ΓJ )λ2 (ΓJ )

(3.6)

The terms ♦1 through ♦3 can be described as ) ! p ∗ ∗ mJ kΣ ∗ k2∞ lJ ∨ kΣ ∗ k2 + kGk2 1+ ∗ ♦1 = (log(n) + p) ∗ gJ gJ r Tr(Σ ∗ ) log(n) + p , × ∗2 n gJ ∗ 2 ∗2 ∗ ∗ m kΣ k ∧ Tr Σ kΣ k ∞ ∞ J def bn + p , ♦2 = p δ ∗3 n gJ r ∗ 3/2 kΣ ∗ k∞ Tr(Σ ∗ ) log(n) def mJ ♦3 = ∗ 2 n gJ def

(

with δbn from (2.1).

Then, from Theorem 3.5 and Theorem 2.3 we derive the following result.

Theorem 3.6. Assume the distribution of the data X n = (X1 , . . . , Xn ) fulfills the sample covariance concentration property (2.1). Let ξ ∼ N (0, ΓJ∗ ) with ΓJ∗

defined by (3.5). Then with probability 1 −

1 n

b J k22 ≤ x X n − P(kξk2 ≤ x) . ♦P + ♦∗ , sup Π nkP J − P

x∈

R

where ♦P and ♦∗ are defined in (3.6) and (2.5), respectively.

As well as for functionals of covariance matrix, this result can be applied for building of sharp elliptic confidence sets for the true projector P ∗J . See again

[27], Corollary 2.3 for the detailed description of the procedure.


15

4. Main proofs 4.1. Proof of Theorem 2.3 Step 1

“Localization”.

Fix arbitrary prior Π (in particular, we may take Π W ). Posterior measure of a set A ⊂ Sp+ is

R exp (ln (Σ)) Π(Σ) dΣ τ (A) R = Π(A|X ) = A , τ (Sp+ ) p exp (ln (Σ)) Π(Σ) dΣ S n

+

where for shortness we introduce Z def τ (A) = exp (ln (Σ)) Π(Σ) dΣ. A

Observe that due to (2.2) since δ n ≥ δn , we have 1 τ (B(δ n )) ≥ 1− . τ (Sp+ ) n

(4.1)

Consider “localized” posterior defined by R exp (ln (Σ)) Π(Σ) dΣ τ (A ∩ B(δ n )) n) n def A∩B(δ = Πloc (A|X ) = R . τ (B(δ n )) B(δ n ) exp (ln (Σ)) Π(Σ) dΣ It is straightforwardly follows from (4.1) that

2 sup Πloc (A X n ) − Π(A X n ) ≤ . n A⊂Sp +

(4.2)

with probability 1 − 1/n . In particular, Lemma 2.1 and the fact that δ n ≥ δnW

imply similar bound for the Inverse Wishart prior:

Step 2

W 2 (A X n ) − Π W (A X n ) ≤ . sup Πloc p n A⊂S+

(4.3)

“Flatness”.

Consider uniform prior Π U over B(δn ) , which will be kind of a bridge between the priors Π and Π W . The corresponding posterior is given by R exp (ln (Σ)) dΣ n A∩B(δ n ) U . Π (A X ) = R exp (ln (Σ)) dΣ B(δ n )


16

Recalling the definition of flatness (2.3), it is easy to show that sup Πloc (A X n ) − Π U (A X n ) ≤ 4 ρ(δn )

A⊂

S

p +

(4.4)

with probability one. The same applies to the Inverse Wishart prior: W (A X n ) − Π U (A X n ) ≤ 4 ρW (δ n ), sup Πloc

A⊂

S

p +

(4.5)

where the flatness of the Inverse Wishart prior is bounded in Lemma 2.2. Applying the triangle inequality to (4.2), (4.3), (4.4) and (4.5), we derive the desired result. 4.2. Proof of Theorem 3.2 Write the representation (3.1) in two ways: φ(Σ) − φ(Σ ∗ ) = Tr[Φ(Σ − Σ ∗ )] + ε(Σ, Σ ∗ ) and b − φ(Σ ∗ ) = Tr[Φ(Σ b − Σ ∗ )] + ε(Σ, b Σ ∗ ). φ(Σ)

Thus, subtracting the latter equality from the former,

Define

b = Tr[Φ(Σ − Σ)] b + ε(Σ, Σ ∗ ) − ε(Σ, b Σ ∗ ). φ(Σ) − φ(Σ) √ W2 n δn C φ (Σ ∗ ) kΣ ∗ k2∞ , def √ ∆2 = n δbn2 C φ (Σ ∗ ) kΣ ∗ k2∞ def

∆1 =

(4.6) (4.7)

with δbn and δnW from (2.1) and (2.4), respectively. Then the assumption (3.2) yields

√ b Σ ∗ )| ≥ ∆2 ) ≤ 1 P( n |ε(Σ, n

and with probability 1 − ΠW

(4.8)

1 n

√ 1 n |ε(Σ, Σ ∗ )| ≥ ∆1 X n ≤ . n

(4.9)


17

Further, in order to work with the main linear part of the above display, we b n + p + b − 1) We make use of elaborate on Σ . Since Σ X n ∼ IW(G + nΣ,

the following property of the Inverse Wishart distribution: d Σ −1 X n =

n+p+b−1 X

Wj WjT ,

j=1

i.i.d. b −1 ) . Hence, introducing np def = n+p+b−1 where Wj X n ∼ N (0, (G + nΣ) and

Σ n,p = we represent Σ

−1

b G + nΣ , np

  np X n d 1 −1/2 −1/2 −1/2 −1/2 X = Σ  = Σ n,p (I p + E n,p ) Σ n,p , Zj ZjT  Σ n,p n,p np j=1

i.i.d. where Zj X n ∼ N (0, I p ) and we also defined def

E n,p =

np 1 X Zj ZjT − I p . np j=1

(4.10)

We may think that in the posterior world all randomness comes from E n,p . Moreover, due to Theorem A.1, (i), there is a random set Υ such that on this set kE n,p k∞ .

s

log(np ) + p ≤ np

r

log(n) + p n

(4.11)

and its posterior measure 1 Π Υ Xn ≥ 1 − . n

Lemma 4.1. The following decomposition holds: b 1/2 + R, b = − Tr Φ Σ b 1/2 E n,p Σ Tr[Φ(Σ − Σ)] and the remainder R satisfies on the random set Υ √

o kΦk1 n b3 def b ∞ + kGk∞ . = √ · (log(n) + p)kΣk n |R| . ∆ n

(4.12)


Proof. Define Rn,p by def

Rn,p =

I p + E n,p

−1

− I p + E n,p .

Its spectral norm can be bounded as

∞

∞

X

X kE n,p k2∞

s . kE n,p k2∞ . kRn,p k∞ . (−E n,p ) ≤ kE n,p ks∞ =

1 − kE k n,p ∞ s=2 s=2 ∞

So

1/2 1/2 −1 Σ = Σ 1/2 Σ 1/2 n,p (E n,p + I p ) n,p = Σ n,p (I p − E n,p + Rn,p ) Σ n,p .

b we have Therefore for Σ − Σ

b = Σ 1/2 (I p − E n,p + Rn,p ) Σ 1/2 − Σ b Σ−Σ n,p n,p

1/2 1/2 1/2 b = −Σ 1/2 n,p E n,p Σ n,p + Σ n,p Rn,p Σ n,p + Σ n,p − Σ.

1/2 b 1/2 : b 1/2 E n,p Σ From Σ 1/2 n,p E n,p Σ n,p we pass to Σ

b 1/2 − Σ 1/2 E n,p Σ 1/2 ) b 1/2 + (Σ b 1/2 E n,p Σ b = −Σ b 1/2 E n,p Σ Σ−Σ n,p n,p b = −Σ

1/2 b +Σ 1/2 n,p Rn,p Σ n,p + Σ n,p − Σ

1/2

b 1/2 + R1 + R2 + R3 , E n,p Σ

where we introduce the remainder terms

def b 1/2 1/2 b 1/2 − Σ 1/2 E n,p Σ R1 = Σ n,p E n,p Σ n,p , def

1/2 R2 = Σ 1/2 n,p Rn,p Σ n,p , def b R3 = Σ n,p − Σ.

They can be bounded in Frobenius norm: b − Σ n,p k1/2 kΣ n,p k1/2 + kΣk b 1/2 , kR1 k∞ ≤ kE n,p k∞ kΣ ∞ ∞ ∞ kR2 k∞ ≤ kRn,p k∞ kΣ n,p k∞ . kE n,p k2∞ kΣ n,p k∞ , kR3 k∞ ≤

b ∞ kGk∞ + (np − n) kΣk . np

18


19

Hence, omitting higher order terms, on Υ we have kR1 k∞ .

b 1/2 kΣk ∞

b ∞ kR2 k∞ . kΣk kR3 k∞ .

Therefore,

1/2 plog(n) + p b ∞ , kGk∞ + pkΣk n

log(n) + p , n

b ∞ kGk∞ + p kΣk . n

1/2 1/2 b b b + R1 + R2 + R3 , Tr[Φ(Σ − Σ)] = − Tr Φ Σ E n,p Σ

where Ri = Tr(ΦRi ) and |Ri | ≤ kΦk1 kRi k∞ for all i = 1, 2, 3 . Putting all

the bound together, we obtain the desired statement.

The structure of E n,p allows to apply Gaussian approximation and derive the following lemma. Lemma 4.2. Let ζb ∼ N

1/2 2n b 1/2 2 b 0, np kΣ Φ Σ k2 . Then

√ b 1/2 Φ Σ b 1/2 E n,p ≤ x X n − Π ζb ≤ x X n sup Π W − n Tr Σ

x∈

R

def

. ∆4 =

r3 (Φ) √ n

with probability one. Proof. Recalling the definition of E n,p from (4.10), we can represent np X √ b 1/2 Φ Σ b 1/2 E n,p = √1 (ηj − Eηj ), − n Tr Σ np j=1

where

def

ηj = − i.i.d.

r

n ⊤ b 1/2 b 1/2 Z Σ Φ Σ Zj np j

and Zj ∼ N (0, I p ) . It is easy to verify that Var(ηj ) =

2n b 1/2 b 1/2 2 kΣ Φ Σ k2 . np


20

Besides, observe that

E|ηj − Eηj |3 . r3 (Φ) (Var(ηj ))3/2 . Now the classical Berry – Esseen theorem yield the result of the lemma. The next step is to compare two normal distributions with variances b 1/2 k2 and kΣ ∗ 1/2 Φ Σ ∗ 1/2 k2 . ΦΣ

2n b 1/2 np kΣ

2

2

Lemma 4.3. Let

ζb ∼ N

and

2n b 1/2 b 1/2 2 0, kΣ Φ Σ k2 np

1/2 1/2 ζ ∼ N 0, 2kΣ ∗ Φ Σ ∗ k22 . Then, with probability one b5 , sup Π ζb ≤ x X n − P(ζ ≤ x) . ∆

x∈

R

∗ b b − Σ ∗ k∞ kΦ(Σ + Σ )Φk1 + p . b5 def = kΣ ∆ 1/2 1/2 n 2 kΣ ∗ ΦΣ ∗ k22

Proof. For shortness define

def

σ12 =

def

2n b 1/2 b 1/2 2 kΣ Φ Σ k2 , np

σ22 = 2kΣ ∗

1/2

Φ Σ∗

1/2 2 k2 .

and let Pσ corresponds to one-dimensional normal distribution with zero mean and variance σ 2 . By definition of total variation distance ( dT V ) distance, we have sup Π ζb ≤ x X n − P(ζ ≤ x) ≤ dT V (Pσ1 , Pσ2 )

x∈

R

in the posterior world for each X n . Due to the Pinsker’s lemma (see, e.g. [29], Lemma 2.5 (i)), p dKL (Pσ1 , Pσ2 )/2, p dT V (Pσ1 , Pσ2 ) = dT V (Pσ2 , Pσ1 ) ≤ dKL (Pσ2 , Pσ1 )/2, dT V (Pσ1 , Pσ2 ) ≤


21

where dKL is Kullback-Leibler divergence. Therefore, r dKL (Pσ1 , Pσ2 ) ∧ dKL (Pσ2 , Pσ1 ) . dT V (Pσ1 , Pσ2 ) ≤ 2 Simple calculations show that σ2 σ2 − σ2 dKL (Pσ1 , Pσ2 ) = log + 1 2 2, σ1 2σ2 2 σ1 σ − σ2 dKL (Pσ2 , Pσ1 ) = log + 2 2 1, σ2 2σ1 dKL (Pσ1 , Pσ2 ) + dKL (Pσ2 , Pσ1 ) dKL (Pσ1 , Pσ2 ) ∧ dKL (Pσ2 , Pσ1 ) ≤ 2 (σ22 − σ12 )2 = , 4 σ12 σ22 which implies dT V (Pσ1 , Pσ2 ) ≤

|σ22 − σ12 | . 2 σ1 σ2

Clearly, one has |σ22 − σ12 | np − n 1/2 1/2 kΣ ∗ ΦΣ ∗ k22 ≤ 2 np n b 1/2 b 1/2 2 ∗ 1/2 ∗ 1/2 2 ΦΣ k2 . + kΣ ΦΣ k2 − kΣ np

For the second term we have h i 1/2 kΣ b ΦΣ b 1/2 k2 − kΣ ∗ 1/2 ΦΣ ∗ 1/2 k2 = Tr (Σ b − Σ ∗ )Φ(Σ b + Σ ∗ )Φ 2 2

b − Σ ∗ k∞ kΦ(Σ b + Σ ∗ )Φk1 . ≤ kΣ

Then, assuming the right-hand side is small enough, we get

∗ b |σ 2 − σ 2 | |σ22 − σ12 | b − Σ ∗ k∞ kΦ(Σ + Σ )Φk1 + p , . 2 2 1 . kΣ 1/2 1/2 2 σ1 σ2 2 σ2 n 2 kΣ ∗ ΦΣ ∗ k2 2

that concludes the proof of the lemma.

b3 , ∆ b5 from Finally, we unite all the obtained bounds. Let us notice that for ∆

Lemmas 4.1, 4.3 due to (2.1) holds

kΦk1 b3 . ∆3 def = √ · {(log(n) + p)kΣ ∗ k∞ + kGk∞ } , ∆ n

p kΣ ∗ k∞ · kΦΣ ∗ Φk1 b5 . ∆5 def + = δbn ∆ ∗ 1/2 ∗ 1/2 2 n kΣ ΦΣ k2

(4.13) (4.14)


22

with probability 1 − 1/n . So, for ∆1 , ∆2 , ∆3 from (4.6), (4.7), (4.13), respec-

tively, we write Π

W

n √ ∗ n (φ(Σ) − φ(Σ )) ≤ x X

≤ Π

W

n √ 1/2 1/2 b b ≤ x + ∆1 + ∆2 + ∆3 X − n Tr Φ Σ E n,p Σ +Π

W

+Π

W

n n √ √ ∗ ∗ W b − n ε(Σ, Σ ) ≤ −∆2 X +Π n ε(Σ, Σ ) ≤ −∆1 X

n √ . n R ≤ −∆3 X

The second term in the right-hand side is at most 1/n with probability 1 − 1/n

due to (4.9). The third term is zero with probability 1 − 1/n according to (4.8).

Lemma 4.1 and (4.13) imply that the fourth term does not exceed 1/n with

probability 1 − 1/n . Thus, with probability 1 − 3/n we have ΠW

√ n (φ(Σ) − φ(Σ ∗ )) ≤ x X n

≤ ΠW

√ b 1/2 ≤ x + ∆1 + ∆2 + ∆3 X n + 2 . b 1/2 E n,p Σ − n Tr Φ Σ n

Subtracting P(ζ ≤ x) with ζ ∼ N (0, 2kΣ ∗

1/2

ΦΣ ∗

1/2 2 k2 )

and taking supremum over x ∈ R , we obtain n ∗ W √ sup Π − P(ζ ≤ x) n (φ(Σ) − φ(Σ )) ≤ x X

x∈

R

"

≤ sup Π W x∈

R

from the both sides

√ b 1/2 ≤ x + ∆1 + ∆2 + ∆3 X n b 1/2 E n,p Σ − n Tr Φ Σ #

−P(ζ ≤ x + ∆1 + ∆2 + ∆3 ) + sup [P(ζ ≤ x + ∆1 + ∆2 + ∆3 ) − P(ζ ≤ x)] + x∈

R

The first term in the right-hand side is bounded by ∆4 + ∆5 due to Lemma 4.2, Lemma 4.3 and (4.14). The second term is at most

2 +∆3 √ √∆1 +∆ 2π 2kΣ ∗1/2 ΦΣ ∗ 1/2 k2

.

2 . n


23

Finally,

sup Π x∈

R

.

W

n √ ∗ − P(ζ ≤ x) n (φ(Σ) − φ(Σ )) ≤ x X

∆1 + ∆2 + ∆3 kΣ

∗ 1/2

ΦΣ

∗ 1/2

k2

+ ∆4 + ∆5 +

1 n

with probability 1 − 3/n , which coincides with the claim of the theorem. Appendix A: Auxiliary results The following theorem gathers several crucial results on concentration of sample covariance. Theorem A.1. Let X1 , . . . , Xn be i.i.d. zero-mean random vectors in Rp . def Denote the true covariance matrix as Σ ∗ = E Xj Xj⊤ and the sample covariPn b def ance as Σ = 1 Xj X ⊤ . Suppose the data and are obtained from: n

j=1

j

(i) Gaussian distribution N (0, Σ ∗ ) . In this case, define δbn as r r er(Σ ∗ ) log(n) b ∨ ; δn ≍ n n

(ii) Sub-Gaussian distribution. In this case, define δbn as r r p log(n) b δn ≍ ∨ ; n n

(iii) a distribution supported in some centered Euclidean ball of radius R . In this case, define δbn as r log(n) R ; δbn ≍ p ∗ n kΣ k (iv) log-concave probability measure. In this case, define δbn as s log6 (n) . δbn ≍ np

b holds Then in all the cases above the following concentration result for Σ with the corresponding δbn : b − Σ ∗ k∞ ≤ δbn kΣ ∗ k∞ kΣ

with probability at least 1 −

1 n

.


24

Proof. (i) See [19], Corollary 2. (ii) This is a well-known simple result presented in a range of papers and lecture notes. See, e.g. [25], Theorem 4.6. (iii) √ See [31], ∗

Tr(Σ ) Corollary 5.52. Usually the radius R is taken such that √ R ∗ ≍ √ ∗ = kΣ k kΣ k p r(Σ ∗ ) . (iv) See [1], Theorem 4.1.

Appendix B: Auxiliary proofs B.1. Proof of Lemma 2.1 Using the definitions of np , Σ n,p , E n,p from the beginning of the proof of Theorem 3.2, we write −1 Σ − Σ n,p = Σ 1/2 − I p Σ 1/2 n,p (I p + E n,p ) n,p .

Since

−1

k(I p + E n,p )

− I p k∞

then

+∞

X

s s = (−1) E n,p

s=1

∞

≤

kE n,p k∞ . kE n,p k∞ , 1 − kE n,p k∞

kΣ − Σ n,p k∞ . kΣ n,p k∞ kE n,p k∞ . Moreover, b b ∞ ≤ kGk∞ + (np − n) kΣk∞ . kΣ n,p − Σk np

Besides, the covariance concentration condition (2.1) provides b − Σ ∗ k∞ ≤ δbn kΣ ∗ k∞ kΣ

with probability 1−1/n . Applying the triangle inequality, omitting higher-order terms and recalling (4.11) and (4.12), we deduce that the posterior measure of the event ∗

∗

kΣ − Σ k∞ . kΣ k∞

r

kGk∞ log(n) + p b + δn + n n kΣ ∗ k∞

!

is at least 1 − 1/n with X n − probability 1 − 1/n , which concludes the proof of the lemma.


25

B.2. Proof of Lemma 2.2 The log-density of the Inverse Wishart prior is log Π W (Σ) = −

1 2p + b log det(Σ) − Tr(GΣ −1 ) + C, 2 2

where C is some universal constant related to normalization. For arbitrary Σ we have W Π (Σ) ∗ W W Π W (Σ ∗ ) − 1 = exp log Π (Σ) − log Π (Σ ) − 1 1 2p + b ∗ −1 ∗ −1 (log det(Σ) − log det(Σ )) − Tr[G(Σ − Σ )] − 1 = exp − 2 2 h i o n −1 . exp p |(log det(Σ) − log det(Σ ∗ ))| + Tr G Σ −1 − Σ ∗ − 1. Notice that

|log det(Σ) − log det(Σ ∗ )| = log det(Σ ∗ def

where ∆ = Ω ∗

1/2

(Σ − Σ ∗ )Ω ∗

log det(I p + ∆) =

p X j=1

1/2

−1/2

Σ Σ∗

−1/2

) = log det(I p + ∆),

. Then

log (λj (∆) + 1) = −

p X ∞ X λsj (∆)

s

j=1 s=1

=−

∞ X Tr(∆s )

s

s=1

.

Hence, we get the bound |log det(I p + ∆)| ≤

∞ X s=1

kΩ ∗ ks1 kΣ − Σ ∗ ks∞ =

kΩ ∗ k1 kΣ − Σ ∗ k∞ . 1 − kΩ ∗ k1 kΣ − Σ ∗ k∞

Assuming δ kΩ ∗ k1 is small enough, we get |log det(I p + ∆)| . δ kΩ ∗ k1 whenever Σ ∈ B(δ) .

Besides, for Σ ∈ B(δ) we have

kΣ −1 − Σ ∗

−1

k∞ = kΣ ∗

−1

≤ δ kΣ ∗

(Σ − Σ ∗ )Σ −1 k∞ ≤ δ kΣ ∗

−1

k∞ kΣ ∗

−1

k∞ + δ kΣ ∗

−1

−1

k∞ kΣ −1 k∞

k∞ kΣ −1 − Σ ∗

−1

k∞ .


26

Therefore, kΣ −1 − Σ ∗

−1

k∞ ≤

δ kΩ ∗ k2∞ . δ kΩ ∗ k2∞ 1 − δ kΩ ∗ k∞

whenever δ kΩ ∗ k∞ is small enough. Hence, h i −1 Tr G Σ −1 − Σ ∗ ≤ δ kGk1 kΩ ∗ k2∞ . Finally,

W Π (Σ) ρW (δ) = sup W ∗ − 1 . δ p kΩ ∗ k1 + kGk1 kΩ ∗ k2∞ . Σ∈Bδ Π (Σ )

References

[1] Adamczak, R., Litvak, A. E., Pajor, A. and Tomczak-Jaegermann, N. (2010). Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. J. Amer. Math. Soc., 23, 535– 561. [2] Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Ann. Statist., 41, 4, 1780–1815. [3] Bickel, P. J. and Kleijn, B. J. K. (2012). The semiparametric Bernsteinvon Mises theorem. Ann. Statist., 40, 206-237. [4] Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist., 41, 3, 1055–1084. [5] Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: optimal rates and adaptive estimation. Ann. Statist., 41, 6, 3074–3110. [6] Castillo, I. and Nickl, R. (2013). Nonparametric Bernstein – von Mises theorems in Gaussian white noise. Ann. Statist., 41, 4, 1999–2028. [7] Castillo, I. and Rousseau, J. (2015). A Bernstein – von Mises theorem for smooth functionals in semiparametric models. Ann. Statist., 43, 6, 2353–2383. [8] El Karoui, N. (2007). Tracy – Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices. Ann. Probab., 35, 2, 663–714. [9] Fan, J., Rigollet, P. and Wang, W. (2015). Estimation of functionals of sparse covariance matrices. Ann. Statist., 43, 6, 2706–2737.


27

[10] Holtz, M. (2010). Sparse grid quadrature in high dimensions with applications in finance and insurance. Lecture notes in computational science and engineering, 77, Springer, Berlin. [11] Gao, C. and Zhou, H. H. (2015). Rate-optimal posterior contraction for sparse PCA. Ann. Statist., 43, 2, 785–818. [12] Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity J. Multivariate Anal., 74, 1, 49–68. [13] Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep learning, MIT Press. [14] Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc., 104, 682-693. [15] Johnstone, I. M. (2010). High dimensional statistical inference and random matrices. In International Congress of Mathematicians, I, 307–333, Eur. Math. Soc., Zurich [16] Johnstone, I. M. (2007). High dimensional Bernstein–von Mises: simple examples Inst. Math. Stat. Collect., 6, 87–98. [17] Kato, T. (1995). Perturbation theory for linear operators. 132, Springer. [18] Koltchinskii, V. and Lounici, K. (2016). Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance. Ann. Inst. H. Poincar Probab. Statist., 52, 4, 1976–2013. [19] Koltchinskii, V. and Lounici, K. (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23, 1, 110–133. [20] Koltchinskii, V. and Lounici, K. (2017). Normal approximation and concentration of spectral projectors of sample covariance. Ann. Statist., 45, 1, 121–157. [21] Le Cam, L. and Yang, G. L. (1990). Asymptotics in statistics: some basic concepts. Springer, New York. [22] Marchenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets of random matrices. Mat. Sb. (N.S.), 72 (114), 4, 507–536. [23] Naumov, A., Spokoiny, V. and Ulyanov, V. (2013). Bootstrap confidence sets for spectral projectors of sample covariance. ArXiv:1703.00871. [24] Panov, M. and Spokoiny, V. (2015). Finite sample Bernstein-von Mises theorem for semiparametric problems. Bayesian Analysis, 10, 3, 665–710.


28

[25] Rigollet, P. (2015). Lecture notes on High-dimensional statistics. [26] Rivoirard, V. and Rousseau, J. (2012). Bernstein - von Mises theorem for linear functionals of the density. Ann. Statist., 40, 3, 1489–1523. [27] Silin, I. and Spokoiny, V. (2017). Bayesian inference for spectral projecctors of covarince matrix. ArXiv:1711.11532. [28] Tropp, J. (2012). User-Friendly Tail Bounds for Sums of Random Matrices. Found. Comput. Math., 12, 4, 389–434. [29] Tsybakov, A. (2009). Introduction to Nonparametric Estimation, Springer. [30] Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics 3, Cambridge University Press, Cambridge. [31] Vershynin, R. (2016). Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing, 210–268, Cambridge University Press, Cambridge. [32] Zhou, H. H. and Gao, C. (2016). Bernstein-von Mises theorems for functionals of the covariance matrix. Electron. J. Statist., 10, 2, 1751–1806.

Finite sample Bernstein â von Mises theorems for functionals ... - arXiv

Finite sample Bernstein â von Mises theorems for functionals ... - arXiv

Suggest Documents

Bernstein-von Mises Theorems for Functionals of Covariance Matrix

Bernstein Von Mises Theorem for linear functionals of the density

Nonparametric Bernstein-von Mises Phenomenon: A

A Bernstein-von Mises Theorem for growing parameter dimension

The Bernstein-Von-Mises theorem under ... - Project Euclid

On the Bernstein-von Mises phenomenon in the ... - Project Euclid

Central limit theorems for functionals of large dimensional sample ...

teaching von mises stress

Esseen bounds for von Mises and

Von Mises Stress failure criterion

Approximation Theorems for q-Bernstein-Kantorovich ... - CiteSeerX

Ludwig von Mises Scholar, Creator, Hero.pdf - Swiss Mises Institute

The generalized von Mises distribution - Semantic Scholar

RANDOMNESS AND FOUNDATIONS OF PROBABILITY: VON MISES

Version of the CramÃ©r-von Mises Test for Two-Sample ...

Integration method for constitutive equation of von Mises ... - CiteSeerX

Online Trans-dimensional von Mises-Fisher Mixture Models for User

Change-point problems for the von Mises distribution - Semantic Scholar

Integration method for constitutive equation of von Mises ... - CiteSeerX

an optimal integration scheme for the von-mises ...

Four limit theorems for quadratic functionals of ... - Google Sites

Limit theorems for weighted functionals of cyclical long-range

Limit theorems for Toeplitz quadratic functionals of continuous-time ...

Limit theorems for nonlinear functionals of Volterra ... - Google Sites

Finite sample Bernstein â von Mises theorems for functionals ... - arXiv