Convergence rates of eigenvector empirical spectral ... - arXiv

arXiv:1311.5000v2 [math.ST] 22 Nov 2013

The Annals of Statistics 2013, Vol. 41, No. 5, 2572–2607 DOI: 10.1214/13-AOS1154 c Institute of Mathematical Statistics, 2013

CONVERGENCE RATES OF EIGENVECTOR EMPIRICAL SPECTRAL DISTRIBUTION OF LARGE DIMENSIONAL SAMPLE COVARIANCE MATRIX By Ningning Xia1 , Yingli Qin2 and Zhidong Bai3 Northeast Normal University and National University of Singapore, University of Waterloo, and Northeast Normal University and National University of Singapore The eigenvector Empirical Spectral Distribution (VESD) is adopted to investigate the limiting behavior of eigenvectors and eigenvalues of covariance matrices. In this paper, we shall show that the Kolmogorov distance between the expected VESD of sample covariance matrix and the Marˇcenko–Pastur distribution function is of order O(N −1/2 ). Given that data dimension n to sample size N ratio is bounded between 0 and 1, this convergence rate is established under finite 10th moment condition of the underlying distribution. It is also shown that, for any fixed η > 0, the convergence rates of VESD are O(N −1/4 ) in probability and O(N −1/4+η ) almost surely, requiring finite 8th moment of the underlying distribution.

1. Introduction and main results. Let Xi = (X1i , X2i , . . . , Xni )T and X = (X1 , . . . , XN ) be an n × N matrix of i.i.d. (independent and identically distributed) complex random variables with mean 0 and variance 1. We consider, a class of sample covariance matrices Sn =

N 1 X 1 Xk X∗k = XX∗ , N N k=1

Received February 2013; revised June 2013. Supported by the Fundamental Research Funds for the Central Universities 11SSXT131. 2 Supported by University of Waterloo Start-up Grant. 3 Supported by NSF China Grant 11171057, PCSIRT Fundamental Research Funds for the Central Universities, and NUS Grant R-155-000-141-112. AMS 2000 subject classifications. Primary 15A52, 60F15, 62E20; secondary 60F17, 62H99. Key words and phrases. Eigenvector empirical spectral distribution, empirical spectral distribution, Marˇcenko–Pastur distribution, sample covariance matrix, Stieltjes transform. 1

This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2013, Vol. 41, No. 5, 2572–2607. This reprint differs from the original in pagination and typographic detail. 1

2

N. XIA, Y. QIN AND Z. D. BAI

where X∗ denotes the conjugate transpose of the data matrix X. The Empirical Spectral Distribution (ESD) F Sn (x) of Sn is then defined as n

(1.1)

F Sn (x) =

1X I(λi ≤ x), n i=1

where λ1 ≤ · · · ≤ λn are the eigenvalues of Sn in ascending order and I(·) is the conventional indicator function. Marˇcenko and Pastur in [16] proved that with probability 1, F Sn (x) converges weakly to the standard Marˇcenko–Pastur distribution Fy (x) with density function dFy (x) 1 p = py (x) = (1.2) (x − a)(b − x)I(a ≤ x ≤ b), dx 2πxy √ √ where a = (1 − y)2 and b = (1 + y)2 . Here the positive constant y is the limit of dimension to sample size ratio when both n and N tend to infinity. In applications of asymptotic theorems of spectral analysis of large dimensional random matrices, one of the important problems is the convergence rate of the ESD. The Kolmogorov distance between the expected ESD of Sn and the Marˇcenko–Pastur distribution Fy (x) is defined as ∆ = kEF Sn − Fy k = sup|EF Sn (x) − Fy (x)| x

as well as the distance between two distributions F Sn (x) and Fy (x), ∆p = kF Sn − Fy k = sup|F Sn (x) − Fy (x)|. x

Notice that, for any constant C > 0, o n P (∆p ≥ C) = P sup|F Sn (x) − Fy (x)| ≥ C ≤ C −1 E∆p . x

Thus, ∆p measures the rate of convergence in probability. Bai in [2, 3] firstly tackled the problem of convergence rate and established three Berry–Esseen type inequalities for the difference of two distributions in terms of their Stieltjes transforms. G¨ otze and Tikhomirov in [11] further improved the Berry–Esseen type inequlatiy and showed the convergence rate of F Sn (x) is O(N −1/2 ) in probability under finite 8th moment condition. More recently, a sharper bound is obtained by Pillai and Yin in [18], under a stronger condition, that is, the sub-exponential decay assumption. It is shown that the difference between eigenvalues of Sn and the Marˇcenko– Pastur distribution is of order O(N −1 (log N )O(log log N ) ) in probability. In the literature, research on limiting properties of eigenvectors of large dimensional sample covariance matrices is much less developed than that of eigenvalues, due to the cumbersome formulation of the eigenvectors. Some great achievements have been made in proving the properties of eigenvectors

ASYMPTOTICS OF EIGENVECTORS AND EIGENVALUES

3

for large dimensional sample covariance matrices, such as [4, 19–22], and that for Wigner matrices, such as [10, 13, 25]. However, the eigenvectors of large sample covariance matrices play an important role in high-dimensional statistical analysis. In particular, due to the increasing availability of high-dimensional data, principal component analysis (PCA) has been favorably recognized as a powerful technique to reduce dimensionality. The eigenvectors corresponding to the leading eigenvalues are the directions of the principal components. Johnstone [12] proposed the spiked eigenvalue model to test the existence of principal component. Paul [17] discussed the length of the eigenvector corresponding to the spiked eigenvalue. In PCA, the eigenvectors (ν 01 , . . . , ν 0n ) of population covariance matrix Σ determine the directions in which we project the observed data and the corresponding eigenvalues (λ01 , . . . , λ0n ) determine the proportion of total variability loaded on each direction of projections. In practice, the (sample) eigenvalues (λ1 , . . . , λn ) and eigenvectors (ν 1 , . . . , ν n ) of the sample covariance matrix Sn are used in PCA. In [1], Anderson has shown the following asymptotic distribution for the sample eigenvectors ν 1 , . . . , ν n when the observations are from a multivariate normal distribution of covariance matrix Σ with distinct eigenvalues: √ d N (ν i − ν 0i ) → Nn (0, Di ), where

Di = λ0i

n X

k=1,k6=i

λ0k T ν0 ν0 . (λ0k − λ0i )2 k k

However, this is a large sample result when the dimension n is fixed and low. In particular, if Σ = σ 2 In , then the eigenmatrix (matrix of eigenvectors) should be asymptotically isotropic when the sample size is large. That is, the eigenmatrix should be asymptotically Haar, under some minor moment conditions. However, when the dimension is large (increasing), the Haar property is not easy to formulate. Motivated by the orthogonal iteration method, [15] proposed an iterative thresholding method to estimate sparse principal subspaces (spanned by the leading eigenvectors of Σ) in high dimensional and spiked covariance matrix setting. The convergence rates of the proposed estimators are provided. By reducing the sparse PCA problem to a high-dimensional regression problem, [9] established the optimal rates of convergence for estimating the principal subspace with respect to a large collection of spiked covariance matrices. See the reference therein for more literature on sparse PCA and spiked covariance matrices. To perform the test of existence of spiked eigenvalues, one has to investigate the null properties of the eigenmatrices, that is, when Σ = σ 2 In (i.e.,

4


nonspiked). Then the eigenmatrix should be asymptotically isotropic, when the sample size is large. That is, the eigenmatrix should be asymptotically Haar. However, when the dimension is large, the Haar property is not easy to formulate. The recent development in random matrix theory can help us investigate the large dimension and large sample properties of eigenvectors. We will adopt the VESD, defined later in the paper, to characterize the asymptotical Haar property so that if the eigenmatrix is Haar, then the process defined the VESD tends to a Brownian bridge. Conversely, if the process defined by the VESD tends to a Brownian bridge, then it indicates a similarity between the Haar distribution and that of the eigenmatrix. Therefore, studying the large sample and large dimensional results of the VESD can assist us in better examining spiked covariance matrix as assumed by [15] and [9] among many others. Let Un Λn U∗n denote the spectral decomposition of Sn , where Λn = diag(λ1 , λ2 , . . . , λn ) and Un = (uij )n×n is a unitary matrix consisting of the corresponding orthonormal eigenvectors of Sn . For each n, let xn ∈ Cn , kxn k = 1 be nonrandom and let dn = U∗n xn = (d1 , . . . , dn )∗ , where kxn k denotes Euclidean norm of xn . Define a stochastic process Xn (t) by [nt] X p 1 2 , [a] denotes the greatest integer ≤ a. |dj | − Xn (t) = n/2 n j=1

If Un is Haar distributed over the orthogonal matrices, then dn would be uniformly distributed over the unit sphere in Rn , and the limiting distribution of Xn (t) is a unique Brownian bridge B(t) when n tends to infinity. In this paper, we use the behavior of Xn (t) for all xn to reflect the uniformity of Un . The process Xn (t) is considerably important for us to understand the behavior of the eigenvectors of Sn . Motivated by Silverstein’s ideas in [19–22], we want to examine the limiting properties of Un through stochastic process Xn (t). We claim that Un is “asymptotically Haar distributed,” which means Xn (t) converges to a Brownian bridge B(t). In [21], it showed that the weak convergence of Xn (t) converging to a Brownian bridge B(t) is equivalent to Xn (F Sn (x)) converging to B(Fy (x)). We therefore consider transforming Xn (t) to Xn (F Sn (x)) where F Sn (x) is the ESD of Sn . We define the eigenvector Empirical Spectral Distribution (VESD) H Sn (x) of Sn as follows: n X Sn (1.3) |di |2 I(λi ≤ x). H (x) = i=1

H Sn (x)

F Sn (x)

in (1.3) and Between in (1.1), we notice that there is no difference except the coefficient associated with each indicator function such


that (1.4)

Xn (F Sn (x)) =

5

p n/2(H Sn (x) − F Sn (x)).

Henceforth, the investigation of Xn (t) is converted to that of the difference between two empirical distributions H Sn (x) and F Sn (x). The authors in [4] proved that H Sn (x) and F Sn (x) have the same limiting distribution, the Marˇcenko–Pastur distribution Fy (x), where yn = n/N and y = limn,N →∞ yn ∈ (0, 1). Before we present the main theorems, let us introduce the following notation: ∆H = kEH Sn − Fyn k = sup|EH Sn (x) − Fyn (x)| x

and Sn − Fyn k = sup|H Sn (x) − Fyn (x)|. ∆H p = kH x

We denote ξn = Op (an ) and ηn = Oa.s. (bn ) if, for any ǫ > 0, there exist a large positive constant c1 and a positive random variable c2 , such that P (ξn /an ≥ c1 ) ≤ ǫ and

P (ηn /bn ≤ c2 ) = 1,

respectively. In this paper, we follow the work in [4] and establish three types of convergence rates of H Sn (x) to Fyn (x) in the following theorems. Theorem 1.1. Suppose that Xij , i = 1, . . . , n, j = 1, . . . , N are i.i.d. complex random variables with EX11 = 0, E|X11 |2 = 1 and E|X11 |10 < ∞. For any fixed unit vector xn ∈ Cn1 = {x ∈ Cn : kxk = 1}, and yn = n/N ≤ 1, it then follows that O(N −1/2 a−3/4 ), if N −1/2 ≤ a < 1, H Sn ∆ = kEH − Fyn k = −1/8 O(N ), if a < N −1/2 , √ where a = (1 − yn )2 as it is defined in (1.2) and Fyn denotes the Marˇcenko– Pastur distribution function with an index yn . Remark 1.2. From the proof of Theorem 1.1, it is clear that the condition E|X11 |10 < ∞ is required only in the truncation step in the next section. We therefore believe that the condition E|X11 |10 < ∞ can be replaced by E|X11 |8 < ∞ in Theorems 1.6 and 1.8. Remark 1.3. Because the convergence rate of kEH Sn − Fy k depends on the convergence rate of |yn − y|, we only consider the convergence rate of kEH Sn − Fyn k. √ Remark 1.4. As a = (1 − yn )2 , we can characterize the closeness between yn and 1 through a. In particular, when yn is away from 1 (or

6


a ≥ N −1/2 ), the convergence rate of kEH Sn − Fyn k is O(N −1/2 ), which we believe is the optimal convergence rate. This is because we observe in [4] that for an analytic function f , Z √ (1.5) Yn (f ) = n f (x) d(H Sn (x) − Fyn (x)) converges to a Gaussian distribution. While in [6], Bai and Silverstein proved that the limiting distribution of Z n f (x) d(F Sn (x) − Fyn (x))

is also a Gaussian distribution. We therefore conjecture that the optimal rate of H Sn (x) should be O(N −1/2 ) and O(N −1 ) for F Sn (x). Although F Sn (x) and H Sn (x) converge to the same limiting distribution, there exists a substantial difference between F Sn (x) and H Sn (x). Remark 1.5. Notice that two matrices XX∗ and X∗ X share the same set of nonzero eigenvalues. However, these two matrices do not always share the same set of eigenvectors. Especially when yn ≫ 1, the eigenvectors of Sn corresponding to 0 eigenvalues can be arbitrary. As a result, the limit of H Sn may not exist or heavily depends on the choice of unit vector xn . Therefore, we only consider the case of yn ≤ 1 in this paper and leave the case of yn ≥ 1 as a future research problem. The rates of convergence in probability and almost sure convergence of the VESD are provided in the next two theorems. Theorem 1.6. Under the assumptions in Theorem 1.1 except that we now only require E|X11 |8 < ∞, we have ( Op (N −1/4 a−1/2 ), if N −1/4 ≤ a < 1, Sn − Fyn k = ∆H p = kH Op (N −1/8 ), if a < N −1/4 . Remark 1.7. As an application of Theorem 1.6, in [8] we extended the CLT of the linear spectral statistics Yn (f ) established in [4] to the case where the kernel function f is continuously twice differentiable provided that the sample covariance matrix Sn satisfies the assumptions of Theorem 1.6. This result is useful in testing Johnstone’s hypothesis when normality is not assumed. Theorem 1.8. we have ∆H p

= kH

Sn

Under the assumptions in Theorem 1.6, for any η > 0,

− Fyn k =

Oa.s. (N −1/4+η a−1/2 ), Oa.s. (N −1/8+η ),

if N −1/4 ≤ a < 1, if a < N −1/4 .


Remark 1.9. • • • • • •

7

In this paper, we will use the following notation:

X∗ denote the conjugate transpose of a matrix (or vector) X; XT denote the (ordinary) transpose of a matrix (or vector) X; kxk denote p the Euclidean norm for any vector x; kAk = λmax (AA∗ ), the spectral norm; kF k = supx |F (x)| for any function F ; z¯ denote the conjugate of a complex number z.

The rest of the paper is organized as follows. In Section 2, we introduce the main tools used to prove Theorems 1.1, 1.6 and 1.8, including Stieltjes transform and a Berry–Esseen type inequality. The proofs of these three theorems are presented in Sections 3–6. Several important results which are repeatedly employed throughout Sections 3–6 are proved in Appendix A. Appendix B contains some existing results in the literature. Finally, preliminaries on truncation, centralization and rescaling are postponed to the last section. 2. Main tools. 2.1. Stieltjes transform. The Stieltjes transform is an essential tool in random matrix theory and our paper. Let us now briefly review the Stieltjes transform and some important and relevant results. For a cumulative distribution function G(x), its Stieltjes transform mG (z) is defined as Z 1 mG (z) = dG(λ), z ∈ C+ = {z ∈ C, ℑ(z) > 0}, λ−z where ℑ(·) denotes the imaginary part of a complex number. The Stieltjes transforms of the ESD F Sn (x) and the VESD H Sn (x) are mF Sn (z) =

1 tr(Sn − zIn )−1 n

and mH Sn (z) = x∗n (Sn − zIn )−1 xn , respectively. Here In denotes the n × n identity matrix. For simplicity of notation, we use mn (z) and mH n (z) to denote mF Sn (z) and mH Sn (z), respectively. Remark 2.1. Notice that although the eigenmatrix Un may not be Sn depends on S for any x unique, the Stieltjes transform mH n n n (z) of H rather than Un .

8


Let Sn = X∗ X/N denote the companion matrix of Sn . As Sn and Sn share the same set of nonzero eigenvalues, it can be shown that Stieltjes transforms of F Sn (x) and F Sn (x) satisfy the following equality: mn (z) = −

(2.1)

1 − yn + yn mn (z), z

where mn (z) denotes the Stieltjes transform of F Sn (x). Moreover, [5] and [24] claimed that F Sn converges, almost surely, to a nonrandom distribution function F y (x) with Stieltjes transform m(z) such that 1−y + ymy (z), z where my (z) denotes the Stieltjes transform of the Marˇcenko–Pastur distribution with index y. Using (6.1.4) in [7], we also obtain the relationship between two limits my (z) and m(z) as follows: (2.2)

m(z) = −

(2.3)

my (z) =

1 . −z(1 + m(z))

2.2. A Berry–Esseen type inequality. Lemma 2.2. Let H Sn (x) and Fyn (x) be the VESD of Sn and the Marˇcenko– Pastur distribution with index yn , respectively. Denote their corresponding Stieltjes transforms by mH n (z) and myn (z), respectively. Then there exist large positive constants A, B, K1 , K2 and K3 , such that for A > B > 5, ∆H = kEH Sn (x) − Fyn (x)k Z A Z H −1 ≤ K1 |Emn (z) − myn (z)| du + K2 v −A

+ K3 v −1 sup x

Z

|t|B

|EH Sn (x) − Fyn (x)| dx

|Fyn (x + t) − Fyn (x)| dt,

where z = u + iv is a complex number with positive imaginary part (i.e., v > 0). Remark 2.3. Lemma 2.2 can be proved using Lemma B.1. To prove Theorem 1.1, we apply Lemma 2.2. In addition, we prove Theorems 1.6 and H Sn 1.8 by replacing EH Sn (x), EmH n (z) with H (x) and mn (z), respectively. 3. Proof of Theorem 1.1. Under the condition of E|X11 |10 < ∞, we can choose a sequence of ηN with ηN ↓ 0 and ηN N 1/4 ↑ ∞ as N → ∞, such that (3.1)

1 10 1/4 )) = 0. 10 E(|X11 | I(|X11 | > ηN N N →∞ ηN lim


9

Furthermore, without loss of generality, we can assume that every |Xij | is bounded by ηN N 1/4 and has mean 0 and variance 1. See Appendix C for details on truncation, centralization and rescaling. We introduce some notation before start proving Theorem 1.1. Throughout the paper, we use C and Ci for i = 0, 1, 2, . . . to denote positive constant numbers which are independent of N and may take different values at different appearances. Let Xj denote the jth column of the data matrix X. √ P ∗ Let rj = Xj / N so that Sn = N j=1 rj rj and let √ √ √ √ vy = a + v = 1 − yn + v, Bj = Sn − rj r∗j ,

A(z) = Sn − zIn , Aj (z) = Bj − zIn , ∗ αj (z) = r∗j A−1 j (z)xn xn rj −

1 ∗ −1 x A (z)xn , N n j

1 E tr A−1 j (z), N 1 ξˆj (z) = r∗j A−1 tr A−1 j (z)rj − j (z), N 1 b(z) = , 1 + (1/N )E tr A−1 (z)

ξj (z) = r∗j A−1 j (z)rj −

b1 (z) =

1 , 1 + (1/N )E tr A−1 1 (z)

βj (z) =

. 1 + r∗j A−1 j (z)rj

1

It is easy to show that (3.2)

βj (z) − b1 (z) = −b1 (z)βj (z)ξj (z).

For any j = 1, 2, . . . , N , we can also show that (3.3)

r∗j A−1 (z) = βj (z)r∗j A−1 j (z)

due to the fact that (Bj − zIn )−1 − (Bj + rj r∗j − zIn )−1 = (Bj − zIn )−1 rj r∗j (Bj + rj r∗j − zIn )−1 . From (2.2) in [23], we can write mn (z) in terms of βj (z) as follows: (3.4)

mn (z) = −

N 1 X βj (z). zN j=1

10


We proceed with the proof of Theorem 1.1: δ =: EmH n (z) − my (z)

= x∗n [EA−1 (z) − (−zm(z)In − zIn )−1 ]xn

= (zm(z) + z)−1 x∗n E[(zm(z) + z)A−1 (z) + In ]xn

= (zm(z) + z)−1 x∗n E[(zIn + A(z))A−1 (z) + zm(z)A−1 (z)]xn "N X −1 ∗ rj r∗j A−1 (z) + z(Emn (z))A−1 (z) = (zm(z) + z) xn E j=1

#

− z(Emn (z))A−1 (z) + zm(z)A−1 (z) xn = (zm(z) + z)−1 x∗n E

"

N X j=1

−1 βj (z)rj r∗j A−1 j (z) − (−zEmn (z))A (z)

#

− (zEmn (z) − zm(z))A−1 (z) xn = (zm(z) + z)−1 x∗n E

"

N X j=1

βj (z)rj r∗j A−1 j (z) −

! # N 1 X Eβj (z) A−1 (z) xn N j=1

− (zm(z) + z)−1 x∗n (zEmn (z) − zm(z))(EA−1 (z))xn "N # X 1 = (zm(z) + z)−1 x∗n EA−1 (z) xn Eβj (z) rj r∗j A−1 j (z) − N j=1

+ m(z)(zEmn (z) − zm(z))EmH n (z) =: δ1 + δ2 , where δ1 =

(zm(z) + z)−1 x∗n

"

N X j=1

# 1 EA−1 (z) xn , Eβj (z) rj r∗j A−1 j (z) − N

δ2 = my (z)(zEmn (z) − zm(z))EmH n (z). Lemma 3.1.

If C1 |δ1 | ≤ N vvy

1 ∆H + vy v


11

holds for some constants C0 and C1 , when v 2 vy ≥ C0 N −1 , under the conditions of Theorem 1.1, there exists a constant C such that ∆H ≤ Cv/vy . Proof. According to Lemma 2.2, Z A H |EmH ∆ ≤ K1 n (z) − my (z)| du −A

+ K2 v

−1

Z


|x|>B

+ K3 v −1 sup x

Z

|t|B

+ K3 v −1 sup

(3.5)

x

Z

|t| 0 we have −1 E(tr A−1 1 (z) − E tr A1 (z))α1 (z) = 0

and p E(tr A−1 1 (z)) α1 (z) = 0. −1 ∗ Denote A−1 1 (z) = (aij )n×n , A1 (z)xn xn = (bij )n×n , and ei be the ith canonical basis vector, that is, the n-vector whose coordinates are all 0 except

14


that the ith coordinate is 1. Then Lemmas B.3, A.5, A.6 and the inequality kA−1 1 (z)k ≤ 1/v imply that |Eξ1 (z)α1 (z)|

= |Eξˆ1 (z)α1 (z)|

n

X C −1 ∗ aii bii (z)A (¯ z )x x ) + ≤ 2 E tr(A−1 n n 1 1 N i=1

!

(

N N X X 1 ∆H 2 ∗ −1 E|aii |2 |x∗n ei |2 E|xn A1 (¯ z )ei | +v + vy v i=1 i=1 H C ∆ 1 ≤ 2 + . N v vy v

C ≤ 2 N v

!1/2 )

In the above, we use the following two results, which can be proved by applying Lemmas A.5 and A.6: ∆H 1 ∗ −1 + , E|a11 | = E|e1 A1 (z)e1 | ≤ C vy v C 1 ∆H 2 ∗ −1 −1 E|a11 | ≤ E|e1 A1 (z)A1 (¯ z )e1 | ≤ + . v vy v Hence, we have shown that (4.3)

C |Eξ1 (z)α1 (z)| ≤ 2 N v

∆H 1 + . vy v

¯ 11 , X ¯ 21 , Let us denote X∗1 the conjugate transpose of X1 , that is, X∗1 = (X ¯ . . . , Xn1 ). Then we can rewrite the second term in the upper bound of |δ11 | as X 2 X 1 2 2 ¯ aij Xi1 Xj1 + (aii |X1i | − Eaii ) Eξ1 (z)α1 (z) = 3 E N i i6=j X ¯ bij (Xi1 Xj1 − δij ) × i,j

1 = 3E N

X

¯ i1 Xj1 aij X

i6=j

×

X i6=j

2

X 2 (aii |X1i | − Eaii ) +2

¯ i1 Xj1 + aij X

i

X i

2 aii |X1i | − Eaii

2

15


×

X

¯ i1 Xj1 + bij X

X i

i6=j

2 bii (|X1i | − 1)

X X X C |a2ii bii | + |Eaii |2 |bii | + |a2ij bii | ≤ 3E N i

i

i6=j

X X X 2 + |aij bij | + |aij aii bjj | + |aii aij bij | i6=j

i6=j

i6=j

C X X ι τ , + 3 Ea a b ik ij jk N ι,τ i6=j6=k

where aιij and aτij denote aij or a ¯ij . By following the similar proofs in establishing (4.3), we are able to show that E

X 1X |a2ii bii | ≤ E|aii bii | v i

E

X i

i

1 −1 z )xn )1/2 ≤ (E|a211 |Ex∗n A−1 1 (z)A1 (¯ v ∆H C 1 + ≤ 2 , v vy v

−1 |Eaii |2 |bii | ≤ |Ea11 |2 (Ex∗n A−1 z )xn )1/2 1 (z)A1 (¯

C ≤ v

∆H 1 + vy v

3/2

.

The Cauchy–Schwarz inequality implies that X X X |a2ij | |bii | E |a2ij bii | ≤ E i

i6=j

≤

X i

j

2 E|x∗n A−1 1 (z)ei |

1/2 X i

|x∗n ei |2 E

2 1/2 X |a2ij | j

−1 −1 = (Ex∗n (A−1 z ))xn )1/2 (E|(A−1 z )11 )2 |)1/2 1 (z)A1 (¯ 1 (z)A1 (¯ H 3/2

C 1 ∆ , + 3/2 vy v v 1/2 X X X 2 2 ∗ ∗ −1 2 E|aij xn ej | E|aij xn A1 (z)ei | E aij bij ≤ ≤

i6=j

ij

ij

16


≤

X i

2 −1 E|(A−1 z ))ii (x∗n A−1 1 (z)A1 (¯ 1 (z)ei ) |

X

×

j

1/2

2 ∗ E|(A−1 z )A−1 1 (¯ 1 (z))jj (xn ej ) |

∆H 3/2 C 1 , + ≤ 2 v vy v 1/2 X X X 2 2 ∗ ∗ −1 E|aij xn ej | E|aii xn A1 (z)ej | E |aii aij bjj | ≤ ij

ij

i6=j

≤

X i

×

−1 E|a2ii |x∗n A−1 z )xn 1 (z)A1 (¯

X j

−1 E(A−1 z ))jj |x∗n ej |2 1 (z)A1 (¯

1/2

√ ∆H N 1 ≤C 2 + , v vy v 1/2 X X X 2 2 ∗ ∗ −1 E E|aij xn ei | E|aii xn A1 ej | |aii aij bij | ≤ i6=j

ij

√ ∆H N 1 + . ≤C 2 v vy v

ij

Finally, we establish √ H X X 1 ∆ N ι τ + Eaij ajk bik ≤ C 2 . v vy v ι,τ i6=j6=k

By inclusive–exclusive principle and what we have just proved, it remains to show that √ H X X 1 ∆ N ι τ + (4.4) . Eaij ajk bik ≤ C 2 v vy v ι,τ i,j,k

P

Notice that γik = j aij ajk is the (i, k)-element of A−2 1 (z). We obtain X X γik bik Eaij ajk bik = i,k

i,j,k

≤

X i,k

2

E|γik |

X i,k

2

E|bik |

1/2


17

−2 −1 = (E tr(A−2 z ))Ex∗n A−1 z )xn )1/2 1 (z)A1 (¯ 1 (z)A1 (¯ √ ∆H N 1 + . ≤C 2 v vy v

Similarly, one can prove the other terms of (4.4) share this common bound. In summary, when v 2 vy ≥ C0 N −1 it holds that 1 ∆H C 2 + (4.5) |Eξ1 (z)α1 (z)| ≤ 2 . N v vy v For the last term in the upper bound of |δ11 |, we apply Lemma A.4 and the Cauchy–Schwarz inequality again. In particular, for any fixed t > 0, we have that |Eβ1 (z)ξ13 (z)α1 (z)| ≤ CE|ξ13 (z)α1 (z)| + o(N −t )

≤ C(E|ξ1 (z)|6 )1/2 (E|α1 (z)|2 )1/2 + o(N −t ) 1 ∆H 1/2 C + . ≤ 3/2 v v y N 5/2 v 2 vy

(4.6)

The last inequality in (4.6) is due to Lemmas A.2 and A.7. Therefore, for any v 2 vy ≥ C0 N −1 , (4.3), (4.5) and (4.6) lead us to 1 ∆H C + (4.7) |δ11 | ≤ . N vvy vy v To establish the upper bound for |δ12 |, we will make use of the following equality: −1 ∗ −1 −1 A−1 1 (z) − A (z) = β1 (z)A1 (z)r1 r1 A1 (z).

(4.8)

Note that (4.1) implies that C −1 |δ12 | ≤ |Eβ1 (z)x∗n (A−1 1 (z) − A (z))xn | vy

(4.9)

≤

C ∗ −1 |Eβ12 (z)x∗n A−1 1 (z)r1 r1 A1 (z)xn | vy

≤

C ∗ −1 −t E|X∗1 A−1 1 (z)xn xn A1 (z)X1 | + o(N ) (see Lemma A.4) N vy

≤

C ∗ −1 |E tr(A−1 1 (z)xn xn A1 (z))| N vy

C |Ex∗n A−2 1 (z)xn | N vy 1 ∆H C + . ≤ N vvy vy v =

(see Lemma B.5)

18


At last, we establish the upper bound for δ13 . By (4.1), Lemma A.3, and the fact (4.10)

β1 (z) = b1 (z) − b1 (z)β1 (z)ξ1 (z),

we obtain |δ13 | ≤

C |E{β1 (z)x∗n (A−1 (z) − EA−1 (z))xn }| vy

=

C |E{b(z)β1 (z)ξ1 (z)x∗n (A−1 (z) − EA−1 (z))xn }| vy

≤

C |E{ξ1 (z)x∗n (A−1 (z) − EA−1 (z))xn }| vy +

C |E{β1 (z)ξ12 (z)x∗n (A−1 (z) − EA−1 (z))xn }|. vy

From Cr inequality (see Lo`eve (author?) [14]), we have |δ13 | ≤ |II2 | + |II3 |), where

C vy (|II1 |

+

−1 II1 = E{ξ1 (z)x∗n (A−1 1 (z) − EA1 (z))xn },

II2 = E{ξ1 (z)x∗n (A−1 (z) − A−1 1 (z))xn },

II3 = E{β1 (z)ξ1 (z)x∗n E(A−1 (z) − A−1 1 (z))xn }.

−1 It should be noted that E(ξ1 (z) | A−1 1 (z)) = 0, and r1 and A1 (z) are independent. Then we have −1 −1 II1 = E[E{ξ1 (z)x∗n (A−1 1 (z) − EA1 (z))xn | A1 (z)}] = 0.

By the results in (4.8) and (4.10), we have ∗ −1 |II2 | = |Eξ1 (z)β1 (z)x∗n A−1 1 (z)r1 r1 A1 (z)xn |

∗ −1 ≤ |b1 (z)||Eξ1 (z)r∗1 A−1 1 (z)xn xn A1 (z)r1 |

∗ −1 + |b1 (z)||Eβ1 (z)ξ12 (z)r∗1 A−1 1 (z)xn xn A1 (z)r1 |

=: III1 + III2 , where

1 ∗ −2 ∗ −1 III1 = |b1 (z)| Eξ1 (z) r∗1 A−1 (z)x x A (z)r − x A (z)x n n 1 1 n n 1 1 N X C −1 ∗ −3 ∗ −1 (A1 (z)xn xn A1 (z))ii aii ≤ 2 E tr(xn A1 (z)xn ) + E N i H C 1 ∆ = 2 2 + . N v vy v


19

The above inequality follows from Lemmas B.3 and A.3. By Lemma A.4 and the Cauchy–Schwarz inequality, it holds that ∗ −1 III2 ≤ CE|ξ12 (z)r∗1 A−1 1 (z)xn xn A1 (z)r1 |

2 1/2 ∗ −1 ≤ C(E|ξ1 (z)|4 )1/2 (E|r∗1 A−1 1 (z)xn xn A1 (z)r1 | ) 1/2 1 1 2 ∗ −2 (see Lemmas A.2 and B.5) E|xn A1 (z)xn | ≤C N 2 v 2 vy2 N 2 C 1 ∆H ≤ 2 2 + . N v vy vy v

Hence, we have shown that C |II2 | ≤ 2 2 N v vy

1 ∆H + . vy v

Moreover, Lemmas A.2 and A.5, and the Cauchy–Schwarz inequality lead us to the following: 2 1/2 |II3 | ≤ C(E|ξ1 (z)|2 E|x∗n (A−1 (z) − A−1 1 (z))xn | ) 1 ∆H C C + ≤ 1/2 v N 1/2 v 1/2 vy N v vy C 1 ∆H = + . 1/2 v v y N 3/2 v 3/2 vy

Therefore, it follows that (4.11)

C C |δ13 | ≤ (|II1 | + |II2 | + |II3 |) ≤ 3/2 3/2 vy N v vy

∆H 1 + . vy v

As it has been shown in (4.7), (4.9) and (4.11), we conclude that 1 ∆H C + (4.12) . |δ1 | ≤ N vvy vy v 5. Proof of Theorem 1.6. From Lemma 2.2, and by replacing EH Sn (x) H Sn and EmH n (z) by H (x) and mn (z), respectively, we have Sn E∆H p =: EkH (x) − Fyn (x)k Z Z A H H E|mn (z) − Emn (z)| du + K1 ≤ K1 −A

+ K2 v

−1

Z

A

−A

|x|>B


|EmH n (z) − my (z)| du

20


+ K3 v −1 sup x

≤ K1

Z

A

−A

Z

|t| 1, Z A H −1/8+η |mH ). n (z) − Emn (z)| du = oa.s. (N When

−A −1/4 a≥N ,

in this case, by choosing v = O(N −1/4 ), we have

2l H −2lη al N 2l(1/4−η) E|mH . n (z) − Emn (z)| ≤ CN

Theorem 1.8 then follows by setting l > Theorem 1.8.

1 2η .

This completes the proof of


21

APPENDIX A In this section, we establish some lemmas which are used in the proofs of the main theorems. Lemma A.1. Under the conditions of Theorem 1.6, for all |z| < A and v 2 vy ≥ C0 N −1 , we have C

|Emn (z) − my (z)| ≤

N v 3/2 vy2 √ √ where C0 is a constant and vy = 1 − yn + v.

,

Proof. Since 1 E tr(Sn − zIn )−1 n n 1X 1 = E n skk − z − N −2 α∗k (Snk − zIn−1 )−1 αk

Emn (z) =

k=1

(A.1)

n

1X 1 = E n ǫ + 1 − yn − z − yn zEmn (z) k=1

=−

1 + δn , z + yn − 1 + yn zEmn (z)

where N 1 X skk = |Xkj |2 , N j=1

1 X X∗ , N (k) (k) ¯ k, αk = X(k) X

Snk =

ǫk = (skk − 1) + yn + yn zEmn (z) −

1 ∗ α (Snk − zIn−1 )−1 αk , N2 k

n

δn = −

1X bn Eβk ǫk , n k=1

bn = bn (z) = βk = βk (z) =

1 , z + yn − 1 + yn zEmn (z)

1 , z + yn − 1 + yn zEmn (z) − ǫk

22


where X(k) is the (n − 1) × N matrix obtained from X with its kth row removed and X∗k is the kth row of X. It has proved that one of the roots of equation (A.1) is (see (3.1.7) in [3]) p 1 (z + yn − 1 − yn zδn − (z + yn − 1 + yn zδn )2 − 4yn z). Emn (z) = − 2yn z

The Stieltjes transform of the Marˇcenko–Pastur distribution with index y is given by (see (2.3) in [3]) p yn + z − 1 − (1 + yn − z)2 − 4yn . my (z) = − 2yn z Thus, |Emn (z) − my (z)| |δn | |2(z + yn − 1) − yn zδn | p ≤ . 1+ p 2 | (z + yn − 1)2 − 4yn z + (z + yn − 1 + yn zδn )2 − 4yn z|

Let us define by convention √ ℑ(z) , ℜ( z) = p 2(|z| − ℜ(z))

√ |ℑ(z)| ℑ( z) = p . 2(|z| + ℜ(z)) p 1 , then the real parts of (z + yn − 1)2 − 4yn z and If |u − yn − 1| ≥ 5(A+1) p (z + yn − 1 + yn zδn )2 − 4yn z have the same sign. Since they both have positive imaginary parts, it follows that p p | (z + yn − 1)2 − 4yn z + (z + yn − 1 + yn zδn )2 − 4yn z| 1/2 q p 2v ≥ |ℑ((z + yn − 1)2 − 4yn z)| = 2v(u − yn − 1) ≥ . 5(A + 1) Thus,

(A.2)

C |δn | C|δn | 1+ √ |Emn (z) − my (z)| ≤ ≤ √ . 2 v v

1 If |u − yn − 1| < 5(A+1) , we have |Emn (z) − my (z)| ≤ C|δn |. In [7] [see the inequality above (8.3.16)], we have C C v 2 |δn | ≤ ≤ . ∆+ 3 Nv vy N vvy2

Combined with (A.2), we get |Emn (z) − my (z)| ≤ The proof of the lemma is complete.

C N v 3/2 vy2

.


23

In addition, the following relevant result which is involved in the proof of Lemma 3.1 is presented here: Z A |Emn (z) − my (z)| du −A

= (A.3)

Z

|u−yn −1|≥1/(5(A+1)),|u|≤A

+

|Emn (z) − my (z)| du

Z

|u−yn −1| 0, P(|β1 (z)| > 2C) = o(N −t ).

Proof. Note that if |b1 (z)ξ1 (z)| ≤ 1/2, by Lemma A.3, we get |β1 (z)| = As a result,

|b1 (z)| |b1 (z)| ≤ ≤ 2|b1 (z)| ≤ 2C. |1 + b1 (z)ξ1 (z)| 1 − |b1 (z)ξ1 (z)|

1 P(|β1 (z)| > 2C) ≤ P |b1 (z)ξ1 (z)| > 2 1 ≤ P |ξ1 (z)| > 2C ≤ (2C)p E|ξ1 (z)|p .

(see Lemma A.3)

26


By the Cr -inequality, Lemmas B.8 and B.9, for some η = ηN N −1/4 and p ≥ log N , we have p 1 1 p p −1 −1 ˆ E|ξ1 (z)| = E|ξ1 (z)| + E tr A1 (z) − E tr A1 (z) N N 4 2 ≤ C(N ηN N −1 )−1 (v −1 ηN N −1/2 )p +

C

p/2 N p v 3p/2 vy

2p−4 p ≤ CηN ≤ CηN .

−1 > t + 1, it can For any fixed t > 0, when N is large enough so that log ηN be shown that −1

E|ξ1 (z)|p ≤ Ce−p log ηN ≤ Ce−p(t+1)

≤ Ce−(t+1) log N

= CN −t−1 = o(N −t ).

We finish the proof. Lemma A.5. that

If v 2 vy ≥ C0 N −1 , C0 is a large constant. For l ≥ 1, it holds 2l E|mH B1 (z)|2l ≤ CE|mH n (z)| .

Proof. Recall that −1 −1 ∗ −1 A−1 j (z) − A (z) = βj (z)Aj (z)rj rj Aj (z).

By Lemmas A.4 and B.5, it holds that 2l E|mH B1 (z) − mH n (z)|

2l −1 = E|x∗n (A−1 1 (z) − A (z))xn |

2l ∗ −1 = E|x∗n β1 (z)A−1 1 (z)r1 r1 A1 (z)xn |

2l ∗ −1 = E|x∗n β1 (z)A−1 1 (z)r1 r1 A1 (z)xn | I(|β1 (z)| ≤ C)

2l ∗ −1 + E|x∗n β1 (z)A−1 1 (z)r1 r1 A1 (z)xn | I(|β1 (z)| > C)

C 2l ∗ −1 −t E|X∗1 A−1 1 (z)xn xn A1 (z)X1 | + o(N ) N 2l C 2l ∗ −1 ∗ −2 ≤ 2l E|X∗1 A−1 1 (z)xn xn A1 (z)X1 − xn A1 (z)xn | N C 2l + 2l E|x∗n A−2 1 (z)xn | N

≤


≤

27

C −1 ∗ −1 [E(tr A−1 z )xn x∗n A−1 z ))l 1 (z)xn xn A1 (z)A1 (¯ 1 (¯ N 2l −1 ∗ −1 + N l−2 E tr(A−1 z )xn x∗n A−1 z ))l ] 1 (z)xn xn A1 (z)A1 (¯ 1 (¯

C 2l E|x∗n A−2 1 (z)xn | N 2l C ≤ l+2 2l E|mH B1 (z)|2l . N v The last step follows the fact that +

−1 −1 x∗n A−1 z )xn = v −1 ℑ(x∗n A−1 1 (z)A1 (¯ 1 (z)xn ) = v ℑ(mH B1 (z)).

For v 2 vy ≥ C0 N −1 , which implies that v ≥ C0 N −1/2 . Choose C0 and N C large enough, such that N l+2 ≤ 21 . Further, by Cr -inequality, we obtain v2l 2l 2l H E|mH B1 (z)|2l ≤ CE|mH B1 (z) − mH n (z)| + CE|mn (z)| 2l ≤ 21 E|mH B1 (z)|2l + CE|mH n (z)| .

2l That is, E|mH B1 (z)|2l ≤ CE|mH n (z)| , for some constant C. This finishes the proof.

Lemma A.6. we have

Under the conditions of Theorem 1.6, for v 2 vy ≥ C0 N −1 ,

2l H E|mH n (z) − Emn (z)|

C ≤ l 2l Nv

1 ∆H + vy v

2l

.

Proof. Write Ej (·) as the conditional expectation given {r1 , . . . , rj }. It PN H can then be shown that mH n (z) − Emn (z) = j=1 γj , where γj =: Ej (x∗n A−1 (z)xn ) − Ej−1 (x∗n A−1 (z)xn ) = (Ej − Ej−1 ){x∗n (A−1 (z) − A−1 j (z))xn }

∗ −1 = −(Ej − Ej−1 ){βj (z)x∗n A−1 j (z)rj rj Aj (z)xn }.

Therefore, by Lemmas B.4(b), we have 2l H E|mH n (z) − Emn (z)| ≤ CE

N X j=1

Ej−1 |γj |2

!l

+C

N X j=1

E|γj |2l .

Using Lemmas A.4 and B.5, we have 2 ∗ −1 Ej−1 = Ej−1 |(Ej − Ej−1 )βj (z)x∗n A−1 j (z)rj rj Aj (z)xn |

≤

C 2 ∗ −1 ∗ −2 Ej−1 |X∗j A−1 j (z)xn xn Aj (z)Xj − xn Aj (z)xn | N2

28


+ ≤

C 2 Ej−1 |x∗n A−2 j (z)xn | N2

C −1 ∗ −1 z )) z )xn x∗n A−1 Ej−1 tr(A−1 j (¯ j (z)xn xn Aj (z)Aj (¯ 2 N C 2 + 2 Ej−1 |x∗n A−2 j (z)xn | . N

−1 −1 By the fact that x∗n A−1 z )xn = v −1 ℑ(x∗n A−1 j (z)Aj (¯ j (z)xn ) and kAj (z)k ≤ −1 v , we have

Ej−1 |γj |2 ≤

C Ej−1 |mH Bj (z)|2 . N 2v2

On the other side, 1 2l ∗ −1 E|X∗1 A−1 1 (z)xn xn A1 (z)X1 | N 2l C 2l ∗ −1 ∗ −2 ≤ 2l E|X∗1 A−1 1 (z)xn xn A1 (z)X1 − xn A1 (z)xn | N C 2l + 2l E|x∗n A−2 1 (z)xn | N

E|γj |2l =

C N l−2 C 2l 2l × E(ℑ(x∗n A−1 E|x∗n A−1 1 (z)xn )) + 1 (z)xn | N 2l v 2l N 2l C ≤ l+2 2l E|mH B1 (z)|2l . N v ≤

Thus, we obtain C C E|mH B1 (z)|2l + l+1 2l E|mH B1 (z)|2l l 2l Nv N v C 2l (by Lemma A.5). ≤ l 2l E|mH n (z)| Nv

2l H E|mH n (z) − Emn (z)| ≤

Further 2l 2l 2l 2l H H H E|mH n (z)| ≤ E|mn (z) − Emn (z)| + |Emn (z) − my (z)| + |my (z)| .

For v ≥ C0 N −1/2 , choose C0 large enough, such that integration by parts, it is easy to find that (A.6)

|EmH n (z) − my (z)| ≤

C N l v2l

C∆H , v

where ∆H = kEH Sn − Fyn k. Besides, from Lemma B.7, we know that |my (z)| ≤

C vy .

≤ 12 . And using


29

Therefore, we obtain 2l H E|mH n (z) − Emn (z)|

C ≤ l 2l Nv

1 ∆H + vy v

2l

.

The proof is then complete. Lemma A.7.

C 1 ∆H E|α1 (z)| ≤ 2 + . N v vy v 2

Proof. Lemma B.5 implies that 1 2 ∗ ∗ −1 E|X∗1 A−1 1 (z)xn xn X1 − xn A1 (z)xn | N2 C z )A−1 ≤ 2 E(x∗n A−1 1 (¯ 1 (z)xn ) N C ≤ 2 |EmH B1 (z)|. N v Using Lemmas A.5 and A.6 and integration by parts, we have E|α1 (z)|2 =

2 E|mH B1 (z)|2 ≤ E|mH n (z)|

2 2 2 H H ≤ E|mH n (z) − Emn (z)| + |Emn (z) − my (z)| + |my (z)| ∆H 2 1 + . ≤C vy v

This finishes the proof. Lemma A.8.

Under the conditions in Theorem 1.1, for any fixed t > 0, Z ∞ |EH Sn (x) − Fyn (x)| dx = o(N −t ). B

Proof. For any fixed t > 0, by Lemma B.12, it follows that P(λmax (Sn ) ≥ B + x) ≤ CN −t−1 (B + x − ε)−2 and Z

∞ B

|EH

Sn

(x) − Fyn (x)| dx ≤ =

Z

∞

B

Z

∞

B

(1 − EH Sn (x)) dx 1−

N X i=1

2

!

|yi | P(λi ≤ x) dx

30


≤ ≤

Z

∞

B

Z

∞

B

N X i=1

|yi |2 −

N X i=1

!

|yi |2 P(λmax ≤ x) dx

N −t−1 (B + x − ε)−2 dx = o(N −t ).

The proof is complete. APPENDIX B In what follows, we will present some existing results which are of substantial importance in proving the main theorems. Lemma B.1 (Theorem 2.2 in [2]). Let F be a distribution function and R let G be a function of bounded variation satisfying |F (x) − G(x)| dx < ∞. Denote their Stieltjes transforms by f (z) and g(z), respectively. Then kF − Gk = sup|F (x) − G(x)| x

≤

1 π(1 − κ)(2γ − 1) Z A Z 2π |f (z) − g(z)| du + × |F (x) − G(x)| dx v |x|>B −A Z 1 |G(x + y) − G(x)| dy , + sup v x |y|≤2vτ

where z = u + iv is a complex variable, γ, κ, τ , A and B are positive constants such that A > B, κ= and

4B 0, we have p Z 11 2(1 + yn ) 2 sup |Fyn (x + u) − Fyn (x)| du < v /vy , 3πyn x |u| 1,

p !p/2 n n X X E Xk ≤ Kp E , |Xk |2 k=1

k=1

(b) for p ≥ 2, n !p/2 ! n n X p X X 2 p E Xk ≤ Kp E + E Ek−1 |Xk | |Xk | , k=1

k=1

k=1

where Kp is a constant which depends on p only.

Lemma B.5 (Lemma 2.7 in [5]). Let A = (aij ) be an n × n nonrandom matrix and X = (X1 , . . . , Xn )∗ be random vector of independent complex entries. Assume that EXi = 0, E|Xi |2 = 1 and E|Xi |l ≤ Vl . Then for any p ≥ 2, E|X∗ AX − tr A|p ≤ Kp ((V4 tr(AA∗ ))p/2 + V2p tr(AA∗ )p/2 ),

where Kp is a constant depending on p only.

Lemma B.6 (Lemma 2.6 in [24]). Let z ∈ C+ with v = ℑ(z), A and B n × n with B Hermitian, τ ∈ R, and q ∈ Cn . Then kAk , |tr((B − zIn )−1 − (B + τ qq∗ − zIn )−1 )A| ≤ v where kAk denotes spectral norm on matrices. Lemma B.7 ((8.4.9) in [7]). Pastur distribution, we have

where vy =

√

a+

√

v=1−

√

For the Stieltjes transform of the Marˇcenko– √

|my (z)| ≤ √ yn +

√

v.

2 , yvy

32


Lemma B.8 (Lemma 8.20 in [7]). then

If |z| < A, v 2 vy ≥ C0 N −1 and l ≥ 1,

v l C , E|mn (z) − Emn (z)| ≤ 2l 4l 2l ∆ + N v yn vy √ √ where A is a positive constant, vy = 1 − yn + v and ∆ := kEF Sn − Fyn k. 2l

inLemma B.9 (Lemma 9.1 in [7]). Suppose that Xi , i = 1, . . . , n, are √ 2 4 dependent, with EXi = 0, E|Xi | = 1, sup E|Xi | = ν < ∞ and |Xi | ≤ η n with η > 0. Assume that A is a complex matrix. Then for any given p such that 2 ≤ p ≤ b log(nν −1 η 4 ) and b > 1, we have E|α∗ Aα − tr(A)|p ≤ νnp (nη 4 )−1 (40b2 kAkη 2 )p ,

where α = (X1 , . . . , Xn )T .

Lemma B.10 (Theorem 8.10 in [7]). Let Sn = XX∗ /N , where X = (Xij (n))n×N . Assume that the following conditions hold: (1) For each n, Xij (n) are independent, (2) EXij (n) = 0, E|Xij (n)|2 = 1, for all i, j, (3) supn supi,j E|Xij (n)|6 < ∞. Then we have ∆ =: kEF

Sn

− Fyn k =

O(N −1/2 a−1 ), O(N −1/6 ),

if a > N −1/3 , otherwise,

where yn = n/N ≤ 1 and a is defined in the Marˇcenko–Pastur distribution. Lemma B.11 (Theorem 5.11 in [7]). Assume that the entries of {Xij } is a double array of i.i.d. complex random variables with mean zero, variance σ 2 and finite 4th moment. Let X = (Xij )n×N be the n × N matrix of the upperleft corner of the double array. If n/N → y ∈ (0, 1), then, with probability one, we have √ lim λmin (Sn ) = σ 2 (1 − y)2 n→∞

and lim λmax (Sn ) = σ 2 (1 +

n→∞

√

y)2 .

Lemma B.12 (Theorem 5.9 in [7]). Suppose that the entries of the matrix X = (Xij )n×N are independent (not necessarily identically distributed) and satisfy: (1) EXij = √ 0, (2) |Xij | ≤ N δN ,


33

(3) maxij |E|Xij√ |2 − σ 2 | → 0 as N → ∞ and (4) E|Xij |l ≤ b( N δN )l−3 for all l ≥ 3, where δN → 0 and b > 0. Let Sn = XX∗ /N . Then, for any x > ǫ > 0, n/N → y, and fixed integer ℓ ≥ 2, we have √ √ P(λmax (Sn ) ≥ σ 2 (1 + y)2 + x) ≤ CN −ℓ (σ 2 (1 + y)2 + x − ǫ)−ℓ for some constant C > 0.

APPENDIX C Note that the data matrix X = (Xij )n×N consists of i.i.d. complex random variables with mean 0 and variance 1. In what follows, we will further assume that every |Xij | is bounded by ηN N 1/4 for some carefully selected ηN . The proofs presented in the following three steps jointly justify such a convenient assumption. C.1. Truncation for Theorem 1.1. Choose ηN ↓ 0 and ηN N 1/4 ↑ ∞ as N → ∞ such that lim η −10 E|X11 |10 I(|X11 | > ηN N 1/4 ) = 0. N →∞ N

b n denote the truncated data matrix whose entry on the ith row and Let X bn = jth column is Xij I(|Xij | ≤ ηN N 1/4 ), i = 1, . . . , n, j = 1, . . . , N . Define S ∗ b n /N . Then b nX X b n ) ≤ nN P(|Xij | > ηN N 1/4 ) P(Sn 6= S

−10 E|X11 |10 I(|X11 | > ηN N 1/4 ) ≤ nN −3/2 ηN

= o(N −1/2 ).

C.2. Truncation for Theorems 1.6 and 1.8. Choose ηN ↓ 0 and ηN N 1/4 ↑ ∞ as N → ∞ such that (C.1)

lim η −8 E|X11 |8 I(|X11 | > ηN N 1/4 ) = N →∞ N

0.

b denote the truncated data matrix whose entry on the ith row and Let X n bn = jth column is Xij I(|Xij | ≤ ηN N 1/4 ), i = 1, . . . , n, j = 1, . . . , N . Define S ∗ b b X X n n /N . Then ! ∞ [ n [ N [ b n , i.o.) = lim P P(Sn 6= S |Xij | > ηN N 1/4 k→∞

N =k i=1 j=1

34

N. XIA, Y. QIN AND Z. D. BAI ∞ [

= lim P k→∞

≤ lim

k→∞

[

N n [ [

t=k N ∈[2t ,2t+1 ) i=1 j=1 ∞ X

(yn +1)2t+1 2t+1

P

k→∞

[

i=1

t=k

≤ C lim

|Xij | > ηN N 1/4

∞ X t=k

[

j=1

t/4

|Xij | > η2t 2

!

!

(2t+1 )2 P(|X11 | > η2t 2t/4 )

≤ C lim

∞ ∞ X X

4t P(η2l 2l/4 < |X11 | ≤ η2l+1 2(l+1)/4 )

= C lim

∞ X l X

4t P(η2l 2l/4 < |X11 | ≤ η2l+1 2(l+1)/4 )

k→∞

k→∞

≤ lim

k→∞

= 0.

t=k l=t

l=k t=k

∞ X l=k

l/4 8 < |X11 | ≤ η2l+1 2(l+1)/4 ) Cη2−8 l E|X11 | I(η2l 2

The last equality is due to (C.1). C.3. Centralization. The centralization procedures for three theorems are identical, only 8th moment is required and thus we treat them uniformly. b . More explicitly, on the ith row e denote the centralized version of X Let X n n e n , the entry is and jth column of X Xij I(|Xij | ≤ ηN N 1/4 ) − E(Xij I(|Xij | ≤ ηN N 1/4 )).

Notice that according to Theorem 3.1 of [26], k(Sn − zIn )−1 k is bounded en = by 1/v, where k · k denotes the spectral norm for a matrix. Define S ∗ −1/2 e e Xn Xn /N . Suppose that v ≥ C0 N , we obtain |mH Sb n (z) − mH Se n (z)|

−1 b n − zIn )−1 xn − x∗ (S e = |x∗n (S n n − zIn ) xn |

b n − zIn )−1 kkS bn − S e n kk(S e n − zIn )−1 k ≤ k(S

1 b enk kSn − S v2 1 b n kkX b ∗n − X e ∗n k + kX bn −X e n kkX e ∗n k) ≤ (kX N v2 C bn −X e nk kX a.s. ≤√ N v2 ≤

by Lemma B.12


35

C =√ |E{X11 I(|X11 | ≤ ηN N 1/4 )}|k1n×1 kk1′N ×1 k 2 Nv √ −2 −7 −7/4 E(|X11 |8 I(|X11 | > ηN N 1/4 )) ≤ C N v ηN N = o(N −1/4 ).

To establish both the weak and the strong convergence rates of the VESD to the Marˇcenko–Pastur distribution, this o(N −1/4 ) suffices. Moreover, for the convergence rate presented in Theorem 1.1, we shall prove the following. Let my (z) denotes the Stieltjes transform of the Marˇcenko–Pastur distribu√ 2 √ √ tion, thus my (z) is bounded by yn (1− yn +√v) ≤ √Cv in Lemma B.7. Then C H |EmH n (z)| ≤ |Emn (z) − my (z)| + |my (z)| ≤ C|my (z)| ≤ √ . v b n − zIn )−1 xn can be considered as a Stieltjes transform of some Besides x∗n (S VESD function. So, we have

Thus,

C −1 b n − zIn )−1 k2 = v −1 Ex∗ (S b Ekx∗n (S n n − zIn ) xn ≤ √ . v

E|mH Sb n (z) − mH Se n (z)| −1 −1 bn − S e n kkx∗ (S b e ≤ EkS n n − zIn ) kk(Sn − zIn ) xn k √ −7 −7/4 E(|X11 |8 I(|X11 | ≥ ηN N 1/4 )) ≤ C N ηN N

b n − zI)−1 k2 )1/2 (Ek(S e n − zI)−1 xn k2 )1/2 × (Ekx∗n (S √ −7 −7/4 ≤ C N v −3/2 ηN N E(|X11 |8 I(|X11 | ≥ ηN N 1/4 ))

≤ o(N −1/2 ).

C.4. Rescaling. The rescaling procedures for the three theorems are exactly the same, and only 8th moment is required. Thus, we treat them e n /σ1 , where uniformly. Write Y n = X σ12 = E|X11 I(|X11 | ≤ ηN N 1/4 ) − E(X11 I(|X11 | ≤ ηN N 1/4 ))|2 .

Notice that σ1 tends to 1 as N goes to ∞. Define Gn = Yn Y ∗n /N , which is the sample covariance matrix of Y n . We shall show that Gn and Sn are asymptotically equivalent, that is, the VESD of Gn and Sn have the same limit if either one limit exists. For v ≥ C0 N −1/2 , e n − zIn )−1 (S e n − Gn )(Gn − zIn )−1 xn | |mH Gn (z) − mH Se n (z)| = |x∗n (S

36


1 enk k(1 − σ1−1 )S v2 C ≤ 2 (1 − σ12 ) a.s. v ≤

(see Lemma B.12)

−6 −3/2 N E(|X11 |8 I(|X11 | > ηN N 1/4 )) ≤ Cv −2 ηN

≤ o(N −1/2 )

a.s.

Hence, we shall without loss of generality assume that every |Xij | is bounded by ηN N 1/4 , and every Xij has mean 0 and variance 1. REFERENCES [1] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Inst. Statist. Math. 34 122–148. MR0145620 [2] Bai, Z. D. (1993). Convergence rate of expected spectral distributions of large random matrices. I. Wigner matrices. Ann. Probab. 21 625–648. MR1217559 [3] Bai, Z. D. (1993). Convergence rate of expected spectral distributions of large random matrices. II. Sample covariance matrices. Ann. Probab. 21 649–672. MR1217560 [4] Bai, Z. D., Miao, B. Q. and Pan, G. M. (2007). On asymptotics of eigenvectors of large sample covariance matrix. Ann. Probab. 35 1532–1572. MR2330979 [5] Bai, Z. D. and Silverstein, J. W. (1998). No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26 316–345. MR1617051 [6] Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of largedimensional sample covariance matrices. Ann. Probab. 32 553–605. MR2040792 [7] Bai, Z. D. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York. MR2567175 [8] Bai, Z. D. and Xia, N. (2013). Functional CLT of eigenvectors for large sample covariance matrices. Statist. Papers. To appear. [9] Cai, T., Ma, Z. M. and Wu, Y. H. (2013). Sparse PCA: Optimal rates and adaptive estimation. Available at arXiv:1211.1309. ˝ s, L., Schlein, B. and Yau, H.-T. (2009). Semicircle law on short scales and [10] Erdo delocalization of eigenvectors for Wigner random matrices. Ann. Probab. 37 815– 852. MR2537522 ¨ tze, F. and Tikhomirov, A. (2004). Rate of convergence in probability to the [11] Go Marchenko–Pastur law. Bernoulli 10 503–548. MR2061442 [12] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327. MR1863961 [13] Knowles, A. and Yin, J. (2013). Eigenvector distribution of Wigner matrices. Probab. Theory Related Fields 155 543–582. MR3034787 eve, M. (1977). Probability Theory. I, 4th ed. Springer, New York. Graduate Texts [14] Lo` in Mathematics 45. MR0651017 [15] Ma, Z. M. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801. ˇenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues for some [16] Marc sets of random matrices. Math. USSR-Sb. 1 457–483.


37

[17] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642. MR2399865 [18] Pillai, N. S. and Yin, J. (2013). Universality of covariance matrices. Available at arXiv:1110.2501v6. [19] Silverstein, J. W. (1981). Describing the behavior of eigenvectors of random matrices using sequences of measures on orthogonal groups. SIAM J. Math. Anal. 12 274–281. MR0605435 [20] Silverstein, J. W. (1984). Some limit theorems on the eigenvectors of largedimensional sample covariance matrices. J. Multivariate Anal. 15 295–324. MR0768500 [21] Silverstein, J. W. (1989). On the eigenvectors of large-dimensional sample covariance matrices. J. Multivariate Anal. 30 1–16. MR1003705 [22] Silverstein, J. W. (1990). Weak convergence of random functions defined by the eigenvectors of sample covariance matrices. Ann. Probab. 18 1174–1194. MR1062064 [23] Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large-dimensional random matrices. J. Multivariate Anal. 55 331–339. MR1370408 [24] Silverstein, J. W. and Bai, Z. D. (1995). On the empirical distribution of eigenvalues of a class of large-dimensional random matrices. J. Multivariate Anal. 54 175–192. MR1345534 [25] Tao, T. and Vu, V. (2011). Universal properties of eigenvectors. Available at arXiv:1103.2801v2. [26] Yin, Y. Q., Bai, Z. D. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78 509–521. MR0950344 N. Xia Z. D. Bai KLAS and School of Mathematics and Statistics Northeast Normal University Changchun 130024 China and Department of Statistics and Applied Probability National University of Singapore Singapore 117546 Singapore E-mail: [email protected] [email protected]

Y. Qin Department of Statistics and Actuarial Science University of Waterloo Waterloo, ON N2L 3G1 Canada E-mail: [email protected]