Rate-Optimal Perturbation Bounds for Singular Subspaces
arXiv:1605.00353v1 [math.ST] 2 May 2016
with Applications to High-Dimensional Statistics
∗
T. Tony Cai and Anru Zhang May 3, 2016
Abstract Perturbation bounds for singular spaces, in particular Wedin’s sin Θ theorem, are a fundamental tool in many fields including high-dimensional statistics, machine learning, and applied mathematics. In this paper, we establish separate perturbation bounds, measured in both spectral and Frobenius sin Θ distances, for the left and right singular subspaces. Lower bounds, which show that the individual perturbation bounds are rate-optimal, are also given. The new perturbation bounds are applicable to a wide range of problems. In this paper, we consider in detail applications to low-rank matrix denoising and singular space estimation, high-dimensional clustering, and canonical correlation analysis (CCA). In particular, separate matching upper and lower bounds are obtained for estimating the left and right singular spaces. To the best of our knowledge, this is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation. In addition to these problems, applications to other high-dimensional problems such as community detection in bipartite networks, multidimensional scaling, and cross-covariance matrix estimation are also discussed.
Keywords: canonical correlation analysis, clustering, high-dimensional statistics, lowrank matrix denoising, perturbation bound, singular value decomposition, sin Θ distances, spectral method.
∗
T. Tony Cai is Dorothy Silberberg Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA (E-mail:
[email protected]); Anru Zhang is Assistant Professor, University of Wisconsin-Madison, Madison, WI (E-mail:
[email protected]). The research of Tony Cai was supported in part by NSF Grant DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.
1
1
Introduction
Singular value decomposition (SVD) and spectral methods have been widely used in statistics, probability, machine learning, and applied mathematics as well as many applications. Examples include low-rank matrix denoising (Shabalin and Nobel, 2013; Yang et al., 2014; Donoho and Gavish, 2014), matrix completion (Cand`es and Recht, 2009; Cand`es and Tao, 2010; Keshavan et al., 2010; Gross, 2011; Chatterjee, 2014), principle component analysis (Anderson, 2003; Johnstone and Lu, 2009; Cai et al., 2013), canonical correlation analysis (Hotelling, 1936; Hardoon et al., 2004; Gao et al., 2014, 2015), community detection (von Luxburg et al., 2008; Rohe et al., 2011; Balakrishnan et al., 2011; Lei and Rinaldo, 2015). Specific applications include collaborative filtering (the Netflix problem) (Goldberg et al., 1992), multi-task learning (Argyriou et al., 2008), system identification (Liu and Vandenberghe, 2009), and sensor localization (Singer and Cucuringu, 2010; Candes and Plan, 2010), among many others. In addition, the SVD is often used to find a “warm start” for more delicate iterative algorithms, see, e.g., Cai et al. (2015b); Sun and Luo (2015). Perturbation bounds, which concern how the spectrum changes after a small perturbation to a matrix, often play a critical role in the analysis of the SVD and spectral methods. To be more specific, for an approximately low-rank matrix X and a perturbation matrix Z, it is crucial in many applications to understand how much the left or right singular spaces of X and X + Z differ from each other. This problem has been widely studied in the literature (Davis and Kahan, 1970; Wedin, 1972; Weyl, 1912; Stewart, 1991, 2006; Yu et al., 2015). Among these results, the sin Θ theorems, established by Davis and Kahan (1970) and Wedin (1972), have become fundamental tools and are commonly used in applications. While Davis and Kahan (1970) focused on eigenvectors of symmetric matrices, Wedin’s sin Θ theorem studies the more general singular vectors for asymmetric matrices and provides a uniform perturbation bound for both the left and right singular spaces in terms of the singular value gap and perturbation level. Several generalizations and extensions have been made in different settings after the seminal work of Wedin (1972). For example, Vu (2011), Shabalin and Nobel (2013), O’Rourke et al. (2013), Wang (2015) considered the rotations of singular vectors after random perturbations; Fan et al. (2016) gave an `∞ eigenvector perturbation bound and used the result for robust covariance estimation. See also Dopico (2000); Stewart (2006). Despite its wide applicability, Wedin’s perturbation bound is not sufficiently precise for some analyses, as the bound is uniform for both the left and right singular spaces. It clearly leads to sub-optimal bound if the left and right singular spaces change in different orders of magnitude after the perturbation. In a range of applications, especially when the row and column dimensions of the matrix differ significantly, it is even possible that one side of the singular space can be accurately recovered, while the other side cannot. The numerical 2
experiment given in Section 2.3 provides a good illustration for this point. It can be seen from the experiment that the left and right singular perturbation bounds behave distinctly when the row and column dimensions are significantly different. Furthermore, for a range of applications, the primary interest only lies in one of the singular spaces. For example, in the analysis of bipartite network data, such as the Facebook user-public-page-subscription network, the interest is often focused on grouping the public pages (or grouping the users). This is the case for many clustering problems. See Section 4 for further discussions. In this paper, we establish separate perturbation bounds for the left and right singular subspaces. The bounds are measured in both the spectral and Frobenius sin Θ distances, which are equivalent to several widely used losses in the literature. We also derive lower bounds that are within a constant factor of the corresponding upper bounds. These results together show that the obtained perturbation bounds are rate-optimal. The newly established perturbation bounds are applicable to a wide range of problems in high-dimensional statistics. In this paper, we discuss in detail the applications of the perturbation bounds to the following high-dimensional statistical problems. 1. Low-rank matrix denoising and singular space estimation: Suppose one observes a low-rank matrix with random noise and wishes to estimate the mean matrix or its left or right singular spaces. Such a problem arises in many applications. We apply the obtained perturbation bounds to study this problem. Separate matching upper and lower bounds are given for estimating the left and right singular spaces. These results together establish the optimal rates of convergence. Our analysis shows an interesting phenomenon that in some settings it is possible to accurately estimate the left singular space but not the right one and vice versa. To the best of our knowledge, this is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation. Another fact we observe is that in certain class of low-rank matrices, one can stably recover the original matrix if and only if one can accurately recover both its left and right singular spaces. 2. High-dimensional clustering: Unsupervised learning is an important problem in statistics and machine learning with a wide range of applications. We apply the perturbation bounds to the analysis of clustering high-dimensional Gaussian mixtures. Particularly in a high-dimensional two-class clustering setting, we propose a simple PCA-based clustering method and use the obtained perturbation bounds to prove matching upper and lower bounds for the misclassification rates. 3. Canonical correlation analysis (CCA): CCA is a commonly used tools in multivariate analysis to identify and measure the associations among two sets of random variables. The perturbation bounds are also applied to analyze CCA. Specifically, we 3
develop sharper upper bounds for estimating the left and right canonical correlation directions. To the best of our knowledge, this is the first result that captures the phenomenon that in some settings it is possible to accurately estimate one side of canonical correlation directions but not the other side. In addition to these applications, the perturbation bounds can also be applied to the analysis of community detection in bipartite networks, multidimensional scaling, cross-covariance matrix estimation, and singular space estimation for matrix completion, and other problems to yield better results than what are known in the literature. These applications demonstrate the usefulness of the newly established perturbation bounds. The rest of the paper is organized as follows. In Section 2, after basic notation and definitions are introduced, the perturbation bounds are presented separately for the left and right singular subspaces. Both the upper bounds and lower bounds are provided. We then apply the newly established perturbation bounds to low-rank matrix denoising and singular space estimation, high-dimensional clustering, and canonical correlation analysis in Sections 3–5. Other potential applications are briefly discussed in Section 6. The main theorems are proved in Section 7 and the proofs of some additional technical results are given in the Appendix.
2
Rate-Optimal Perturbation Bounds for Singular Subspaces
We establish in this section rate-optimal perturbation bounds for singular subspaces. We begin with basic notation and definitions that will be used in the rest of the paper.
2.1
Notation and Definitions
For a, b ∈ R, let a ∧ b = min(a, b), a ∨ b = max(a, b). Let Op,r = {V ∈ Rp×r : V | V = Ir } be the set of all p × r orthonormal columns and write Op for Op,p , the set of p-dimensional orthogonal matrices. For a matrix A ∈ Rp1 ×p2 , write the SVD as A = U ΣV | , where Σ = diag{σ1 (A), σ2 (A), · · · } with the singular values σ1 (A) ≥ σ2 (A) ≥ · · · ≥ 0 in descending order. In particular, we use σmin (A) = σmin(p1 ,p2 ) (A), σmax (A) = σ1 (A) as the smallest and largest non-trivial singular values of A. Several qPmatrix norms will be used in the 2 paper: kAk = σ1 (A) is the spectral norm; kAkF = i σi (A) is the Frobenius norm; and P kAk∗ = i σi (A) is the nuclear norm. We denote PA ∈ Rp1 ×p1 as the projection operator onto the column space of A, which can be written as PA = A(A| A)† A| . Here (·)† represent the Moore-Penrose pseudoinverse. Given the SVD A = U ΣV | with Σ non-singular, a simpler form for PA is PA = U U | . We adopt the R convention to denote the submatrix: A[a:b,c:d] represents the a-to-b-th row, c-to-d-th column of matrix A; we also use A[a:b,:] and A[:,c:d] to represent a-to-b-th full rows of A and c-to-d-th full columns of A, respectively. We 4
use C, C0 , c, c0 , · · · to denote generic constants, whose actual values may vary from time to time. We use the sinΘ distance to measure the difference between two p×r orthogonal columns V and Vˆ . Suppose the singular values of V | Vˆ are σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0, then we call Θ(V, Vˆ ) = diag(cos−1 (σ1 ), cos−1 (σ2 ), · · · , cos−1 (σr )) as the principle angles. A quantitative measure of distance between the column spaces of V and Vˆ is then k sin Θ(Vˆ , V )k or k sin Θ(Vˆ , V )kF . Some more convenient characterizations and properties of the sin Θ distances will be given in Lemma 1 in Section 7.1.
2.2
Perturbation Upper Bounds and Lower Bounds
We are now ready to present the perturbation bounds for the singular subspaces. Let X ∈ Rp1 ×p2 be an approximately low-rank matrix and let Z ∈ Rp1 ×p2 be a “small” perturbation matrix. Our goal is to provide separate and rate-sharp bounds for the sin Θ distances between the left singular subspaces of X and X +Z and between the right singular subspaces of X and X + Z. Suppose X is approximately rank-r with the SVD X = U ΣV | , where a significant gap exists between σr (X) and σr+1 (X). The leading r left and right singular vectors of X are of particular interest. We decompose X as follows, " # " # h i Σ 0 V| 1 X = U U⊥ · · , (1) 0 Σ2 V⊥| where U ∈ Op1 ,r , V ∈ Op2 ,r , Σ1 = diag(σ1 (X), · · · , σr (X)) ∈ Rr×r , Σ2 = diag(σr+1 (X), · · · ) ∈ R(p1 −r)×(p2 −r) , [U U⊥ ] ∈ Op1 , [V V⊥ ] ∈ Op2 are orthogonal matrices. ˆ = X + Z. Partition the SVD of X ˆ in the Let Z be a perturbation matrix and let X same way as in (1), " # " # h i Σ ˆ1 0 Vˆ | ˆ =X +Z = U ˆ U ˆ⊥ · X · , (2) ˆ2 0 Σ Vˆ | ⊥
ˆ, U ˆ⊥ , Σ ˆ 1, Σ ˆ 2 , Vˆ and Vˆ⊥ have the same structures as U, U⊥ , Σ1 , Σ2 , V , and V⊥ . Dewhile U compose the perturbation Z into four blocks Z = Z11 + Z12 + Z21 + Z22 ,
(3)
where Z11 = PU ZPV ,
Z21 = PU⊥ ZPV ,
Z12 = PU ZPV⊥ ,
Define zij := kZij k
for i, j = 1, 2. 5
Z22 = PU⊥ ZPV⊥ .
Theorem 1 below provides separate perturbation bounds for the left and right singular subspaces in terms of both spectral and Frobenius sin Θ distances. ˆ Z be given as (1)-(3). Theorem 1 (Perturbation bounds for singular subspaces). Let X, X, Denote ˆ ) and β := kU | XV ˆ ⊥ k. α := σmin (U | XV ⊥ 2 ∧ z 2 , then If α2 > β 2 + z12 21
k sin Θ(V, Vˆ )k ≤ ˆ )k ≤ k sin Θ(U, U
α2
αz12 + βz21 2 ∧ z 2 ∧ 1, − β 2 − z21 12
α2
αz21 + βz12 2 ∧ z 2 ∧ 1, − β 2 − z21 12
αkZ12 kF + βkZ21 kF √ 2 ∧ z 2 ∧ r. α2 − β 2 − z21 12 (4) √ αkZ k + βkZ k 21 F 12 F ˆ )kF ≤ k sin Θ(U, U 2 ∧ z 2 ∧ r. α2 − β 2 − z21 12 (5) k sin Θ(V, Vˆ )kF ≤
One can see the respective effects of the perturbation on the left and right singular spaces. In particularly, if z12 ≥ z21 (which is typically the case when p2 p1 ), then ˆ )k than for k sin Θ(V, Vˆ )k. Theorem 3 gives a sharper bound for k sin Θ(U, U Remark 1. Consider the setting where X ∈ Rp1 ×p2 is a fixed rank-r matrix with r ≤ p1 p2 , and Z ∈ Rp1 ×p2 is a random matrix with i.i.d. standard normal entries. In this case, Z11 , Z12 , Z21 , and Z22 are all i.i.d. standard normal matrices of dimensions r × r, r × (p2 − r), (p1 − r) × r, and (p1 − r) × (p2 − r), respectively. By random matrix theory, √ √ √ √ √ α ≥ σr (X) − kZ11 k ≥ σr (X) − C( p1 + p2 ), β ≤ C( p1 + p2 ), z12 ≤ C p2 , and √ √ z21 ≤ C p1 with high probability. When σr (X) ≥ Cgap p2 / p1 for some large constant Cgap , Theorem 3 immediately implies √ √ C p2 C p1 ˆ ˆ k sin Θ(V, V )k ≤ , k sin Θ(U, U )k ≤ . σr (X) σr (X) Further discussions on perturbation bounds for the general i.i.d. sub-Gaussian perturbation matrix with matching lower bounds with be discussed in Section 3. Theorem 1 gives upper bounds for the perturbation effects. We now establish lower bounds for differences as measured by the sin Θ distances. The lower bounds given in Theorem 2 match those in (10) and (11), proving that our results in Theorem 1 cannot be essentially improved. For the lower bound, it is sufficient to consider the class of exact rankr matrices as the approximate low-rank matrices perturbation can be discussed similarly. Theorem 2 (Perturbation Lower Bound). We define the following class of (X, Z) pairs of p1 × p2 rank-r matrices and perturbations, n o ˆ ) ≥ α, kZ22 k ≤ β, kZ12 k ≤ z12 , kZ21 k ≤ z21 . Fr,α,β,z21 ,z12 = (X, Z) : rank(X) = r, σmin (U | XV (6) 6
2 + z 2 , r < p1 ∧p2 we have the following lower bound for all Provided that α2 > β 2 + z12 21 2 ˆ estimate V˜ ∈ Op2 ×r based on the observations X. 1 αz12 + βz21 ˜ (7) inf sup k sin Θ(V, V )k ≥ √ 2 ∧ z2 ∧ 1 . 2 10 α2 − β 2 − z12 V˜ (X,Z)∈Fα,β,z21 ,z12 21
Now we further define Gr,α,β,z21 ,z12 ,˜z21 ,˜z12 = {(X, Z) : kZ21 kF ≤ z˜21 , kZ12 kF ≤ z˜12 , (X, Z) ∈ Fr,α,β,z21 ,z12 ,˜z21 ,˜z12 } . (8) p1 ∧p2 2 2 2 2 2 2 2 2 Provided that α > β + z12 + z21 , z˜21 ≤ rz21 , z˜12 ≤ rz12 , r ≤ 2 , we have the following ˆ lower bound for all estimator V˜1 ∈ Op2 ×r based on the observations X. √ α˜ z12 + β z˜21 1 ∧ r . (9) inf sup k sin Θ(V, V˜ )kF ≥ √ 2 ∧ z2 2 10 α2 − β 2 − z12 V˜1 (X,Z)∈Gα,β,z21 ,z12 ,˜ 21 z21 ,˜ z12 The following Proposition 1, which provides upper bounds for the sin Θ distances between leading singular vectors of a matrix A and arbitrary orthogonal columns W , can be viewed as another version of Theorem 1. For some applications, applying Proposition 1 might be more convenient than using Theorem 1 directly. Proposition 1. Suppose A ∈ Rp1 ×p2 , V˜ = [V V⊥ ] ∈ Op2 are right singular vectors of A, V ∈ Op2 ,r , V⊥ ∈ Op2 ,p2 −r correspond to the first r and last (p2 − r) singular vectors ˜ = [W W⊥ ] ∈ Op is any orthogonal matrix with W ∈ Op ,r , W⊥ ∈ Op ,p −r . respectively. W 2
2
2
2
Given that σr (AW ) > σr+1 (A), we have ksin Θ(V, W )k ≤
ksin Θ(V, W )kF ≤
2.3
σr+1 (A)kP(AW ) AW⊥ k 2 (A) ∧ 1, σr2 (AW ) − σr+1
σr+1 (A)kP(AW ) AW⊥ kF √ ∧ r. 2 (A) σr2 (AW ) − σr+1
(10)
(11)
Comparisons with Wedin’s sin Θ Theorem
Theorems 1 and 2 together establish separate rate-optimal perturbation bounds for the left and right singular subspaces. We now compare the results with the well-known Wedin’s sin Θ Theorem, which gives uniform upper bounds for the singular subspaces on both sides. Specifically, using the same notation as in Section 2.2, Wedin’s sin Θ Theorem states that ˆ 1 ) − σmax (Σ2 ) = δ > 0, then if σmin (Σ n o ˆ | Zk n o max kZ Vˆ k, kU ˆ )k ≤ max k sin Θ(V, Vˆ )k, k sin Θ(U, U , δ n o ˆ | ZkF n o max kZ Vˆ kF , kU ˆ )kF ≤ max k sin Θ(V, Vˆ )kF , k sin Θ(U, U . δ 7
As mentioned in the introduction, the union bound on both left and right singular subspaces in Wedin’s sin Θ Theorem might be sub-optimal in some cases. For example, in the setting discussed in Remark 1, applying Wedin’s theorem leads to √ √ n o ˆ )k ≤ C max{ p1 , p2 } , max k sin Θ(V, Vˆ )k, k sin Θ(U, U σr (X) ˆ )k if p2 p1 . which is sub-optimal for k sin Θ(U, U We now further illustrate the advantages of having separate bounds for the left and right singular subspaces over uniform bounds through a numerical experiment. In a range of cases, especially when the numbers of rows and columns of the matrix differ significantly, it is even possible that one side of singular space can be stably recovered, while the other side cannot. The following simple numerical experiment illustrates this point. We randomly generate X = tU V | ∈ Rp1 ×p2 and perturbation Z, where t ∈ R, U, V are p1 × r and p2 × r random uniform orthonormal columns with respect to the Haar measure, and Z = iid (Zij )p1 ×p2 with Zij ∼ N (0, 1). We calculate the SVD of X + Z and form the first r left ˆ and Vˆ . The average losses in Frobenius and spectral sin Θ and right singular vectors as U distances for both the left and right singular space estimates with 1,000 repetitions are given in Table 2.3 for various values of (p1 , p2 , r, t). It can be easily seen from this experiment that the left and right singular perturbation bounds behave very distinctly when p1 p2 . (p1 , p2 , r, t) (100, 10, 2, 15) (100, 10, 2, 30) (100, 20, 5, 20) (100, 20, 5, 40) (1000, 20, 5, 30) (1000, 20, 10, 100) (1000, 200, 10, 50) (1000, 200, 50, 100)
ˆ , U )k2 k sin Θ(U
k sin Θ(Vˆ , V )k2
ˆ , U )k2 k sin Θ(U F
k sin Θ(Vˆ , V )k2F
0.3512 0.1120 0.2711 0.0770 0.5838 0.1060 0.3456 0.1289
0.0669 0.0139 0.0930 0.0195 0.0699 0.0036 0.0797 0.0205
0.6252 0.1984 0.9993 0.2835 2.6693 0.9007 2.9430 4.3614
0.0934 0.0196 0.2347 0.0508 0.1786 0.0109 0.4863 0.2731
Table 1: Average losses in Frobenius and spectral sin Θ distances for both the left and right singular space changes after Gaussian noise perturbations.
8
3
Low-rank Matrix Denoising and Singular-Space Estimation
In this section, we apply the perturbation bounds given in Theorem 1 for low-rank matrix denoising. It can be seen that the new perturbation bounds are particularly powerful when the matrix dimensions differ significantly. We also establish a matching lower bound for matrix low-rank matrix denoising which shows that the results are rate-optimal. As mentioned in the introduction, accurate recovery of a low-rank matrix based on noisy observations has a wide range of applications, including magnetic resonance imaging (MRI) and relaxometry. See, e.g., Candes et al. (2013); Shabalin and Nobel (2013) and the reference therein. This problem is also important in the context of dimensional reduction. Suppose one observes a low-rank matrix with additive noise, Y = X + Z, where X = U ΣV | ∈ Rp1 ×p2 is a low-rank matrix with U ∈ Op1 ,r , V ∈ Op2 ,r , and Σ = diag{σ1 (X), . . . , σr (X)} ∈ Rr×r , and Z ∈ Rp1 ×p2 is an i.i.d. mean-zero sub-Gaussian matrix. The goal is to estimate the underlying low-rank matrix X or its singular values or singular vectors. This problem has been actively studied. For example, Bura and Pfeiffer (2008), Capitaine et al. (2009), Benaych-Georges and Nadakuditi (2012), Shabalin and Nobel (2013) focused on the asymptotic distributions of single singular values and vectors when p1 , p2 and the singular values grows proportionally. Vu (2011) discussed the squared matrix perturbed by i.i.d Bernoulli matrix and derived an upper bound on the rotation angle of singular vectors. O’Rourke et al. (2013) further generalized the results in Vu (2011) and proposed a trio-concentrated random matrix perturbation setting. Recently, Wang (2015) provides the `∞ distance under relatively complicated settings when matrix is perturbed by i.i.d. Gaussian noise. Donoho and Gavish (2014); Gavish and Donoho (2014); Candes et al. (2013) studied the algorithm for recovering X, where singular value thresholding (SVT) and hard singular value thresholding (HSVT), stated as 1 kY − Xk2F + λkXk∗ , SV T (Y )λ = arg min 2 X (12) 1 2 HSV T (Y )λ = arg min kY − XkF + λrank(X) 2 X were proposed. The optimal choice of thresholding level λ∗ was further discussed in Donoho and Gavish (2014) and Gavish and Donoho (2014). Especially, Donoho and Gavish (2014) proves that
2
ˆ
inf sup E X − X r(p1 + p2 ), ˆ X∈Rp1 ×p2 X rank(X)≤r
F
9
when Z is i.i.d. standard normal random matrix. If one defines the class of rank-r matrices, Fr,t = {X ∈ Rp1 ×p2 : σr (X) ≥ t}, we can immediately show that the following upper bound of relative error, ˆ − Xk2 kX C(p1 + p2 ) F ≤ sup E ∧ 1. (13) 2 t2 kXkF X∈Fr,t where
( ˆ= X
SV T (Y )λ∗ , if t2 ≥ C(p1 + p2 ), 0, if t2 < C(p1 + p2 ).
In the following discussion, we assume that the entries of Z = (Zij ) have unit variance (which can be simply achieved by normalization). To be more precise, we define the class of distributions Gτ for some τ > 0 as follows. If Z ∼ Gτ , then EZ = 0, Var(Z) = 1, E exp(tZ) ≤ exp(τ t), ∀t ∈ R.
(14)
The distribution of the entries of Z, Zij , is assumed to satisfy iid
Zij ∼ Gτ ,
1 ≤ i ≤ p1 , 1 ≤ j ≤ p2 .
ˆ and Vˆ are respectively the first r left and right singular vectors of Y . We Suppose U ˆ and Vˆ as the estimators of U and V respectively. Then the perturbation bounds for use U singular spaces yield the following results. Theorem 3 (Upper Bound). Suppose X = U ΣV | ∈ Rp1 ×p2 is of rank-r. There exists constants C > 0 that only depends on τ such that Cp2 (σr2 (X) + p1 ) Ek sin Θ(V, Vˆ )k2 ≤ ∧ 1, σr4 (X)
Cp2 r(σr2 (X) + p1 ) Ek sin Θ(V, Vˆ )k2F ≤ ∧ r. σr4 (X)
2 ˆ )k2 ≤ Cp1 (σr (X) + p2 ) ∧ 1, Ek sin Θ(U, U σr4 (X)
2 ˆ )k2 ≤ Cp1 r(σr (X) + p2 ) ∧ r. Ek sin Θ(U, U F σr4 (X)
ˆ )) Theorem 3 provides a non-trivial perturbation upper bound for sin Θ(V, Vˆ ) (or sin Θ(U, U if there exists a constant Cgap > 0 such that 1
σr2 ≥ Cgap ((p1 p2 ) 2 + p2 ) 1
(or σr2 ≥ Cgap ((p1 p2 ) 2 +p1 )). In contrast, Wedin’s sin Θ Theorem requires the singular value gap σr2 (X) ≥ Cgap (p1 + p2 ), which shows the power of the proposed unilateral perturbation bound. Furthermore, the upper bounds in Theorem 3 are rate-sharp in the sense that the following matching lower bounds hold. To the best of our knowledge, this is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation. 10
Theorem 4 (Lower Bound). Define the following class of low-rank matrices Fr,t = X ∈ Rp1 ×p2 : σr (X) ≥ t . If r ≤
p1 16
∧
p2 2 ,
(15)
then
inf sup Ek sin Θ(V, V˜ )k2 ≥ c V˜ X∈Fr,t
p2 (t2 + p1 ) ∧1 , t4
p2 r(t2 + p1 ) ∧r . t4 V˜ X∈Fr,t p1 (t2 + p2 ) 2 ˜ ∧1 , inf sup Ek sin Θ(U, U )k ≥ c t4 V˜ X∈Fr,t p1 r(t2 + p2 ) 2 ˜ inf sup Ek sin Θ(U, U )kF ≥ c ∧r . t4 V˜ X∈Fr,t inf sup Ek sin Θ(V, V˜ )k2F ≥ c
Remark 2. Using similar technical arguments, we can also obtain the following lower bound for estimating the low-rank matrix X over Fr,t under a relative error loss, ˜ − Xk2 kX F inf sup E ≥c 2 ˜ kXk X X∈Fr,t F
p1 + p2 ∧1 . t2
(16)
Combining equations (13) and (16) yields the minimax optimal rate for relative error in matrix denoising, ˜ − Xk2 kX p1 + p2 F c ∧1 . inf sup E ˜ X∈Fr,t t2 kXk2F X A interesting fact is that p1 + p2 p2 (t2 + p1 ) p1 (t2 + p2 ) c ∧1 c ∧1 +c ∧1 , t2 t4 t4 which yields directly 2 ˜ ˜ )k2 + inf sup Ek sin Θ(V, V˜ )k2 inf sup E kX − XkF . inf sup Ek sin Θ(U, U ˜ X∈Fr,t ˜ X∈Fr,t kXkF U V˜ X∈Fr,t X
Hence, for the class of Fr,t , one can stably recover X in relative Frobenius norm loss if and only if one can stably recover both U and V in spectral sin Θ norm. 1
Another interesting aspect of Theorems 3 and 4 is that, when p1 p2 , (p1 p2 ) 2 t2 p1 , there is no stable algorithm for recovery of either the left singular space U or whole matrix X in the sense that there exists uniform constant c > 0 such that
2
˜ ) inf sup E sin Θ(U, U
≥ c, ˜ X∈Fr,t U
11
inf sup E ˜ X∈Fr,t X
˜ − Xk2 kX F ≥ c. kXk2F
In fact, for X = tU V | ∈ Fr,t , if we simply apply SVT or HSVT algorithms with optimal choice of λ as proposed in Donoho and Gavish (2014) and Gavish and Donoho (2014), with ˆ = HSV Tλ (X) ˆ = 0. On the other hand, the spectral method high probability, SV Tλ (X) does provide a consistent recovery of the right singular-space according to Theorem 3.
2
E sin Θ(V, Vˆ ) → 0. This phenomenon is well demonstrated by the simulation result (Table 2.3) provided in Section 1.
4
High-dimensional Clustering
Unsupervised learning, or clustering, is an ubiquitous problem in statistics and machine learning (Hastie et al., 2009). The perturbation bounds given in Theorem 1 as well as the results in Theorems 3 and 4 have a direct implication in high-dimensional clustering. Suppose the locations of n points, X = [X1 · · · Xn ] ∈ Rp×n , which lie in a certain rdimensional subspace S in Rp , are observed with noise Yi = Xi + εi ,
i = 1, · · · , n.
Here Xi ∈ S ⊆ Rp are fixed coordinates, εi ∈ Rp are random noises. The goal is to cluster the observations Y . Let the SVD of X be given by X = U ΣV | , where U ∈ Op,r , V ∈ On,r , and Σ ∈ Rr×r . When p2 p1 r, applying the standard algorithms (e.g. k-means) directly to the coordinates Y may lead to sub-optimal results with expensive computational costs due to the high-dimensionality. A better approach is to first perform dimension reduction by computing the SVD of Y directly or on its random projections, and then carry out clustering based on the first r right singular vectors Vˆ ∈ On,r . See, e.g. Feldman et al. (2013) and Boutsidis et al. (2015), and the reference therein. It is important to note that the left singular space U are not directly used in the clustering procedure. Thus Theorem 3 is more suitable for the analysis of the clustering method than Wedin’s sin Θ theorem as the method main depends on the accuracy of Vˆ as an estimate of V . Let us consider the following two-class clustering problem in more detail (see Hastie et al. (2009); Azizyan et al. (2013); Jin and Wang (2014); Jin et al. (2015)). Suppose li ∈ {−1, 1}, i = 1, ..., n, are indicators representing the class label of the n-th nodes and let µ ∈ Rp be a fixed vector. Suppose one observes Y = [Y1 , · · · , Yn ] where Yi = li µ + Zi ,
iid
Zi ∼ N (0, Ip ),
1 ≤ i ≤ n,
where neither the labels li nor the mean vector µ are observable. The goal is to cluster the data into two groups. The accuracy of any clustering algorithm is measured by the 12
misclassification rate
n o 1 min i : li 6= π(ˆli ) . (17) π n Here π is any permutations on {−1, 1}, as any permutation of the labels {−1, 1} does not change the clustering outcome. In this case, EYi is either µ or −µ, which lies on a straight line. A simple PCA-based clustering method is to set ˆl = sgn(ˆ v ), (18) M(l, ˆl) :=
where vˆ ∈ Rn is the first right singular vector of Y . We now apply the sin Θ upper bound in Theorem 3 to analyze the performance guarantees of this clustering method. We are particularly interested in the high-dimensional case where p ≥ n. The case where p < n can be handled similarly. 1
Theorem 5. Suppose p ≥ n, π is any permutation on {−1, 1}. When kµk2 ≥ Cgap (p/n) 4 for some large constant Cgap > 0, then for some other constant C > 0 the mis-classification rate for the PCA-based clustering method ˆl given in (18) satisfies 2
nkµk2 + p EM(ˆl, l) ≤ C . nkµk42 It is intuitively clear that the clustering accuracy depends on the signal strength kµk2 . The stronger the signal, the easier to cluster. In particular, Theorem 5 requires the 1 minimum signal strength condition kµk2 ≥ Cgap (p/n) 4 . The following lower bound result shows that this condition is necessary for consistent clustering: When the condition 1 kµk2 ≥ Cgap (p/n) 4 does not hold, it is not possible to essentially do better than random guessing. Theorem 6. Suppose p ≥ n, there exists cgap , Cn > 0 such that if n ≥ Cn , inf ˜ l
sup 1
µ:kµk2 ≤cgap (p/n) 4 l∈{−1,1}n
1 EM(˜l, l) ≥ . 4
Remark 3. Azizyan et al. (2013) considered a similar setting when n ≥ p, li ’s are i.i.d. Rademacher variables and derived rates of convergence for both the upper and lower bounds with a logarithmic gap between the upper and lower bounds. In contrast, with the help of the newly obtained perturbation bounds, we are able to establish the optimal misclassification rate for high-dimensional setting when n ≤ p. Moreover, Jin and Wang (2014) and Jin et al. (2015) considered the sparse and highly structured setting, where the contrast mean vector µ is assumed to be sparse and the nonzero coordinates are all equal. Their method is based on feature selection and PCA.
13
Our setting is close to the “less sparse/weak signal” case in Jin et al. (2015). In this case, they introduced a simple aggregation method with ˆl(sa) = sgn(X µ ˆ), where µ ˆ = arg maxµ∈{−1,0,1}p kXµkq for some q > 0. The statistical limit, i.e. the necessary condition for obtaining correct labels for most of the points, is kµk2 > C in their setting, 1 which is smaller than the boundary kµk2 > C(p/n) 4 in Theorem 5. As shown in Theorem 1 6 the bound kµk2 > C(p/n) 4 is necessary. The reason for this difference is that they focused on highly structured contrast mean vector µ which only takes two values {0, ν}. In contrast, we considered the general µ ∈ Rp , which leads to stronger condition and larger statistical limit. Moreover, the simple aggregation algorithm is computational difficult for a general signal µ, thus the PCA-based method considered in this paper is preferred under the general dense µ setting.
5
Canonical Correlation Analysis
In this section, we consider an application of the perturbation bounds given in Theorem 1 to the canonical correlation analysis (CCA), which is one of the most important tools in multivariate analysis in exploring the relationship between two sets of variables (Hotelling, 1936; Anderson, 2003; Gao et al., 2014, 2015). Given two random vectors X and Y with a certain joint distribution, the CCA first looks for the pair of vectors α(1) ∈ Rp1 , β (2) ∈ Rp2 that maximize corr((α(1) )| X, (β (1) )| Y ). After obtaining the first pair of canonical directions, one can further obtain the second pair α(2) ∈ Rp1 , β (2) ∈ Rp2 such that Cov((α(1) )| X, (α(2) )| X) = Cov((β (1) )| Y, (β (2) )| Y ) = 0, and Corr((α(2) )| X, (β (2) )| Y ) is maximized. The higher order canonical directions can be obtained by repeating this process. If (X, Y ) is further assumed to have joint covariance, say ! " # X ΣX ΣXY =Σ= , Cov Y ΣY X ΣY Y the population canonical correlation directions can be inductively defined as the following optimization problem. For k = 1, 2, · · · , (α(k) , β (k) ) = arg max a∈Rp1 ,b∈Rp2
a| ΣXY b,
subject to a| ΣX a = b| ΣY b = 1, a| ΣX α(l) = b| ΣY β (l) = 0, ∀1 ≤ l ≤ k − 1. A more explicit form for the canonical correlation directions is given in Hotelling (1936): 1
1
−1
−1
2 α(k) , ΣY2 β (k) ) is the k-th pair of singular vectors of ΣX 2 ΣXY ΣY 2 . We combine the (ΣX
14
leading r population canonical correlation directions and write h i h i A = α(1) · · · α(r) , B = β (1) · · · β (r) . Suppose one observes i.i.d.samples (Xi| , Yi| )| ∼ N (0, Σ). Then the sample covariance and cross-covariance for X and Y can be calculated as n n n X 1X 1X | | ˆ ˆ ˆ ΣX = Xi Xi , ΣY = Yi Yi , ΣXY = Xi Yi| . n n i=1
i=1
i=1
The standard approach to estimate the canonical correlation directions α(k) , β (k) is via the 1 1 ˆ−2 ˆ XY Σ ˆ−2 Σ SVD of Σ X Y 1
1
ˆ−2 = U ˆ−2 Σ ˆ XY Σ ˆ SˆV ˆ = Σ X Y
pX 1 ∧p2
ˆ[:,k] Sˆkk U ˆ| . U [:,k]
k=1
Then the leading r sample canonical correlation directions can be calculated as 1
ˆ−2 U ˆ[:,1:r] , Aˆ = Σ X
1
ˆ=Σ ˆ − 2 Vˆ[:,1:r] , B Y
ˆ = [βˆ(1) , βˆ(2) , · · · , βˆ(r) ]. B (19) ˆ ˆ A, B are consistent estimators for the first r left and right canonical directions in the classical fixed dimension case. Let X ∗ ∈ Rp1 be an independent copy of the original sample X, we define the following two losses to measure the accuracy of the estimator of the canonical correlation directions, 2 | ∗ ˆ ˆ A) = X − (Av)| X ∗ , Lsp (A, max EX ∗ (AOv) Aˆ = [ˆ α(1) , α ˆ (2) , · · · , α ˆ (r) ],
O∈Or v∈Rr ,kvk2 =1
ˆ A) = max EX ∗ k(AO) ˆ | X ∗ − A| X ∗ k22 . LF (A, O∈Or
ˆ | X ∗ can predict the values of the These two losses quantify how well the estimator (AO) canonical variables A| X ∗ , where O ∈ Or is a rotation matrix as the objects of interest here are the directions. The following theorem gives the upper bound for one side of the canonical correlation directions. The main technical tool is the perturbation bounds proposed in Section 2. −1
−1
Theorem 7. Suppose (Xi , Yi ) ∼ N (0, Σ), i = 1, · · · , n, where S = ΣX 2 ΣXY ΣY 2 is of rankr. Suppose Aˆ ∈ Rp1 ×r is given by (19). Then there exist uniform constants Cgap , C, c > 0 C
1
((p p ) 2 +p +p
3/2 − 1 n 2)
such that whenever σr (S)2 ≥ gap 1 2 n 1 2 , 2 ˆ A) ≤ Cp1 (nσr (S) + p2 ) ≥ 1 − C exp(−cp1 ∧ p2 ), P Lsp (A, n2 σr4 (S) 2 ˆ A) ≤ Cp1 r(nσr (S) + p2 ) ≥ 1 − C exp(−cp1 ∧ p2 ). P LF (A, n2 σr4 (S) ˆ can be written down similarly. The results for B 15
Remark 4. Chen et al. (2013) and Gao et al. (2014, 2015) considered sparse CCA, where the canonical correlation directions A and B are assumed to be jointly sparse. In particular, Chen et al. (2013) and Gao et al. (2015) proposed estimators under different settings and provided a unified rate-optimal bound for jointly estimating left and right canonical correˆ∗ lations. Gao et al. (2014) proposed another computationally feasible estimators Aˆ∗ and B and provided minimax rate-optimal bound for LF (Aˆ∗ , A) under regularity conditions that ˆ ∗. in fact enable us to also prove the consistency of B 1 2
Now let us consider the setting where p2 p1 , pn2 σr2 (S) = t2 (p1 pn2 ) . The lower bound result in Theorem 3.3 by Gao et al. (2014) implies that there is no consistent estimator for the right canonical correlation directions B. While Theorem 7 given above shows that the left sample canonical correlation directions Aˆ are a consistent estimator of A. This interesting phenomena again shows the merit of our proposed unilateral perturbation bound.
6
Discussions
We have established in the present paper new and rate-optimal perturbation bounds, measured in both spectral and Frobenius sin Θ distances, for the left and right singular subspaces separately. These perturbation bounds are widely applicable to the analysis of many highdimensional problems. In particular, we applied the perturbation bounds to study three important problems in high-dimensional statistics: low-rank matrix denoising and singular space estimation, high-dimensional clustering and CCA. As mentioned in the introduction, in addition to these problems and possible extensions discussed in the previous sections, the obtained perturbation bounds can be used in a range of other applications including community detection in bipartite networks, multidimensional scaling, cross-covariance matrix estimation, and singular space estimation for matrix completion. We briefly discuss these problems here. An interesting application of the perturbation bounds given in Section 2 is community detection in bipartite graphs. Community detection in networks has attracted much recent attention. The focus of the current community detection literature has been mainly on unipartite graph (i.e., there are only one type of nodes). However in some applications, the nodes can be divided into different types and only the interactions between the different types of nodes are available or of interest, such as people vs. committees, Facebook users vs. public pages (see Melamed (2014); Alzahrani and Horadam (2016)). The observations on the connectivity of the network between two types of nodes can be described by an adjacency matrix A, where Aij = 1 if the i-th Type 1 node and j-th Type 2 node are connected, and Aij = 0 otherwise. The spectral method is one of the most commonly
16
used approaches in the literature with theoretical guarantees (Rohe et al., 2011; Lei and Rinaldo, 2015). In a bipartite network, the left and right singular subspaces could behave very differently from each other. Our perturbation bounds can be used for community detection in bipartite graph and potentially lead to sharper results in some settings. Another possible application lies in multidimensional scaling (MDS) with distance matrix between two sets of points. MDS is a popular method of visualizing the data points embedded in low-dimensional space based on the distance matrices (Borg and Groenen, 2005). Traditionally MDS deals with unipartite distance matrix, where all distances between any pairs of points are observed. In some applications, the data points are from two groups and one is only able to observe its biparitite distance matrix formed by the pairwise distances between points from different groups. As the SVD is a commonly used technique for dimension reduction in MDS, the perturbation bounds developed in this paper can be potentially used for the analysis of MDS with bipartite distance matrix. In some applications, the cross-covariance matrix, not the overall covariance matrix, is of particular interest. (Cai et al., 2015a) considered multiple testing of cross-covariances in the context of the phenome-wide association studies (PheWAS). Suppose X ∈ Rp1 and Y ∈ Rp2 are jointly distributed with covariance matrix Σ. Given n i.i.d. samples (Xi , Yi ), i = 1, · · · , n, from the joint distribution, one wishes to make statistical inference for the cross-covariance matrix ΣXY . If ΣXY has low-rank structure, the perturbation bounds established in Section 2 could be potentially applied to make statistical inference for ΣXY . Matrix completion, whose central goal is to recover a large low-rank matrix based on a limited number of observable entries, has been widely studied in last decade. Among various methods for matrix completion, spectral method is fast, easy to implement and achieves good performance (Keshavan et al., 2010; Chatterjee, 2014; Cho et al., 2015). The new perturbation bounds can potentially used for singular space estimation under the matrix completion setting to yield better results.
7
Proofs
We prove the main results in Sections 2, 3 and 4 in this section. The proofs for CCA and the additional technical results are given in the Appendix.
7.1
Proofs of General Unilateral Perturbation Bounds
Some technical tools are needed to prove Theorems 1, 2 and Proposition 1. In particular, we need a few useful properties of sin Θ distances given below in Lemma 1. Specifically, Lemma 1 provides some more convenient expressions than the definitions for the sin Θ distances. It also shows that they are indeed distances as they satisfy triangular inequality.
17
Some other widely used metrics for orthogonal spaces, including Dsp (Vˆ , V ) = inf kVˆ − V Ok, O∈Or
DF (Vˆ , V ) = inf kVˆ − V OkF ,
(20)
ˆ ˆ| | V V − V V
.
(21)
O∈Or
ˆ ˆ| | V V − V V
,
F
are shown to be equivalent to the sin Θ distances. Lemma 1 (Properties of the sin Θ Distances). The following properties hold for the sin Θ distances. 1. (Equivalent Expressions) Suppose V, Vˆ ∈ Op,r . If V⊥ is an orthogonal extension of V , namely [V V⊥ ] ∈ Op , we have the following equivalent forms for k sin Θ(Vˆ , V )k and k sin Θ(Vˆ , V )kF , k sin Θ(Vˆ , V )k =
q
k sin Θ(Vˆ , V )kF =
2 (V ˆ | V ) = kVˆ | V⊥ k, 1 − σmin
(22)
q r − kV | Vˆ k2F = kVˆ | V⊥ kF .
(23)
2. (Triangular Inequality) For any V1 , V2 , V3 ∈ Op,r , k sin Θ(V2 , V3 )k ≤ k sin Θ(V1 , V2 )k + k sin Θ(V1 , V3 )k,
(24)
k sin Θ(V2 , V3 )kF ≤ k sin Θ(V1 , V2 )kF + k sin Θ(V1 , V3 )kF .
(25)
3. (Equivalence with Other Metrics) The metrics defined as (20) and (21) are equivalent to sin Θ distances as the following inequalities hold √ k sin Θ(Vˆ , V )k ≤ Dsp (Vˆ , V ) ≤ 2k sin Θ(Vˆ , V )k, √ k sin Θ(Vˆ , V )kF ≤ DF (Vˆ , V ) ≤ 2k sin Θ(Vˆ , V )kF ,
√
ˆ ˆ|
| ˆ ˆ
sin Θ(V , V ) ≤ V V − V V ≤ 2 sin Θ(V , V ) ,
√
ˆ ˆ|
V V − V V | = 2 sin Θ(Vˆ , V ) . F
F
Proof of Proposition 1. First, we can rotate the right singular space by right multiplying the whole matrices A, V | , W | by [W W⊥ ] without changing the singular values and left singular vectors. Thus without loss of generality, we assume that [W W⊥ ] = Ip2 . ¯Σ ¯ V¯ | , where U ¯ ∈ Op ,r , Σ ¯ ∈ Rr×r , Next, we further calculate the SVD: AW = A[:,1:r] := U 1 V¯ ∈ Or , and rotate the left singular space by left multiplying the whole matrix A by 18
¯ U ¯⊥ ]| , then rotate the right singular space by right multiplying A[:,1:r] by V¯ . After this [U rotation, the singular structure of A, AW are unchanged. Again without loss of generality, ¯U ¯⊥ ]| = Ip , V¯ = Ir . After these two steps of rotations, the formation we can assume that [U 1 of A is much simplified, p2 − r
r σ1 (AW )
r
A=
..
. σr (AW )
p1 − r
¯ | AW⊥ , U
(26)
¯ | AW⊥ U ⊥
0
while the problem we are considering is still without loss of generality. For convenience, denote ¯ | AW⊥ | = [y (1) y (2) · · · y (r) ], y (1) , · · · , y (r) ∈ Rp2 −r . U (27) We can further compute that p2 − r σ1 (AW )y (1)| .. . . (r)| σr (AW )y (AW⊥ )| AW⊥
r r
A| A =
p2 − r
σ12 (AW ) ..
σ1 (AW )y (1)
.
···
σr2 (AW ) σr (AW )y (r)
(28)
By basic theory in algebra, the i-th eigenvalue of A| A is equal to σi2 (A), and the i-th eigenvector of A| A is equal to the i-th right singular vector of A (up-to-sign). Suppose the singular vectors of A are V˜ = [v (1) , v (2) , · · · , v (p2 ) ], where the singular values can be further decomposed into two parts as v (k)
r = p2 − r
(k) α β (k) ,
or equivalently,
α(k) = W | v (k) , β (k) = W⊥| v (k) .
(29)
By observing the i-th entry of A| Av (k) = σk2 (A)v (k) , we know for 1 ≤ i ≤ r, r + 1 ≤ k ≤ p2 , (k) σi2 (AW ) − σk2 (A) αi + σi (AW )y (i)| β (k) = 0,
⇒
(k)
αi
=
−σi (AW ) y (i)| β (k) . 2 σi (AW ) − σk2 (A) (30)
Recall the assumption that σ1 (AW ) ≥ · · · ≥ σr (AW ) > σr+1 (A) ≥ · · · σp2 (A) ≥ 0.
19
(31)
x Also x2 −y 2 = x > y ≥ 0, so
1 x−y 2 /x
is a decreasing function for x and a increasing function for y when
σi (AW ) 2 σi (AW ) − σk2 (A)
≤
σr (AW ) 2 (A) , 2 σr (AW ) − σr+1
1 ≤ i ≤ r, r + 1 ≤ k ≤ p2 .
Since [β (r+1) · · · β (p2 ) ] is the submatrix of the orthogonal matrix V ,
(r+1) (p2 ) [β · · · β ]
≤ 1.
(32)
(33)
Now we can give an upper bound for the Frobenius norm of [α(r+1) · · · α(p2 ) ] p2 r
h i 2 (32) X X
(r+1) (k) 2 (p2 ) ··· α αi ≤
α
= F
σr2 (AW ) ≤ 2 (A) 2 σr2 (AW ) − σr+1 (27)(33) σr2 (AW ) ≤ 2 (A) 2 σr2 (AW ) − σr+1
i=1 k=r+1
k[y1 · · · yr ]| k2F [β (r+1)
σr2 (AW ) 2 (A) 2 σr2 (AW ) − σr+1
2
· · · β (p2 ) ]
p2 r X X
y (i)| β (k)
i=1 k=r+1
| ¯ AW⊥ 2 .
U F
It is more complicated to give a upper bound for the spectral norm of [α(r+1) · · · α(p2 ) ]. Suppose s = (sr+1 , · · · , sp2 ) ∈ Rp2 −r is any vector with ksk2 = 1. Based on (30), p2 X k=r+1
p2 p2 X X −sk −sk σi (AW )y (i)| β (k) 1 = y (i)| β (k) 2 2 2 σi (AW ) 1 − σk (A)/σi (AW )2 σi (AW ) − σk (A) k=r+1 k=r+1 ! p2 X p2 ∞ ∞ 2l X X −sk σk (A) (i)| (k) X −y (i)| (31) = y β = sk σk2l (A)β (k) . 2l+1 2l+1 σ (AW ) σ (AW ) k=r+1 l=0 i l=0 i k=r+1
(k)
sk αi
=
Hence,
y (1)| /σ 2l+1 (AW )
p
!
1 p2 ∞ 2
X
X X
. (k) 2l (k) ·
.. sk α ≤ sk σk (A)β
k=r+1 k=r+1 l=0 (r)| 2l+1 2
y /σr (AW ) 2
(1) (2)
∞
(r)
X [y y · · · y ] (r+1) (r+2) (p2 ) ≤ · [β β · · · β ]
σr2l+1 (AW ) l=0
2l · sr+1 σr+1 (A), · · · , sp2 σp2l2 (A) 2
∞ (27)(33)(31) X
≤
=
˜ | AW⊥ k kU 2l · σr+1 (A)ksk2 σr2l+1 (AW )
l=0 ˜ | AW kU
⊥ kσr (AW ) 2 (A) , 2 σr (AW ) − σr+1
20
2
which implies ˜ | AW⊥ kσr (AW ) kU 2 (A) . σr2 (AW ) − σr+1
k[α(r+1) · · · α(p2 ) ]k ≤
Note the definition of α(i) in (29), we know [α(r+1) α(r+2) · · · α(p2 ) ] = V˜[1:r,(r+1):p2 ] = (V⊥ )[1:r,:] . Thus,
˜ | AW⊥ kσr (AW ) kU (22)
ksin Θ(V, W )k = kW | V⊥ k = [α(r+1) · · · α(p2 ) ] ≤ 2 2 (A) , σr (AW ) − σr+1 (23)
ksin Θ(V, W )k2F = kW | V⊥ k2F = k[α(r+1) · · · α(p2 ) ]k2F ≤
˜ | AW⊥ k2 σr2 (AW ) kU F . 2 (A) 2 2 σr (AW ) − σr+1
(34)
¯ is the left singular vectors of AW , Finally, since U ¯ | AW⊥ k = kP(AW ) AW⊥ k, kU The upper bounds 1 in (10) and proof of Proposition 1.
√
¯ | AW⊥ kF = kP(AW ) AW⊥ k. kU
(35)
r on (11) are trivial. Therefore, we have finished the
Proof of Theorem 1. Before proving this theorem, we introduce the following lemma on the inequalities of the singular values in the perturbed matrix. Lemma 2. Suppose X ∈ Rp×n , Y ∈ Rp×n , rank(X) = a, rank(Y ) = b, 1. σa+b+1−r (X + Y ) ≤ min(σa+1−r (X), σb+1−r (Y )) for r ≥ 1; 2. if we further have X | Y = 0 or XY | = 0, we must have a + b ≤ n ∧ p, and σr2 (X + Y ) ≥ max(σr2 (X), σr2 (Y )) for any r ≥ 1. Also, σ12 (X + Y ) ≤ σ12 (X) + σ12 (Y ). The proof of Lemma 2 is provided in the Appendix. Applying Lemma 2, we get 2 ˆ ) = σr2 (XV ˆ ) = σr2 (U | XV ˆ + U | XV ˆ ) ≥ σr2 (U | XV ˆ ) = α2 , σmin (XV ⊥
(by Lemma 2 Part 2.) (36) | ˆ | | | | | ˆ ˆ ˆ ˆ ˆ ˆ Since X = U⊥ U⊥ X + U U X = XV⊥ V⊥ + XV V , and rank(XV V ), rank(U U X) ≤ r, we have n o 2 ˆ ≤ min σ1 (U⊥ U | X), ˆ σ1 (XV ˆ ⊥V |) σr+1 (X) (by Lemma 2 Part 1.) ⊥ ⊥ n o ˆ ⊥ ), σ1 (Z12 + U | XV ˆ ⊥) = min σ1 (Z21 + U⊥| XV ⊥ 2 2 ≤(β 2 + z12 ) ∧ (β 2 + z21 ) 2 2 =β 2 + z12 ∧ z21 .
21
(by Lemma 2 Part 2.)
We shall also note the fact that for any matrix A ∈ Rp×r with r ≤ p, denote the SVD as A = UA ΣA VA| , then
| | † † | † −1 (A). (37)
A (A| A) = UA ΣA VA VA Σ2A VA = UA ΣA VA ≤ σmin Thus, | ˆ| ˆ † ˆ | ˆ ˆ ˆ kP(XV ˆ ) XV⊥ k = kXV (V X XV ) (XV ) XV⊥ k (37)
ˆ ⊥k ˆ ⊥ k ≤ σ −1 (XV ˆ | XV ˆ )† k · k(XV ˆ )| XV ˆ )kV | X ˆ | XV ˆ | (V | X ≤kV | X min
" # i | ˆ | (36) 1 h
ˆ |U V |X ˆ | U⊥ · U XV⊥ ≤ V |X ˆ |
α U⊥| XV ⊥
αz + βz 1 | ˆ| 12 21 | | ˆ | . ≤
V X U k · kZ12 k + kZ21 k · kU⊥ XV ⊥ ≤ α α Similarly, −1 ˆ | ˆ| ˆ ˆ kP(XV ˆ ) XV⊥ kF ≤ σmin (XV )kV X XV⊥ kF ≤
αkZ12 kF + βkZ21 kF . α
ˆ W ˜ = [V V⊥ ], V˜ = [Vˆ Vˆ⊥ ], we could obtain Next, applying Proposition 1 by setting A = X, (4). Proof of Theorem 2. The construction of lower bound relies on the following design of 2-by-2 blocks. Lemma 3 (SVD of 2-by-2 matrices). Suppose 2-by-2 matrix A satisfies " # a b A= , a, b, c, d ≥ 0, a2 > d2 + b2 + c2 . c d " # v11 v12 Suppose V = is the right singular vectors of A, then v21 v22 ab + cd 1 |v12 | = |v21 | ≥ √ ∧1 . 10 a2 − d2 − b2 ∧ c2
(38)
(39)
The proof of Lemma 3 is provided in the Appendix. Suppose we have the following singular value decomposition for 2-by-2 matrix " # " # " # " #| α z12 u11 u12 σ1 0 v11 v12 = · · , z21 β u21 u22 0 σ2 v21 v22 by Lemma 3, we have 1 |v12 | = |v21 | ≥ √ 10
αz12 + βz21 2 ∧ z2 ∧ 1 . α2 − β 2 − z12 21 22
(40)
We construct the following matrices r r σ1 u11 v11 Ir X1 = σ1 u21 v11 Ir r p1 − 2r 0
r σ1 u11 v21 Ir σ1 u21 v21 Ir 0
p2 − 2r 0 0 ,
r r σ2 u12 v12 Ir Z1 = σ2 u22 v12 Ir r p1 − 2r 0
r σ2 u12 v22 Ir σ2 u22 v22 Ir 0
p2 − 2r 0 0 ;
r r αIr X2 = r 0 p1 − 2r 0
r 0 0 0
p2 − 2r 0 0 , 0
0
0
r r 0 Z2 = r z12 Ir p1 − 2r 0
r z12 Ir βIr 0
p2 − 2r 0 0 . 0
(41)
We can see rank(X1 ) = rank(X2 ) = r, r r αIr ˆ= X1 + Z1 = X2 + Z2 = X r z21 Ir p1 − 2r 0
r z12 Ir βIr 0
p2 − 2r 0 0 . 0
It is easy to check that both (X1 , Z1 ) and (X2 , Z2 ) are both in Fr,α,β,z12 ,z21 . Assume V1 , V2 are the first r singular vectors of X1 and X2 , respectively. Based on the structure of X1 , X2 , we know r v11 Ir r Ir r v21 Ir , V = V1 = 2 p2 − r 0 . p2 − 2r 0 ˆ for any estimator V˜ for the right singular space, we have Now based on the observations X, n o max k sin Θ(V˜ , V1 )k, k sin Θ(V˜ , V2 )k 1 ≥ k sin Θ(V˜ , V1 )k + k sin Θ(V˜ , V2 )k (42) 2 (40) Lemma 1 1 1 αz12 + βz21 ≥ ksin Θ(V1 , V2 )k ≥ √ 2 ∧ z2 ∧ 1 , 2 2 2 10 α − β 2 − z12 21
23
which gives us (7). For the proof of the Frobenius norm loss lower bound (8), the construction of pairs (X1 , Z1 ), (X2 , Z2 ) is essentially the same. We just need to construct similar √ √ pairs of (X1 , Z1 ) and (X2 , Z2 ) by replacing z12 , z21 by z˜12 / r, z˜21 / r. Based on the similar calculation, we can finish the proof of Theorem 2.
7.2
Proofs for Matrix Denoising
Proof of Theorem 3. We need some technical results for the proof of this theorem. Specifically, the following lemma relating to random matrix theory plays an important part. Lemma 4 (Properties related to Random Matrix Theory). Suppose X ∈ Rp1 ×p2 is a rank-r iid matrix with right singular space as V ∈ On,r , Z ∈ Rp1 ×p2 , Z ∼ Gτ is an i.i.d. sub-Gaussian random matrix. Y = X + Z. Then there exists constants C, c only depending on τ such that for any x > 0, P σr2 (Y V ) ≥ (σr2 (X) + p1 )(1 − x) ≤ C exp Cr − c σr2 (X) + p1 x2 ∧ x , 2 P σr+1 (Y ) ≤ p1 (1 + x) ≤ C exp Cp2 − cp1 · x2 ∧ x .
(43) (44)
Moreover, there exists Cgap , C, c which only depends on τ , such that whenever σr2 (X) ≥ Cgap p2 , for any x > 0 we have P (kPY V Y V⊥ k ≤ x) o n p ≥1 − C exp Cp2 − c min(x2 , σr2 (X) + p1 x) − C exp −c(σr2 (X) + p1 ) .
(45)
The following lemma provides an upper bound for matrix spectral norm based on an ε-net argument. Lemma 5 (ε-net Argument for Unit Ball). For any p ≥ 1, denote Bp = {x : x ∈ Rp , kxk2 ≤ 1} as the p-dimensional unit ball. Suppose K ∈ Rp1 ×p2 is a random matrix. Then we have for t > 0, P (kKk ≥ 3t) ≤ 7p1 +p2 · max P (|u| Kv| ≥ t) . (46) p p u∈B
1 ,v∈B 2
Now we start to prove Theorem 3. We only need to focus on the losses of Vˆ since the ˆ are symmetric. Besides, we only need to prove the spectral norm loss, as results for U sin Θ(Vˆ , V ) is a r × r matrix k sin Θ(V, Vˆ )k2F ≤ rk sin Θ(V, Vˆ )k2 . Throughout the proof, we use C and c to represent generic “large” and “small” constants, respectively. These constants C, c are uniform and only relying on τ , while the actual values may vary in different formulas.
24
1
Next, we focus on the scenario that σr2 (X) ≥ Cgap ((p1 p2 ) 2 + p2 ) for some large constant Cgap > 0 only relying on τ . The other case will be considered later. By Lemma 4, there exists constants C, c only depending on τ such that σ 4 (X) σ 2 (X) ≤ C exp Cr − c min σr2 (X), 2 r , P σr2 (Y V ) ≥ σr2 (X) + p1 − r 3 σr (X) + p1 1 2 σr4 (X) 2 2 P σr+1 (Y ) ≤ p1 + σr (X) ≤ C exp Cp2 − c min σr (X), , 3 p1 n o p P {kPY V Y V⊥ k ≥ x} ≤ C exp Cp2 − c min x2 , x σr2 (X) + p1 + C exp −c(σr2 (X) + p1 ) . (47) When Cgap is large enough, it holds that σr4 (X)/(σr2 (X) + p1 ) ≥ Cp2 . Then σr4 (X) σr4 (X) c σr4 (X) 2 c min ∧ σ (X) − Cr ≥ − Cr ≥ ; r σr2 (X) + p1 σr2 (X) + p1 2 σr2 (X) + p1 σr4 (x) σr4 (X) σ 4 (X) c 2 c min σr (X), − Cp2 ≥ c 2 r − Cp2 ≥ . p1 σr (X) + p1 2 σr2 (X) + p1 p c 2 σr (X) + p1 , c min(x2 , σr2 (X) + p1 x) − Cp2 = c(σr2 (X) + p1 ) − Cp2 ≥ 2 p c σr4 (X) ≥ 2 if x = σr2 (X) + p1 . 2 σr (X) + p1 To sum up, if we denote the event Q as p σr2 (X) 2 1 2 2 2 2 Q = σr (Y V ) ≥ σr (X) + p1 − , σr+1 (Y ) ≤ p1 + σr (X), kPY V Y V⊥ k ≤ σr (X) + p1 , 3 3 1
when σr2 (X) ≥ Cgap ((p1 p2 ) 2 + p2 ) for some large constant Cgap > 0, σr4 (X) c P (Q ) ≤ C exp −c 2 . σr (X) + p1
(48)
Under event Q, we can apply Proposition 1 and obtain k sin Θ(Vˆ , V )k2 ≤
σr2 (Y V )kPY V Y V⊥ k2 (σ 2 (X) + p1 )kPY V Y V⊥ k2 ≤C r . 2 2 (σr (Y V ) − σr+1 (Y )) σr4 (X) 2
x Here we used the fact that (x2 −y 2 )2 is a decreasing function of x and increasing function of y when x > y ≥ 0. Next we shall note that k sin Θ(Vˆ , V )k ≤ 1 for any Vˆ , V ∈ Op2 ,r . Therefore,
2
2
2
E sin Θ(Vˆ , V ) = E sin Θ(Vˆ , V ) 1Q + E sin Θ(Vˆ , V ) 1Qc (49) C(σr2 (X) + p1 ) 2 c ≤ EkPY V Y V⊥ k 1Q + P(Q ). σr4 (X)
25
By basic property of exponential function, (48) σ 2 (X) + p1 σr4 (X) p2 (σr2 (X) + p1 ) c ≤C r 4 P(Q ) ≤ exp −c 2 ≤C . σr (X) + p1 σr (X) σr4 (X)
(50)
It remains to consider E kPY V Y V⊥ k2 1Q . Denote T = kPY V Y V⊥ k. Applying Lemma 4 again, we have for some constant Cx to be determined a little while later that Z ∞ 2 2 P T 2 1{T 2 ≤σr2 (X)+p1 } ≥ t dt ET 1Q ≤ET 1{T 2 ≤σr2 (X)+p1 } = 0
Z
σr2 (X)+p1
≤Cx p2 + Cx p2 (47)
Z
σr2 (X)+p1
≤ Cx p2 +
P T 2 1{T 2 ≤σr2 (X)+p1 } ≥ t dt C exp {Cp2 − ct} dt + exp −c(σr2 (X) + p1 ) dt
Cx p2
1 ≤Cx p2 + C(σr2 (X) + p1 ) exp −c(σr2 (X) + p1 ) + C exp(Cp2 ) · exp (−cCx p2 ) c C ≤Cx p2 + C + exp((C − cCx )p2 ). c As we could see we can choose Cx large enough, but only relying on other constants C, c in the inequalities above, to ensure that ET 2 1Q ≤ Cp2
(51)
for large constant C > 0 as long as p2 ≥ 1. Now, combining (49), (50), (51) as well as the trivial upper bound 1, we obtain
2 Cp (σ 2 (X) + p )
2 r 1 ∧ 1. E sin Θ(Vˆ , V ) ≤ 4 σr (X) 1 as long as σr2 (X) ≥ Cgap (p1 p2 ) 2 + p2 for some large enough Cgap . 1 Finally when σr2 (X) < Cgap (p1 p2 ) 2 + p2 , we have 1
p2 (p1 + Cgap (p1 p2 ) 2 + Cgap p2 ) p2 (σr2 (X) + p1 ) ≥ 1 4 σr (X) 2 (p p + p2 + 2p 2 p3/2 ) Cgap 1 2 2 1 2 1 p1 + Cgap (p1 p2 ) 2 + Cgap p2 1 ≥ min 1, = , 1 Cgap 2 Cgap p1 + 2(p1 p2 ) 2 + p2 then
(52)
2 Cp2 (σr2 (X) + p1 )
E sin θ(Vˆ , V ) ≤ 1 ≤ ∧1 σr4 (X)
when C ≥ min−1 (1, 1/Cgap ). In summary, no matter what value σr2 (X) takes, we always have
2 Cp (σ 2 (X) + p )
2 r 1 E sin Θ(Vˆ , V ) ≤ ∧ 1. 4 σr (X) 26
Proof of Theorem 4. Since k sin Θ(Vˆ , V )k ≥ √1r k sin Θ(Vˆ , V )kF , we only need to prove the Frobenius norm lower bound. Particularly we focus on the case with Gaussian noise iid with mean 0 and variance 1, i.e. Zij ∼ N (0, 1). The technical tool we use to develop such lower bound include the generalized Fano’s Lemma. One interesting aspect of this singular space estimation problem is that, multiple sampling distribution PX correspond to one target parameter V . In order to proceed, we select one “representative”, either a distribution PV or a mixture distribution P¯V,t for each V , and consider the estimation with samples from “representative” distribution. In order to bridge the representative estimation and the original estimation problem, we introduce the following lower bound lemma. Lemma 6. Let {Pθ : θ ∈ Θ} be a set of probability measures in Euclidean space Rp , let T : Θ → Φ be a function which maps θ into another metric space φ ∈ (Φ, d). For each φ ∈ Φ, denote co(Pφ ) as the convex hull of the set of probability measures Pφ = {Pθ : T (θ) = φ}. If we choose a representative Pφ in each co(Pφ ), then for any estimator φˆ based on the sample generated from Pθ with φ = T (θ), and any estimator φˆ0 based on the samples generated from Pθφ , we have ˆ ≥ inf sup EP [d2 (φ, φ)]. ˆ inf sup EPθ [d2 (T (θ), φ)] φ φˆ θ∈Θ
φˆ φ∈Φ
(53)
We will prove this theorem separately in two cases according to t: t2 ≤ p1 /4 or t2 > p1 /4. • First we consider when t2 ≤ p1 /4. For each V ∈ Op2 ,r , we define the following class of density PY , where Y = X + Z and the right singular space of X is V . ( ) iid Y ∈ Rp1 ×p2 , Y = X + Z, X is fixed, Z ∼ N (0, 1), PV,t = PY : . (54) X ∈ Fr,t , the right singular vectors of X is V O, O ∈ Or For each V ∈ Op2 ,r , we construct the following Gaussian mixture measure P¯V,t ∈ Rp1 ×p2 . Z 1 ¯ exp(−kY − 2tW V | k2F /2) PV,t (Y ) = CV,t p1 p2 /2 p ×r W ∈R 1 :σmin (W )≥1/2 (2π) (55) p1 p1 r/2 2 ·( ) exp(−p1 kW kF /2)dW. 2π Here CV,t is the constant which normalizes the integral and makes P¯V,t a valid probability density. To be specific Z iid −1 CV,t = PV,t (Y )dY = P σmin (W ) ≥ 1/2 W ∈ Rp1 ×r , W ∼ N (0, 1/p1 ) . Y
27
Moreover, since 2tW V | is rank-r with the least singular value no less than t in the event that σmin (W ) ≥ 1/2, P¯V,t is a mixture density of PV,t , i.e. P¯V,t ∈ co(PV,t ). The following lemma, whose proof is provided in the Appendix, gives a upper bound for the KL divergence between P¯V,t and P¯V 0 ,t . Lemma 7. Under the assumption of Theorem 4 and t2 ≤ p1 /4, for any V, V 0 ∈ Op2 ,r , we have 16t4 D(P¯V,t ||P¯V 0 ,t ) ≤ k sin Θ(V, V 0 )k2F + CKL , (56) 2(4t2 + p1 ) where CKL is some uniform constant. We then consider the metric (Op2 ,r , k sin Θ(·, ·)kF ). Especially we consider the ball √ with radius 0 < ε < 2r and center V ∈ Op2 ,r
B(V, ε) = V 0 ∈ Op2 ,r : sin Θ(V 0 , V ) F ≤ ε . Note that r ≤ p2 /2, k sin Θ(V 0 , V )kF = √12 kV 0 V | − V V | kF , based on Lemma 1 √ in Cai et al. (2013), one can show for any α ∈ (0, 1), ε ∈ (0, 2r), there exists V1 , · · · , Vm ⊆ B(V, ε), such that m≥
c r(p2 −r) α
,
min
1≤i6=j≤m
k sin Θ(Vi , Vj )kF ≥ αε.
By {V1 , · · · , Vm } ⊆ B(V, ε), k sin Θ(Vi , Vj )k ≤ 2ε, along with (56), 32ε2 t4 + CKL . D P¯Vi ,t ||P¯Vj ,t ≤ 2 4t + p1 Fano’s Lemma (Yu, 1997) leads to the following lower bound
2
inf sup EP¯V,t sin Θ(Vˆ , V ) ≥ α2 ε2 Vˆ V ∈Θ
F
1−
32ε2 t4 4t2 +p1
+ CKL + log 2
r(p2 − r) log αc q
! .
2 +p
We particularly select α = c exp(−(1 + CKL + log(2))), ε = r(p2 −r)(4t 32t4 Note that r ≤ p2 /2, we further have
2 2+p ) rp (t
2 1 inf sup EP¯V,t sin Θ(Vˆ , V ) ≥ c ∧r . t4 F Vˆ V ∈Op2 ,r
1)
(57)
∧
√
2r.
(58)
Finally, note that P¯V,t is a mixture distribution from PV,t defined in (54). Lemma 6 implies
2
2
inf sup E sin Θ(Vˆ , V ) ≥ inf sup EP¯V,t sin Θ(Vˆ , V ) . (59) Vˆ X∈Fr,t
F
Vˆ V ∈Θ
The two inequalities above together imply the desired lower bound. 28
F
• Then we consider when t2 > p1 /4. This case is simpler than the previous as we do not have to mix the multivariate Gaussian measures. Suppose " # Ir ∈ Op1 ,r . U0 = 0 We introduce XV = tU0 V | ,
V ∈ Op2 ,r ,
(60)
and denote PV as the probability measure of Y when Y = XV + Z, Z ∈ Rp1 ×p2 , iid Z ∼ N (0, 1). Based on (114), we have D(PV ||PV 0 ) =
2
1
tU0 V | − tU0 (V 0 )| 2 = t V − V 0 2 . F F 2 2
(61)
Based on the same procedure Step 5 in the case t2 < p1 /4, one can construct the ball of radius ε centered at V0 ∈ Op2 ,r , B(V0 , ε) = V 0 : k sin Θ(V 0 , V0 )kF ≤ ε . and for 0 < α < 1, there exist {V10 , · · · , Vm0 } ⊆ B(V0 , ε) such that m≥
c r(p2 −r) α
,
max sin θ(Vi0 , Vj0 ) F ≥ εα.
1≤i6=j≤m
By basic property of sin Θ distances (Lemma 1 in this paper), we can find Oi ∈ Or such that √
√
V0 − Vi0 Oi ≤ 2 sin Θ(V0 , Vi0 ) ≤ 2ε F Denote Vi = Vi0 Oi ∈ Op2 ,r , then t2 kVi − Vj k2F 1≤i6=j≤m 1≤i6=j≤m 2 t2 ≤ max 2 kV0 − Vi k2F + kV0 − Vi k2F ≤ 4t2 ε2 . 2 1≤i6=j≤m max D PVi ||PVj
(61) =
max
Follow the same procedure as Step 5 in the case t2 < p1 /4, we have
2 4ε2 t2 + log 2
2 2 ˆ inf sup EPV sin Θ(V , V ) ≥ α ε 1 − . r(p2 − r) log αc F Vˆ V ∈Op2 ,r for any α ∈ (0, 1), ε
p1 /4, p2 r(t2 /2 + p1 /8) 1 p2 r ≥ ≥ t2 t4 8
p2 r(t2 + p1 ) t4
.
Similarly by Lemma 6, we have
2 p2 r(t2 + p1 )
ˆ inf sup E sin Θ(V , V ) ≥ c ∧r . t4 F Vˆ X∈Fr,t Finally, combining these two cases for t2 < p1 /4 and t2 > p1 /4, we have finished the proof of Theorem 4.
7.3
Proofs in High-dimensional Clustering
Proof of Theorem 5. Since EY = µl| =
√
nkµk2 ·
√l , n
by Theorem 3, we know
√
2 ! 2+p
nkµk ) Cn ( C(nkµk22 + p) l 2
√ E ≤ ≤ .
sin Θ √n , vˆ 4 ( nkµk2 ) nkµk42 In addition, the third part of Lemma 1 implies E
√ min kˆ v − o · l/ nk22
o∈{−1,1}
2 ! 2
l
l ≤ C(nkµk2 + p) . √ ≤ 2E sin Θ , v ˆ
n nkµk42
√ √ Finally, by definition that ˆl = sgn(ˆ v ) and li / n = ±1/ n, we can obtain n o 1 √ 1 li E min i : li 6= π(ˆli ) ≤ E min i : √ 6= π(sgn(ˆ vi ))/ n π π n n n n X li 2 √ 1 √ 1 vi n ≤ E min i : √ − oˆ vi ≥ 1/ n ≤ E min li / n − oˆ n o∈{−1,1} n o∈{−1,1} n i=1 √ 2 C(nkµk22 + p) =E min kˆ v − o · l/ nk2 ≤ . nkµk42 o∈{−1,1} which has finished the proof of this theorem.
Proof of Theorem 6. We first consider the metric space {−1, 1}n , where each pair of l and −l are considered as the same element. For any l1 , l2 ∈ {−1, 1}n , it is easy to see that M(l1 , l2 ) =
1 min{kl1 − l2 k∞ , kl1 + l2 k∞ }, 2n
where M is defined in (17). For any three elements l1 , l2 , l3 ∈ {−1, 1}n , since kl1 − l2 k∞ ≤ min {kl1 − l3 k∞ + kl3 − l2 k∞ , kl1 + l3 k∞ + kl3 + l2 k∞ } , 30
kl1 + l2 k∞ ≤ min {kl1 − l3 k∞ + kl3 + l2 k∞ , kl1 + l3 k∞ + kl3 − l2 k∞ } , we have M(l1 , l2 ) = ≤
1 min {kl1 − l2 k∞ , kl1 + l2 k∞ } ≤ 2n
1 min{kl1 − l3 k∞ + kl3 − l2 k∞ , kl1 + l3 k∞ + kl3 + l2 k∞ , 2n + kl1 − l3 k∞ + kl3 + l2 k∞ , kl1 + l3 k∞ + kl3 − l2 k∞ } = M(l1 , l3 ) + M(l2 , l3 ).
In other words, M is a metric in the space of {−1, 1}n . Similarly to Lemma 4 in Yu (1997), we can show that there exists universal positive constant c0 , Cn , such that when n ≥ Cn , there exists a subset A of {−1, +1}n satisfying |A| ≥ exp(c0 n), and ∀l, l0 ∈ {−1, +1}, l 6= l0 , we have M(l, l0 ) ≥ 1/3.
(63)
1
Denote t2 = cgap (p/n) 4 . Next, we need to discuss separately according to the value of t2 . • When nt2 ≤ p/4, for each l ∈ {−1, 1}n , similarly to the proof to Theorem 6, we define the following class of density PY , where Y = X + Z and the right singular space of √ X is l/ n, n o iid Pl,t = PY : Y ∈ Rp×n , Y = X + Z, X is fixed, Z ∼ N (0, 1), X = µl| , kµk2 ≥ t . We construct the following Gaussian mixture measure P¯l,t , Z 1 P¯l,t (Y ) = Cl,t exp(−kY − 2tµ0 l| k2F /2) pn/2 p (2π) µ0 ∈R :kµ0 k≥1/2 p · ( )p1 r/2 exp(−pkµ0 k22 /2)dµ0 . 2π
(64)
Here Cl,t is the constant which normalize the integral and make P¯l,t a valid probability R −1 density. To be specific, Cl,t = Y Pl,t (Y )dY. Moreover, since k2tµ0 k ≥ t in the event that kµ0 k2 ≥ 1/2, P¯l,t is a mixture density in Pl,t , i.e. P¯l,t ∈ co(Pl,t ). By Lemma 7, for any two l, l0 ∈ {−1, 1}n , the KL-divergence between P¯l,t and P¯l0 ,t have the following upper bound 2 4 √ 0 √ 2 16n2 t4 n t ¯ ¯
0 +1 D(Pl,t ||Pl ,t ) ≤ sin Θ(l/ n, l / n) F + CKL ≤ C 2(4nt2 + p) nt2 + p ! n2 c4gap (p/n) ≤C +1 p for some uniform constant C > 0. Then we can set cgap , Cn > 0, such that whenever 1 t ≤ c(p/n) 4 , n ≥ Cn , D(P¯l,t ||P¯l0 ,t ) ≤ c0 n/5. (65) 31
Finally, combining (63) and (65), the generalized Fano’s lemma (Lemma 3 in Yu (1997)) along with Lemma 6 lead to 1 c0 n/5 + log 2 1 ˆ max EM(l, l) ≥ 1− ≥ , inf (66) ˆ 3 c0 n 4 l kµk2 ≤cgap (p/n) 14 l∈{−1,1}n
where the last inequality hold when n ≥ Cn for large enough Cn . • When nt2 > p/4, we fix µ = (t, 0, . . . , 0)T ∈ Rp , introduce l ∈ {−1, 1}n ,
Xl = µl| ,
iid
and denote Pl as the probability of Y if Y = Xl + Z with Z ∼ N (0, 1). Based on the calculation in Theorem 4, the KL-divergence between Pl and Pl0 satisfies D(Pl ||Pl0 ) =
t2 kl − l0 k22 ≤ 2t2 n. 2
Applying the generalized Fano’s lemma on {−1, 1}n , we have 1 2t2 n + log 2 ˆ inf max EM(l, l) ≥ 1− . ˆ 3 c0 n l l∈{−1,1}n 1
When nt2 > p/4, t ≤ cgap (p/n) 4 , we know 1
1 1 1 (p/n) 2 ≤ t ≤ cgap (p/n) 4 , 2
thus
(cgap (p/n) 4 )2 ≤ 2c2gap . t ≤ 1 (p/n) 2 2
Finally if n ≥ Cn for some large constant Cn > 0, and cgap is small enough, we have 4c2gap n + log 2 2t2 n + log 2 1 ≤ ≤ , c0 n c0 n 4 which implies inf ˆ l
max
1 kµk2 ≤cgap (p/n) 4 n l∈{−1,1}
1 EM(ˆl, l) ≥ . 4
To sum up, we have finished the proof of this theorem.
References Alzahrani, T. and Horadam, K. (2016). Community detection in bipartite networks: Algorithms and case studies. In Complex Systems and Networks, pages 25–50. Springer. Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis. Wiley, Hoboken, NJ., 3rd ed. edition. 32
Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3):243–272. Azizyan, M., Singh, A., and Wasserman, L. (2013). Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, pages 2139–2147. Balakrishnan, S., Xu, M., Krishnamurthy, A., and Singh, A. (2011). Noise thresholds for spectral clustering. In Advances in Neural Information Processing Systems, pages 954–962. Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vectors of low rank perturbations of large rectangular random matrices. Journal of Multivariate Analysis, 111:120–135. Borg, I. and Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer Science & Business Media. Boutsidis, C., Zouzias, A., Mahoney, M. W., and Drineas, P. (2015). Randomized dimensionality reduction for k-means clustering. Information Theory, IEEE Transactions on, 61(2):1045–1062. Bura, E. and Pfeiffer, R. (2008). On the distribution of the left singular vectors of a random matrix and its applications. Statistics & Probability Letters, 78(15):2275–2280. Cai, T., Cai, T. T., Liao, K., and Liu, W. (2015a). Large-scale simultaneous testing of cross-covariance matrix with applications to phewas. technical report. Cai, T. T., Li, X., and Ma, Z. (2015b). Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics. Cai, T. T., Ma, Z., and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074–3110. Candes, E., Sing-Long, C. A., and Trzasko, J. D. (2013). Unbiased risk estimates for singular value thresholding and spectral estimators. Signal Processing, IEEE Transactions on, 61(19):4643–4657. Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936. Cand`es, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772. 33
Cand`es, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5):2053–2080. Capitaine, M., Donati-Martin, C., and F´eral, D. (2009). The largest eigenvalues of finite rank deformation of large wigner matrices: convergence and nonuniversality of the fluctuations. The Annals of Probability, pages 1–47. Chatterjee, S. (2014). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214. Chen, M., Gao, C., Ren, Z., and Zhou, H. H. (2013). Sparse cca via precision adjusted iterative thresholding. arXiv preprint arXiv:1311.6186. Cho, J., Kim, D., and Rohe, K. (2015). Asymptotic theory for estimating the singular vectors and values of a partially-observed low rank matrix with noise. arXiv preprint arXiv:1508.05431. Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii. SIAM J. Numer. Anal., 7:1–46. Donoho, D. and Gavish, M. (2014). Minimax risk of matrix denoising by singular value thresholding. The Annals of Statistics, 42(6):2413–2440. Dopico, F. M. (2000). A note on sin θ theorems for singular subspace variations. BIT Numerical Mathematics, 40(2):395–403. Fan, J., Wang, W., and Zhong, Y. (2016). An `∞ eigenvector perturbation bound and its application to robust covariance estimation. arXiv preprint arXiv:1603.03516. Feldman, D., Schmidt, M., and Sohler, C. (2013). Turning big data into tiny data: Constantsize coresets for k-means, PCA and projective clustering. In Proceedings of the TwentyFourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1434–1453. SIAM. Gao, C., Ma, Z., Ren, Z., and Zhou, H. H. (2015). Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5):2168–2197. Gao, C., Ma, Z., and Zhou, H. H. (2014). Sparse CCA: Adaptive estimation and computational barriers. arXiv preprint arXiv:1409.8565. Gavish, M. and Donoho, D. L. (2014). The optimal hard threshold for singular values is √ 4/ 3. Information Theory, IEEE Transactions on, 60(8):5040–5053. Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):61–70. 34
Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548–1566. Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639– 2664. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning 2nd edition. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4):321–377. Jin, J., Ke, Z. T., and Wang, W. (2015). Phase transitions for high dimensional clustering and related problems. arXiv preprint arXiv:1502.06952. Jin, J. and Wang, W. (2014). Important feature pca for high dimensional clustering. arXiv preprint arXiv:1407.5241. Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682–693. Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy entries. J. Mach. Learn. Res., 11(1):2057–2078. Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237. Liu, Z. and Vandenberghe, L. (2009). Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256. Melamed, D. (2014). Community structures in bipartite networks: A dual-projection approach. PloS one, 9(5):e97823. O’Rourke, S., Vu, V., and Wang, K. (2013). Random perturbation of low rank matrices: Improving classical bounds. arXiv preprint arXiv:1311.2657. Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39:1878–1915. Rudelson, M. and Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab, 18(0).
35
Shabalin, A. A. and Nobel, A. B. (2013). Reconstruction of a low-rank matrix in the presence of gaussian noise. Journal of Multivariate Analysis, 118:67–76. Singer, A. and Cucuringu, M. (2010). Uniqueness of low-rank matrix completion by rigidity theory. SIAM Journal on Matrix Analysis and Applications, 31(4):1621–1641. Stewart, G. W. (1991). Perturbation theory for the singular value decomposition. SVD and Signal Processing, II: Algorithms, Analysis and Applications, page 99109. Stewart, M. (2006). Perturbation of the svd in the presence of small singular values. Linear algebra and its applications, 419(1):53–77. Sun, R. and Luo, Z.-Q. (2015). Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270–289. IEEE. Vershynin, R. (2011). Spectral norm of products of random and deterministic matrices. Probability theory and related fields, 150(3-4):471–509. Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing, pages 210–268. Cambridge Univ. Press, Cambridge. von Luxburg, U., Belkin, M., and Bousquet, O. (2008). Consistency of spectral clustering. The Annals of Statistics, pages 555–586. Vu, V. (2011). Singular vectors under random perturbation. Random Structures & Algorithms, 39(4):526–538. Wang, R. (2015). Singular vector perturbation under gaussian noise. SIAM Journal on Matrix Analysis and Applications, 36(1):158–177. Wedin, P. (1972). Perturbation bounds in connection with singular value decomposition. BIT, 12:99–111. Weyl, H. (1912). Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479. Yang, D., Ma, Z., and Buja, A. (2014). Rate optimal denoising of simultaneously sparse and low rank matrices. arXiv preprint arXiv:1405.0338. Yu, B. (1997). Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer.
36
Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the davis-kahan theorem for statisticians. Biometrika, 102:315–323.
37
Appendix: Additional Proofs In this Appendix, we provide the proofs for Theorem 7 (canonical correlation analysis) and all the technical lemmas.
Proofs in Canonical Correlation Analysis Proof of Theorem 7. The proof of Theorem 7 is relatively complicated, which we shall divide into steps. 1. (Analysis of Loss) Suppose the SVD of Sˆ and S are ˆS Σ ˆ S Vˆ | , Sˆ =U S
ˆ S ∈ Rp1 ×p2 , VˆS ∈ Op ˆS ∈ Op , Σ U 2 1
S =US ΣS VS| ,
US ∈ Op1 ,r , ΣS ∈ Rr×r , VS ∈ Op2 ,r
(we shall note rank(S) = r.) (67)
1
1
− ˆ−2 U ˆS,[:,1:r] . Invertible multiplication respectively. Recall that A = ΣX 2 US , Aˆ = Σ X to all X’s or Y ’s does not change the loss of the procedure, thus without loss of generality we could assume that ΣX = Ip1 . Under this assumption, we have following expansions for the loss LF and Lsp . ˆ A) = min EX ∗ k(AO ˆ − A)| X ∗ k2 = max EX ∗ tr (AO ˆ − A)| X ∗ (X ∗ )| (AO ˆ − A) LF (A, 2 O∈Or O∈Or 1 1 ˆS:[:,1:r] O − US )| Ip (Σ ˆ−2 U ˆS,[:,1:r] O − US ) ˆ−2 U = minr tr (Σ 1 X X O∈O
2
−1
ˆ 2U ˆS,[:,1:r] O − US Σ = minr X
O∈O F (
2 )
2
−1
ˆ
ˆ ˆ 2
≤2 minr US,[:,1:r] O − US +
(ΣX − I)US,[:,1:r] O O∈O
F
F
2
2
−1
2 ˆ ˆ ˆ
≤4 sin Θ(US,[:,1:r] , US ) + 2 (ΣX − I)US,[:,1:r] O
(by Lemma 1). F F
Similarly, ˆ A) = max k sin Θ(U ˆS,[:,1:r] , US )k2 Lsp (A, O∈Or v∈Rr
2
2
−1
2 ˆ ˆ
≤4 sin Θ(US,[:,1:r] , US ) + 2 ΣX − I
.
(68)
We use the bold symbols X ∈ Rp1 ×n , Y ∈ Rp2 ×n to denote those n samples. Since ˆ = 1 XX| , where X ∈ Rp1 ×n is a i.i.d. Gaussian matrix, by random matrix theory Σ n (Corollary 5.35 in Vershynin (2012)), there exists constant C, c such that p p ˆ X ) ≤ σmax (Σ ˆ X ) ≤ 1 + C p1 /n. 1 − C p1 /n ≤ σmin (Σ (69) 38
with probability at least 1 − C exp(cp1 ). Since property of correlation matrix
1
≥
σr2 (S)
problem assumption
≥
Cgap p1 , n
we can find large Cgap > 0 to ensure n ≥ Cp1 for any large C > 0. In addition, we could have !
2
−1 Cp 1 2 ˆ
(70) P
ΣX − I ≤ Cp1 /n ≤ nσ 2 (S) ≥ 1 − C exp(−cp1 ). r
2
2
−1
−1 2 2 ˆ ˆ ˆ ˆ
Moreover, as US,[:,1:r] O ∈ Op1 ,r , we have r ΣX − I ≥ (ΣX − I)US,[:,1:r] O
, F thus !
2
−1 Cp r 1 ˆ ˆ 2
(71) P
(ΣX − I)US,[:,1:r] O ≤ Cp1 r/n ≤ nσ 2 (S) ≥ 1 − C exp(−cp1 ). r F Now the central goal in our analysis moves to bound
2
2
ˆS,[:,1:r] , US ) ˆS,[:,1:r] , US )
and sin Θ(U
,
sin Θ(U F
−1
−1
namely to compare the singular spectrum between the population S = ΣX 2 ΣXY ΣY 2 1 1 ˆ − 2 in sin Θ distance. Since sin Θ(U ˆ−2 Σ ˆ XY Σ ˆS,[:,1:r] , US ) and the sample version Sˆ = Σ X Y
2
2
ˆS,[:,1:r] , US ) ˆS,[:,1:r] , US ) is r × r, sin Θ(U
≤ sin Θ(U
. We only need to consider F the sin Θ spectral distance below. 2. (Reformation of the Model Set-up) In this step, we transform X, Y to a better formulation to simplify the further analysis. First, any invertible affine transformation on X and Y separately will not essentially change the problem. We specifically do transformation as follows −1
X → [US US⊥ ]| ΣX 2 X,
1
−1
Y → (Ip2 − Σ2S )− 2 [VS VS⊥ ]| ΣY 2 Y.
Simple calculation shows that after transformation, Var(X), Var(Y ), Cov(X, Y ) be1 come Ip1 , (Ip2 − Σ2S )−1 , ΣS (Ip2 − Σ2S )− 2 respectively. Therefore, without loss of generality we can assume that ΣY = (Ip2 − Σ2S )−1 . ( σi (S), i = j = 1, · · · , r is diagonal, such that Sij = 0, otherwise ΣX = Ip 1 ,
S ∈ Rp1 ×p2
(72) (73)
throughout the proof. (It will be explained a while later why we want to transform ΣY into this form.) 39
Since X, Y are jointly Gaussian, we can relate them as Y = W | X + Z, where W ∈ Rp1 ×p2 is a fixed matrix, X ∈ Rp1 and Z ∈ Rp2 are independent random vectors. Based on simple calculation, ΣXY = EXY | = ΣX W , ΣY = W | ΣX W + Var(Z). Combining (72), we can calculate that ( 1 1 i=j Sii (1 − Sii2 )− 2 , −1 W = ΣX ΣXY = SΣY2 , W is diagonal, Wij = (74) 0 otherwise 1
1
1
1
Var(Z) = ΣY − ΣY2 S | SΣY2 = ΣY2 (Ip2 − S | S)ΣY2 = Ip2 .
(75)
In other words, Z is i.i.d. standard normal and W is diagonal. By rescaling, the analysis could be much more simplified and easier to read. ˆ In this step, we move from the population to the samples and 3. (Expression for S) ˆ We use bold symbols X ∈ Rp1 ×n , Y ∈ Rp2 ×n , find out a useful expression for S. Z ∈ Rp2 ×n to denote the compiled data such that Y = W | X + Z. Denote the singular decomposition for X and Y are ˆ Y Vˆ | , ˆY Σ Y=U Y
ˆ X Vˆ | , ˆX Σ X=U X
ˆX ∈ Op , Σ ˆ X ∈ Rp1 ×p1 , VˆX ∈ On,p , U ˆY ∈ Op , Σ ˆ Y ∈ Rp2 ×p2 , VˆY ∈ On,p . Thus, here U 1 1 2 2 ˆX = U ˆX Σ ˆ 2X U ˆ| , Σ X Additionally,
ˆY = U ˆY Σ ˆ 2Y U ˆ|, Σ Y 1
ˆ XY = U ˆX Σ ˆ X Vˆ | VˆY Σ ˆY U ˆ|. Σ X Y
1
ˆ−2 Σ ˆ|. ˆ XY Σ ˆ−2 = U ˆX Vˆ | VˆY U Sˆ = Σ Y X X Y ˆ Since X is a i.i.d. 4. (Useful Characterization of the left Singular Space of S) ˆX , Gaussian matrix at this moment, from the random matrix theory, we know U VˆX are randomly distributed with Haar measure on Op1 and On,p1 respectively and ˆ | ∈ On,p to orthogonal matrix ˆX , Σ ˆ X , VˆX , Z are independent. We can extend VˆX U U 1 X | ˆ ˜ = Z·R ˜ = ˜ ˆ ˆ ˆ ˜X , Y RX = [VX UX , RX⊥ ] ∈ On such that RX⊥ ∈ On,n−p1 . Introduce Z ˜ = XR ˜ | can be explicitly written as the following clean form ˜X , X ˜X . Y Y·R "
# ˆX Vˆ | X| W U | | | X ˜ =X ˜ W +Z ˜ = Y + Z˜ | = ˆ | X| W R X⊥
˜ iid Z ∼ N (0, 1),
p1 n − p1
p2 ˆ ˆ ˆ| UX ΣX UX W + Z ˜ | := G + Z ˜ |, 0
˜ U ˜ are independent. ˆX , Σ ˆ X are independent; G, Z Z, (76)
Meanwhile, since VˆY is the left singular vectors for Y| , we have " # |ˆ ˆ ˆ U V V X Y X ˆ| , R ˆ X⊥ ]| VˆY = [VˆX U X ˆ | VˆY R X⊥ 40
is the left singular vectors of important characterization for
˜ | = [VˆX U ˆ| , R ˆ X⊥ ]| Y| , which yields the following Y X ˆ S.
˜ |. ˆY = U ˆX Vˆ | VˆY is the first p1 rows of the left singular vectors of Y SˆU X
(77)
˜ is Suppose the SVD for Y ˜| = U ˜Y Σ ˜ Y V˜ | , Y Y
˜ Y ∈ Rp2 ×p2 , V˜Y ∈ Op . ˜Y ∈ On,p , Σ U 2 2
˜Y is Further assume the SVD for the first p1 rows of U ˜ Y 2 ∈ Rp1 ×p2 , V˜2 ∈ Op . ˜Y 2 ∈ Op , Σ ˜ Y 2 V˜ | , U ˜Y 2 Σ ˜Y =U U 2 1 Y2 [1:p1 ,:]
ˆY ∈ Op does not By characterization (77) and the fact that right multiplication of U 2 change left singular vectors, we have ˆS = U ˜Y 2 , where U ˆS is the left singular vectors of S, ˆ U ˜Y ˜Y 2 is the left singular space of U U
(78) [1:p1 ,]
.
The characterization above is the baseline we shall use later to compare the spectrum of Sˆ and S. 5. (Split of sin Θ Norm Distance) Recall Step 1, the central goal of analysis now is to find the sin Θ distance between the leading r left singular vectors of Sˆ and S. It is easy to see that the left singular space of S is 1 . .. ∈ Op1 ,r US = (79) 1 0 ··· 0 According to By triangular inequality of sin Θ distance (Lemma 1),
ˆS,[:,1:r] , US ˆS,[:,1:r]
.
sin Θ U
≤ ksin Θ (UG , US )k + sin Θ UG , U
(80)
where UG is defined a little while later in (81). For the next Steps 6 and 7, we try to bound the two sin Θ distances respectively. 6. (Left Singular Space of G[1:p1 ,1:r] ) Recall the definition of G in (76), since W is with only first r diagonal entries non-zero, only the submatrix G[1:p1 ,1:r] of G is non-zero. Suppose G[1:p1 ,1:r] = UG ΣG VG| ,
UG ∈ Op1 ,r , ΣG ∈ Rr×r , VG ∈ Or . 41
(81)
In this section we aim to show the following bound on the sin Θ distance from UG to US , i.e. there exists C0 > 0 such that whenever n > Cp1 , p P ksin Θ(UG , US )k ≥ C p1 /n ≤ C exp(−cp1 ) (82) and also cnσ 2 (S) 2 P σmin (G[1:p1 ,1:r] ) ≥ p r 1 − σr2 (S)
! ≥ 1 − C exp(−cp1 ).
(83)
Recall
G[1:p1 ,1:r] =
ˆ | W[:,1:r] , ˆXU ˆX Σ U X
W[:,1:r]
W11 =
..
.
. Wrr
(84)
0 ˆX as Thus if we split U
ˆX U
p1 ˆ r = UX1 , ˆ UX2 p1 − r
ˆX Σ ˆX then U
ˆ| U X
[:,1:r]
r ˆ ˆ ˆ| r . = UX1 ΣX UX1 | ˆ ˆ ˆ UX2 ΣX UX1 p1 − r
Furthermore, due to the diagonal structure of W , G[1:p1 ,1:r] has the same left singular ˆ X is the singular values of i.i.d. Gaussian matrix ˆX Σ ˆX U ˆ| . Since Σ space as U X
X∈
[:,1:r]
Rp1 ×n ,
by random matrix theory (Vershynin, 2012), √ √ √ ˆ X k ≥ σmin Σ ˆ X ≥ n − C √p1 ≥ 1 − C exp(−cp1 ). P n + C p1 ≥ kΣ
(85)
√ √ ˆ X k ≥ σmin Σ ˆ X ≥ √n − C √p1 hold and Under the event that n + C p1 ≥ kΣ n ≥ Cgap p1 for some large Cgap , we have √ ˆX1 Σ ˆXU ˆ | ≥ σmin (Σ ˆ X ) ≥ n − C √p1 , (since U ˆX1 is orthogonal.) σmin U X1
√ | √
ˆ ˆ ˆ| ˆ ˆ X − nI U ˆ
UX1 ΣX UX2 = UX1 Σ X2 ≤ C p1 . ˆX Σ ˆX U ˆ| By Lemma 9, the left singular space of U , which is also the left singular X
[:,1:r]
vectors of G[1:p1 ,1:r] satisfies 2 σmax
ˆ ˆ ˆ | 2 U Σ U
X1 X X2 Cp1 UG,[(r+1):n,1:r] ≤
2 ≤ n 2 ˆX1 Σ ˆXU ˆ| + ˆX1 Σ ˆXU ˆ| σmin U
U X1 X2
when n ≥ Cgap p1 for some large Cgap > 0. By the characterization of sin Θ distance in Lemma 1, we have finally proved the statement (82). 42
Since ˆX Σ ˆXU ˆ | · σmin W[:,1:r] ˆ | W[:,1:r] ≥ σr U ˆXU ˆX Σ σmin (G[1:p1 ,1:r] ) = σr U X X1 (86)
ˆ X ) · p σr (S) , ≥σmin (Σ 1 − σr2 (S) by (85), (83) holds.
7. In this step we try to prove the following statement: there exists constant Cgap , such that whenever 1 1 3/2 (p1 p2 ) 2 + p1 + p2 n− 2 σr2 (S) ≥ Cgap , (87) n we have !
2 Cp nσ 2 (S) + p 1 2
r ˆS,[1:r,:] , UG ) ≤ ≥ 1 − C exp(−cp1 ∧ p2 ). (88) P sin Θ(U n2 σr4 (S) We shall again note that by 1 ≥ σr2 (S), the condition (87) also implies n ≥ Cgap p1 . Recall (83), with probability at least 1 − C exp(−cp1 ), 1 cnσr2 (S) 3/2 − 1 2 2 + p + p 2 σr2 (G) ≥ . (89) ≥ cnσ (S) ≥ cC (p p ) n gap 1 2 1 r 2 1 − σr2 (S) Conditioning on G with σr2 (G) satisfies (89), set L = Y˜ | , n = n, p = p2 , d = p1 , r = r, Lemma 8 leads to the following result !
2 Cp (σ 2 (G) + p )(1 + σ 2 (G)/n)
1 r 2 r ˆS,[:,1:r] , UG ) ≤ P sin Θ(U G ≥ 1−C exp(−cp1 ∧p2 ). 4 σr (G) 1 1 3/2 whenever σr2 (G) ≥ Cgap (p1 p2 ) 2 + p1 + p2 n− 2 for uniform constant Cgap large enough. Then we shall also note that Cp1 nσr2 (S) + p2 Cp1 nσr2 (S) + p2 (1 + nσr2 (S)/n) 2 kSk ≤ 1, ⇒ σr (S) ≤ 1 ⇒ ≥ n2 σr4 (S) n2 σr4 (S) Cp1 nσr2 (S) + p2 Cp1 (σr2 (G) + p2 )(1 + σr2 (G)/n) (89) ⇒ ≥ . n2 σr4 (S) σr4 (G) The two inequalities above together implies the statement (88) holds for true. Finally, combining (80), (82), (88), (70), (71) and (68), we have completed the proof of Theorem 7. The following Lemmas 8, 9 and 10 are used in the proof of Theorem 7. To be specific, Lemma 8 provides a sin Θ upper bound for the singular vectors of a sub-matrix; Lemma 10 gives both upper and lower bounds for the singular values of a sub-matrix; Lemma 10 propose a spectral norm upper bound for any matrix after a random projection. 43
Lemma 8. Suppose L ∈ Rn×p (n > p) is a non-central random matrix such that L = G+Z, iid Z ∼ N (0, 1), G is all zero except the top left d × r block (d ≥ r). Suppose the SVD for G[1:d,1:r] , L are G[1:d,1:r] = UG ΣG VG| , ˆΣ ˆ Vˆ | , L=U
UG ∈ Od,r , ΣG ∈ Rr×r , VG ∈ Or ;
ˆ ∈ On,p , Σ ˆ ∈ Rp×p , Vˆ ∈ Op . U
ˆ[1:d,:] is U ˆ[1:d,:] = U ˆ2 Σ ˆ 2 Vˆ | . There exists In addition, suppose r < d < n, the SVD for U 2 Cgap , C0 > 0 such that whenever 1
1
σr2 (G) = t2 > Cgap ((pd) 2 + d + p3/2 n− 2 ), we have ˆ2,[:,1:r] , UG )k ≤ k sin Θ(U
C
n ≥ C0 p,
p d(t2 + p)(1 + t2 /n) t2
(90)
(91)
with probability at least 1 − C exp(−cd ∧ p). Lemma 9 (Spectral Bound for Partial Singular Vectors). Suppose L ∈ Rn×p with n > p, L = U ΣV | is the SVD with U ∈ On×p , Σ ∈ Rp×p , V ∈ Op . Then for any subset Ω ⊆ {1, · · · , n}, 2 (L σmax [Ω,:] ) 2 (92) σmax (U[Ω,:] ) ≤ 2 2 (L c ) , σmax (L[Ω,:] ) + σmin [Ω ,:] 2 σmin (U[Ω,:] ) ≥
2 (L σmin [Ω,:] ) . 2 2 (L c ) σmin (L[Ω,:] ) + σmax [Ω ,:]
(93)
Lemma 10 (Spectral Norm Bound of a Random Projection). Suppose X ∈ Rn×m is a fixed matrix rank(X) = p, suppose R ∈ On×d (n > d) is with orthogonal columns with Haar measure. There exists uniform constant C0 > 0 such that whenever n ≥ C0 d, the following bound probability hold for uniform constant C, c > 0, √ p + C pd + Cd 2 | kXk ≤ 1 − C exp(−cd). (94) P kR Xk ≥ n
Proofs of Technical Lemmas Proof of Lemma 1. • (Expressions) Suppose V | Vˆ = AΣB | is the SVD, where Σ = diag(σ1 , · · · , σr ), A, B ∈ Or . By definition of sin Θ distances, k sin Θ(Vˆ , V )k = sin(cos−1 (σr )) =
44
q p 2 (V | V ˆ ), 1 − σr2 = 1 − σmin
(95)
k sin Θ(Vˆ , V )k2F =
r X
2
sin (cos
−1
(σr )) =
i=1
r X
(1 −
σr2 )
i=1
=r−
r X
σi2 = r − kVˆ | V k2F .
i=1
(96) On the other hand, since Vˆ , V , V⊥ are orthonormal, kV xk22 − kVˆ⊥| V xk22 kVˆ⊥ V xk22 kVˆ | V xk22 = min = 1−max = 1−kVˆ⊥| V k2 , ) = minr 2 2 2 r r x∈R x∈R x∈R kxk2 kxk2 kxk2 (97) | | | 2 | |ˆ | | | | ˆ ˆ ˆ ˆ ˆ ˆ ˆ kV V kF = tr V V V V = tr V V V V = tr V V − V⊥ V⊥ V V = r −kV⊥ V k2F , (98) | | ˆ ˆ ˆ ˆ we conclude that k sin Θ(V , V )k = kV⊥ V k, k sin Θ(V , V )kF = kV⊥ V kF .
2 σmin (Vˆ | V
• (Triangular Inequality) Next we consider the triangular inequality under spectral norm (24). Denote x = k sin Θ(V1 , V2 )k,
y = k sin Θ(V1 , V3 )k.
For each i = 1, 2, 3, we can expand Vi to the full orthogonal matrix as [Vi Vi⊥ ] ∈ Op . Thus, (97) (95) p | 2 2 V2 ) = 1 − σmin (V1| V2 ) = x2 . 1 − x2 , σmax (V1⊥ σmin (V1| V2 ) = Similarly σmin (V1| V3 ) =
p
1 − y2,
| V3 ) = y 2 . σ12 (V1⊥
Thus, | | V3 ) V3 ) ≥ σmin (V2| V1 V1| V3 ) − σmax (V2| V1⊥ V1⊥ σmin (V2| V3 ) =σmin (V2| V1 V1| V3 + V2| V1⊥ V1⊥ | V3 ) ≥σmin (V2| V1 )σmin (V1| V3 ) − σmax (V2| V1⊥ ) · σmax (V1⊥ p ≥ (1 − x2 )(1 − y 2 ) − xy.
Therefore, r k sin Θ(V2 , V3 )k ≤
1−
p 2 (1 − x2 )(1 − y 2 ) − xy . +
(99)
Now, we discuss under two situations. p 1. If (1 − x2 )(1 − y 2 ) ≥ xy, we have p 2 p 1− (1 − x2 )(1 − y 2 ) − xy = x2 +y 2 −x2 y 2 +2xy (1 − x2 )(1 − y 2 )−x2 y 2 ≤ (x+y)2 . +
Thus, k sin Θ(V2 , V3 )k ≤ x + y. p 2. If (1 − x2 )(1 − y 2 ) > xy, we have x2 + y 2 > 1. Provided that 0 ≤ x, y ≤ 1, this implies x + y > 1. Thus, k sin Θ(V2 , V3 )k ≤ 1 ≤ x + y. 45
To sum up, we always have (24). The proof for triangular inequality under Frobenius norm is slightly simpler. k sin Θ(V2 , V3 )kF
(96),(98)
=
| | kV2⊥ V3 kF = kV2⊥ (PV1 + PV1⊥ )V3 kF
| | | ≤kV2⊥ V1 V1| V3 kF + kV2⊥ V1⊥ V1⊥ V 3 kF | | | ≤kV2⊥ V1 kF · kV1| V3 k + kV2⊥ V1⊥ k · kV1⊥ V3 kF | | ≤kV2⊥ V1 kF + kV1⊥ V3 kF ≤ k sin Θ(V1 , V3 )kF + k sin Θ(V1 , V2 )kF .
(100) • (Equivalence with Other Metrics) Since all metrics mentioned in Lemma 1 is rotation invariant, i.e. for any J ∈ Op , sin Θ(Vˆ , V ) = sin Θ(J Vˆ , JV ), so does the other metrics. Without loss of generality, we can assume that " # Ir V = 0(p−r)×r p×r
In this case, Dsp (Vˆ , V ) = inf kVˆ − V Ok ≥ kVˆ[(r+1):p,:] k = kV⊥| Vˆ k = k sin Θ(Vˆ , V )k. O∈Or
Recall V | Vˆ = AΣB | is the singular decomposition. Dsp (Vˆ , V ) = inf kVˆ − V Ok ≤ kVˆ − V AB | k O∈Or r 2 2 q
≤ V | Vˆ − V AB | + V⊥| Vˆ − V AB | = kA(Σ − Ir )B | k2 + kVˆ⊥| Vˆ k2 q q (95) √ 2 2 ˆ ≤ (1 − σr ) + k sin Θ(V , V )k ≤ 1 − σr2 + k sin Θ(Vˆ , V )k2 ≤ 2k sin Θ(Vˆ , V )k. Similarly, we can show k sin Θ(Vˆ , V )kF ≤ DF (Vˆ , V ) ≤
√
2k sin Θ(Vˆ , V )kF .
For kVˆ Vˆ | − V V | k, one can show
ˆ ˆ|
|
|
V V − V V | ≥ V⊥ Vˆ Vˆ | − V⊥ V V | =k(V⊥| Vˆ )Vˆ | k = kV⊥| Vˆ k = k sin Θ(Vˆ , V )k since V is orthonormal. Besides, Vˆ Vˆ | − V V | = (PV + PV⊥ ) Vˆ Vˆ | (PV + PV⊥ ) − V V | =V (V | Vˆ )(V | Vˆ )| − Ir V | + V⊥ V⊥| Vˆ Vˆ | V V | + V V | Vˆ Vˆ | V⊥ V⊥| + V⊥ V⊥| Vˆ Vˆ | V⊥ V⊥| . For any vector x ∈ Rp , we denote x1 = V | x, x2 = V⊥| x. We also denote t = kV⊥| V k. 2 (V | V ˆ ) = 1 − σmax (V⊥ Vˆ ), then 1 ≥ Recall that we have proved in part 1 that σmin 46
√ σmax (V | Vˆ ) ≥ σmin (V | Vˆ ) ≥ 1 − t2 . Thus, | ˆ ˆ| | |ˆ ˆ| |ˆ | | |ˆ V ) − I x + x V V x1 V )(V V V V V − V V x = x (V x 1 r 1 2 ⊥ | | |ˆ |ˆ ˆ| | + x1 V V V V ⊥ x2 + x2 V ⊥ V Vˆ V⊥ x2 p ≤t2 kx1 k22 + 2t 1 − t2 kx2 k2 kx1 k2 + t2 kx2 k22 p p ≤ t2 + t 1 − t2 kx1 k22 + kx2 k22 = t 1 + 1 − t2 kxk22 . By Cauchy-Schwarz inequality, √ p p 2 (t2 + 1 − t2 ) = 2t, t t + 1 − t2 ≤ t which implies
√
√
ˆ ˆ|
V V − V V | ≤ 2t = 2 sin Θ(Vˆ , V ) .
This has proved the equivalence between k sin θ(Vˆ , V )k and kVˆ Vˆ | − V V | k. Finally for kVˆ Vˆ | − V V | kF , one has kVˆ Vˆ | − V V | k2F =tr (Vˆ Vˆ | − V V | )2 = tr Vˆ Vˆ | Vˆ Vˆ | + V V | V V | − Vˆ Vˆ | V V | − V V | Vˆ Vˆ | =r + r − 2kVˆ | V k2F = 2k sin Θ(Vˆ , V )k2F . Proof of Lemma 2. Before starting the proof, we introduce some useful notations. For Pp1 ∧p2 any matrix M ∈ Rp1 ×p2 with the SVD M = i=1 ui σi (M )vi| we use Mmax(r) to denote Pr its leading r principle components, i.e. Mmax(r) = i=1 ui σi (M )vi| ; M− max(r) denotes the P 1 ∧p2 remainder, i.e. M− max(r) = pi=r+1 ui σi (M )vi| . 1. First, by a well-known fact about best low-rank matrix approximation, σa+b+1−r (X + Y ) =
min
M ∈Rp×n ,rank(M )≤a+b−r
kX + Y − M k.
Hence, σa+b+1−r (X + Y ) ≤ kX + Y − (Xmax(a−r) + Y )k = kX− max(a−r) k = σa+1−r (X); similarly σa+b+1−r (X + Y ) ≤ σb+1−r (Y ). 2. When we further have X | Y = 0 or XY | = 0, without loss of generality we can assume X | Y = 0. Then the column space of X and Y are orthogonal, and rank(X + Y ) = rank(X) + rank(Y ) = a + b, which means a + b ≤ n. Next, note that (X + Y )| (X + Y ) = X | X + Y | Y + X | Y + Y | X = X | X + Y | Y, 47
if we note λi (·) as the r-th largest eigenvalue of the matrix, then we have σi2 (X + Y ) =λi ((X + Y )| (X + Y )) = λi (X | X + Y | Y ) ≥ max(λi (X | X), λi (Y | Y ))
(since X | X, Y | Y are semi-positive definite)
= max(σi2 (X), σi2 (Y )). σ12 (X + Y ) =λ1 ((X + Y )| (X + Y )) = kX | X + Y | Y k ≤kX | Xk + kY | Y k = σ12 (X) + σ12 (Y ). Proof of Lemma 3. Since "
# 2 + c2 ab + cd a , A| A = ab + cd b2 + d2 If its eigenvalues are λ1 ≥ λ2 , clearly λ1 + λ2 = a2 + b2 + c2 + d2 , λ1 λ2 = (ad − bc)2 . We can solve that its singular values λ1 , λ2 are p a2 + b2 + c2 + d2 ± (a2 + b2 + c2 + d2 )2 − 4(ad − bc)2 {λ1 , λ2 } = 2 (101) p 2 2 2 2 2 a + b + c + d ± (a + c2 − b2 − d2 )2 + 4(ab − cd)2 = 2 √ 2 2 2 2 2 +c2 −b2 −d2 ) a2 +b2 +c2 +d2 − (a2 +c2 −b2 −d2 )2 +4(ab−cd)2 ≤ a +b +c +d −(a = b2 + d2 . Thus, λ2 = 2 2 Also, λ2 ≥
a2 + b2 + c2 + d2 − (a2 + c2 − b2 − d2 ) − 2|ab − cd| ≥ b2 + d2 − |ab − cd|. 2
Thus, a2 + c2 − λ2 ≤ a2 + c2 − b2 − d2 + |ab − cd| ≤2(a2 − b2 − d2 ) − (a2 − b2 − c2 − d2 ) + |ab + cd| ≤2(a2 − d2 − b2 ∧ c2 ) + |ab + cd| "
# v 12 By the definition of singular vectors, we have (A| A − λ2 I2 ) = 0. Thus, v22 (a2 + c2 − λ2 )v12 + (ab + cd)v22 = 0
48
2 + v 2 = 1, we have Given v12 22
ab + cd ab + cd ≥p |v12 | = p (a2 + c2 − λ2 )2 + (ab + cd)2 (2(a2 − d2 − b2 ∧ c2 ) + |ab + cd|)2 + (ab + cd)2 ab + cd ≥p 4(a2 − d2 − b2 ∧ c2 )2 + 4(a2 − d2 − b2 ∧ c2 )|ab + cd| + 2(ab + cd)2 ab + cd ≥p 2 10 max {(a − d2 − b2 ∧ c2 )2 , (ab + cd)2 } 1 ab + cd ≥√ ∧1 . 10 a2 − d2 − b2 ∧ c2 This finished the proof of Lemma 3.
Proof of Lemma 4. Suppose the SVD of X is X = U ΣV | , where U ∈ Op1 ,r , V ∈ Op2 ,r , Σ = diag(σ1 , · · · , σr ) ∈ Rr×r . We can extend V ∈ Op2 ,r into the full p2 ×p2 orthogonal matrix [V V⊥ ] = V0 ∈ Op2 , V⊥ ∈ Op2 ,p2 −r . For convenience, denote Y V , Y1 . Since, EY | Y = X | X + p1 Ip2 = V Σ2 V | + p1 Ip2 ,
EV | Y | Y V = Σ2 + p1 Ip2 ,
(102)
we introduce fixed normalization matrix M ∈ Rp2 ×r as 1 (σ12 (X) + p1 )− 2 .. . M = . 1 (σr2 (X) + p1 )− 2 r×r By (102), this design yields M | EY1| Y1 M = Ir . In other words, by right multiplying M to Y1 , we can normalize its second moment. Now we are ready to show (43), (44) and (45). 1. We target on (43) in this step. Note that the maximum diagonal entry of M is 1 (σr2 (X) + p1 )− 2 , thus σr2 (Y1 ) ≥(σr2 (X) + p1 )σr2 (Y1 M ) ≥ (σr2 (X) + p1 )σr (M | Y1| Y1 M ) ≥(σr2 (X) + p1 ) − (σr2 (X) + p1 ) kM | Y1| Y1 M − Ir k , (43) could be implied by P (kM | Y1| Y1 M − Ir k ≤ x) ≥ 1 − C exp Cr − c(σr2 (X) + p1 )x ∧ x2 . 49
(103)
The main idea to proceed is to use ε-net to split the spectral norm deviation control to single random variable deviation control. Then use Hanson-Wright inequality to control the single random variable. To be specific, for any unit vector u ∈ Rr , by expansion Y1 = Y V = XV + ZV , we have u| M | Y1| Y1 M u − u| Ir u = u| M | Y1| Y1 M u − Eu| M | Y1| Y1 M u = (u| M | V | X | XV M u − Eu| M | V | X | XV M u) + (2u| M | V | X | ZV M u − E2u| M | V | X | ZV M u)
(104)
+ (u| M | V | Z | ZV M u − Eu| M | V | Z | ZV M u) =2(XV M u)| ZV (M u) + (V M u)| (Z | Z − p1 Ip2 )(V M u). We shall emphasize that the only random variable in the equation above is Z ∈ Rp1 ×p2 . Our plan is to bound the two terms in (104) separately as follows. • For fixed unit vector u ∈ Rr , we vectorize Z ∈ Rp1 ×p2 into ~z ∈ Rp1 p2 as follows, ~z = (z11 , z12 , · · · , z1p2 , z21 , · · · z2p2 , · · · , zp1 1 , · · · , zp1 p2 )| . We also repeat the (V M u)(V M u)| block for p1 times and introduce (V M u)(V M u)| .. ~ = ∈ R(p1 p2 )×(p1 p2 ) . D . (M u)(M u)| ~ z − E~z| D~ ~ z. Besides, It is obvious that (V M u)| (Z | Z − Ip1 )(V M u) = ~z| D~ ~ = k(V M u)(V M u)| k = kM uk2 ≤ kM k2 kuk2 = (σ 2 (X) + p1 )−1 , kDk 2 2 r ~ 2 = p1 k(V M u)(V M u)| k2 = p1 kM uk4 ≤ p1 (σ 2 (X) + p1 )−2 . kDk F F 2 r By Hanson-Wright Inequality (Theorem 1 in Rudelson and Vershynin (2013)), n o ~ ~ z > x z − E~z| D~ P {|(V M u)| (Z | Z − p1 Ip2 )(V M n)| > x} = P ~z| D~ 2 2 x (σr (X) + p1 )2 2 , x(σr (X) + p1 ) , ≤2 exp −c min p1 where c only depends on τ . • Next, we bound (XV M u)| Z(V M u) = tr(Z(V M u)(XV M u)| ) = ~z| vec(V M u(XV M u)| ),
50
where ~z and vec(V M u(XV M u)| ) are the vectorized Z and (V M u(XV M u)| ). Since 1 σ1 (X)(σ12 (X) + p1 )− 2 .. , XV M = U . − 21 2 σr (X)(σr (X) + p1 ) we know kXV M k ≤ 1, and kvec(V M u(XV M u)| )k22 = k(M V u)(XV M u)| k2F =kM uk22 · kXV M uk22 ≤ kM k2 ≤ (σr2 (X) + p)−1 . By the basic property of i.i.d. sub-Gaussian random variables, we have P (|(XV M u)| Z(V M u)| > x) ≤ C exp −cx2 (σr2 (X) + p1 ) , where C, c only depends on τ . The two bullet points above and (104) implies for any fixed unit vector u ∈ Rr , P (|u| M | Y1| Y1 M u − u| Ir u| > x) ≤ C exp −c σr2 (X) + p1 x2 ∧ x ,
(105)
for all x > 0. Here the C, c above only depends on τ . Next, the ε-net argument (Lemma 5) leads to P (kM | Y1| Y1 M − Ir k > x) ≤ C exp Cr − c σr2 (X) + p1 x2 ∧ x .
(106)
In other words, (103) holds, which implies (43). 2. In order to prove (44), we use the following fact about best rank-r approximation of Y, σr (Y ) = max kY − Bk ≥ kY − Y · [V 0]k = σmax (Y V⊥ ). rank(B)≤r
to switch our focus from σr (Y ) to σmax (Y V⊥ ). Next,
2 σmax (Y V⊥ ) = σmax (V⊥| Y | Y V⊥ ) ≤ p1 + V⊥| Y | Y V⊥ − EV⊥| Y | Y V⊥ . Note that EV⊥| Y | Y V⊥ = p1 Ip2 −r , based on essentially the same procedure as the proof for (43), one can show that
−1 | | 2
P p−1 . 1 V⊥ Y Y V⊥ − Ep1 V⊥ Y Y V⊥ ≥ x ≤ C exp Cp2 − cp1 x ∧ x Then we obtain (44) by combining the two inequalities above. 51
3. Finally, we consider kPY V Y V⊥ k. Since kPY V Y V⊥ k =kPY V M Y V⊥ k = k(Y V M )((Y V M )| (Y V M ))−1 (Y V M )| Y V⊥ k (37)
≤
−1 σmin (Y
(107) |
|
|
V M )kM V Y Y V⊥ k
We analyze σmin (Y V M ) and kM | V | Y | V⊥ k separately below. Since 2 σmin (Y V M ) = σmin (M | V | Y | Y V M ) ≥ 1 − kM | V | Y | Y V M − Ir k,
by (103), we know there exists C, c only depending on τ such that 2 (Y V M ) ≥ 1 − x ≥ 1 − exp Cr − c(σr2 (X) + p1 )x ∧ x2 . P σmin Set x = 1/2, we could choose Cgap large enough, but only depends on τ , such that whenever σr2 (X) ≥ Cgap p2 ≥ Cgap r, Cr − c(σr2 (X) + p1 )x ∧ x2 ≤ − 8c (σr2 (X) + p1 ). Under such setting, 1 2 P σmin (Y V M ) ≥ ≥ 1 − C exp −c(σr2 (X) + p1 ) . (108) 2 For kM | V | Y | Y V⊥ k, since XV⊥ = 0, we have the following decomposition, M | V | Y | Y V⊥ = M | V | (X + Z)| (X + Z)V⊥ = M | V | X | ZV⊥ + M | V | Z | ZV⊥ . Follow the similar idea of the proof for (43), we can show for any unit vectors u ∈ Rr , v ∈ Rp2 −r , P (|u| M | V | X | ZV⊥ v| ≥ x) ≤ C exp −cx2 /k(V⊥ v)(u| M | V | X | )k2F ≤ C exp(−cx2 ). P (|u| M | V | Z | ZV⊥ v| ≥ x) = P (|u| M | V | (Z | Z − p1 Ip2 )V⊥ v| ≥ x) p ≤C exp(−c min(x2 , σr2 + p1 x)). By the -net argument again (Lemma 4), we have n o p P (kM | V | Y | Y V⊥ k ≥ x) ≤ C exp Cp2 − c min(x2 , σr2 + p1 x) . Combining (107), (108) and (109), we obtain (45).
(109)
Proof of Lemma 5. First, based on Lemma 2.5 in Vershynin (2011), there exists ε-nets WL in Bp1 , WR in Bp2 , namely for any u ∈ Bp1 , v ∈ Bp2 , there exists u0 ∈ WL , v0 ∈ WR such
52
that ku0 − uk2 ≤ ε, kv0 − vk2 ≤ ε and |WL | ≤ (1 + 2/ε)p1 , |WR | ≤ (1 + 2/ε)p2 . Especially we choose ε = 1/3. Under the event that Q = {|u| Kv| ≥ t, ∀u ∈ WL , v ∈ WR } , denote (u∗ , v ∗ ) = arg maxu∈Bp1 ,v∈Bp2 |u| Kv|, α = maxu∈Bp1 ,v∈Bp2 |u| Kv|, then α = kKk. According to the definition of 13 -net, there exists u∗0 ∈ WL , v0∗ ∈ WR such that ku∗ − u∗0 k2 ≤ 1/3, kv ∗ − v0∗ k2 ≤ 1/3. Then, α = |(u∗ )| Kv ∗ | ≤ |(u∗0 )| Kv0∗ | + |(u∗ − u0 )| Kv0∗ | + |(u∗ )| K(v ∗ − v0 )| 2 ≤t + ku∗ − u0 k2 · kKk · kv0∗ k2 + ku∗ k2 · kKk · kv ∗ − v0 k2 ≤ t + α 3 Thus, α ≤ 3t when Q happens. Finally, since X X P(Q) ≤ P (|u| Kv| ≥ t) ≤ 7p1 · 7p2 · maxn P (|u| Ku| ≥ t) , u∈B
u∈WL v∈WR
which finished the proof of this lemma.
Proof of Lemma 6. For each representative Pφ ∈ co(Pφ ), we suppose m(φ) is a measure on Pφ such that Z Pφ = Pθ dm(φ) . (110) Pφ
Thus, sup EPθ θ∈Θ
Z
h
Z i h i 2 ˆ ˆ d (T (θ), φ) = sup sup EPθ d (φ, φ) ≥ sup 2
= sup φ∈Φ Pφ (110)
Z
= sup φ∈Φ Rp
φ∈Φ Pφ
φ∈Φ θ∈Pφ
Z
h
Z i 2 (φ) ˆ d (φ, φ) dPθ dm = sup
Rp
h
φ∈Φ Rp
ˆ d (φ, φ) 2
iZ
h i ˆ dm(φ) EPθ d2 (φ, φ)
dPθ dm(φ)
Pφ
h i ˆ dPφ = sup EP [d2 (φ, φ)]. ˆ d2 (φ, φ) φ φ∈Φ
Proof of Lemma 7. The direct calculation for D(P¯V,t ||P¯V 0 ,t ) is relatively difficult, thus we detour by introducing the similar density to P¯V,t as follows, Z 1 p1 P˜V,t (Y ) = exp(−kY − 2tW V | k2F /2) · ( )p1 r/2 exp(−p1 kW k2F /2)dW. p1 p2 /2 p ×r 2π (2π) R 2 (111) We can see P˜V,t is another mixture of Gaussian distributions, thus it is indeed a density which sums up to 1. Since V ∈ On,r , V | V = Ir , 53
(112)
Denote Yi· as the i-th row of Y . Note that P˜V,t can be simplified as p r/2
p1 1 (2π)p1 (p2 +r)/2
Z
1 n exp − tr Y Y | − 2tY V W | − 2tW V | Y | 2 Rn×r o + 4t2 W W | + p1 W W | dW Z p r/2 1 n 4t2 p1 1 − V V | )Y | = exp − tr Y (I p 2 2 4t2 + p1 (2π)p1 (p2 +r)/2 Rn×r o 2t 2t + (4t2 + p1 )(W − 2 Y V )(W − 2 Y V )| dW. 4t + p1 4t + p1 ! p1 r/2 p 1 2 p1 /(4t + p1 ) 1X 4t2 | | = exp − Yi· (I − 2 V V )Yi· . 2 4t + p1 (2π)p1 p2 /2
(112) P˜V,t (Y ) =
(113)
i=1
From the calculation above we can see P˜V,t is actually joint normal, i.e. when Y ∼ P˜V,t , −1 ! 4t2 4t2 iid | | Yi· ∼ N 0, Ip2 − 2 VV = N 0, Ip2 + VV , i = 1 · · · , p1 . 4t + p1 p1 It is widely known that the KL-divergence between two p-dimensional multivariate Gaussians is D(N (µ0 , Σ0 )||N (µ1 , Σ1 )) det Σ1 1 | −1 Σ (µ − µ ) − p + log = tr Σ−1 + (µ − µ ) Σ . 1 1 0 1 0 0 1 2 det Σ0
(114)
We can calculate that for any two V, V 0 ∈ Op1 ,r , p1 4t2 4t2 0 0 | | ˜ ˜ D(PV,t ||PV 0 ,t ) = tr Ip2 − 2 VV Ip2 + V (V ) − p2 2 4t + p1 p1 4t2 4t2 16t4 p1 | 0 0 | | 0 0 | − 2 tr(V V ) + tr(V (V ) ) − tr(V V (V )(V ) ) = 2 4t + p1 p1 p1 (4t2 + p1 ) 16t4 (112) = r − kV | V 0 k2F 2 2(4t + p1 ) 16t4 Lemma 1 = k sin Θ(V, V 0 )k2F 2(4t2 + p1 ) Next, we show that P¯V,t (Y ) and P˜V,t (Y ) are very close in terms of calculating KLdivergence. To be specific, we show when Cr is large enough but with a uniform choice, there exists a uniform constant c such that 1 − 2 exp(−cp1 ) ≤
P¯V,t (Y ) ≤ 1 + 2 exp(−cp1 ), P˜V,t (Y )
54
∀Y ∈ Rp1 ×p2 .
(115)
According to (55) and (113), we know for any fixed Y , Z P¯V,t (Y ) exp −tr(Y Y | − 2tY V W | − 2tW V | Y | + (4t2 + p1 )W W | )/2 =CV,t P˜V,t (Y ) σmin (W )≥ 21 2 p r/2 4t + p1 1 4t2 | | V V )Y )/2 · · exp tr(Y (I − 2 dW 4t + p1 2π
2 !
2 p r/2 Z
4t + p1 1 2t 2
/2 dW =CV,t Y V exp −(4t + p1 ) W −
2π 4t2 + p1 σmin (W )≥ 12 F ! I(p r) 2t ˜ ) ≥ 1 W ˜ ∈ Rp1 ×r , W ˜ ∼N =CV,t P σmin (W Y V, 2 1 . 2 2 4t + p1 4t + p1 (116) For fixed Y ∈ Rp1 ×p2 , Y V ∈ Rp1 ×r , we can find Q ∈ Op1 ,p1 −r which is orthogonal to Y V , ˜ ∈ R(p1 −r)×r and Q| W ˜ are i.i.d. normal distributed with mean i.e. Q| Y V = 0. Then Q| W 2 0 and variance 1/(4t + p1 ). By standard result in random matrix (e.g. Corollary 5.35 in Vershynin (2012)), we have p ˜ ˜ ) = σr (W ˜ ) ≥ σr (Q| W ˜)= p 1 4t2 + p1 Q| W σr σmin (W 4t2 + p1 √ √ 1 p1 − r − r − x ≥p 4t2 + p1 with probability at least 1 − 2 exp(−x2 /2). Since t2 ≤ p1 /4, p1 ≥ 16r, we further set √ x = 0.078 p1 , the inequality above further yields p p √ √ √ 15p1 /16 − p1 /16 − 0.078 p1 1 1 ˜ √ ≥ σmin (W ) ≥ p ( p1 − r − r − x) ≥ 2 2 2p1 4t + p1 with probability at least 1 − exp(−cp1 ). Thus, for fixed Y , p1 ≥ 16r, ! I 2t 1 (p r) 1 p ×r ˜ ) ≥ W ˜ ∈ R 1 ,W ˜ ∼N Y V, 2 ≥ 1 − exp(−cp1 ). (117) P σmin (W 2 4t2 + p1 4t + p1 Recall the definition of CV,t , we have iid −1 CV,t = P σmin (W ) ≥ 1/2 W ∈ Rp1 ×r , W ∼ N (0, 1/p1 ) . Also recall the assumption that r ≤ 16p1 , Corollary 5.35 in Vershynin (2012) yields P(σmin (W ) < 1/2) ≤ exp(−cp1 ). Thus, 1 < CV,t < 1 + 2 exp(−cp1 ) 55
(118)
Combining (116), (118) and (117) we have proved (115). Finally, we can bound the KL divergence for P¯V,t and P¯V 0 ,t based on the previous steps. D(P¯V,t ||P¯V 0 ,t ) ¯ Z P (Y ) V,t dY = P¯V,t (Y ) log ¯ PV 0 ,t (Y ) Y ∈Rp1 ×p2 " ! ! Z ¯V,t (Y ) ˜V,t (Y ) P P = + log + log P¯V,t (Y ) log P˜V,t (Y ) P˜V 0 ,t (Y ) Y ∈Rp1 ×p2 ! ! Z ˜V,t (Y ) P + 2 log(1 + exp(−cp1 )) dY ≤ P¯V,t (Y ) log P˜V 0 ,t (Y ) Y ! Z ˜V,t (Y ) P ≤C exp(−cp1 ) + P˜V,t (Y ) log dY P˜V 0 ,t (Y ) Y ! Z ˜V,t (Y ) P ˜ + (PV,t (Y ) − P¯V,t (Y )) log dY P˜V 0 ,t (Y ) Y ≤
P˜V 0 ,t (Y ) P¯V 0 ,t (Y )
!# dY
16t4 k sin Θ(V, V 0 )k2F + C exp(−cp1 ) 2(4t2 + p1 ) ! Z P˜V,t (Y ) ˜ + exp(−cp1 ) PV,t (Y ) log dY P˜V 0 ,t (Y ) Y
For the last term in the formula above, we can calculate accordingly as ! Z P˜V,t (Y ) ˜ dY PV,t (Y ) log P˜V 0 ,t (Y ) Y Z p 4t2 X | = P˜V,t (Y ) · 2 Yi· V V | − V 0 (V 0 )| Yi· /2 dY 4t + p1 Y i=1 ! p1 2 4t2 X 1 4t ≤E Yi· V V | + V 0 (V 0 )| Yi·| Yi· ∼ N 0, Ip2 + VV| 4t2 + p1 2 p1 i=1 2 4t 4t2 | 0 0 | | =p1 · tr V V + V (V ) · Ip2 + VV 2(4t2 + p1 ) p1 4t2 4t2 | | ≤p1 · tr 2V I + V V V p2 2(4t2 + p1 ) p1 =4t2 r ≤ p21 . Since exp(−cp1 )p21 is upper bounded by some constant for all p1 ≥ 1. To sum up there exists uniform constant CKL such that for all V, V 0 ∈ Op2 ,r , D(P¯V,t ||P¯V 0 ,t ) ≤
16t4 k sin Θ(V, V 0 )k2F + CKL , 2(4t2 + p1 )
which has finished the proof of this lemma.
56
Proof of Lemma 8. First of all, since left and right multiplication for G[1:r,:] does not change the essence of the problem, without loss of generality we can assume that 1 . .. ∈ Od,r . UG = 1 0 ··· 0 In such case, all non-zero entries of G are zero except the top left r × r block. Based on the random matrix theory (Lemma 4), we know 1 2 P σmin (L[1:r,:] ) ≥ t2 + p − C (dp) 2 + d ≥ 1 − C exp(−cd),
2 1 P L[(r+1):n,:] ≤ n + C (pn) 2 + p ≥ 1 − C exp(−cp),
2 P L[(r+1):d,:] ≤ Cd ≥ 1 − C exp(−cd).
(119) (120) (121)
ˆ[1:r,:] ). By Lemma 9 and (119), (120), we know with proba1. First we consider σmin (U bility at least 1 − C exp(−cd ∧ p). 2 (L σmin [1:r,:] ) 2 2 σmin (L[1:r,:] ) + σmax (L[(r+1):p,:] ) 1 t2 + p − C (dp) 2 + d . ≥ 1 1 t2 + p − C (dp) 2 + d + n + C (pn) 2 + p
2 ˆ[1:r,:] ) ≥ σmin (U
We target on showing under the statement given in the lemma, 1 t2 + p − C (dp) 2 + d p + t2 /2 ≥ . 1 1 t2 + n t2 + p − C (dp) 2 + d + n + C (pn) 2 + p
(122)
The inequality above is implied by 1 1 t2 + p − C((dp) 2 + d) (t2 + n) ≥ t2 + n + C (pn) 2 + p p + t2 /2 2 1 1 1 1 2 n+t 2 2 ⇐t − C (pn) + p + (dp) + d − C p3/2 n 2 + p2 + (dp) 2 n + dn ≥ 0. 2 1 1 In fact, whenever t2 > Cgap (dp) 2 + d + p3/2 n− 2 , n > C0 p, we have 1 1 1 1 n + t2 t − C (pn) 2 + p + (dp) 2 + d − C p3/2 n 2 + p2 + (dp) 2 n + dn 2 1 1 C C C 2 2 − −√ − nt2 ≥t (Cgap /2 − C) (dp) + d + n 2 C0 Cgap C0 2
57
From the inequality above we can see, there exists large but uniform choices of constants C0 , Cgap > 0 such that the term above is non-negative, additionally implies (122). In another word, there exists C0 , Cgap > 0, such that whenever (90) holds, then 2 2 ˆ[1:r,:] ≥ p + t /2 ≥ 1 − C exp(−cd ∧ p). (123) U P σmin n + t2 ˆ[(r+1):p,:] ). Suppose we randomly generate R ˜ ∈ On−r as a 2. Next we consider σmax (U unitary matrix of (n − r) dimension, which is independent of L. Also, R ∈ On−r,p−r ˜ Clearly, as the first p − r columns of R. # " Ir · L and L have the same distribution. ˜| R This implies "
#
Ir ˜| R
ˆ, ·U
and
ˆ U
have the same distribution.
When we focus on the [(r + 1) : d]-th rows, we get ˆ[(r+1):n,:] R| U
ˆ[(r+1):d,:] and U
have the same distribution.
ˆ[(r+1):n,:] rather than U ˆ[(r+1):p,:] . Conditioning on Thus, we can turn to consider R| U ˆ , i.e. U ˆ[1:r,:] , the rest part of U ˆ , i.e. U ˆ[(r+1):n,:] is a random matrix first r rows of U ˆ[1:r,:] with spectral norm no more than 1. Applying Lemma 10, we get for any given U we have the following conditioning probability when n ≥ C0 d for some large C0 , !
2 p + C p(d − r)p + C(d − r)
|ˆ
ˆ P R U[(r+1):n,:] ≥ U[1:r,:] ≤ C exp(−cp). n−r Therefore,
2 p + C p(d − r)p + C(d − r)
ˆ
P U[(r+1):d,:] ≥ n−r
! ≤ C exp(−cp).
(124)
Next, under essentially the same argument as the proof in Part 1, one can show that there exists Cgap , C0 > 0 such that whenever (90) holds, we have p p + C (d − r)p + C(d − r) p + t2 /4 ≤ , n−r n + t2 additionally we also have
2 p + t2 /4
ˆ
P U[(r+1):d,:] ≥ ≤ C exp(−cp). n + t2 58
(125)
ˆ[(r+1):p,:] P ˆ 3. In this step we consider kU U[1:r,:] k. The idea to proceed is similar to the part ˆ[(r+1):p,:] k. Conditioning on U ˆ[1:r,:] , for kU ˆ[(r+1):p,:] P ˆ U U
[1:r,:]
ˆ[(r+1):n,:] P ˆ and R| U U
[1:r,:]
have the same distribution.
ˆ[r+1:n,:] P ˆ ˆ Also, kU ˆ[1:r,:] ) ≤ r. By Lemma 10, there exists U[1:r,:] k ≤ 1, rank(U[r+1:n,:] PU uniform constants C, c > 0 such that
1 1
|ˆ
ˆ
2 2 U P > C(d/n) R = P ≤ C exp(−cd). P U P > C(d/n)
ˆ[1:r,:] ˆ[1:r,:] [(r+1):n,:] U [(r+1):p,:] U (126) 4. Combine (123), (125), (126), we know there exists Cgap , C0 > 0 such that whenever (90) holds, then with probability at least 1 − C exp(p ∧ d),
2 2 2 ˆ[1:r,:] ) > ˆ[(r+1):d,:] ˆ[1:d,:] ). σmin (U (U
U
≥ σr+1
q 1
p+t2 /4 ˆ[1:r,:] ) · ˆ[(r+1):p,:] P ˆ σmin (U
U
· C(d/n) 2 U[1:r,:] n+t2
2 ≤ p+t2 /2 p+t2 /4 2 (U − n+t2 ˆ[1:r,:] ) − ˆ[(r+1):d,:] σmin
U
n+t2 p C d(p + t2 )(1 + t2 /n) . ≤ t2 By Proposition 1, we have finished the proof of Lemma 9.
(127)
Proof of Lemma 9. Based on the setting, U = LV Σ−1 . Thus,
U[Ω,:] v 2
L[Ω,:] V Σ−1 v 2 2 2 2 σmax (U[Ω,:] ) = maxp ≤ maxp
−1 v 2 + L c V Σ−1 v 2 v∈R v∈R L kU vk22 V Σ [Ω,:] [Ω ,:] 2 2
2 −1 2 2 σmax (L[Ω,:] ) V Σ v 2 σmax (L[Ω,:] ) ≤ maxp = 2 2 2 (L c ) . 2 (L v∈R σ 2 (L[Ω,:] ) kV Σ−1 vk + σ 2 (L[Ωc ,:] ) kV Σ−1 vk σ ) + σ [Ω ,:] [Ω,:] max min max 2 2 min The other inequality in the lemma can be proved in the same way.
Proof of Lemma 10. Suppose α = kXk. Since left and right multiply orthogonal matrix to X does not essentially change the problem, without loss of generality we can assume that X ∈ Rn×m is diagonal, such that X = diag(σ1 (X), σ2 (X), · · · , σp (X), 0, · · · ). Clearly σi (X) ≤ α. Now
σ (X)R
11 · · · σ1 (X)R1d
1
.. .. ≤ α · R[1:p,:] . kX | Rk = . .
σp (X)Rp1 · · · σp (X)Rpd 59
In order to finish the proof, we only need to bound the spectral norm for kR[1:p,:] k. For any unit vector v ∈ Rd , Rv is randomly distributed on On,1 with Haar measure. Thus kR[1:p,:] vk22 has the same distribution as
x21 + · · · + x2p , x21 + · · · + x2n
iid
x1 , · · · , xn ∼ N (0, 1).
(128)
By the tail bound for χ2 -distribution, ! p X p p 2 P p − 2pt ≤ xk ≤ p + 2pt + 2t ≤ 2 exp(−t), k=1
! n X √ √ P n − 2nt ≤ x2k ≤ n + 2nt + 2t ≤ 2 exp(−t), k=1
which means √ p + 2pt + 2t p p | | √ , − 2 exp(−t) ≥P v R[1:p,:] R[1:p,:] − Ir v ≥ n n n − 2nt √ p p − 2pt p | | √ 2 exp(−t) ≥P v R[1:p,:] R[1:p,:] − Ir v ≤ − . n n + 2nt + 2t n We set t = Cd for large enough C > 0 and apply ε-net method (Lemma 5), the following result hold for true.
√ √
| d p − C pd p + C pd + Cd p p
√ √ C exp(−cd) ≥ P R1:d.: R1:d,: − Ir ≥ 3 max − , − . n n n n + C nd + Cd n − C nd Note that √ √ p − C pd p + C pd + Cd p p √ √ max − , − n n n + C nd + Cd n − C nd p √ p √ C pd + d + p d/n C p d/n + pd/n + pd √ √ ≤ max , . n − C nd n + C nd + Cd ( ) √ √ C 2 pd + d C 3 pd √ √ ≤ max , n − C nd n + C nd + Cd
√ Thus there exists C0 > 0 such that when n > C0 r, n − C nr > n/2, and additionally, 1 ( ) √ √ 2 + d C (pd) d + C dr + Cr d d d − C dr √ √ max − , − ≤ . n n n + C nr + Cr n n − C nr To sum up, we have finished the proof of Lemma 10.
60