A Simple Approach to Optimal CUR Decomposition

A Simple Approach to Optimal CUR Decomposition Haishan Ye

[email protected]

arXiv:1511.01598v2 [cs.NA] 9 Dec 2015

Yujun Li

[email protected]

Zhihua Zhang

[email protected] Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dong Chuan Road, Shanghai, China 200240

November 5, 2015 Abstract Prior optimal CUR decomposition and near optimal column reconstruction methods have been established by combining BSS sampling and adaptive sampling. In this paper, we propose a new approach to the optimal CUR decomposition and approximate matrix multiplication with respect to spectral norm by using leverage score sampling. Moreover, our approach is an O(nnz(A)) optimal CUR algorithm where A is a data matrix in question. We also extend our approach to the Nyström method, obtaining a fast algorithm which runs in O(nnz(A)) time. Keywords: Column Selection, CUR, Leverage Score Sampling, the Nyström method

1. Introduction The CUR matrix decomposition approximates an arbitrary matrix by selecting a subset of columns and a subset of rows of the matrix to form a low-rank approximation (Drineas et al., 2008). It can overcome some drawbacks of SVD decomposition. For example, the standard SVD or QR decomposition on a sparse matrix does not preserve sparsity in general which leads to a big challenge to compute or even store such decompositions when the sparse matrix is large. Besides, SVD lacks interpretability for the data matrix, which makes it very difficult for us to understand and interpret the data in question. However, CUR decomposition can easily keep the sparsity and interpretability. Owing to the benefits from CUR decomposition, it has wide applications in image processing, bioinformatics, document classification, securities trading, and web graphs (Wang and Zhang, 2013, Yang et al., 2015a, Mahoney and Petros, 2009, Thurau et al., 2012). The Nyström method is closely related to CUR decomposition. More specifically, Wang and Zhang (2013) proposed a modified Nyström method. This modified method is a special case of CUR when the data matrix is symmetric positive semidefinite (SPSD). Typically, existing CUR decomposition methods and column reconstruction methods are based on BSS sampling (Boutsidis and Woodruff, 2014) and adaptive sampling (Bout1

W&S (2013) B&W (2014) Algorithm 1

Table 1: Comparison c r u O(kǫ−1 ) O(kǫ2 ) c −1 −1 O(kǫ ) O(kǫ ) k O(kǫ−1 ) O(kǫ−1 ) k

of our bound on CUR Time if m = Θ(n) O(n2 kǫ−1 + nk3 ǫ−2/3 + nk2 ǫ4 ) O(nnz(A) log n + n · poly(log n, k, ǫ−1 )) O(nnz(A) + n · poly(k, ǫ−1 ))

sidis and Woodruff, 2014, Boutsidis et al., 2014, Wang et al., 2014, Wang and Zhang, 2013, Anderson et al., 2015). Since the work of Boutsidis et al. (2014) and Wang and Zhang (2013), BSS sampling and adaptive sampling become the cornerstone of CUR decomposition methods and column reconstruction methods. However, BSS sampling and adaptive sampling suffer from highly expensive computation time. BSS sampling takes a sequential procedure which is very slow in modern computer architectures. Besides, adaptive sampling is the key step that costs O(nnz(A) log n) in (Boutsidis and Woodruff, 2014). This leads to an O(nnz(A) log n) optimal CUR (Boutsidis and Woodruff, 2014). Naturally, we wonder whether BSS sampling and adaptive sampling are necessary for optimal CUR decomposition methods and practical near optimal column reconstruction methods. In this paper, we resort to a leverage score sampling method which has been used in CUR decomposition in several work (Gittens and Mahoney, 2013, Drineas et al., 2008). However, we prove that the leverage score sampling method can result in an optimal CUR decomposition in the cases that O(k/ǫ) is bigger than k log k without BSS sampling and adaptive sampling. In practical cases, it is common that O(k/ǫ) is bigger than k log k. For example, it needs 1620k/ǫ to guarantee an optimal CUR decomposition with constant probability (Boutsidis and Woodruff, 2014). Then we take ǫ = 0.1, it needs k > exp(16200) columns to satisfy the property that k log k is bigger than 1620k/ǫ. k > exp(16200) is rarely happened in practical usage. Hence, we just use notation O(k/ǫ) instead of max(O(k/ǫ), Θ(k log k)) for convenience and claim our CUR decompostion is optimal. Moreover, the resulting algorithm (see Algorithm 1) is effective and efficient. In Algorithm 1, BSS sampling and adaptive sampling are no longer necessary. In the big data era, an algorithm linear to input size is preferred. For an optimal CUR decomposition, O(nnz(A)) algorithm is ideal for a sparse input matrix A ∈ Rm×n . However, the existing fastest CUR decomposition, proposed by Boutsidis and Woodruff (2014), is of O(nnz(A) log n) computation complexity. In this paper, we give the first O(nnz(A)) optimal CUR decomposition using the leverage score sampling method. Even in Θ(k log k) > O(k/ǫ) cases, Algorithm 1 still runs in input sparsity which is faster than previous CUR decompostions. To the best of our knowledge, Algorithm 1 is so far the fastest CUR decomposition method with relative error. Furthermore, fast randomized CUR in Algorithm 1 based on the leverage score sampling has better computation performance than the algorithms using BSS sampling and adaptive sampling (Boutsidis and Woodruff, 2014). In Table 1 , we list comparison of our work with previous work. In general case, our work has a comparable computation complexity, and in the sparse setting, our work shows better computation performance. This improvement is due to the fact that our method only needs leverage-score sketching. In contrast, adaptive sampling increases the computation burden in the work of Boutsidis and Woodruff (2014). 2

Table 2: Comparison of our bound on Nyström method c ||A − CUCT ||F Time Kumar et al. (2012) Ω(1) ||A − Ak ||F + n( kc )1/4 ||A||2 ˜ 2 k/ǫ2 ) G&M (2013) Ω( k log(k/β) ) ||A − Ak ||F + ǫ||A − Ak ||⋆ O(n βǫ2 Theorem 4 O( kǫ ) (1 + ǫ)||A − Ak ||F nnz(A)

We extend our result to SPDP low-rank approximation cases and devise a fast Nyström method. Our Nyström method has a tighter bound than the leverage score sampling method proposed by Gittens and Mahoney (2013). Besides, our methods is efficient which can be computed in O(nnz(A)) which is linear to nonzero number of A. Although the traditional Nyström method using uniform sampling method has a good computation performance, it suffers from some drawbacks as pointed by Wang and Zhang (2013). Besides, empirical evaluation also shows that the traditional Nyström method is not as effective as the leverage score sampling method (Gittens and Mahoney, 2013). We list these results in Table 2. As we can see, our method has a tighter bound than those of Kumar et al. (2012) and Gittens and Mahoney (2013). Gittens and Mahoney (2013) also used the leverage score sampling to implement the Nyström method. However, our algorithm given in Algorithm 2 adds some additional steps, obtaining a tighter bound while taking less sampling columns. Finally, in Theorem 5, we prove that leverage score can be used to achieve approximate matrices multiplication with respect to spectral norm when one matrix is orthonormal. Cohen et al. (2015) showed that oblivious subspace embedding matrices could be used in approximate multiplication with respect to spectral norm. However it is unknown whether leverage score sketching matrices can be used to achieve approximate matrices multiplication with respect to spectral norm. The condition for applying leverage score sketching matrices in approximate matrices multiplication with respect to spectral norm is similar to the one for approximate matrices multiplication with respect to Frobenius norm, i.e. one of matrices has to be orthonormal. And it has similar upper bound with the ones using oblivious subspace embedding matrices. Using the result of approximate matrices multiplication, we prove a tighter bound of column selection method using leverage score sampling with respect to spectral norm. We list some recent results in Table 3, where ρ is the rank of A − Ak . The bound √ of Yang s∗

s∗

et al. (2015b) depends on c(s) and q(s). where c(s) = maxi sii , q(s) = maxi si i , s∗i is the leverage score of Ak and q si is sampling probability. Yang et al. (2015b) proposed sqLρ ˜ sampling method to get a O( ) upper bound in an ideal setting. Our bound is based on ℓ

stable rank of A − Ak . It is common that stable rank of A − Ak is small. However, the rank of A − Ak is very big due to numerical computation. Hence, our bound is better than the bounds of (Boutsidis et al., 2014) and (Yang et al., 2015b). Based on column reconstruction, q k+sr(A−A k) k) ˜ + k+sr(A−A )-error CUR decomposition with respect to spectral we get a O( ℓ ℓ norm. 3

Table 3: Comparison of our bound on column reconstruction c ||Aq − CC† A||2 /||A − Ak ||2 BSS sampling Boutsidis et al. (2014) ℓ O( ρℓ ) q k ˜ Yang et al. (2015b) ℓ O( c(s) kρ ℓ + q(s) ℓ ) q k+sr(A−Ak ) k) ˜ + k+sr(A−A ) Theorem 6 ℓ O( ℓ ℓ

2. Notation and Definition First of all, we present the notation and notion that are used here and later. We let Im denote the m × m identity matrix. For convenience, we just use I sometimes, and 0 denotes a zero vector or matrix with appropriate size. Let ρ = rank(A) ≤ min{m, n} and k ≤ ρ. The singular value decomposition (SVD) of A can be written as A=

ρ X i=1

σi ui viT

T

= UΣV =

Uk Uk⊥

Σk 0 0 Σk⊥

VkT T Vk⊥

,

where Uk (m×k), Σk (k×k), and Vk (n×k) correspond to the top k singular values. We denote Ak = Uk Σk VkT which is the best (or closest) rank-k approximation to A. We also use σi = σi (A) to denote the i-th largest singular value and σmin (A) to denote the smallest singular value of A. When A is SPSD, the SVD is identical to the eigenvalue decomposition, in which case we have UA = VA . Besides, kAk , σ1 is the spectral norm. Given a matrix A ∈ Rm×n and a matrix C ∈ Rm×r with r > k, we formally define the matrix ΠζC,k (A) as the best approximation to A within the column space of C that has rank ˆ ζ over all A ˆ in the column space of C at most k. ΠζC,k (A) minimizes the residual ||A − A|| with rank at most k. T Additionally, let A† = Vρ Σ−1 ρ Uρ be the Moore-Penrose inverse of A. When A is nonsingular, the Moore-Penrose inverse is identical to the matrix inverse. P We define the matrix norms as follows. Let kAk1 = i,j |aij | be the ℓ1 -norm, kAkF = P P norm, kAk2 = maxx∈Rn ,kxk2 =1 kAxk2 = σ1 ( i,j a2ij )1/2 = ( i σi2 )1/2 be the Frobenius P be the spectral norm, and kAk∗ = i σi be the nuclear norm. Using spectral norm and Frobenius norm, we define stable rank sr(A) = ||A||2F /||A||22 . Given matrices A ∈ Rm×n , X ∈ Rm×p , and Y ∈ Rq×n , XX† A = UX UTX A ∈ Rm×n is T ∈ Rm×n is the the projection of A onto the column space of X, and AY † Y = AVY VY projection of A onto the row space of Y. We discuss the computational costs of the matrix operations mentioned above. For an m×n general matrix A (assume m ≥ n), it takes O(mn2 ) flops to compute the full SVD and O(mnk) flops to compute the truncated SVD of rank k (< n). The computation of A† also takes O(mn2 ) flops. For the Hadamard-Walsh transform, given the m × m Hadamard˜ Walsh transform matrix H, HA costs O(mn) which is much faster than typical matrix 2 multiplication which needs O(m n) arithmetic operations. We use nnz(A) to denote the number of non-zero entries in A. 4

Finally, we give the defintion of leverage score sampling which is the main tool to construct simple optimal CUR algorithm. Definition 1 (Leverage score sampling) (Boutsidis and Woodruff, 2014) Let V ∈ Rn×k be column orthonormal with n > k, and vi,∗ denote the i-th row of V. Let ℓi = kvi,∗ k2F /k. Let r be an integer with 1 ≤ r ≤ n, where the ℓi are leverage scores. Construct a sampling matrix Ω ∈ Rn×r and a rescaling matrix D ∈ Rr×r as follows. For every column j = 1, . . . , r of Ω and D, independently and with replacement, pick √ an index i from the set {1, 2 . . . , d} with probability ℓi and set Ωij = 1 and Djj = 1/ ℓi r. This procedure needs O(nk + n) numerical operations. We denote this procedure as [Ω, D] = LeverscoreSampling(V, r).

3. Simple Sparse CUR In this section, we propose simple sparse CUR, which is given the procedure in Algorithm 1. Algorithm 1 runs in input sparsity time, i.e., linear to O(nnz(A)). We start with the algorithm description. Then, we give a detailed analysis of the computation complexity of the algorithm. Finally, we theoretically demonstrate the asymptotic performance of the algorithm. 3.1 Algorithm description Algorithm 1 takes an m × n matrix A, a target rank k and an error parameter 0 < ǫ < 1 as input. It returns a matrix C ∈ Rm×c with c = O(k/ǫ) columns of A, a matrix R ∈ Rr×n with r = O(k/ǫ) rows of A, and a matrix U ∈ Rc×r with rank at most k. The two algorithms run mainly in three steps:(i) in the first step, sample an optimal number of columns of A to get C using leverage score; (ii) in the second step, sample an optimal number of rows of A to get R using leverage score; (iii) in the third step, construct the intersection matrix U with rank k. 3.2 Running time analysis A detailed analysis of the arithmetic operations of Algorithm 1 is given as follows 2 /ǫ4 + k 2 /ǫ5 ) to find O(k/ǫ) columns of A ˜ 1. Algorithm 1 needs O(nnz(A) + nk) + O(nk to construct C. 2 /ǫ4 + k 2 /ǫ5 ) to get Z ∈ Rn×k from Lemma 8. ˜ (a) It needs O(nnz(A)) + O(nk

(b) It needs O(nk) to get leverage score and O(n) to sample C.

2. Algorithm 1 needs O(nnz(A) + mk 3 /ǫ5 + mk) to find O(k/ǫ) rows of A to construct R. (a) It needs O(nnz(A) + mk 3 /ǫ5 ) to get Q, Y and ∆ from Lemma 9. (b) It needs O(mk) to get leverage score and O(m) to sample R. 3. It needs O(k3 /ǫ3 + k3 /ǫ2 + k3 /ǫ) to construct intersection matrix U. 5

Algorithm 1 Simple Sparse CUR. 1: 2: 3: 4: 5: 6:

Input: a real matrix A ∈ Rm×n , target rank k and error parameter ǫ; Z = SparseSV D(A, k, ǫ); [Ω, Γ] = LeverscoreSampling(Z, O(k/ǫ)) and construct C = AΩ; [Q, Y, ∆] = ApproxSubspaceSV D(A, C, k, ǫ); [T, D] = LeverscoreSampling(Q∆, O(k/ǫ)) and construct R = TT A; Output: C, R, U = Y† ∆(TT Q∆)† .

(a) It needs O(k3 /ǫ3 ) and O(k3 /ǫ) to compute Y † and (TT Q∆)† respectively. (b) It needs O(k3 /ǫ2 ) to compute matrix multiplication. The total asymptotic arithmetic operation of the algorithm is 2 4 ˜ O(nnz(A) + mk 3 /ǫ5 + k3 /ǫ3 + k3 /ǫ2 + k3 /ǫ + mk + nk) + O(nk /ǫ + k2 /ǫ5 ).

3.3 Error Bound The following theorem shows that our main result regarding to Algorithm 1. Theorem 2 Given a matrix A ∈ Rm×n , a target rank k and an error parameter ǫ, run Algorithm 1. Then ||A − CUR||F ≤ (1 + ǫ)||A − Ak ||F holds with high probability.

4. SPSD Low-rank Approximation and Modified Nystr¨ om CUR decomposition has a close relationship with SPSD low-rank approximation. Wang and Zhang (2013) proposed a modified Nyström method which had better approximation precision than traditional Nyström methods but had less computation efficiency. Wang et al. (2015) gave a more efficient modified Nyström method using random projection matrix. In this section, we will give a fast and simple modified Nyström method. Because Nyström method is mainly used in kernel matrix approximation and kernel matrices are dense usually. For example, the widely used Gaussian RBF kernel matrix is dense. Fast subsampled Hadamard-Walsh transform matrices are efficient when data matrix is dense. Hence, we use a subsampled Hadamard-Walsh transform matrix to construct a fast randomized SVD. 4.1 Algorithm description Algorithm 2 takes an n × n symmetric matrix A, a target rank k and an error parameter 0 < ǫ < 1 as input. It returns a matrix C ∈ Rn×c with c = O(k/ǫ) columns of A and a matrix U ∈ Rc×c. Algorithm 2 runs mainly in three steps: (i) in the first step, sample a number of columns of A to get C using leverage score; (ii) compute leverage scores of C using method in (Drineas et al., 2012); (iii) in the third step, construct the intersection matrix U. We use the method proposed in Wang et al. (2015) to promote the efficiency of constructing U. 6

Algorithm 2 Simple Modified Nyström. Input: a symmetric matrix A ∈ Rn×n , target rank k and error parameter ǫ; Z = SparseSV D(A, k, ǫ); [Ω, Γ] = LeverscoreSampling(Z, O(k/ǫ)) and construct C1 = AΩ; [Q, Y, ∆] = ApproxSubspaceSV D(A, C, k, ǫ); [T, D] = LeverscoreSampling(Q∆, O(k/ǫ)) and construct C2 = AT; Construct C = [C1 , C2 ] ∈ Rn×O(k/ǫ) ; Compute approximate leverage scores of C using the method of (Drineas et al., 2012) and construct leverage sketch matrix S ∈ Rn×s , where s = O( ǫk3 ); 8: Output: C, U = (ST C)† (ST AS)(CT S)† .

1: 2: 3: 4: 5: 6: 7:

4.2 Running time analysis Now we give a detailed analysis of the arithmetic operations of Algorithm 2. 2 /ǫ4 + k 2 /ǫ5 ) to find O(k/ǫ) ˜ 1. Alogrithm 2 needs O(nnz(A) + nk 3 /ǫ5 + nk) + O(nk columns of A to construct C. 2 /ǫ4 + k 2 /ǫ5 ) to get Z ∈ Rn×k from Lemma 8. ˜ (a) It needs O(nnz(A)) + O(nk (b) It needs O(nk) to get leverage score and O(n) to sample C1 and C2 . (c) It needs O(nnz(A) + nk 3 /ǫ5 ) to get Q, Y and ∆ from Lemma 9.

˜ 2. Alogrithm 2 needs O(nk/ǫ3 + k2 /ǫ6 + k3 /ǫ7 + k3 /ǫ5 + k2 /ǫ4 ) + O(nk/ǫ) to construct U. ˜ (a) It needs O(nk/ǫ) to get the approximate leverage scores of C as shown in (Drineas et al., 2012); (b) It needs O(k2 /ǫ4 ) to get (ST C) and O(nk/ǫ3 + k2 /ǫ6 ) to get ST AS. (c) It needs O(k3 /ǫ5 ) to compute (ST C)† . (d) It needs O(k3 /ǫ7 ) to compute matrix multiplication. The total asymptotic arithmetic operation of the algorithm is 2 4 ˜ O(nnz(A) + nk 3 /ǫ5 + nk/ǫ3 + k2 /ǫ6 + k3 /ǫ7 + k3 /ǫ5 + k2 /ǫ4 + nk) + O(nk /ǫ + k2 /ǫ5 + nk/ǫ)

4.3 Error Bound The following theorem shows that our main approximate result regarding to Algorithm 2. Theorem 3 Given that A ∈ Rm×n , C ∈ Rm×c and R ∈ Rn×r , S is the leverage score sketching matrix of C with O(c/ǫ) rows and T is the leverage score sketching matrix of R with O(r/ǫ) columns. Let U⋆ = C† AR† = argmin ||A − CUR||F U

and ˆ = (SC)† SAT(RT)† , U then, we have ˆ ||A − CUR|| F ≤ (1 + 7

√

ǫ)||A − CU⋆ R||F

Theorem 4 Given a symmetric matrix A ∈ Rn×n , a target rank k and an error parameter ǫ, run Algorithm 2. Then ||A − CUCT ||F ≤ (1 + ǫ)||A − Ak ||F holds with high probability.

5. Column reconstruction with respect to spectral norm In this section, we will give some results with respect to spectral norm using levererage score sampling. First, we will prove that leverage score can be used to achieve approximate matrices multiplication with respect to spectral norm when one matrix is orthonormal just as Theorem 5 shows. Further more, we show an upper bound on column reconstruction in the form of stable rank using leverage score sampling. 5.1 Approximate matrix multipication Theorem 5 Given A ∈ Rp×m and U ∈ Rp×n , U is a column orthonormal matrix. let [Ω, D] = LeverscoreSampling(U, ℓ), and S = ΩD, then we have s T T 2(sr(A) + p) log m+n 2(sr(A) + p) log m+n ||A S SU − AU||2 δ δ ≤ + . (1) ||A||2 ℓ 3ℓ Cohen et al. (2015) showed that oblivious subspace embedding matrices could be used in approximate multiplication with respect to spectral norm just as following s sr(A) sr(B) T T 1+ ||A Π ΠB − AB||2 ≤ ǫ||A||2 ||B||2 1+ . (2) k k when Π is a (k, ǫ)-oblivious subspace embedding matrix. However, leverage sketching matrices are not oblivious because leverage sketching matrices depend on input matrices. When 2 ℓ = O(l log m+n δ )/ǫ , then Equation (1) becomes r ||AT ST SU − AU||2 (sr(A) + p) (sr(A) + p) ≤ǫ + ǫ2 ||A||2 k k With the condition of Thoerem 5, and Π is a (k, ǫ)-oblivious subspace embedding matrix, then, Equation (2) becomes r sr(A) + p sr(A)p ||AT ΠT ΠU − AU||2 ≤ǫ 1+ + ||A||2 k k2

The above two inequalities have similar forms and bounds. However, these bounds can not be compared directly, because different ǫ and different ratio of sr(A) and p will influence the result significiently. Theorem 10 shows that when one input matrix U is orthonormal then the leverage score sketching matrix of U can be used to approximate multiplication with respect to Frobenius norm. Theorem 5 shows that when one input matrix U is orthonormal then the leverage score sketching matrix of U can be used to approximate multiplication with respect to spectral norm. 8

Algorithm 3 Column Selection Algorithm Using Leverage Scores 1: Input: a real matrix A ∈ Rm×n , target rank k, sampling number ℓ; 2: Compute truncated SVD to get Vk ; 3: Run [Ω, D] = LeverscoreSampling(Z, ℓ), where ℓ > Θ(k log k) ; 4: Output: C = AΩ.

5.2 Column Selection Using Theorem 5, we can get a bound on column reconstruction with respect to spectral norm in the form of stable rank. Theorem 6 Given a matrix A ∈ Rm×n , a target rank k, C ∈ Rm×ℓ is returned from Algorithm 3 with A, k and ℓ > Θ(k log k) as input parameters. Then 

r

||A − CC†A||2 ≤ ||A − Ak ||2 1 +

2(ˆ r + k) log ℓ

r δ

! 2(ˆ r + k) log  + 3ℓ r δ

1−

1 q

k log ℓ

k δ

 

holds with probability at least 1 − 2δ, where rˆ is the stable rank of A − Ak and r is the rank ˜ r + k)/ǫ2 , where 0 < ǫ < 1, then we have of A. When ℓ = O(ˆ ||A − CC† A||2 ≤ (1 + ǫ)||A − Ak ||2 . In most cases, the rank of A − Ak is big due to numerical computation. However, stable rank is commonly small since it is the ratio of ||A||2F and ||A||22 , as we will see in emprical studies. The bound of Theorem 1 of Boutsidis et al. (2014) can not get a tight upper bound ˜ r + k)/ǫ2 , Theorem 6 when the rank of A is full or near full which is common. When ℓ = O(ˆ can get a relative error bound. Corollary 7 Given a matrix A ∈ Rm×n , a target rank k, C ∈ Rm×ℓ is returned from Algorithm 3 with A, k and ℓ > Θ(k log k) as input parameters. RT ∈ Rn×ℓ is returned from Algorithm 3 with AT , k and ℓ > Θ(k log k) as input parameters. Then, ˜ rˆ + k + ||A − CC†AR† R||2 ≤ O( ℓ

r

rˆ + k ) ℓ

holds with high probability, where rˆ is the stable rank of A − Ak . Proof Theorem 6 shows that ˜ rˆ + k + ||A − CC A||2 ≤ O( ℓ

r

rˆ + k ) ℓ

˜ rˆ + k + ||A − AR R||2 ≤ O( ℓ

r

rˆ + k ). ℓ

†

and †

9

Then, ||A − CC† AR† R||2 ≤||A − CC†A||2 + ||CC† A − CC†AR† R||2 ≤||A − CC†A||2 + ||CC† ||2 ||A − AR† R||2 =||A − CC†A||2 + ||A − AR† R||2 r r ˆ + k rˆ + k ˜ ≤O( + ). ℓ ℓ

6. Conclusion In this paper, we have proposed the first nnz(A) optimal CUR decomposition method. Our CUR decomposition algorithm is much simpler than the existing CUR algorithms and is still optimal. In particular, the BSS sampling is not longer needed in our algorithms. The BSS sampling is costly since it is a sequential procedure. We have also extended our CUR method to the modified Nyström method and obtained a fast Nyström decomposition in O(nnz(A)) which is suitable in large scale machine learning scenarios. We identify that leverage score sketching matrices can be used to approximate matrix multiplication with respect to spectral norm just as Frobenius norm case and get a tighter spectral norm upper bound of column reconstruction using leverage score sampling.

10

Appendix A. Key Tools Used in CUR Decomposition Lemma 8 (Lemma 3.4 in Boutsidis and Woodruff (2014)) Given a matrix A ∈ Rm×n of rank ρ, a target rank 2 ≤ k < ρ, and 0 < ǫ ≤ 1, there exists a randomized algorithm that computes Z ∈ Rn×k with ZT Z = Ik and with probability at least 0.99, ||A − AZZT ||2F ≤ (1 + ǫ)||A − Ak ||2F . 2 /ǫ4 + k 3 /ǫ5 ). We denote this procedure as ˜ The algorithm requires O(nnz(A)) + O(nk

Z = SparseSV D(A, k, ǫ). Lemma 9 (Lemma 3.12 of Boutsidis and Woodruff (2014)) Let A ∈ Rm×n and V ∈ Rm×c . Assume that for some rank parameter k and accuracy parameter 0 < ǫ < 1, ||A − ΠFV,k (A)||2F ≤ ||A − Ak ||2F . Let V = QY be a QR-decomposition of V where Q ∈ Rm×c and Y ∈ Rc×c . Let Γ = QT AWT ∈ Rc×ℓ , where WT ∈ Rn×ℓ is a sparse subspace embedding matrix with ℓ = O(c2 /ǫ2 ). Let ∆ ∈ Rc×k contain the top k left singular vetors of Γ. Then, with probability at least 0.99, kA − Q∆∆T QT Ak2F ≤ kA − Ak k2F .

Q, Y and ∆ can be computed in O(nnz(A) + mc3 /ǫ2 ) time. We denote the construction of Q, Y and ∆ as [Q, Y, ∆] = ApproxSubspaceSVD(A, V, k, ǫ).

Appendix B. Key Theorems in Later Proofs Theorem 10 (Clarkson and Woodruff (2013)) For A ∈ Rm×n and orthonormal matrix U ∈ Rm×k , there is t = Θ(ǫ−2 ), so that for a t × m leverage-score sketching matrix S for U, P ||AT ST SU − AT U||2F < ǫ2 ||A||2F ||U||2F ≥ 1 − δ,

for any fixed δ > 0.

Theorem 11 (Clarkson and Woodruff (2013)) For any rank k matrix A ∈ Rm×n with row leverage scores, there is t = O(kǫ−2 log k) such that leverage-score sketching matrix S ∈ Rt×m is an ǫ-embedding matrix for A, i.e. ||SAx||22 = (1 ± ǫ)||Ax||22 Theorem 12 Suppose C and A are matrices with m rows, and C is a matrix with rank k. S is a t × m leverage-score skecting matrix for U, and S is the leverage score sketching matrix of C with √ O(k/ǫ) rows, and S is also a subspace embedding for C with error parameter ˆ is the solution to ǫ0 ≤ 1/ 2. Then if Y min = ||S(CY − A)||2F Y

11

and Y ⋆ is the solution to min = ||S(CY − A)||2F Y

then ˆ − A||F ≤ (1 + ǫ)||CY ⋆ − A||F ||CY and

√ ˆ − Y ∗ )||F ≤ 2 ǫ||CY ∗ − A||F ||C(Y

holds with high probability. Theorem 13 Given matrices A ∈ Rm×n , C ∈ Rm×c and R ∈ Rr×n , we have C† AR† = argmin ||A − CUR||F U

Theorem 14 Given a matrix A ∈ Rm×n , construct a m×n random matrix R that satisfies E[R] = A,

and

||R||2 ≤ L.

Compute the per-sample second moment: M = max{||E[RRT ]||2 , E[RT R]||2 }. Form the matrix sampling estimator ℓ

X ¯ =1 R Ri , ℓ

where each Rk is an independent copy of R.

1

Then, for all t ≥ 0 ¯ − A|| ≥ t ≤ (m + n) exp P ||R

−ℓt2 /2 M + 2Lt/3

.

Theorem 15 Given a matrix A = AZZT + E ∈ Rm×n , where Z ∈ Rn×k and ZT Z = Ik , let S ∈ Rn×t be any matrix such that rank(ZT S) = k. Let C = AS ∈ Rm×r . Then ||A − CC† A||2ζ ≤ ||A − ΠζC,k (A)||2ζ ≤ ||A − C(ZT S)† ZT ||2ζ ≤ ||E||2ζ + ||ES(ZT S)† ||2ζ .

Appendix C. Proof of Theorem 5 We first prove a theorem which is the key for the proof of Theorem 16. Theorem 16 Given matrices A ∈ Rm×p and B ∈ ×Rp×n, we define sampling probability pj =

||a:,j ||22 + ||bj,: ||22 , ||A||2F + ||B||2F 12

where a:,j is the j-th column of A and bj,: is the j-th row of B. Sampling ℓ columns and rows of A and B respectively with probability distribution {pj }, and define l

X 1 ¯ =1 R a:,j bj,: . ℓ 1 pj Then, we have ¯ − AB||2 ||R ≤ ||A||2 ||B||2

s

2(sr(A) + sr(B)) log ℓ

m+n δ

2(sr(A) + sr(B)) log + 3ℓ

m+n δ

holds with probabilty at least 1 − δ. Proof Let us define R=

1 a:,j bj,:. pj

It is easy to check that E[R] = AB. Also, it is easy to check that ||R||2 ≤ 12 (sr(A) + sr(B)) q 2(sr(A)+sr(B)) log m+n T T δ and M = max{||E[RR ]||2 , E[R R]||2 } ≤ (sr(A)+sr(B)) . Setting t = + ℓ 2(sr(A)+sr(B)) log 3ℓ

m+n δ

, and use Theorem 14 to get the result.

Now, we give the proof of Theorem 5. ˆ = cU, where c is a constant that we choose. Then, we have Proof We observe that if U pˆj = we define

||aj,: ||22 + c||uj,: ||22 . ||A||2F + c||B||2F

l

X 1 ¯ =1 ˆ T S(cU). ˆ R aT cuj,: = AT S ℓ pˆj j,: 1

By Theorem 16, we have the following result, s ¯ 2(sr(A) + p) log ||R − AcU||2 ≤ ||A||2 ||cU||2 ℓ

m+n δ

+

2(sr(A) + p) log 3ℓ

m+n δ

.

Besides, we have, ¯ − A(cU)||2 ˆ T S(cU) ˆ ˆ T SU ˆ − AU||2 ||R ||AT S − A(cU)||2 ||AT S = = ||A||2 ||cU||2 ||A||2 ||cU||2 ||A||2 ||U||2 ˆ = S. Let c go to infinite, then pj = ||uj,: ||22 /||U||2F becomes the leverage score of U, and S And we get the result s T T 2(sr(A) + p) log m+n 2(sr(A) + p) log m+n ||A S SU − AU||2 δ δ ≤ + ||A||2 ℓ 3ℓ

13

Appendix D. Proof of Theorem 6 Proof By Theorem 15 with ζ = 2, then we have ||A − CC†A||2 || ≤ ||E||2 + ||ES(ZT S)† ||2 . Let E = A − Ak and Z = Vk , and using the fact (ZT S)† = (ZT S)T (ZT SST Z)−1 when rank(ZT S) = k, we have ||ES(ZT S)† ||2 = ||ESST Z(ZT SST Z)−1 ||2 ≤ ||ESST Z||2 ||(ZT SST Z)−1 ||2

(3)

By Theorem 11, we have ǫ0 =

s

k log kδ . ℓ

Hence, we have ||(ZT SST Z)−1 ||2 =

1 = 1 − ǫ0

1−

For ||ESST Z||2 , we have

1 q

k log ℓ

k δ

.

||ESST Z||2 =||ESST Z − EZ||2 s  rank(A) rank(A) 2(sr(E) + k) log 2(sr(E) + k) log δ δ  ≤||E||2  + ℓ 3ℓ s 2(sr(A − Ak ) + k) log 2(sr(A − Ak ) + k) log rank(A) δ + =||A − Ak ||2  ℓ 3ℓ

(4) (5) rank(A) δ



,

where Equation (4) follows from EZ = (A − Ak )Vk = 0, and Equation (5) follows from Theorem 5. Hence, we have s

||ES(ZT S)† ||2 ≤ ||A−Ak ||2 

2(sr(E) + k) log ℓ

rank(A) δ

+

2(sr(E) + k) log 3ℓ

rank(A) δ

Appendix E. Proof of Theorem 2 Proof By Theorem 15, we have ||A − CC† A||2F ≤ ||A − ΠFC,k (A)||2F ≤ ||E||2F + ||ES(ZT S)† ||2F . 14

 

1−

1 q

k log ℓ

k δ

 

Let E = A − AZZT and S = ΩΓ. Then ST is a row leverage-score sketching matrix of Z, where Z, Ω and Γ are computed in Algorithm 1. Additionally, ST is also a subspace embedding matrix of Z with error parameter ǫ0 = 1/2. Using the fact (ZT S)† = (ZT S)T (ZT SST Z)−1 , we have ||ES(ZT S)† ||2F = ||ESST Z(ZT SST Z)−1 ||2F

≤ ||ESST Z||2F ||(ZT SST Z)−1 ||22

≤ ε2 ||E||2F ||Z||2F ||(ZT SST Z)−1 ||22 2

≤ 4kε

||E||2F ,

(6) (7) (8)

where Equation (6) follows from the fact ||AB||F ≤ ||A||2 ||B||F , and Equation (7) follows from Theorem 10 with error parameter ε and EZ = A(I − ZZT )Z = 0. Equation (8) is due to Theorem 11 with error parameter ǫ0 = 1/2. We have, ||ST Z||22 = (1 ± ǫ0 )||Z||22 = (1 ± ǫ0 ). Hence, ||(ZT SST Z)−1 ||22 ≤ (1 − ǫ0 )−2 = 4.

ǫ Let ǫ = 4kε2 , i.e., ε2 = 4k . Theorem 10 shows that S needs t = Θ( kǫ ) columns. S needs t = O(k log k) columns as a subspace embedding matrix of Z with error parameter ǫ0 = 1/2 due to Theorem 11. Hence t = max{Θ( kǫ ), O(k log k)}. Combining Lemma 9 that ||E||2F ≤ (1 + ǫ)||A − Ak ||2F , we have

||A − CC† A||2F ≤ ||A − ΠFC,k (A)||2F ≤ (1 + ǫ)2 ||A − Ak ||2F . By Lemma 9, we have min ||Q∆X − A|| = ||Q∆∆T QT A − A|| ≤ ||A − ΠFC,k (A)||F . X

By condition that (TD)T ∈ Rr×m is the leverage score sketching matrix of Q∆ with r = max{Θ(k/ǫ), O(k log k)}, where T and D are computed in Algorithm 1, combining Theorem 12, we have that if ˆ = (TT Q∆)† TT A = (DTT Q∆)† DTT A = argmin ||DTT (Q∆X − A)||F X X

where the equality follows from D is a diagonal matrix, then, ˆ − A||F ≤ (1 + ǫ)||A − ΠF (A)||F ≤ (1 + ǫ)2 ||A − Ak ||F . ||Q∆X C,k And

ˆ = Q∆(TT Q∆)† TT A = CY † ∆(TT Q∆)† R = CUR, Q∆X

where U = Y † ∆(TT Q∆)† . Hence, by rescaling ǫ, we have kA − CUR||F ≤ (1 + ǫ)kA − Ak ||F .

15

Appendix F. Proof of Theorem 3 Proof By Theorem 13, we have ||A − CC†AR† R||F = ||A − CC†A||F + ||CC† A − CC†AR† R||F because (A − CC† A)T (CC†A − CC† AR† R) = 0. We rewrite the optimization problem in Theorem 13 as follows, min ||A − CUR||F = min ||A − CX + CX − CUR||F U

U,X

Then, we have min ||A − CX + CX − CUR||F ≤ min (||A − CX||F + ||CX − CUR||F ) U,X

U,X

= min ||A − CX||F + min ||CX − CUR||F X

U

†

=||A − CC A||F + ||CC† A − CC† AR† R||F

=||A − CC† AR† R||F .

ˆ = (SC)† SA is the solution to We first tackle minX ||A − CX||F , X min = ||S(CX − A)||2F , X

by Theorem 12, we have ||C(SC)† SA − A||F ≤ (1 + ǫ)||A − CC†A||F

(9)

ˆ − CUR||F , U ˜ = XR ˆ † is a solution. Let X∗ = C† A and U∗ = X∗ R† = As to minU ||CX † † C AR , then we have ˆ − CUR ˜ − (CX∗ − CU∗ R)||F ||CX ˆ − CX∗ ||F + ||CUR ˜ − CU∗ R||F ≤||CX

ˆ − CX∗ ||F + ||CXR ˆ † R − CX∗ R† R||F =||CX ˆ − CX∗ ||F + ||CX ˆ − CX∗ ||F ||R† R||2 ≤||CX ˆ − CX∗ ||F ≤2||CX √ ≤4 ǫ||CX∗ − A||F The last inequality follows from Theorem 12. Hence, we have √ ∗ ∗ ∗ ˆ − CUR|| ˜ ||CX F ≤ ||CX − CU R||F + 4 ǫ||CX − A||F Let ˆ = argmin ||CXT ˆ − CURT||F , U U

16

(10)

by Theorem 12, we have ˆ − CX|| ˆ F ≤(1 + ǫ)||CX ˆ − CXR ˆ † R||F ||CUR

√ ≤(1 + ǫ)||CC† A − CC†AR† R||F + 4(1 + ǫ) ǫ||A − CC†A||F

Hence, we have min (||A − CX||F + ||CX − CUR||F ) U,X

ˆ F + ||CX ˆ − CUR|| ˆ ≤||A − CX|| F

√ ≤(1 + ǫ)||A − CC† A||F + (1 + ǫ)||CC† A − CC† AR† R||F + 4(1 + ǫ) ǫ||A − CC†A||F √ 3 ≤(1 + 4 ǫ + ǫ + 4ǫ 2 )||A − CC†AR† R||F by rescaling ǫ, we can get the result.

Appendix G. Proof of Theorem 4 To prove the above theorem, we first give some important lemmas. Lemma 17 Given two matrices A ∈ Rm×n , B ∈ Rm×p , let R(B) ⊆ R(A). Then AA† = UUT and BB† = UQQT UT , where U ∈ Rm×k1 , Q ∈ Rk1 ×k2 are column orthonormal matrices, and k1 and k2 are rank of A and B, respectively. Proof The first equation can be proved using an economic SVD decomposition. That is, AA† = UΣVT VΣ−1 UT = UUT , where U ∈ Rm×k1 . Since R(B) ⊆ R(A), then there exists a matrix X ∈ Rk1 ×p such that B = UX. And we haveX = QR which is a pivot QR decomposition of X, where Q ∈ Rk1 ×k2 , and R ∈ Rk2 ×p is a full row rank matrix. Hence BB† = UQRR† QT UT = UQQT UT .

Lemma 18 Given any matrices A ∈ Rm×n , Z ∈ Rm×p , C ∈ Rm×q , assume R(Z) ⊆ R(C) ⊆ R(A). Let X ∈ Rn×n is a projection matrix. Then kA − CC†AXkF ≤ kA − ZZ† AXkF . Proof By Lemma 17, CC† = UUT , U ∈ Rm×k1 is a column orthonormal matrix, i.e. UT U = Ik1 , where k1 is the rank of C. ZZ† = UQQT UT , Q is a column orthonormal 17

matrix, i.e., QT Q = Ik2 , where k2 is the rank of Z. X is a projection matrix with the properties X2 = X and XT = X. We have kA − CC†AXk2F − kA − ZZ† AXk2F = tr AAT − 2AT UUT AX + XT AT UUT AX

− AT A − 2AT UQQT UT AX + XT AT UQQT UT AX = tr 2 AT UQQT UT AX − AT UUT AX + XT AT UUT AX − XT AT UQQT UT AX = tr 2 AT UQQT UT AX2 − AT UUT AX2 + XT AT UUT AX − XT AT UQQT UT AX = tr 2 XT AT UQQT UT AX − XT AT UUT AX + XT AT UUT AX − XT AT UQQT UT AX = tr XT AT UQQT UT AX − AT UUT AX = tr XT AT U(QQT − I)UT AX ˜Q ˜ T )UT AX = tr XT AT U(QQT − QQT − Q ˜Q ˜ T UT AX = tr XT AT UQ ≤ 0.

Q T ˜ ˜T ˜ Here we use the fact that I = Q Q ˜ = QQ + QQ . Q

Now we begin to prove Theorem 4. Proof Using Theorem 2, we have ˜ T2 ||F ≤ ||A − Ak ||F , ||A − C1 UC ˜ is constructed as Algorithm 1. Due to Theorem 13, we have where U ˜ T ||F ≤ ||A − Ak ||F . ||A − C1 C†1 A(C†2 )T CT2 ||F ≤ ||A − C1 UC 2 By the fact that R(C1 ) ⊆ R(C) ⊆ R(A) and R(C2 ) ⊆ R(C) ⊆ R(A), using Lemma 18 twice, we get the following result: ||A − CC†A(C† )T CT ||F ≤ ||A − C1 C†1 A(C†2 )T CT2 ||F ≤ ||A − Ak ||F . S is leverage score sketching matrix of C, by Theorem 3, we have, ||A − CUCT ||2F ≤ (1 + ǫ)||A − CC† A(C† )T CT ||2F By rescaling ǫ, we get the final result, ||A − CUCT ||F ≤ (1 + ǫ)||A − Ak ||F .

18

References David G Anderson, Simon S Du, Michael W Mahoney, Christopher Melgaard, Kunming Wu, and Ming Gu. Spectral gap error bounds for improving CUR matrix decomposition and the Nyström method. In AISTATS, 2015. Christos Boutsidis and David P. Woodruff. Optimal CUR matrix decompositions. Eprint Arxiv, pages 353–362, 2014. Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014. Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 81–90, 2013. Michael B Cohen, Jelani Nelson, and David P Woodruff. Optimal approximate matrix product in terms of stable rank. arXiv preprint arXiv:1507.02268, 2015. Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008. Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475–3506, 2012. Alex Gittens and Michael W. Mahoney. Revisiting the Nyström method for improved large-scale machine learning. Eprint Arxiv, pages 567–575, 2013. Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nyström method. The Journal of Machine Learning Research, 13(1):981–1006, 2012. Michael W Mahoney and Neas Petros. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009. Christian Thurau, Kristian Kersting, and Christian Bauckhage. Deterministic CUR for improved large-scale data analysis: An empirical study. In SDM, pages 684–695. SIAM, 2012. Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. Journal of Machine Learning Research, 14(9): 2729–2769, 2013. Shusen Wang, Luo Luo, and Zhihua Zhang. The modified Nyström method: Theories, algorithms, and extension. arXiv preprint arXiv:1406.5675, 2014. Shusen Wang, Zhihua Zhang, and Tong Zhang. Towards more efficient Nyström approximation and CUR matrix decomposition. arXiv preprint arXiv:1503.08395, 2015. 19

Jiyan Yang, Oliver R¨ ubel, Prabhat, Michael W. Mahoney, Benjamin P. Bowen, and Anal Chem. Identifying important ions and positions in mass spectrometry imaging data using CUR matrix decompositions. Analytical Chemistry, 2015a. Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. An explicit sampling dependent spectral error bound for column subset selection. arXiv preprint arXiv:1505.00526, 2015b.

20