Consistency of trace norm minimization

Consistency of trace norm minimization

arXiv:0710.2848v1 [cs.LG] 15 Oct 2007

Francis R. Bach

[email protected]

INRIA - Willow project Département d’Informatique, Ecole Normale Supérieure 45, rue d’Ulm 75230 Paris, France

Abstract Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled.

1. Introduction In recent years, regularization by various non Euclidean norms has seen considerable interest. In particular, in the context of linear supervised learning, norms such as the ℓ1 -norm may induce sparse loading vectors, i.e., loading vectors with low cardinality or ℓ0 -norm. Such regularization schemes, also known as the Lasso (Tibshirani, 1994) for least-square regression, come with efficient path following algorithms (Efron et al., 2004). Moreover, recent work has studied conditions under which such procedures consistently estimate the sparsity pattern of the loading vector (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006). When learning on rectangular matrices, the rank is a natural extension of the cardinality, and the sum of singular values, also known as the trace norm or the nuclear norm, is the natural extension of the ℓ1 -norm; indeed, as the ℓ1 -norm is the convex envelope of the ℓ0 -norm on the unit ball (i.e., the largest lower bounding convex function) (Boyd and Vandenberghe, 2003), the trace norm is the convex envelope of the rank over the unit ball of the spectral norm (Fazel et al., 2001). In practice, it leads to low rank solutions (Fazel et al., 2001, Srebro et al., 2005) and has seen recent increased interest in the context of collaborative filtering (Srebro et al., 2005), multi-task learning (Abernethy et al., 2006, Argyriou et al., 2007) or classification with multiple classes (Amit et al., 2007). In this paper, we consider the rank consistency of trace norm regularization with the square loss, i.e., if the data were actually generated by a low-rank matrix, will the matrix and its rank be consistently estimated? In Section 4, we provide necessary and sufficient conditions for the rank consistency that are extensions of corresponding results for the Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006) and the group Lasso (Bach, 1

2007). We do so under two sets of sampling assumptions detailed in Section 3.2: a full i.i.d assumption and a non i.i.d assumption which is natural in the context of collaborative filtering. As for the Lasso and the group Lasso, the necessary condition implies that such procedures do not always estimate the rank correctly; following the adaptive version of the Lasso and group Lasso (Zou, 2006), we design an adaptive version to achieve n−1/2 -consistency and rank consistency, with no consistency conditions. Finally, in Section 6, we present a smoothing approach to convex optimization with the trace norm, while in Section 6.3, we show simulations on toy examples to illustrate the consistency results.

2. Notations In this paper we consider various norms on vectors and matrices. On vectors x in Rd , we always consider the Euclidean norm, i.e., kxk = (x⊤ x)1/2 . On rectangular matrices in Rp×q , however, we consider several norms, based on singular values (Stewart and Sun, 1990): xk the spectral norm kM k2 is the largest singular value (defined as kM k2 = supx∈Rq kM kxk ), the trace norm (or nuclear norm) kM k∗ is the sum of singular values, and the Frobenius norm kM kF is the ℓ2 -norm of singular values (also defined as kM kF = (trM ⊤ M )1/2 ). In Appendix A and B, we review and derive relevant tools and results regarding perturbation of singular values as well as the trace norm. Given a matrix M ∈ Rp×q , vec(M ) denotes the vector in Rpq obtained by stacking its columns into a single vector; and A ⊗ B denotes the Kronecker product between matrices A ∈ Rp1×q1 and B ∈ Rp2 ×q2 , defined as the matrix in Rp1 p2 ×q1 q2 , defined by blocks of sizes p2 × q2 equal to aij B. We make constant use of the following identities: (B ⊤ ⊗ A) vec(X) = vec(AXB) and vec(uv ⊤ ) = v ⊗ u. For more details and properties, see Golub and Loan (1996) and Magnus and Neudecker (1998). We also use the notation ΣW for Σ ∈ Rpq×pq and W ∈ Rp×q to design the matrix in Rp×q such that vec(ΣW ) = Σ vec(W ) (note the potential confusion with ΣW when Σ is a matrix with p columns). We also use the following standard asymptotic notations: a random variable Zn is said to be of order Op (an ) if for any η > 0, there exists M > 0 such that supn P (|Zn | > M an ) < η. Moreover, Zn is said to be of order op (an ) if Zn /an converges to zero in probability, i.e., if for any η > 0, P (|Zn | > ηan ) converges to zero. See Van der Vaart (1998) and Shao (2003) for further definitions and properties of asymptotics in probability. Finally, we use the following two conventions: lowercase for vectors and uppercase for matrices, while bold fonts are reserved for population quantities.

3. Trace norm minimization We consider the problem of predicting a real random variable z as a linear function of a matrix M ∈ Rp×q , where p and q are two fixed strictly positive integers. Throughout this paper, we assume that we are given n observations (Mi , zi ), i = 1, . . . , n, and we consider the following optimization problem with the square loss: n

min

W ∈Rp×q

1 X (zi − trW ⊤ Mi )2 + λn kW k∗ , 2n i=1

2

(1)

where kW k∗ denotes the trace norm of W . 3.1 Special cases Regularization by the trace norm has numerous applications (see, e.g., Recht et al. (2007) for a review); in this paper, we are particularly interested in the following two situations: Lasso and group Lasso When xi ∈ Rm , we can define Mi = Diag(xi ) ∈ Rm×m as the diagonal matrix with xi on the diagonal. In this situation the minimization of problem Eq. (1) must lead to diagonal solutions (indeed the minimum trace norm matrix with fixed diagonal is the corresponding diagonal matrix, which is a consequence of Lemma 20 and Proposition 21) and for a diagonal matrix the trace norm is simply the ℓ1 norm of the diagonal. Once we have derived our consistency conditions, we check in Section 4.5 that they actually lead to the known ones for the Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006). We can also see the group Lasso asPa special case; indeed, if xij ∈ Rdj for j = 1, . . . , m, m i = 1, . . . , n, then we define Mi ∈ R( j=1 dj )×m as the block diagonal matrix (with non ˆ must share square blocks) with diagonal blocks xji , j = 1, . . . , m. Similarly, the optimal W the same block-diagonal form, and its singular values are exactly the norms of each block, i.e., the trace norm is indeed the sum of the norms of each group. We also get back results from Bach (2007) in Section 4.5. Note that the Lasso and group Lasso can be seen as special cases where the singular vectors are fixed. However, the main difficulty in analyzing trace norm regularization, as well as the main reason for it use, is that singular vectors are not fixed and those can often be seen as implicit features learned by the estimation procedure (Srebro et al., 2005). In this paper we derive consistency results about the value and numbers of such features. Collaborative filtering and low-rank completion Another natural application is collaborative filtering where two types of attributes x and y are observed and we consider bilinear forms in x and y, which can be written as a linear form in M = xy ⊤ (thus it corresponds to situations where all matrices Mi have rank one). In this setting, the matrices Mi are not usually i.i.d. but exhibit a statistical dependence structure outlined in Section 3.2. A special case here is when then no attributes are observed and we simply wish to complete a partially observed matrix (Srebro et al., 2005, Abernethy et al., 2006). The results presented in this paper do not immediately apply because the dimension of the estimated matrix may grow with the number of observed entries and this situation is out of the scope of this paper. 3.2 Assumptions We make the following assumptions on the sampling distributions of M ∈ Rp×q for the 1 Pn ˆ problem in Eq. (1). We let denote: Σmm = n i=1 vec(Mi ) vec(Mi )⊤ ∈ Rpq×pq , and we consider the following assumptions: (A1) Given Mi , i = 1, . . . , n, the n values zi are i.i.d. and there exists W ∈ Rp×q such that for all i, E(zi |M1 , . . . , Mn ) = trW⊤ Mi and var(zi |M1 , . . . , Mn ) is a strictly positive constant σ 2 . W is not equal to zero and does not have full rank. 3

ˆ mm − Σmm k2 = O(ζn2 ) (A2) There exists an invertible matrix Σmm ∈ Rpq×pq such that EkΣ F for a certain sequence ζn that tends to zero. P (A3) The random variable n−1/2 ni=1 εi vec(Mi ) is converging in distribution to a normal distribution with mean zero and covariance matrix σ 2 Σmm . Assumption (A1) states that given the input matrices Mi , i = 1, . . . , n we have a linear prediction model, where the loading matrix W is non trivial and rank-deficient, the goal being to estimate this rank (as well as the matrix itself). We let denote W = U Diag(s)V⊤ its singular value decomposition, with U ∈ Rp×r , V ∈ Rq×r , and r ∈ (0, min{p, q}) denotes the rank of W. We also let denote U⊥ ∈ Rp×(p−r) and V⊥ ∈ Rq×(q−r) any orthogonal complements of U and V. ˆ M z = 1 Pn zi Mi ∈ Rp×q , Σ ˆ M ε = 1 Pn εi Mi = We let denote εi = zi −trW⊤ Mi and Σ i=1 i=1 n n ˆ Mz − Σ ˆ mm W ∈ Rp×q . We may then rewrite Eq. (1) as Σ 1 ˆ mm vec(W ) − trW ⊤ Σ ˆ M z + λn kW k∗ , vec(W )⊤ Σ 2

(2)

1 ˆ mm vec(W − W) − trW ⊤ Σ ˆ M ε + λn kW k∗ . vec(W − W)⊤ Σ 2

(3)

min

W ∈Rp×q

or, equivalently, min

W ∈Rp×q

The sampling assumptions (A2) and (A3) may seem restrictive, but they are satisfied in the following two natural situations. The first situation corresponds to a classical full i.i.d problem, where the pairs (zi , Mi ) are sampled i.i.d: Lemma 1 Assume (A1). If the matrices Mi are sampled i.i.d., z and M have finite fourth order moments, and E vec(M ) vec(M )⊤ is invertible, then (A2) and (A3) are satisfied with ζn = n−1/2 . Note the further refinement when for each i, Mi = xi yi⊤ and xi and yi are independent, which implies that Σmm is factorized as a Kronecker product, of the form Σyy ⊗ Σxx where Σxx and Σyy are the (invertible) second order moment matrices of x and y. The second situation corresponds to a collaborative filtering situation where two types of attributes are observed, e.g., x and y, and for every pair (x, y) we wish to predict z as a bilinear form in x and y: we first sample nx values for x, and ny values for y, and we select uniformly at random a subset of n 6 nx ny observations from the nx ny possible pairs. The following lemma, proved in Appendix C.1, shows that this set-up satisfies our assumptions: Lemma 2 Assume (A1). Assume moreover that nx values x ˜1 , . . . , x ñx are sampled i.i.d and ny values y˜1 , . . . , yñy are also sampled i.i.d. from distributions with finite fourth order moments and invertible second order moment matrices Σxx and Σyy . Assume also that a random subset of size n of pairs (ik , jk ) in {1, . . . , nx } × {1, . . . , ny } is sampled uniformly, then if nx , ny and n tend to infinity, then (A2) and (A3) are satisfied with Σmm = Σyy ⊗Σxx −1/2 −1/2 and ζn = n−1/2 + nx + ny . 4

3.3 Optimality conditions From the expression of the subdifferential of the trace norm in Proposition 21 (Appendix B), we can identify the optimality condition for problem in Eq. (1), that we will constantly use in the paper: Proposition 3 The matrix W with singular value decomposition W = U Diag(s)V ⊤ (with strictly positive singular values s) is optimal for the problem in Eq. (1) if and only if ˆ mm W − Σ ˆ M z + λn U V ⊤ + N = 0, Σ

(4)

with U ⊤ N = 0, N V = 0 and kN k2 6 λn . ˆ mm W − Σ ˆ M z have simultaneous singular value decomThis implies notably that W and Σ positions, and the largest singular values are less than λn , and exactly equal to λn for the corresponding strictly positive singular values of W . Note that when all matrices are diagonal (the Lasso case), we obtain the usual optimality conditions (see also Recht et al. (2007) for further discussions).

4. Consistency results We consider two types of consistency, the regular consistency, i.e., we want the probability ˆ − Wk > ε) to tend to zero as n tends to infinity, for all ε > 0. We also consider P(kW ˆ ) 6= rank(W)) tends to zero as n tends the rank consistency, i.e., we want that P(rank(W to infinity. Following the similar properties for the Lasso, the consistency depends on the decay of the regularization parameter. Essentially, we obtain the following results: ˆ is not consistent; a) if λn does not tend to zero, then the trace norm estimate W b) if λn tends to zero faster than n−1/2 , then the estimate is consistent and its error is Op (n−1/2 ) while it is not rank-consistent with probability tending to one (see Section 4.1); c) if λn tends to zero exactly at rate n−1/2 , then the estimator is consistent with error Op (n−1/2 ) but the probability of estimating the correct rank is converging to a limit in (0, 1) (see Section 4.2); d) if λn tends to zero more slowly than n−1/2 , then the estimate is consistent with error Op (λn ) and its rank consistency depends on specific consistency conditions detailed in Section 4.3. The following sections will look at each of these cases, and state precise theorems. We then consider some special cases, i.e., factored second-order moments and implications for the special cases of the Lasso and group Lasso. The first proposition (proved in Appendix C.2) considers the case where the regularization parameter λn is converging to a certain limit λ0 . When this limit is zero, we obtain ˆ tends in probability to a regular consistency (Corollary 5 below), while if λ0 > 0, then W limit which is always different from W: 5

ˆ be a global minimizer of Eq. (1). If Proposition 4 Assume (A1), (A2) and (A3). Let W ˆ converges in probability to the unique global minimizer λn tends to a limit λ0 > 0, then W of 1 vec(W − W)⊤ Σmm vec(W − W) + λ0 kW k∗ . min W ∈Rp×q 2 ˆ be a global minimizer of Eq. (1). If Corollary 5 Assume (A1), (A2) and (A3). Let W ˆ λn tends to zero, then W converges in probability to W. We now consider finer results when λn tends to zero at certain rates, slower or faster than n−1/2 , or exactly at rate n−1/2 . 4.1 Fast decay of regularization parameter The following proposition—which is a consequence of standard results in M-estimation (Shao, 2003, Van der Vaart, 1998)—considers the case where n1/2 λn is tending to zero, where we ˆ is asymptotically normal with mean W and covariance matrix n−1 σ 2 Σ−1 obtain that W mm , i.e., for fast decays, the first order expansion is the same as the one with no regularization parameter: ˆ be a global minimizer of Eq. (1). If Proposition 6 Assume (A1), (A2) and (A3). Let W 1/2 1/2 ˆ n λn tends to zero, n (W − W) is asymptotically normal with mean W and covariance matrix σ 2 Σ−1 mm . We now consider the corresponding rank consistency results, when λn goes to zero faster than n−1/2 . The following proposition (proved in Appendix C.3) states that for such regularization parameter, the solution has rank strictly greater than r with probability tending to one and can thus not be rank consistent: ˆ)> Proposition 7 Assume (A1), (A2) and (A3). If n1/2 λn tends to zero, then P(rank(W rank(W)) tends to one. 4.2 n−1/2 -decay of the regularization parameter We first consider regular consistency through the following proposition (proved in Appendix C.4), then rank consistency (proposition proved in Appendix C.5): ˆ be a global minimizer of Eq. (1). If Proposition 8 Assume (A1), (A2) and (A3). Let W 1/2 1/2 ˆ n λn tends to a limit λ0 > 0, then n (W − W) converges in distribution to the unique global minimizer of min

∆∈Rp×q

i h 1 ∆V k vec(∆)⊤ Σmm vec(∆) − tr∆⊤ A + λ0 trU⊤ ∆V + kU⊤ ⊥ ∗ , ⊥ 2

where vec(A) ∈ Rpq is normally distributed with mean zero and covariance matrix σ 2 Σmm . Proposition 9 Assume (A1), (A2) and (A3). If n1/2 λn tends to a limit λ0 > 0, then ˆ is different from the rank of W is converging to P(kΛ − the probability that the rank of W 6

(p−r)×(q−r) is defined in Eq. (6) (Section 4.3) and λ−1 0 Θk2 6 1) ∈ (0, 1) where Λ ∈ R (p−r)×(q−r) Θ∈R has a normal distribution with mean zero and covariance matrix −1 σ 2 (V⊥ ⊗ U⊥ )⊤ Σ−1 . mm (V⊥ ⊗ U⊥ )

ˆ cannot be rank consistent with this The previous proposition ensures that the estimate W decay of the regularization parameter. Note that when we take λ0 small (i.e., we get closer to fast decays), the probability P(kΛ − λ−1 0 Θk2 6 1) tends to zero, while when we take λ0 large (i.e., we get closer to slow decays), the same probability tends to zero or one depending on the sign of kΛk2 − 1. This heuristic argument is made more precise in the following section. 4.3 Slow decay of regularization parameter When λn tends to zero more slowly than n−1/2 , the first order expansion is deterministic, as the following proposition shows (proof in Appendix C.6): ˆ be a global minimizer of Eq. (1). Proposition 10 Assume (A1), (A2) and (A3). Let W 1/2 −1 ˆ − W) converges in probability to If n λn tends to +∞ and λn tends to zero, then λn (W the unique global minimizer ∆ of min

∆∈Rp×q

1 vec(∆)⊤ Σmm vec(∆) + trU⊤ ∆V + kU⊤ ⊥ ∆V⊥ k∗ . 2

(5)

ˆ = W + λn ∆ + Op (λn + ζn + λ−1 n−1/2 ). Moreover, we have W n ˆ around W. From Proposition 18 The last proposition gives a first order expansion of W ⊤ (Appendix B), we obtain immediately that if U⊥ ∆V⊥ is different from zero, then the rank ˆ is ultimately strictly larger than r. The condition U⊤ ∆V⊥ = 0 is thus necessary for of W ⊥ rank consistency when λn n1/2 tends to infinity while λn tends to zero. The next lemma (proved in Appendix 11), gives a necessary and sufficient condition for U⊤ ⊥ ∆V⊥ = 0. Lemma 11 Assume Σmm is invertible, and W = U Diag(s)V⊤ is the singular value decomposition of W. Then the unique global minimizer of vec(∆)⊤ Σmm vec(∆) + trU⊤ ∆V + kU⊤ ⊥ ∆V⊥ k∗ satisfies U⊤ ⊥ ∆V⊥ = 0 if and only if

−1

⊤ −1

(V⊥ ⊗ U⊥ )⊤ Σ−1 (V⊥ ⊗ U⊥ ) Σmm (V ⊗ U) vec(I) mm (V⊥ ⊗ U⊥ )

6 1. 2

This leads to consider the matrix Λ ∈ R(p−r)×(q−r) defined as −1 ⊤ −1 vec(Λ) = (V⊥ ⊗ U⊥ )⊤ Σ−1 (V ⊗ U ) (V ⊗ U ) Σ (V ⊗ U) vec(I) , ⊥ ⊥ ⊥ ⊥ mm mm

(6)

and the two weak and strict consistency conditions: kΛk2 6 1, 7

(7)

kΛk2 < 1.

(8)

Note that if Σmm is proportional to identity, they are always satisfied because then Λ = 0. We can now prove that the condition in Eq. (8) is sufficient for rank consistency when n1/2 λn tends to infinity, while the condition Eq. (7) is necessary for the existence of a sequence λn such that the estimate is both consistent and rank consistent (which is a stronger result than restricting λn to be tending to zero slower than n−1/2 ). The following two theorems are proved in Appendix C.8 and C.9: ˆ be a global minimizer of Eq. (1). If the Theorem 12 Assume (A1), (A2), (A3). Let W 1/2 condition in Eq. (8) is satisfied, and if n λn tends to +∞ and λn tends to zero, then the ˆ is consistent and rank-consistent. estimate W ˆ be a global minimizer of Eq. (1). If Theorem 13 Assume (A1), (A2) and (A3). Let W ˆ is consistent and rank-consistent, then the condition in Eq. (7) is satisfied. the estimate W As opposed to the Lasso, where Eq. (7) is a necessary and sufficient condition for rank consistency (Yuan and Lin, 2007), this is not even true in general for the group Lasso (Bach, 2007). Looking at the limiting case kΛk2 = 1 would similarly lead to additional but more complex sufficient and necessary conditions, and is left out for future research. Moreover, it may seem surprising that even when the sufficient condition Eq. (8) is ˆ , i.e., W ˆ = W + λn ∆ + op (λn ) is such that fulfilled, that the first order expansion of W ⊤ ⊤ U⊥ ∆V⊥ = 0, but nothing is said about U⊥ ∆V and U⊤ ∆V⊥ , which are not equal to zero in general. This is due to the fact that the first r singular vectors U and V of W + λn ∆ are not fixed; indeed, the r first singular vectors (i.e., the implicit features) do rotate but ⊤ . This is to be contrasted with the adaptive version where with no contribution on U⊥ V⊥ asymptotically the first order expansion has constant singular vectors (see Section 5). Finally, in this paper, we have only proved whether the probability of correct rank selection tends to zero or one. Proposition 9 suggests that when λn n1/2 tends to infinity 1/2 Θk 6 1), where Θ has a normal slowly, then this probability is close to P(kΛ − λ−1 2 n n distribution with known covariance matrix, which converges to one exponentially fast when kΛk2 < 1. We are currently investigating additional assumptions under which such results are true and thus estimate the convergence rates of the probability of good rank selection as done by Zhao and Yu (2006) for the Lasso. 4.4 Factored second order moment Note that in the situation where nx points in Rp and ny points in Rq are sampled i.i.d and a random subset of n points in selected, then, we can refine the condition as follows (because Σmm = Σyy ⊗ Σxx ): −1 ⊤ −1 ⊤ −1 −1 −1 ⊤ −1 Λ = (U⊤ ⊥ Σxx U⊥ ) U⊥ Σxx UV Σyy V⊥ (V⊥ Σyy V⊥ ) ,

which is equal to (by the expression of inverses of partitioned matrices): ⊤ −1 ⊤ −1 ⊤ Λ = (U⊤ ⊥ Σxx U)(U Σxx U) (V Σyy V) (V Σyy V⊥ ).

This also happens when Mi = xi yi⊤ and xi and yi independent for all i. 8

4.5 Corollaries for the Lasso and group Lasso For the Lasso or the group Lasso, all proposed results in Section 4.3 should hold with the additional conditions that W and ∆ are diagonal (block-diagonal for the group Lasso). In this situation, the singular values of the diagonal matrix W = Diag(w) are the norms of the diagonal blocks, while the left singular vectors are equal to the normalized versions of the block (the signs for the Lasso). However, the results developed in Section 4.3 do not immediately apply since the assumptions regarding the invertibility of the second order moment matrix is not satisfied. For those problems, all matrices M that are ever considered belong to a strict subspace of Rp×q and we need to satisfy invertibility on that subspace. More precisely, we assume that all matrices M are such that vec(M ) = Hx where H is a given design matrix in Rpq×s where s is the number of implicit parameter and x ∈ Rs . If we replace the invertibility of Σmm by the invertibility of H ⊤ Σmm H, then all results presented in Section 4.3 are valid, in particular, the matrix Λ may be written as † vec(Λ) = (V⊥ ⊗ U⊥ )⊤ H(H ⊤ Σmm H)−1 H ⊤ (V⊥ ⊗ U⊥ ) × (V⊥ ⊗ U⊥ )⊤ H(H ⊤ Σmm H)−1 H ⊤ (V ⊗ U) vec(I) , (9)

where A† denotes the pseudo-inverse of A (Golub and Loan, 1996). We now apply Eq. (9) to the case of the group Lasso (which includes the Lasso as a special case). In this situation, we have M = Diag(x1 , . . . , xm ) and each xj ∈ Rdj , j = 1, . . . , xm ; we consider w as being defined by blocks w1 , . . . , wm , where each wj ∈ Rdj . The design matrix H is such that Hw = vec(Diag(w)) and the matrix H ⊤ Σmm H is exactly equal to the joint covariance matrix Σxx of x = (x1 , . . . , xm ). Without loss of generality, we assume that the generating sparsity patttern corresponds to the first r blocks. We can then i k)i6r compute the singular value decomposition in closed form as U = (Diag(wi /kw , V = 0I 0 and s = (kwj k)j6r . If we let denote, for each j, Oj a basis of the subspace orthogonal to Diag(Oi )i6r 0 and V⊥ = 0I . We can put these singular vectors wj , we have: U⊥ = 0 I into Eq. (9) and get (H ⊤ Σmm H)−1 H ⊤ (V ⊗ U) vec(I) = (Σ−1 xx )J,Jc ηJ , where J = {1, . . . , r} and ηJ is the vector of normalised wj , j ∈ J. Thus, for the group Lasso, we finally obtain:

−1 −1

kΛk2 = Diag ((Σ−1 xx )Jc Jc ) (Σxx )J,Jc ηJ 2

i h

= Diag (Σxx )Jc J (Σxx )−1 J,J ηJ by the partitioned matrices inversion lemma, 2

−1

= maxc Σxi xJ ΣxJ xJ ηJ . i∈J

The condition on the invertibility of H ⊤ Σmm H is exactly the invertibility of the full joint covariance matrix of x = (x1 , . . . , xm ) and is a standard assumption for the Lasso or the group Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006, Bach, 2007). Moreover the condition kΛk2 6 1 is exactly the one for the group Lasso (Bach, 2007), where the pattern consistency is replaced by the consistency for the number of non zero groups. Note that we only obtain a result in terms of numbers of selected groups of variables and not in terms of the identities of the groups themselves. However, because of regular consistency, we know that at least the r true groups will be selected, and then correct model size is asymptotically equivalent to the correct groups being selected. 9

5. Adaptive version We can follow the adaptive version of the Lasso to provide a consistent algorithm with no consistency conditions such as Eq. (7) or Eq. (8). More precisely, we consider the leastˆ M z ). We have the following well known result for ˆ LS ) = Σ ˆ −1 vec(Σ square estimate vec(W mm least-square regression: ˆ −1 ˆ Lemma 14 Assume (A1), (A2) and (A3). Then n1/2 (Σ mm vec(ΣM z ) − vec(W)) is converging in distribution to a normal distribution with zero mean and covariance matrix σ 2 Σ−1 mm . ˆ LS = ULS Diag(sLS )V ⊤ , where sLS > We consider the singular value decomposition of W LS 0. With probability tending to one, min{p, q} singular values are strictly positive (i.e. the ˆ LS is full). We consider the full decomposition where ULS and VLS are orthogonal rank of W square matrices and the matrix Diag(sLS ) is rectangular. We complete the singular values sLS ∈ Rmin{p,q} by n−1/2 to reach dimensions p and q (we keep the same notation for both dimensions for simplicity). For γ ∈ (0, 1], we let denote ⊤ ⊤ ∈ Rq×q , ∈ Rp×p and B = VLS Diag(sLS )−γ VLS A = ULS Diag(sLS )−γ ULS

two positive definite symmetric matrices, and, following the adaptive Lasso of Zou (2006), we consider replacing kW k∗ by kAW Bk∗ —note that in the Lasso special case, this exactly corresponds to the adaptive Lasso of Zou (2006). We obtain the following consistency theorem (proved in Appendix C.10): Theorem 15 Assume (A1), (A2) and (A3). If γ ∈ (0, 1], n1/2 λn tends to 0 and λn n1/2+γ/2 ˆ A of tends to infinity, then any global minimizer W n

1 X (zi − trW ⊤ Mi )2 + λn kAW Bk∗ 2n i=1

ˆ A −W) is converging in distribution is consistent and rank consistent. Moreover, n1/2 vec(W to a normal distribution with mean zero and covariance matrix h i−1 σ 2 (V ⊗ U) (V ⊗ U)⊤ Σmm (V ⊗ U) (V ⊗ U)⊤ .

ˆ LS Note the restriction γ 6 1 which is due to the fact that the least-square estimate W −1/2 only estimates the singular subspaces at rate Op (n ). In Section 6.3, we illustrate the previous theorem on synthetic examples. In particular, we exhibit some singular behavior for the limiting case γ = 1.

6. Algorithms and simulations In this section we provide a simple algorithm to solve problems of the form min

W ∈Rp×q

1 vec(W )⊤ Σ vec(W ) − trW ⊤ Q + λkW k∗ , 2 10

(10)

where Σ ∈ Rpq×pq is a positive definite matrix (note that we do not restrict Σ to be of the form Σ = A ⊗ B where A and B are positive semidefinite matrices of size p × p and q × q). We assume that vec(Q) is in the column space of Σ, so that the optimization problem is ˆ mm and bounded from below (and thus the dual is feasible). In our setting, we have Σ = Σ ˆ Mz. Q=Σ We focus on problems where p and q are not too large so that we can apply Newton’s method to obtain convergence up to machine precision, which is required for the fine analysis of rank consistency in Section 6.3. For more efficient algorithms with larger p and q, see Srebro et al. (2005), Rennie and Srebro (2005) and Abernethy et al. (2006). Because the dual norm of the trace norm is the spectral norm (see Appendix B), the dual is easily obtained as 1 − vec(Q − λV )⊤ Σ−1 vec(Q − λV ). V ∈Rp×q ,kV k2 61 2 max

(11)

Indeed, we have: 1 vec(W )⊤ Σ vec(W ) − trW ⊤ Q + λkW k∗ 2 1 = min max vec(W )⊤ Σ vec(W ) − trW ⊤ Q + λtrV ⊤ W W ∈Rp×q V ∈Rp×q ,kV k2 61 2 1 vec(W )⊤ Σ vec(W ) − trW ⊤ Q + λtrV ⊤ W = max min p×q p×q 2 V ∈R ,kV k2 61 W ∈R 1 = max − vec(Q − λV )⊤ Σ−1 vec(Q − λV ), V ∈Rp×q ,kV k2 61 2 min

W ∈Rp×q

where strong duality holds because both the primal and dual problems are convex and strictly feasible (Boyd and Vandenberghe, 2003). 6.1 Smoothing The problem in Eq. (10) is convex but non differentiable; in this paper we consider adding a strictly convex function to its dual in Eq. (11) in order to make it differentiable, while controlling the increase of duality gap yielded by the added function (Bonnans et al., 2003). We thus consider the following smoothing of the trace norm, namely we define Fε (W ) =

max

V ∈Rp×q ,kV k2 61

trV ⊤ W − εB(V ),

where B(V ) is a spectral function (i.e., that depends only on singular values of V , equal to Pmin{p,q} B(V ) = i=1 b(si (V )) where b(s) = (1 + s) log(1 + s) + (1 − s) log(1 − s) if |s| 6 1 and +∞ otherwise (si (V ) denotes the i-th largest singular values of V ). This function Fε may be computed in closed form as: min{p,q}

Fε (W ) =

X

b∗ (si (W )),

i=1

where b∗ (s) = ε log(1 + ev/ε ) + ε log(1 + e−v/ε ) − 2ε log 2. These functions are plotted in Figure 1; note that |b∗ (s) − |s|| is uniformly bounded by 2 log 2. 11

1.5

2 1.5

1 1 0.5 0.5 0 −1

−0.5

0 s

0.5

0 −2

1

−1

0 s

1

2

Figure 1: Spectral barrier functions: (left) primal function b(s) and (right) dual functions b∗ (s).

We finally get the following pairs of primal/dual optimization problems: min

W ∈Rp×q

V

max

1 vec(W )⊤ Σ vec(W ) − trW ⊤ Q + λFε/λ (W ), 2

∈Rp×q ,kV k2 61

1 − vec(Q − λV )⊤ Σ−1 vec(Q − λV ) − εB(V ). 2

We can now optimize directly in the primal formulation which is infinitely differentiable, using Newton’s method. Note that the stopping criterion should be an ε × min{p, q} duality gap, as the controlled smoothing also leads to a small additional gap on the solution of the original non smoothed problem. More precisely, a duality gap of ε × min{p, q} on the smoothed problem, leads to a gap of at most (1 + 2 log 2)ε × min{p, q} for the original problem. 6.2 implementation details Derivatives of spectral functions Note that derivatives of spectral functions of the Pmin{p,q} form B(W ) = i=1 b(si (W )), where b is an even twice differentiable function such that b(0) = b′ (0) = 0, are easily calculated as follows; Let U Diag(s)V ⊤ be the singular value decomposition of W . We then have the following Taylor expansion (Lewis and Sendov, 2002): p

B(W + ∆) = B(W ) + tr∆⊤ U Diag(b′ (si ))V ⊤ +

q

1 X X b′ (si ) − b′ (sj ) ⊤ (ui ∆vj )2 , 2 si − sj i=1 j=1

where the vector of singular values is completed by zeros, and when si = sj .

b′ (si )−b′ (sj ) si −sj

is defined as b′′ (si )

Choice of ε and computational complexity Following the common practice in barrier methods we decrease the parameter geometrically after each iteration of Newton’s method (Boyd and Vandenberghe, 2003). Each of these Newton iterations has complexity 12

O(p3 q 3 ). Empirically, the number of iterations does not exceed a few hundreds for solving one problem up to machine precision1 . We are currently investigating theoretical bounds on the number of iterations through self concordance theory (Boyd and Vandenberghe, 2003). Start and end of the path In order to avoid to consider useless values of the regularization parameter and thus use a well adapted grid for trying several λ’s, we can consider a specific interval for λ. When λ is large, the solution is exactly zero, while when λ is small, the solution tends to vec(W ) = Σ−1 vec(Q). More precisely, if λ is larger than kQk2 , then the solution is exactly zero (because in this situation 0 is in the subdifferential). On the other side, we consider for which λ, Σ−1 vec(Q) leads to a duality gap which is less than ε vec(Q)⊤ Σ−1 vec(Q), where ε is small. A looser condition is to take V = 0, and the condition becomes λkΣ−1 vec(Q)k∗ 6 ε vec(Q)⊤ Σ−1 vec(Q). Note that this is in the correct order (i.e. lower bound smaller than upper bound ), because vec(Q)⊤ Σ−1 vec(Q) = hvec(Q), Σ−1 vec(Q)i 6 kΣ−1 vec(Q)k∗ k vec(Q)k2 . This allows to design a good interval for searching for a good value of λ or for computing the regularization path by uniform grid sampling (in log scale), or numerical path following with predictor-corrector methods such as used by Bach et al. (2004). 6.3 Simulations In this section, we perform simulations on toy examples to illustrate our consistency results. ˜ and Y˜ with Gaussian distributions and we select a low We generate random i.i.d. data X ˜ ⊤ WY˜ )+ε where ε has i.i.d components rank matrix W at random and generate Z = diag(X with normal distributions with zero mean and known variance. In this section, we always use r = 2, p = q = 4, while we consider several numbers of samples n, and several distributions for which the consistency conditions Eq. (7) and Eq. (8) may or may not be satisfied2 . In Figure 2, we plot regularization paths for n = 103 , by showing the singular values of ˆ compared to the singular values of W, in two particular situations (Eq. (7) and Eq. (8) W satisfied and not satisfied), for the regular trace norm regularization and the adaptive versions, with γ = 1/2 and γ = 1. Note that in the consistent case (top), the singular values and their cardinalities are well jointly estimated, both for the non adaptive version (as predicted by Theorem 12) and the adaptive versions (Theorem 15), while the range of correct rank selection increases compared to the adaptive versions. However in the inconsistent case, the non adaptive regularizations scheme (bottom left) cannot achieve regular consistency together with rank consistency (Theorem 13), while the adaptive schemes can. Note the particular behavior of the limiting case γ = 1, which still achieves both consistencies but with a singular behavior for large λ. In Figure 3, we select the distribution used for the rank-consistent case of Figure 2, and compute the paths from 200 replications for n = 102 , 103 , 103 and 105 . For each λ, we plot the proportion of estimates with correct rank on the left plots (i.e., we get an estimation ˆ ) = rank(W)), while we plot the logarithm of the average root mean squared of P(rank(W 1. MATLAB code can be downloaded from http://www.di.ens.fr/∼ fbach/tracenorm/ 2. Simulations may be reproduced with MATLAB code available from http://www.di.ens.fr/∼ fbach/ tracenorm/

13

consistent − adaptive (γ=1/2) 3

2.5

2.5

2.5

2 1.5 1

2 1.5 1 0.5

0

5 −log(λ)

0

10

singular values

1.5 1

0

5 −log(λ)

0

5 −log(λ)

10

1

0

10

2 1.5 1

0 −5

−5

0

5 −log(λ)

10

inconsistent − adaptive (γ=1) 3

2.5

0.5

0.5 0 −5

−5

3

3

2

1.5

inconsistent − adaptive (γ=1/2)

inconsistent − non adaptive

2.5

2

0.5

singular values

0 −5

singular values

3

0.5

singular values

consistent − adaptive (γ=1)

3 singular values

singular values

consistent − non adaptive

2.5 2 1.5 1 0.5

0

5 −log(λ)

10

0

−5

0

5 −log(λ)

10

15

Figure 2: Examples of paths of singular values for kΛk2 = 0.49 < 1 (consistent, top) and kΛk2 = 4.78 > 1 (inconsistent, bottom) rank selection: regular trace norm penalization (left) and adaptive penalization with γ = 1/2 (center) and γ = 1 (right). Estimated singular values are plotted in plain, while population singular values are dotted.

14

ˆ − Wk on the right plot. For the three regularization schemes, the estimation error kW range of values with high probability of correct rank selection increases as n increases, and, most importantly achieves good mean squared error (right plot); in particular, for the non adaptive schemes (top plots), this corroborates the results from Proposition 9, which states that for λn = λ0 n−1/2 the probability tends to a limit in (0, 1): indeed, when n increases, the value λn which achieves a particular limit grows as n−1/2 , and considering the log-scale for λn in Figure 3 and the uniform sampling for n in log-scale as well, the regular spacing between the decaying parts observed in Figure 3 is coherent with our results. In Figure 4, we perform the same operations but with the inconsistent case of Figure 2. For the non adaptive case (top plot), the range of values of λ that achieve high probability of correct rank selection does not increase when n increases and stays bounded, in places where the estimation error is not tending to zero: in the inconsistent case, the trace norm regularization does not manage to solve the trade-off between rank consistency and regular consistency. However, for the adaptive versions, it does, still with a somewhat singular behavior of the limiting case γ = 1. Finally, in Figure 5, we consider 400 different distributions with various values of kΛk smaller or greater than one, and computed the regularization paths with n = 103 samples. ˆ with correct rank and best distance to W and From the paths, we consider the estimate W plot the best error versus log10 (kΛk2 ). For positive values of log10 (kΛk2 ), the best error is far from zero, and the error grows with the distance to zero; while for negative values, we get low errors with lower errors for small log10 (kΛk2 ), corroborating the influence of kλk2 described in Proposition 9.

7. Conclusion We have presented an analysis of the rank consistency for the penalization by the trace norm, and derived general necessary and sufficient conditions. This work can be extended in several interesting ways: first, by going from the square loss to more general losses, in particular for other types of supervised learning problems such as classification; or by looking at the collaborative filtering setting where only some of the attributes are observed (Abernethy et al., 2006) and dimensions p and q are allowed to grow. Moreover, we are currently pursuing non asymptotic extensions of the current work, making links with the recent work of Recht et al. (2007) and of Meinshausen and Yu (2006).

Appendix A. Tools for analysis of singular value decomposition In this appendix, we review and derive precise results regarding singular value decompositions. We consider W ∈ Rp×q and we let denote W = U Diag(s)V ⊤ its singular value decomposition with U ∈ Rp×r , V ∈ Rq×r with orthonormal columns, and s ∈ Rr with strictly positive values (r is the rank of W ). Note that when a singular value si is simple, i.e., does not coalesce with any other singular values, then the vectors ui and vi are uniquely defined up to simultaneous sign flips, i.e., only the matrix ui vi⊤ is unique. However, when some singular values coalesce, then the corresponding singular vectors are defined up to a 15


consistent − non adaptive 2 1

0.8 n = 102 0.6

n = 103 n = 104

0.4

5

n = 10

log(RMS)

P(correct rank)

1

0.2

0 −1 −2 −3

0 −5

0 −log(λ)

−4 −5

5

consistent − adaptive (γ=1/2) 1 n = 102 n = 103

0.6

n = 104

0.4

5

n = 10

log(RMS)

P(correct rank)

2

0.8

0.2

0 −1 −2 −3

0

5

−4 −5

10

0

−log(λ) consistent − adaptive (γ=1)

10

consistent − adaptive (γ=1) 2 1

0.8 n = 102 n = 103

0.6

n = 104

0.4

5

n = 10

0.2

log(RMS)

P(correct rank)

5 −log(λ)

1

0 −5

5

consistent − adaptive (γ=1/2)

1

0 −5

0 −log(λ)

0 −1 −2 −3

0

5

−4 −5

10

−log(λ)

0

5

10

−log(λ)

Figure 3: Synthetic example where consistency condition in Eq. (8) is satisfied: probability of correct rank selection (left) and logarithm of the expected mean squared estimation error (right), for several number of samples as a function of the regularization parameter, for regular regularization (top), adaptive regularization with γ = 1/2 (center) and γ = 1 (bottom).

16



1

1.5 1 2

n = 10

0.6

n = 103 4

n = 10

0.4

n = 105

0.2

log(RMS)

P(correct rank)

0.8

0.5 0 −0.5

0 −5

0 −log(λ)

−1 −5

5

inconsistent − adaptive (γ=1/2) 2

n = 102 0.6

n = 103 n = 104

0.4

5

n = 10

0 log(RMS)

P(correct rank)

0.8

−2

−4

0.2

0

5

−6 −5

10

0

−log(λ)

5

10

−log(λ) inconsistent − adaptive (γ=1)

inconsistent − adaptive (γ=1) 2

1

1

0.8 n = 102 3

0.6

n = 10

n = 104 0.4

n = 105

0 log(RMS)

P(correct rank)

5

inconsistent − adaptive (γ=1/2)

1

0 −5

0 −log(λ)

−1 −2 −3

0.2 0 −5

−4 0

5

−5 −5

10

0

5

10

−log(λ)

−log(λ)

Figure 4: Synthetic example where consistency condition in Eq. (7) is not satisfied: probability of correct rank selection (left) and logarithm of the expected mean squared estimation error (right), for several number of samples as a function of the regularization parameter, for regular regularization (top), adaptive regularization with γ = 1/2 (center) and γ = 1 (bottom).

17

0.4

RMS

0.3 0.2 0.1 0 −1.5

−1

−0.5

0 0.5 log (|| Λ || ) 10

1

1.5

2

Figure 5: Scatter plots of log10 (kΛk2 ) versus the squared error of the best estimate with ˆ ) = r and kW ˆ − Wk as small as possible). correct rank (i.e., such that rank(W See text for details.

rotation, and thus in general care must be taken and considering isolated singular vectors should be avoided (Stewart and Sun, 1990). All tools presented in this appendix are robust to the particular choice of the singular vectors. A.1 Jordan-Wielandt matrix We use the fact that singular values of W can be obtained from the eigenvalues of the Jordan0 W ¯ = Wielandt matrix W ∈ R(p+q)×(p+q) (Stewart and Sun, 1990). Indeed this W⊤ 0 matrix has eigenvalues si and −si , i= 1, .. . , r, where si arethe (strictly positive) singular u ui i values of W , with eigenvectors √12 and √12 where ui , vi are the left and vi −vi right associated singular vectors. Also, the remaining eigenvalues are all equal to zero, with u such that for all i ∈ {1, . . . , r}, eigensubspace (of dimension p+q−2r) composed of all v ¯ the eigenvectors of W ¯ corresponding to non zero u⊤ ui = v ⊤ vi = 0. We let denoteU U U Diag(s) 0 1 1 ¯ ¯ ¯ √ √ eigenvalues in S. We have U = 2 and S = 2 and V −V 0 − Diag(s) ⊤ 0 0 UV ⊤ ¯ =U ¯ S¯U ¯ ⊤, U ¯U ¯⊤ = UU ¯ sign(S) ¯U ¯⊤ = W . , and U ⊤ ⊤ 0 VV VU 0 A.2 Cauchy residue formula and eigenvalues ¯ , and a simple closed curve C in the complex plane that does not go Given the matrix W ¯ , then through any of the eigenvalues of W ¯)= 1 ΠC (W 2iπ 18

I

C

dλ ¯ λI − W

¯ is equal to the orthogonal projection onto the orthogonal sum of all eigensubspaces of W associated with eigenvalues in the interior of C (Kato, 1966). This is easilyH seen by writing 1 dλ down the eigenvalue decomposition and the Cauchy residue formula ( 2iπ C λ−λi = 1 if λi is in the interior int(C) of C and 0 otherwise), and: I I 2r X X dλ dλ 1 1 ⊤ = ui u⊤ u ¯ u ¯ × = i i . i ¯ 2iπ C λI − W 2iπ C λ − s¯i i=1 i, s¯i ∈int(C)

See Rudin (1987) for an introduction to complex analysis and Cauchy residue formula. ¯ onto a specific eigensubspace as: Moreover, we can obtain the restriction of W I I ¯ dλ 1 λdλ W ¯ ΠC (W ¯)= 1 = − W ¯ ¯ . 2iπ C λI − W 2iπ C λI − W

We let denote s1 and sr the largest and smallest strictly positive singular values of W ; if k∆k2 < sr /2, then W + ∆ has r singular values strictly greater than sr /2 and the remaining ones are strictly less than sr /2 (Stewart and Sun, 1990). Thus, if we denote C the oriented ¯ ) is the projector on the p + q − 2r-dimensional null space of W ¯, circle of radius sr /2, ΠC (W ¯ + ∆) ¯ is also the projector on the p + q − 2rand for any ∆ such that k∆k2 < sr /2, ΠC (W ¯ ¯ dimensional invariant subspace of W + ∆, which corresponds to the smallest eigenvalues. ¯ + ∆) ¯ that projector and Πr (W ¯ + ∆) ¯ = I − Πo (W ¯ + ∆) ¯ the orthogonal We let denote Πo (W projector (which is the projection onto the 2r-th principal subspace). We can now find expansions around ∆ = 0 as follows: I 1 ¯ )−1 ∆(λI ¯ ¯ − ∆) ¯ −1 dλ ¯ ¯ ¯ (λI − W −W Πo (W + ∆) − Πo (W ) = 2iπ C I 1 ¯ )−1 ∆(λI ¯ ¯ )−1 dλ = (λI − W −W 2iπ C I 1 ¯ )−1 ∆(λI ¯ ¯ )−1 ∆(λI ¯ ¯ − ∆) ¯ −1 dλ, (λI − W −W −W + 2iπ C

and

I 1 ¯ ¯ ¯ ¯ ¯ ¯ ¯ )−1 ∆(λI ¯ ¯ − ∆) ¯ −1 dλ (W + ∆)Πo (W + ∆)− W Πo (W ) = − λ(λI − W −W 2iπ C I 1 ¯ )−1 ∆(λI ¯ ¯ )−1 dλ λ(λI − W −W = − 2iπ C I 1 ¯ )−1 ∆(λI ¯ ¯ )−1 ∆(λI ¯ ¯ − ∆) ¯ −1 dλ, λ(λI − W −W −W − 2iπ C which lead to the following two propositions:

Proposition 16 Assume W has rank r and k∆k2 < sr /4 where sr is the smallest positive ¯ ) on the first r eigenvectors of W ¯ is such singular value of W . Then the projection Πr (W that ¯ + ∆) ¯ − Πo (W ¯ )k2 6 4 k∆k2 kΠo (W sr and ¯ 2. ¯ + ∆) ¯ − Πo (W ¯ ) − (I − U ¯U ¯ ⊤ )∆ ¯U ¯ S¯−1 U ¯⊤ − U ¯ S¯−1 U ¯ ⊤ ∆(I ¯ −U ¯U ¯ ⊤ )k2 6 8 k∆k kΠo (W 2 s2r 19

¯ )−1 k2 > 2/sr and k(λI − W ¯ − ∆) ¯ −1 k2 > 4/sr , which Proof For λ ∈ C we have: k(λI − W implies I ¯ + ∆) ¯ − Πr (W ¯ )k2 6 1 ¯ )−1 k2 k∆k2 k(λI − W ¯ − ∆) ¯ −1 k2 kΠr (W k(λI − W 2π C 1 sr 2 4 6 2π . k∆k2 2π 2 sr sr In order to prove the other result, we simply need to compute: I I X 1 1 −1 ¯ −1 ⊤ ⊤ 1 ¯ ¯ (λI − W ) ∆(λI − W ) dλ = dλ u ¯i u ¯i ∆¯ uj u ¯j 2iπ C 2iπ C (λ − s¯i )(λ − s¯j ) i,j X 1j∈int(C) 1j ∈int(C) 1i∈int(C) / / ⊤ ⊤ 1i∈int(C) = u ¯i u ¯i ∆¯ uj u ¯j + s¯i s¯j i,j

¯U ¯ ⊤ )∆ ¯U ¯ S¯−1 U ¯⊤ + U ¯ S¯−1 U ¯ ⊤ ∆(I ¯ −U ¯U ¯ ⊤ ). = (I − U

Proposition 17 Assume W has rank r and k∆k2 < sr /4 where sr is the smallest positive ¯ ) on the first r eigenvectors of W ¯ is such singular value of W . Then the projection Πr (W that ¯ + ∆)( ¯ W ¯ + ∆) ¯ − Πo (W ¯ )W ¯ k2 6 2k∆k2 kΠo (W and ¯ + ∆)( ¯ W ¯ + ∆) ¯ − Πo (W ¯ )W ¯ + (I − U ¯U ¯ ⊤ )∆(I ¯ −U ¯U ¯ ⊤ )k2 6 kΠo (W

4 ¯ 2 k∆k2 . sr

¯ )−1 k2 > 2/sr and k(λI − W ¯ − ∆) ¯ −1 k2 > 4/sr , which Proof For λ ∈ C we have: k(λI − W implies I ¯ + ∆) ¯ − Πr (W ¯ )k2 6 1 ¯ )−1 k2 k∆k2 k(λI − W ¯ − ∆) ¯ −1 k2 kΠr (W |λ|k(λI − W 2π C sr sr 2 4 1 2π k∆k2 . 6 2π 2 2 sr sr In order to prove the other result, we simply need to compute: I I X 1 λ ⊤ 1 ¯ )−1 ∆(λI ¯ ¯ )−1 dλ = − − λ(λI − W dλ −W u ¯i u ¯⊤ ∆¯ u u ¯ j j i 2iπ C 2iπ C (λ − s¯i )(λ − s¯j ) i,j X = − u ¯i u ¯⊤ uj u ¯⊤ i ∆¯ j 1i∈int(C) 1j∈int(C) i,j

¯U ¯ ⊤ )∆(I ¯ −U ¯U ¯ ⊤ ). = −(I − U

20

¯ ) translates immediately into variations of the singular proThe variations of Π(W ⊤ ⊤ jections U U and V V . Indeed we get that the first order variation of U U ⊤ is −(I − U U ⊤ )∆V S −1 U ⊤ and the variation of V is equal to −(I − V V ⊤ )∆⊤ U S −1 V ⊤ , with errors bounded in spectral norm by s82 k∆k22 . Similarly, when restricted to the small singular valr

ues, the first order expansion is (I − U U ⊤ )∆(I − V V ⊤ ), with error term bounded in spectral norm by s4r k∆k22 . Those results lead to the following proposition that gives a local sufficient condition for rank(W + ∆) > rank(W ): Proposition 18 Assume W has rank r < min{p, q} with ordered singular value decomposition W = U Diag(s)V ⊤ . If s4r k∆k22 < k(I − U U ⊤ )∆(I − V V ⊤ )k2 , then rank(W + ∆) > r.

Appendix B. Some facts about the trace norm In this appendix, we review known properties of the trace norm that we use in this paper. Most of the results are extensions of similar results for the ℓ1 -norm on vectors. First, we have the following result: Lemma 19 (Dual norm, Fazel et al., 2001) The trace norm k · k∗ is a norm and its dual norm is the operator norm k · k. Note that the dual norm N (W ) is defined as (Boyd and Vandenberghe, 2003): N (W ) = sup trW ⊤ V. kV k∗ 61

This immediately implies the following result: Lemma 20 (Fenchel conjugate) We have: +∞ otherwise.

max trW ⊤ V − kW k∗ = 0 if kV k 6 1 and

W ∈Rp×q

In this paper, we need to compute the subdifferential and directional derivatives of the trace norm. We have from Recht et al. (2007) or Borwein and Lewis (2000): Proposition 21 (Subdifferential) If W = U Diag(s)V ⊤ with U ∈ Rp×m and V ∈ Rq×m having orthonormal columns,P and s ∈ Rm is strictly positive, is the singular value decomposition of W , then kW k∗ = m i=1 si and the subdifferential of k · k∗ is equal to n o ∂k · k∗ (W ) = U V ⊤ + M, such that kM k2 6 1, U ⊤ M = 0 and M V = 0 .

This result can be extended to compute directional derivatives:

Proposition 22 (Directional derivative) The directional derivative at W = U SV ⊤ is equal to: kW + ε∆k∗ − kW k∗ lim = trU ⊤ ∆V + kU⊥⊤ ∆V⊥ k∗ , ε ε→0+ where U⊥ ∈ Rp×(p−m) and V⊥ ∈ Rq×(q−m) are any orthonormal complements of U and V . 21

Proof From the subdifferential, we get the directional derivative (Borwein and Lewis, 2000) as lim

ε→0+

kW + ε∆k∗ − kW k∗ = max tr∆⊤ V ε V ∈∂k·k∗ (W )

which exactly leads to the desired result. The final result that we use is a bit finer as it gives an upper bound on the error in the previous limit: Proposition 23 Let W = U Diag(s)V ⊤ the ordered singular value decomposition, where rank(W ) = r, s > 0 and U⊥ and V⊥ be orthogonal complement of U and V ; then, if k∆k2 6 sr /4: s2 kW + ∆k∗ − kW k∗ − trU ⊤ ∆V − kU⊥⊤ ∆V⊥ k∗ 6 16 min{p, q} 31 k∆k22 . sr

Proof The trace norm of kW + ∆k∗ may be divided into the sum of the r largest and the sum of the remaining singular values. The sums of the remaining ones are given through Proposition 17 by kU⊥⊤ ∆V⊥ k∗ with an error bounded by min{p, q} s4r k∆k22 . For the first r singular values, we need to upperbound the second derivative of the sum of the r largest ¯ +∆ ¯ with strictly positive eigengap, which leads to the given bound by eigenvalues of W using the same Cauchy residue technique described in Appendix A.

Appendix C. Proofs C.1 Proof of Lemma 2 We let denote S ∈ {0, 1}nx ×ny the sampling matrix; i.e., Sij = 1 if the pair (i, j) is observed ˜ and Y˜ the data matrices. We can write Mk = and zero otherwise. We let denote X ⊤ ⊤ ˜ ˜ X δik δjk Y and: n

1X vec(Mk ) vec(Mk )⊤ = n k=1

=

n

1X ˜ ˜ ˜ ⊤ vec(δi δj⊤ ) vec(δi δj⊤ )⊤ (Y˜ ⊗ X) (Y ⊗ X) k k k k n k=1

1 ˜ ˜ ⊤ Diag(vec(S))(Y˜ ⊗ X), ˜ (Y ⊗ X) n

ˆ xx = n−1 X ˜ ⊤X ˜ and Σ ˆ yy = n−1 Y˜ ⊤ Y˜ ): which leads to (denoting Σ x x n

1X ˆ yy ⊗ Σ ˆ xx vec(Mk ) vec(Mk )⊤ − Σ n k=1

!

=

1 ˜ ˜ ⊤ Diag(vec(S − n/nx ny ))(Y˜ ⊗ X). ˜ (Y ⊗ X) n

22

We can thus compute the squared Frobenius norm:

2 n

1 X

⊤ ˆ ˆ vec(M ) vec(M ) − Σ ⊗ Σ

yy xx k k

n

k=1

=

=

F

1 ˜X ˜ ⊤ ) Diag(vec(S − n/nx ny ))(Y˜ Y˜ ⊤ ⊗ X ˜X ˜ ⊤) tr Diag(vec(S − n/nx ny ))(Y˜ Y˜ ⊤ ⊗ X n2 1 X ˜X ˜ ⊤ )ij,i′ j ′ . ˜X ˜ ⊤ )ij,i′ j ′ (Si′ j ′ − n/nx ny )(Y˜ Y˜ ⊤ ⊗ X (Sij − n/nx ny )(Y˜ Y˜ ⊤ ⊗ X n2 ′ ′ i,j,i ,j

We have, by properties of sampling without replacement (Hoeffding, 1963): E(Sij − n/nx ny )(Si′ j ′ − n/nx ny ) = n/nx ny (1 − n/nx ny ) if (i, j) = (i′ , j ′ ), 1 E(Sij − n/nx ny )(Si′ j ′ − n/nx ny ) = −n/nx ny (1 − n/nx ny ) if (i, j) 6= (i′ , j ′ ). nx ny − 1 This implies n

1X ˆ yy ⊗ Σ ˆ xx k2 |X, ˜ Y˜ ) vec(Mk ) vec(Mk )⊤ − Σ E(k F n k=1 1 X ˜ ˜⊤ 1 ˜X ˜ ⊤ )2ij,ij − = (Y Y ⊗ X nx ny n (nx ny − 1)nx ny n i,j

˜X ˜ ⊤ )2 ′ ′ (Y˜ Y˜ ⊤ ⊗ X ij,i j

X

(i,j)6=(i′ ,j ′ )

6

2 X k˜ yj k4 k˜ xi k4 . nx ny n i,j

This finally implies that

2

n

1 X

vec(Mk ) vec(Mk )⊤ − Σyy ⊗ Σxx E

n k=1 F 4X 4 4 2 ˆ xx − Σxx kF EkΣ ˆ yy k2F + 2EkΣ ˆ yy − Σyy k2F kΣxx k2F 6 Ekxk Ekyk + 2EkΣ n i,j

6 CEkxk4 Ekyk4 × (

1 1 1 + + ), n ny nx

for some constant C > 0. This implies (A2). To prove the asymptotic normality in (A3), we use the martingale central limit theorem (Hall and Heyde, 1980) with sequence ˜ Y˜ , ε1 , . . . , εk , (i1 , j1 ), . . . , (ik , jk )) for k 6 n. We consider ∆n,k = of σ-fields Fn,k = σ(X, −1/2 n εik jk yjk ⊗ xik ∈ Rpq as the martingale difference. We have E(∆n,k |Fn,k−1 ) = 0 and −1 2 ⊤ ⊤ E(∆n,k ∆⊤ n,k |Fn,k−1 ) = n σ yjk yjk ⊗ xik xik ,

with E(k∆n,k )k4 ) = O(n−2 ) because of the finite fourth order moments. Moreover, n X

2ˆ E(∆n,k ∆⊤ n,k |Fn,k−1 ) = σ Σmm ,

k=1

and thus tends in probability to σ 2 Σyy ⊗ Σxx because of The assumptions of the P(A2). n martingale central limit theorem are met, we have that k=1 vec(∆n,k ) is asymptotically normal with mean zero and covariance matrix σ 2 Σyy ⊗ Σxx , which concludes the proof. 23

C.2 Proof of Proposition 4 ˆ −1 Σ ˆ We may first restrict minimization over the ball {W, kW k∗ 6 kΣ mm M z k∗ } because the −1 ˆ ˆ optimum value is less than the value for W = Σmm ΣM z . Since this random variable is bounded in probability, we can reduce the problem to a compact set. The sequence of ˆ mm vec(W −W)−trW ⊤ Σ ˆ M ε +λn kW k∗ continuous random functions W 7→ 21 vec(W −W)⊤ Σ 1 ⊤ converges pointwise in probability to W 7→ 2 vec(W − W) Σmm vec(W − W) + λ0 kW k∗ with a unique global minimum (because Σmm is assumed invertible). We can thus apply standard result of consistency in M-estimation (Van der Vaart, 1998, Shao, 2003). C.3 Proof of Proposition 7 ˆ = n1/2 (W ˆ − W) is asymptotically normal with We consider the result of Proposition 6: ∆ −1/2 2 −1 ˆ 2 < mean zero and covariance σ Σmm . By Proposition 18 in Appendix B, if 4nsr k∆k 2 ˆ ⊥ k2 , then rank(W ˆ ) > r. For a random variable Θ with normal distribution kU⊤ ∆V ⊥ 4C −1/2 2 with mean zero and covariance matrix σ 2 Σ−1 mm , we let denote f (C) = P( sr kΘk2 < kU⊤ ⊥ ΘV⊥ k2 ). By the dominated convergence theorem, f (C) converges to one when C → ∞. Let ε > 0, thus there exists C0 > 0 such that f (C0 ) > 1 − ε/2. By the asymptotic normality −1/2 4C ˆ ⊥ k2 ) converges to f (C0 ) thus ∃n0 > 0 such that ∀n > n0 , ˆ 2 < kU⊤ ∆V result, P( 0 k∆k −1/2

4C P( s0r

−1/2

P( 4nsr

sr

2

⊥

ˆ ⊥ k2 ) > f (C0 ) − ε/2 > 1 − ε, which concludes the proof, because ˆ 2 < kU⊤ ∆V k∆k 2 ⊥

ˆ ⊥ k2 ) > P( ˆ 2 < kU⊤ ∆V k∆k 2 ⊥

−1/2

4C0 sr

ˆ ⊥ k2 ) as soon as n > C0 . ˆ 2 < kU⊤ ∆V k∆k 2 ⊥

C.4 Proof of Proposition 8 This is the same result as Fu and Knight (2000), but extended to the trace norm minimization, simply using the directional derivative result of Proposition 22 and the epiconvergence ˆ mm vec(∆) − theorem from Geyer (1994, 1996). Indeed, if we denote Vn (∆) = vec(∆)⊤ Σ ⊤ 1/2 1/2 −1/2 ⊤ ˆ tr∆ n ΣM ε + λ0 n (kW + n ∆k ∗ − kWk∗ ) and V (∆) = vec(∆) Σmm vec(∆) − ∆V k tr∆⊤ A + λ0 trU⊤ ∆V + kU⊤ ⊥ ∗ , then for each ∆, Vn (∆) converges in probability ⊥ to V (∆), and V is strictly convex, which implies that it has an unique global minimum; thus the epi-convergence theorem can be applied, which concludes the proof. ˆ = W+ Note that a simpler analysis using regular tools in M-estimation leads to W ˆ + op (n−1/2 ), where ∆ ˆ is the unique global minimizer of n−1/2 ∆

min

∆∈Rp×q

i h 1 ˆ M ε ) + λ0 trU⊤ ∆V + kU⊤ ∆V⊥ k∗ , vec(∆)⊤ Σmm vec(∆) − tr∆⊤ (n1/2 Σ ⊥ 2

ˆ M ε (which is asymptotically normal with correct moi.e., we can actually take A = n1/2 Σ ments). 24

C.5 Proof of Proposition 9 ˆ = n1/2 (W ˆ −W). We first show that lim supn→∞ P(rank(W ˆ ) = r) is smaller We let denote ∆ than the proposed limit a. We consider the following events: ˆ ) = r} E0 = {rank(W ˆ 2 < sr /2} E1 = {kn−1/2 ∆k ) ( 4n−1/2 ˆ 2 ˆ k∆k2 < kU⊤ E2 = ⊥ ∆V⊥ k2 . sr By Proposition 18 in Appendix B, we have E1 ∩ E2 ⊂ E0c , and thus it suffices to show that P(E1 ) tends to one, while lim supn→∞ P(E2c ) 6 a. The first assertion is a simple consequence of Proposition 8. ˆ converges in distribution to the unique global optimum Moreover, by Proposition 8, ∆ ∆(A) of an optimization problem parameterized by a vector A with normal distribution. For a given η > 0, we consider the probability P(kU⊤ ⊥ ∆(A)V⊥ k2 6 η). For any A, when η tends to zero, the indicator function 1kU⊤ ∆(A)V⊥ k2 6η converges to 1kU⊤ ∆(A)V⊥ k2 =0 , which ⊥ ⊥ is equal to 1kΛ(A)k2 6λ0 , where −1 ⊤ −1 vec(Λ(A)) = (V⊥ ⊗ U⊥ )⊤ Σ−1 (V ⊗ U ) (V ⊗ U ) Σ ((V ⊗ U) vec(I)−vec(A)) . ⊥ ⊥ ⊥ ⊥ mm mm

By the dominated convergence theorem, P(kU⊤ ⊥ ∆(A)V⊥ k2 6 η) converges to a = P(kΛ(A)k2 6 λ0 ), which is the proposed limit. This limit is in (0, 1) because of the normal distribution has an invertible covariance matrix and the set {kΛk2 6 1} and its complement have non empty interiors. ˆ = Op (1), we can instead consider E3 = { 4n−1/2 M 2 < kU⊤ ∆V ˆ ⊥ k2 } for a particSince ∆ ⊥ sr ular M , instead of E2 . Then following the same line or arguments than in Appendix C.3, we conclude that lim supn→∞ P(E3c ) 6 a, which concludes the first part of the proof. ˆ ) = r) > a. A sufficient condition for rank We now show that lim inf n→∞ P(rank(W ˆ = U SV ⊤ the singular value decomposition of consistency is the following: we let denote W ˆ and we let denote Uo and Vo the singular vectors corresponding to all but the r largest W singular values. Since we have simultaneous a sufficient

singular value decompositions,

⊤ ˆ ˆ ˆ ˆ condition is that rank(W ) > r and Uo Σmm (W − W) − ΣM ε Vo < λn (1 − η). If 2 ˆ M ε )V⊥ = 0, and we get, using ˆ M ε )k 6 λ0 (1 − η), then, by Lemma 11, U⊤ ∆(n1/2 Σ kΛ(n1/2 Σ ⊥ ˆ M ε: the proof of Proposition 8 and the notation Aˆ = n1/2 Σ ˆ mm ∆(A) ˆ − Aˆ Vo + op (n−1/2 ). ˆ mm (W ˆ − W) − Σ ˆ M ε Vo = Uo⊤ n−1/2 Σ Uo⊤ Σ

Moreover, because of regular consistency and a positive eigengap for W, the projection ˆ converges to the projection onto the first r singular onto the first r singular vectors of W vectors of W (see Appendix A), which implies that the projection onto the orthogonal ⊤ is also consistent, i.e., Uo Uo⊤ converges in probability to U⊥ U⊤ ⊥ and Vo Vo converges in 25

⊤ . Thus: probability to V⊥ V⊥

⊤ ˆ ˆ − W) − Σ ˆ M ε Vo Σmm (W

U o

2

ˆ mm (W ˆ − W) − Σ ˆ M ε Vo V ⊤ = Uo Uo⊤ Σ o =

=

ˆ ˆ n kU⊥ U⊤ ⊥ (Σmm ∆(A) − n−1/2 kΛ(A)k2 + op (n−1/2 ). −1/2

2

ˆ ⊥ V⊤ k2 + op (n−1/2 ) A)V ⊥

This implies that

ˆ 2 6 λ0 (1 − η)) ˆ mm (W ˆ − W) − Σ ˆ M ε Vo lim inf Uo⊤ Σ

< λn (1 − η) > lim inf P(kΛ(A)k n→∞

n→∞

2

which converges to a when η tends to zero, which concludes the proof. C.6 Proof of Proposition 10

This is the same result as Fu and Knight (2000), but extended to the trace norm minimization, simply using the directional derivative result of Proposition 22. If we write ˆ = W + λn ∆, ˆ then ∆ ˆ is defined as the global minimum of W 1 ˆ mm vec(∆) − λ−1 tr∆⊤ Σ ˆ M ε + λ−1 (kW + λn ∆k∗ − kWk∗ ) vec(∆)⊤ Σ n n 2 1 −1/2 ˆ Mε = vec(∆)⊤ Σmm vec(∆) + Op (ζn k∆k22 ) + Op (λ−1 ) + tr∆⊤ Σ n n 2 2 +trU⊤ ∆V + kU⊤ ⊥ ∆V⊥ k∗ + Op (λn k∆k2 )

Vn (∆) =

−1/2 = V (∆) + Op (ζn k∆k22 ) + Op (λ−1 ) + Op (λn k∆k22 ). n n

More precisely, if M λn < sr /2, E sup k∆k2 6M

2 1/2 2 ˆ mm − Σmm kF + M λ−1 ˆ |Vn (∆) − V (∆)| = cst × M 2 EkΣ E(k Σ k ) + λ M M ε n n −1/2 = O(M 2 ζn + M λ−1 + λn M 2 ). n n

Moreover, V (∆) achieves its minimum at a bounded point ∆0 6= 0. Thus, by Markov inequality the minimum of Vn (∆) over the ball k∆k2 < 2k∆0 k2 is with probability tending to one strictly inside and is thus also the unconstrained minimum, which leads to the proposition. C.7 Proof of Proposition 11 p×(p−r) The optimal ∆ ∈ Rp×q should be such that U⊤ ⊥ ∆V⊥ has low rank, where U⊥ ∈ R and V⊥ ∈ Rq×(q−r) are orthogonal complements of the singular vectors U and V. We now derive the condition under which the optimal ∆ is such that U⊤ ⊥ ∆V⊥ is actually equal 1 ⊤ to zero: we consider the minimum of 2 vec(∆) Σmm vec(∆) + vec(∆)⊤ vec(UV⊤ ) with ⊤ respect to ∆ such that vec(U⊤ ⊥ ∆V⊥ ) = (V⊥ ⊗ U⊥ ) vec(∆) = 0. The solution of that constrained optimization problem is obtained through the following linear system (Boyd and Vandenberghe, 2003): Σmm (V⊥ ⊗ U⊥ ) vec(∆) − vec(UV⊤ ) = , (12) (V⊥ ⊗ U⊥ )⊤ 0 vec(Λ) 0

26

where Λ ∈ R(p−r)×(q−r) is the Lagrange multiplier for the equality constraint. We can solve explicitly for ∆ and Λ which leads to −1 ⊤ −1 Σ vec(Λ) = (V⊥ ⊗ U⊥ )⊤ Σ−1 (V ⊗ U ) (V ⊗ U ) (V ⊗ U) vec(I) , ⊥ ⊥ ⊥ ⊥ mm mm and

⊤ ⊤ vec(∆) = −Σ−1 mm vec(UV − U⊥ ΛV⊥ ).

Then the minimum of the function F (∆) in Eq. (5) is such that U⊤ ⊥ ∆V⊥ = 0 (and thus p×q equal to ∆ defined above) if and only if for all Θ ∈ R , the directional derivative of F at ∆ in the direction Θ is nonnegative, i.e.: F (∆ + εΘ) − F (∆) > 0. ε By Proposition 22, this directional derivative is equal to lim

ε→0+

⊤ ⊤ trΘ⊤ (Σmm ∆ + UV⊤ ) + kU⊤ ⊥ ΘV⊥ k∗ = trΘ U⊥ ΛV⊥ + kU⊥ ΘV⊥ k∗

⊤ = trΛ⊤ U⊤ ⊥ ΘV⊥ + kU⊥ ΘV⊥ k∗ .

Thus the directional derivative is always non negative if for all Θ′ ∈ R(p−r)×(q−r), trΛ⊤ Θ′ + kΘ′ k∗ > 0, i.e., if and only if kΛk2 6 1, which concludes the proof. C.8 Proof of Theorem 12 Regular consistency is obtained by Corollary 5. We consider the problem in Eq. (5) of Proposition 10, where λn n1/2 → ∞ and λn → 0. Since Eq. (8) is satisfied, the solution ∆ indeed satisfies U⊤ ⊥ ∆V⊥ = 0 by Lemma 11. ˆ We have W = W + λn ∆ + op (λn ) and we now show that the optimality conditions ˆ is, with probability are satisfied with rank r. From the regular consistency, the rank of W tending to one, larger than r (because the rank is lower semi-continuous function). We now ˆ = U SV ⊤ the singular value need to show that it is actually equal to r. We let denote W ˆ and we let denote Uo and Vo the singular vectors corresponding to all decomposition of W but the r largest singular values. Since value decompositions,

we have simultaneous singular

⊤ ˆ ˆ ˆ we simply need to show that, Uo Σmm (W − W) − ΣM ε Vo < λn with probability 2 tending to one. We have: ˆ mm ∆ + op (λn ) − Op (n−1/2 ) Vo ˆ mm (W ˆ − W) − Σ ˆ M ε Vo = U ⊤ λn Σ Uo⊤ Σ o = λn Uo⊤ (Σmm ∆)Vo + op (λn ).

Moreover, because of regular consistency and a positive eigengap for W, the projection ˆ converges to the projection onto the first r singular onto the first r singular vectors of W vectors of W (see Appendix A), which implies that the projection onto the orthogonal ⊤ is also consistent, i.e., Uo Uo⊤ converges in probability to U⊥ U⊤ ⊥ and Vo Vo converges in ⊤ probability to V⊥ V⊥ . Thus:

⊤ ˆ ˆ mm (W ˆ − W) − Σ ˆ M ε Vo V ⊤ ˆ − W) − Σ ˆ M ε Vo

= Uo Uo⊤ Σ

Uo Σmm (W o 2

2

⊤ = λn kU⊥ U⊤ ⊥ (Σmm ∆)V⊥ V⊥ k2 + op (λn )

= λn kΛk2 + op (λn ). 27

This implies that that the last expression is asymptotically of magnitude strictly less than one, which concludes the proof. C.9 Proof of Theorem 13 We have seen earlier that if n1/2 λn tends to zero and λn tends to zero, then Eq. (7) is necessary for rank-consistency. We just have to show that there is a subsequence that does satisfy this. If lim inf λn > 0, then we cannot have consistency (by Proposition 6), thus if we consider a subsequence, we can always assume that λn tends to zero. We now consider the sequence n1/2 λn , and its accumulation points. If zero or +∞ is one of them, then by Propositions 7 and 9, we cannot have rank consistency. Thus, for all acccumulation points (which are finite and strictly positive), by considering a subsequence, we are in the situation where n1/2 λn tends to +∞ and λn tends to zero, which implies Eq. (7), by definition of Λ in Eq. (6) and Lemma 11. C.10 Proof of Theorem 15 r and V r the first r columns of U o o We let denote ULS LS and VLS and ULS and VLS the LS remaining columns; we also denote srLS the corresponding first r singular values and soLS the remaining singular values. From Lemma 14 and results in the appendix, we get that r (U r )⊤ − UU⊤ k = O (n−1/2 ) ksrLS − sk2 = Op (n−1/2 ) and ksoLS k2 = Op (n−1/2 ) and kULS 2 p LS r (V r )⊤ − VV⊤ k = O (n−1/2 ). By writing W ˆ A = W + n−1/2 ∆ ˆ A, ∆ ˆ A is defined as and kVLS 2 p LS the minimum of

1 ˆ mm vec(∆) − n1/2 tr∆⊤ Σ ˆ M ε + nλn kAWB + n−1/2 A∆Bk∗ − kAWBk∗ . vec(∆)⊤ Σ 2 We have: ⊤ AU = ULS Diag(sLS )−γ ULS U r r ⊤ o o ⊤ = ULS Diag(srLS )−γ (ULS ) U + ULS Diag(soLS )−γ (ULS ) U

= U Diag(s)−γ + Op (n−1/2 ) + Op (n−1/2 nγ/2 ) = U Diag(s)−γ + Op (n−1/2 nγ/2 ), and ⊤ AU⊥ = ULS Diag(sLS )−γ ULS U⊥

r r ⊤ o o ⊤ = ULS Diag(srLS )−γ (ULS ) U⊥ + ULS Diag(soLS )−γ (ULS ) U⊥

= U⊥ Diag(soLS )−γ + Op (nγ/2−1/2 )

= Op (nγ/2 ).

−1/2 nγ/2 ) and BV = O (nγ/2 ). We can Similarly we have: BV = V Diag(s)−γ + p Op (n ∆ ∆ rr ro (V V⊥ )⊤ . We have assumed that decompose any ∆ ∈ Rp×q as ∆ = (U U⊥ ) ∆or ∆oo λn n1/2 nγ/2 tends to infinity. Thus,

28

⊤ • if U⊤ ⊥ ∆ = 0 and ∆V⊥ = 0 (i.e., if ∆ is of the form U∆rr V ),

nλn kAWB + n−1/2 A∆Bk∗ − kAWBk∗ 6 λn n1/2 kA∆Bk∗

= λn n1/2 k Diag(s)−γ ∆rr Diag(s)−γ k∗ + Op (λn nγ/2 ) = Op (λn n1/2 )

tends to zero. • Otherwise, nλn kAWB + n−1/2 A∆Bk∗ − kAWBk∗ is larger than λn n1/2 kA∆Bk∗ − 2kAWBk∗ . The term kAWBk∗ is bounded in probability because we can write AWB = U Diag(s)1−2γ V⊤ + Op (n−1/2+γ/2 ) and γ 6 1. Besides, λn n1/2 kA∆Bk∗ is tending to infinity as soons as any of ∆or , ∆ro or ∆rr are different from zero. Indeed, by equivalence of finite dimensional norms λn n1/2 kA∆Bk∗ is larger than a constant times λn n1/2 kA∆BkF , which can be decomposed in four pieces along (U, U⊥ ) and (V, V⊥ ), corresponding asymptotically to ∆oo , ∆or , ∆ro or ∆rr . The smallest of those terms grows faster than λn n1/2+γ/2 , and thus tends to infinity. Thus, since Σmm is invertible, by the epi-convergence theorem of Geyer (1994, 1996), ˆ ∆A converges in distribution to the minimum of 1 ˆ M ε, vec(∆)⊤ Σmm vec(∆) − n1/2 tr∆⊤ Σ 2 such that U⊤ ⊥ ∆ = 0 and ∆V⊥ = 0. This minimum has a simple asymptotic distribution, namely ∆ = UΘV⊤ and Θ is asymptotically normal with mean zero and covariance −1 matrix σ 2 (V ⊗ U)⊤ Σmm (V ⊗ U) , which leads to the consistency and the asymptotic normality. In order to finish the proof, we consider the optimality conditions which can be written as A∆B and ˆ mm ∆ ˆ A − n1/2 Σ ˆ M ε B −1 A−1 Σ

having simultaneous singular value decompositions with proper decays of singular values, i.e, such that the first r are equal to λn n1/2 and the remaining ones are less than λn n1/2 . ˆ mm ∆ ˆ A − n1/2 Σ ˆ M ε is Op (1), we can thus From the asymptotic normality we get that Σ −1 −1 consider matrices of the form A ΘB where Θ is bounded, the same way we considered matrices of the form A∆B. We have: ⊤ A−1 U = ULS Diag(sLS )γ ULS U o o ⊤ r r ⊤ ) U ) U + ULS Diag(soLS )γ (ULS = ULS Diag(srLS )γ (ULS

= U Diag(s)γ + Op (n−1/2 ), and ⊤ A−1 U⊥ = ULS Diag(sLS )γ ULS U⊥

r r ⊤ o o ⊤ = ULS Diag(srLS )γ (ULS ) U⊥ + ULS Diag(soLS )γ (ULS ) U⊥

= Op (n−1/2 ) + U⊥ Diag(soLS )γ , 29

with similar expansions for B −1 V and B −1 V⊥ . We obtain the first order expansion: A−1 ΘB −1 = U Diag(s)γ Θrr Diag(s)γ V⊤ + U⊥ Diag(soLS )γ Θor Diag(s)γ V⊤

⊤ ⊤ + U Diag(s)γ Θro Diag(soLS )γ V⊥ + U⊥ Diag(soLS )γ Θoo Diag(soLS )γ V⊥

Because of the regular consistency, the first term is of the order of λn n1/2 (so that the ˆ are strictly positive), while the three other terms have norms first r singular values of W −γ/2 less than Op (n ) which is less than Op (n1/2 λn ) by assumption. This concludes the proof.

References J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006. Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007. A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Adv. NIPS 19, 2007. F. Bach. Consistency of the group lasso and multiple kernel learning. Technical Report HAL-00164735, HAL, 2007. F. R. Bach, R. Thibaux, and M. I. Jordan. Computing regularization paths for learning multiple kernels. In Advances in Neural Information Processing Systems 17, 2004. J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizbal. Numerical Optimization Theoretical and Practical Aspects. Springer, 2003. J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization. Number 3 in CMS Books in Mathematics. Springer-Verlag, 2000. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407, 2004. M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings American Control Conference, volume 6, pages 4734–4739, 2001. W. Fu and K. Knight. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5): 1356–1378, 2000. C. J. Geyer. On the asymptotics of constrained m-estimation. Annals of Statistics, 22(4): 1993–2010, 1994. C. J. Geyer. On the asymptotics of convex stochastic optimization. Technical report, School of Statistics, University of Minnesota, 1996. 30

G. H. Golub and C. F. Van Loan. Matrix Computations. J. Hopkins Univ. Press, 1996. P. Hall and C. C. Heyde. Martingale Limit Theory and Its Application. Academic Press, 1980. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. T. Kato. Perturbation Theory for Linear Operators. Springer-Verlag, 1966. A. S. Lewis and H. S. Sendov. Twice differentiable spectral functions. SIAM J. Mat. Anal. App., 23(2):368–386, 2002. J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, New York, 1998. N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for highdimensional data. Technical Report 720, Dpt of Statistics, UC Berkeley, 2006. B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Technical Report arXiv:0706.4138v1, arXiv, 2007. J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proc. ICML, 2005. W. Rudin. Real and complex analysis, Third edition. McGraw-Hill, Inc., New York, NY, USA, 1987. ISBN 0070542341. J. Shao. Mathematical Statistics. Springer, 2003. N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Adv. NIPS 17, 2005. G. W. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, 1990. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal Royal Statististics, 58(1):267–288, 1994. A. W. Van der Vaart. Asymptotic Statistics. Cambridge Univ. Press, 1998. M. Yuan and Y. Lin. On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2):143–161, 2007. P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, December 2006.

31

consistent − adaptive

singular values

3

2

1

0

−5

0

5 −log(λ)

10


singular values

3 2 1 0 −5

0

5 −log(λ)

10

singular values

inconsistent − adaptive 3 2 1 0 −5

0

5 −log(λ)

10

singular values

inconsistent − non adaptive 3 2 1 0 −5

0

5 −log(λ)

10

4

2 1 3

n = 10

n = 104 n = 105

log(RMS)

n = 102

0 −1 −2 −3

−2

0 2 −log(λ)

4

6

−4 −4

−2

0 2 −log(λ)

4

4

1.5 1

2

n = 103 n = 104 5

n = 10

log(RMS)

n = 10

0.5 0 −0.5

−2

0 2 −log(λ)

4

6

−1 −4

−2

0 2 −log(λ)

4

Consistency of trace norm minimization

Consistency of trace norm minimization

Suggest Documents