Feb 25, 2017 - [10] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, ...
Global Optimality in Low-rank Matrix Optimization Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B. Wakin
∗
Department of Electrical Engineering and Computer Science
arXiv:1702.07945v1 [cs.IT] 25 Feb 2017
Colorado School of Mines
February 28, 2017
Abstract This paper considers the minimization of a general objective function f (X) over the set of non-square n × m matrices where the optimal solution X ⋆ is low-rank. To reduce the computational burden, we factorize the variable X into a product of two smaller matrices and optimize over these two matrices instead of X. Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function f (X) whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the strict saddle property. These geometric properties implies that a number of iterative optimization algorithms (such as gradient descent) can provably solve the factored problem with global convergence.
1
Introduction
Consider the minimization of a general objective function f (X) over all n × m matrices: minimize f (X), X∈Rn×m
(1)
which we suppose admits a low-rank solution X ⋆ ∈ Rn×m with rank(X ⋆ ) = r ⋆ . Low-rank matrix optimizations of the form (1) appear in a wide variety of applications, including quantum tomography [1, 18], collaborative filtering [17, 33], sensor localization [4], low-rank matrix recovery from compressive measurements [12,30], and matrix completion [13]. In order to find a low-rank solution, the nuclear norm is widely used in matrix inverse problems [12, 30] arising in machine learning [22], signal processing [16], and control [28]. Although nuclear norm minimization enjoys strong statistical guarantees [13], its computational complexity is very high (as most algorithms require performing an expensive singular value decomposition (SVD) in each iteration), prohibiting it from scaling to practical problems. To relieve the computational bottleneck, recent studies propose to factorize the variable into the Burer-Monteiro type decomposition [6, 7] with X = U V T , and optimize over the n × r and m × r (r ≥ r ⋆ ) matrices U and V . With this parameterization of X, we can recast (1) into the following program: minimize
U ∈Rn×r ,V ∈Rm×r
h(U , V ) := f (U V T ).
(2)
The bilinear nature of the parameterization renders the objective function of (2) nonconvex even when f (X) is a convex function. Hence, the objective function in (2) can potentially have spurious local minima (i.e., local minimizers that are not global minimizers) or “bad” saddle points that prevent a ∗ Email: {zzhu, qiuli, gtang, mwakin}@mines.edu. This work was supported by NSF grant CCF-1409261, NSF grant CCF1464205, and and NSF CAREER grant CCF-1149225.
1
number of iterative algorithms from converging to the global solution. By analyzing the landscape of nonconvex functions, several recent works have shown that with an exact factorization (r = r ⋆ ), the factored objective function h(U , V ) in matrix inverse problems has no spurious local minima [3, 20, 29]. We generalize this line of work by focusing on a general objective function f (X) in the optimization (1), not necessarily a quadratic loss function coming from a matrix inverse problem. We provide a geometric analysis for the factored program (2) and show that all the critical points of the the objective function are well-behaved. Our characterization of the geometry of the objective function ensures a number of iterative optimization algorithms converge to a global minimum.
1.1
Summary of Results
The purpose of this paper is to analyze the geometry of the factored problem h(U , V ) in (2). In particular, we attempt to understand the behavior of all the critical points of the objective function in the reformulated problem (2). Before presenting our main results, we lay out the necessary assumptions on the objective function f (X). As is known, without any assumptions on the problem, even minimizing traditional quadratic objective functions is challenging. For this purpose, we focus on the model where f (X) is (2r, 4r)restricted strongly convex and smooth, i.e., for any n × m matrices X, G with rank(X) ≤ 2r and rank(G) ≤ 4r, the Hessian of f (X) satisfies α kGk2F ≤ [∇2 f (X)](G, G) ≤ β kGk2F
(3)
for some positive α and β. Similar assumption is also utilized in [39, Conditions 5.3 and 5.4]. With this assumption on f (X), we summarize our main results in the following informal theorem. Theorem 1. (informal) Suppose the function f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (3) and the program (1) has a low-rank minimizer X ⋆ ∈ Rn×m with rank(X ⋆ ) = r ⋆ ≤ r. Then the factored objective function h(U , V ) (with an additional regularizer, see Theorem 3) in (2) has no spurious local minima and obeys the strict saddle property (see Definition 2 in Section 2). Remark 1. The above result implies that we can recover the rank-r ⋆ global minimizer X ⋆ of (1) by many iterative algorithms (such as the trust region method [36] and stochastic gradient descent [19]) even from a random initialization. This is because 1) as guaranteed by Theorem 2, the strict saddle property ensures local search algorithms converge to a local minimum, and 2) there are no spurious local minima. Remark 2. Since our main result only requires the (2r, 4r)-restricted strong convexity and smoothness property (3), aside from low-rank matrix recovery [12], it can also be applied to many other low-rank matrix optimization problems [38] which do not necessarily involve quadratic loss functions. Typical examples include robust PCA [5, 10], 1-bit matrix completion [8, 15] and Possion principal component analysis (PCA) [31].
1.2
Related Works
Compared with the original programm (1), the factored form (2) typically involves many fewer variables (or variables with much smaller size) and can be efficiently solved by simple but powerful methods (such as gradient descent [19, 25], the trust region method [35], and alternating methods [23]) for large-scale settings, though it is nonconvex. In recent years, tremendous effort has been devoted to analyzing nonconvex optimizations by exploiting the geometry of the corresponding objective functions. These works can be separated into two types based on whether the geometry is analysed locally or globally. One type of works analyze the behavior of the objective function in a small neighborhood containing the global optimum and require a good initialization that is close enough to a global minimum. Problems such as phase retrieval [11], matrix sensing [37] and semi-definite optimization [2] have been studied. Another type of works attempt to analyze the landscape of the objective function and show that it obeys the strict saddle property. If this particular property holds, then simple algorithms such as gradient descent and the trust region method are guaranteed to converge to a local minimum from a random initialization [19, 25, 34] rather than requiring a good guess. We approach low-rank matrix optimization with general objective functions (1) via similar geometric characterization. Similar geometric results
2
are known for a number of problems including complete dictionary learning [34], phase retrieval [36], orthogonal tensor decomposition [19], and matrix inverse problems [3, 20]. Our work is most closely related to certain recent works in low-rank matrix optimization. Bhojanapalli et al. [3] showed that the low-rank, positive semi-definite (PSD) matrix sensing problem has no spurious local minima and obeys the strict saddle property. Similar results were exploited for PSD matrix completion [20], PSD matrix factorization [27] and low-rank, PSD matrix optimization problems with generic objective functions [26]. Our work extends this line of analysis to general low-rank matrix (not necessary PSD or even square) optimization problems. Another closely related work is low-rank, non-square matrix sensing problem by factorization approach [29]. We note that our general objective function framework includes the low-rank matrix sensing problem as a special case (see Section 3.3). Furthermore, our result covers both over-parameterization where r > r ⋆ and exact parameterization where r = r ⋆ . Wang et al. [39] also considered the factored low-rank matrix minimization problem with a general objective function which satisfies the restricted strong convexity and smoothness condition. Their algorithms require good initializations for global convergence since they characterized only the local landscapes around the global optima. By categorizing the behaviors of all the critical points, our work differs from [39] in that we instead characterize the global landscape of the factored objective function. This paper continues in Section 2 with formal definitions for strict saddles and the strict saddle property. We present the main results and their implications in matrix sensing and weighted low-rank approximation in Section 3. The proof of our main results is given in Section 4. We conclude the paper in Section 5.
2 2.1
Preliminaries Notation
To begin, we first briefly introduce some notation used throughout the paper. The symbols I and 0 respectively represent the identity matrix and zero matrix with appropriate sizes. The set of r × r orthonormal matrices is denoted by Or := {R ∈ Rr×r : RT R = I}. If a function h(U , V ) has two arguments, U ∈ Rn×r and V ∈ Rm×r ,we occasionally use the notation h(W ) when we put these two U . For a scalar function f (Z) with a matrix variable Z ∈ Rn×m , arguments into a new one as W = V (Z) its gradient is an n × m matrix whose (i, j)-th entry is [∇f (Z)]ij = ∂f for all i ∈ {1, 2, . . . , n}, j ∈ ∂Z ij 2
(Z) {1, 2, . . . , m}. The Hessian of f (Z) can be viewed as an nm × nm matrix [∇2 f (Z)]ij = ∂∂zfi ∂z for j all i, j ∈ {1, . . . , nm}, where z i is the i-th entry of the vectorization of Z. An alternative way to P ∂ 2 f (Z) Aij B kl for represent the Hessian is by a bilinear form defined via [∇2 f (Z)](A, B) = i,j,k,l ∂Z ij ∂Z kl
any A, B ∈ Rn×m . The bilinear form for the Hessian is widely utilized through the paper.
2.2
Strict Saddle Property
Suppose h : Rn → R is a twice continuously differentiable objective function. We begin with the notion of strict saddles and the strict saddle property. Definition 1 (Strict saddles). A critical point x is a strict saddle if the Hessian matrix evaluated at this point has a strictly negative eigenvalue, i.e., λmin (∇2 h(x)) < 0. Definition 2 (Strict saddle property [19]). A twice differentiable function satisfies the strict saddle property if each critical point either corresponds to a local minimum or is a strict saddle. Intuitively, the strict saddle property requires a function to have a directional negative curvature at all the critical points but local minima. This property allows a number of iterative algorithms such as noisy gradient descent [19] and the trust region method [14] to further decrease the function value at all the strict saddles and thus converge to a local minimum. Theorem 2. [19, 25, 35] (informal) For a twice continuously differentiable objective function satisfying the strict saddle property, a number of iterative optimization algorithms (such as gradient descent and the the trust region method) can find a local minimum.
3
3 3.1
Problem Formulation and Main Results Problem Formulation
This paper considers the problem (1) of minimizing a general function f (X) admitting a low-rank solution X ⋆ with rank(X ⋆ ) = r ⋆ ≤ r. We factorize the variable X = U V T with U ∈ Rn×r , V ∈ Rm×r and c are matrices depending transform (1) into its factored counterpart (2). Through the paper, X, W and W on U and V : U c = U , X = U V T. , W W = −V V
Although the new variable W has much smaller size than X when r ≪ min{n, m}, the objective function in the factored problem (2) may have much more complicated landscape due to the bilinear form about U and V . The reformulated objective function h(U , V ) could introduce spurious local minima or degenerate saddle points even when f (X) is convex. Our goal is to guarantee that this does not happen. ⋆ n×r Let X ⋆ = QU ⋆ Σ⋆ QT and QV ⋆ ∈ Rm×r are orthonormal V ⋆ denote an SVD of X , where QU ⋆ ∈ R ⋆ r×r matrices of appropriate sizes, and Σ ∈ R is a diagonal matrix with non-negative diagonals (but with some zero diagonals if r > r ⋆ = rank(X ⋆ )). We denote U ⋆ = QU ⋆ Σ⋆1/2 ,
V ⋆ = QV ⋆ Σ⋆1/2 ,
where X ⋆ = U ⋆ V ⋆T forms a balanced factorization of X ⋆ since U ⋆ and V ⋆ have the same singular values. Throughout the paper, we utilize the following two ways to stack U ⋆ and V ⋆ together: ⋆ ⋆ ⋆ U U c W⋆ = , W = . V⋆ −V ⋆
Before moving on, we note that for any solution (U , V ) to (2), (U Ψ, V Φ) is also a solution to (2) for any Ψ, Φ ∈ Rr×r such that U ΨΦT V T = U V T . In order to address this ambiguity (i.e., to reduce the search space of W for (2)), we utilize the trick in [29, 37, 39] by introducing a regularizer
2 µ
g(U , V ) = U T U − V T V (4) 4 F
and solving the following problem
minimize
U ∈Rn×r ,V ∈Rm×r
ρ(U , V ) := f (U V T ) + g(U, V ),
(5)
2 where µ > 0 controls the weight for the term U T U − V T V F , which will be discussed soon. We remark that W ⋆ is still a global minimizer to the factored problem (5) since f (X) and g(W ) achieve their global minimum at X ⋆ and W ⋆ , respectively. The regularizer g(W ) is applied to force the difference between the two Gram matrices of U and V to be as small as possible. The global minimum of g(W ) is 0, which is achieved when U and V have the same Gram matrices, i.e., when W belongs to U : U TU − V TV = 0 . (6) E := W = V Informally, we can view (5) as finding a point from E that also minimizes f (U V T ). This is formally established in Theorem 3.
3.2
Main Results
Our main argument is that the objective function ρ(W ) has no spurious local minima and satisfies the strict saddle property. This is equivalent to categorizing all the critical points into two types: 1) the global minima which correspond to the global solution of the original convex problem (1) and 2) strict saddles such that the Hessian matrix ∇2 ρ(W ) evaluated at these points has a strictly negative eigenvalue. We formally establish this in the following theorem, whose proof is given in the next section.
4
Theorem 3. For any µ > 0, each critical point W =
U of ρ(W ) defined in (5) satisfies V
U T U − V T V = 0.
(7)
Furthermore, suppose the function f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness β α condition (3) with positive constants α and β satisfying α ≤ 1.5. Set µ ≤ 16 for the factored problem (5). Then ρ(W ) has no spurious local minimum, i.e., any local minimum of ρ(W ) is a global minimum corresponding to the global solution of the original convex problem (1): U V T = X ⋆ . In addition, ρ(W ) obeys the strict saddle property that any critical point not being a local minimum is a strict saddle with ⋆ −0.08ασ r = r⋆ 2 r (X ), 2 ⋆ λmin ∇ (ρ(W )) ≤ −0.05α · min σrc (W ), 2σr⋆ (X ) , r > r ⋆ (8) rc = 0, −0.1ασr⋆ (X ⋆ ), where r c ≤ r is the rank of W , λmin (·) represents the smallest eigenvalue, and σℓ (·) denotes the ℓ-th largest singular value.
Remark 3. Equation (7) shows that any critical point W belongs to E for the objective function in the factored problem (5) with any positive µ. This demonstrates the reason for adding the regularizer g(U , V ). Thus, any iterative optimization algorithm converging to some critical point of ρ(W ) results in a solution within E . Furthermore, the strict saddle property along with the lack of spurious local minima ensures that a number of iterative optimization algorithms find the global minimum. Remark 4. For any critical point W ∈ R(n+m)×r but not being a local minimum, the right hand side of (8) is strictly negative, implying W is a strict saddle. We also note that Theorem 3 not only covers exact parameterization where r = r ⋆ , but also includes over-parameterization where r > r ⋆ . 1 Remark 5. The constants appeared in Theorem 3 are not optimized. We use µ ≤ 16 α simply to include 1 µ = 16 which is utilized for matrix sensing problem in [37]. If the ratio between the restricted strong β ≤ 1.4, then we can show ρ(W ) has no spurious local minima and convexity and smoothness constants α obeys the strict saddle property for any µ ≤ 14 α (where µ = 14 is utilized for matrix sensing problem in [29]). In all cases, a smaller µ yields a more negative constant in (8); see Section 4 for more discussion on this. This implies that when the restricted strong convexity constant α is not provided a priori, one can always choose a small µ to ensure the strict saddle property holds, and hence guarantee the global convergence of many iterative optimization algorithms. We prove Theorem 3 in Section 4. Before proceeding, we present two stylized applications of Theorem 3 in matrix sensing and weighted low-rank approximation.
3.3 3.3.1
Stylized Applications Matrix Sensing
We first consider the implication of Theorem 3 in the matrix sensing problem where f (X) =
1 2 kA (X − X ⋆ )k2 . 2
Here A : Rn×m → Rp is a known measurement operator satisfying the following restricted isometry property. Definition 3. (Restricted Isometry Property (RIP) [30]) The map A : Rn×m → Rp satisfies the r-RIP with constant δr if (1 − δr ) kXk2F ≤ kA(X)k2 ≤ (1 + δr ) kXk2F holds for any n × m matrix X with rank(X) ≤ r.
(9)
Note that in this case, the Hessian quadrature form ∇2 f (X)[Y , Y ] for any n × m matrices X and Y is given by ∇2 f (X)[Y , Y ] = kA(Y )k2 .
5
If A satisfies the 4r-restricted isometry property with constant δ4r , then f (X) satisfies the (2r, 4r)restricted strong convexity and smoothness condition (3) with constants α = 1 − δ4r and β = 1 − δ4r since (1 − δ4r ) kY k2F ≤ kA(Y )k2 ≤ (1 + δ4r ) kY k2F for any rank-4r matrix Y . Now applying Theorem 3, we can characterize the geometry for the following matrix sensing problem with the factorization approach:
2 1
minimize V T − X ⋆ ) + g(U , V ), (10)
A(U 2 U ∈Rn×r ,V ∈Rn×r 2 where g(U , V ) is the added regularizer defined in (4).
4r . Then the Corollary 1. Suppose A satisfies the 4r-RIP with constant δ4r ≤ 15 , and set µ ≤ 1−δ 16 objective function in (10) has no spurious local minima and satisfies the strict saddle property.
β 1+δ4r This result follows directly from Theorem 3 by noting that α = 1−δ ≤ 1.5 if δ4r ≤ 15 . We remark 4r that Park et al. [29, Theorem 4.3] provided a similar geometric result for (10). Compared to their result 1 which requires δ4r ≤ 100 , our result has a much weaker requirement on the RIP of the measurement operator.
3.3.2
Weighted Low-Rank Matrix Factorization
We now consider the implication of Theorem 3 in the weighted matrix factorization problem [32], where f (X) :=
1 2 kΩ ◦ (X − X ⋆ )kF . 2
Here Ω is an n × m weight matrix consisting of positive elements and ◦ denotes the point-wise product between two matrices. In this case, the Hessian quadrature form ∇2 f (X)[Y , Y ] for any n × m matrices X and Y is given by ∇2 f (X)[Y , Y ] = kΩ ◦ Y k2F . Thus f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (3) with constants α = kΩk2min and β = kΩk2max since kΩk2min kY k2F ≤ kΩ ◦ Y k2F ≤ kΩk2max kY k2F , where kΩkmin and kΩkmax represent the smallest and largest entries in Ω, respectively. Now we consider the following weighted matrix factorization problem:
2 1
minimize ◦ (U V T − X ⋆ ) + g(U , V ), (11)
Ω F U ∈Rn×r ,V ∈Rn×r 2
where g(U , V ) is the added regularizer defined in (4). For an arbitrary weight matrix Ω, it is proved that the weighted low-rank factorization can be NP-hard [21] and has spurious local minima. When the elements in the weight matrix Ω are concentrated, it is expected that (11) can be efficiently solved by a number of iterative optimization algorithms as it is close to an (unweighted) matrix factorization problem (where Ω is a matrix of ones) which obeys the strict saddle property [27]. The following result characterizes the geometric structure in the objection function of (11) by directly applying Theorem 3. kΩk2
kΩk2
≤ 1.5. Set µ ≤ 16min . Then the objective function in (11) Corollary 2. Suppose Ω satisfies kΩkmax 2 min has no spurious local minima and satisfies the strict saddle property.
4
Proof of Theorem 3
In this section, we provide a formal proof of Theorem 3. The main argument involves showing that each critical point of ρ(W ) either corresponds to the global solution of (1) or is a strict saddle whose Hessian ∇2 ρ(W ) has a strictly negative eigenvalue. Specifically, we show that W is a strict saddle by arguing that the Hessian ∇2 ρ(W ) has a strictly negative curvature along ∆ := W − W ⋆ R, i.e., [∇2 ρ(W )](∆, ∆) ≤ −τ k∆k2F for some τ > 0. Here R is an r × r orthonormal matrix such that the distance between W and W ⋆ rotated through R as small as possible.
6
4.1
Supporting Results
We first present some useful results. Because of the restricted strong convexity and smoothness condition (3), the following result establishes that if (1) has an optimal solution X ⋆ with rank(X ⋆ ) ≤ r, then this is the unique global minimum of rank at most r. Proposition 1. Suppose f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (3) with positive α and β. Assume X ⋆ is a global minimum of f (X) with rank(X ⋆ ) = r ⋆ ≤ r. Then there is no other optimum of (1) that has rank less than or equal to r. Proof of Proposition 1. First note that if X ⋆ is the globally optimal solution of the original unconstrained program (1), then ∇f (X ⋆ ) = 0.
Now suppose there exists another global minimum X ′ 6= X ⋆ with rank(X ′ ) ≤ r such that that f (X ⋆ ) = f (X ′ ). On the other hand, the second order Taylor expansion gives
1 ′ f f (X ′ ) =f (X ⋆ ) + ∇f (X ⋆ ), X ′ − X ⋆ + [∇2 f (X)](X − X ⋆ , X ′ − X ⋆ ), 2
f = tX ⋆ + (1 − t)X ′ for some t ∈ [0, 1]. This Taylor expansion together with f (X ⋆ ) = f (X ′ ) where X and ∇f (X ⋆ ) = 0 gives ′ f [∇2 f (X)](X − X ⋆ , X ′ − X ⋆ ) = 0,
f and X ′ − X ⋆ have rank at most 2r. which contradicts (3) since X ′ − X ⋆ 6= 0 and both X
The (2r, 4r)-restricted strong convexity and smoothness assumption (3) also implies the following isometry property, whose proof is given in Appendix A.
Proposition 2. Suppose the function f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (3) with positive α and β. Then for any n × m matrices Z, G, H of rank at most 2r, we have β −α 2 2 α + β [∇ f (Z)](G, H) − hG, Hi ≤ β + α kGkF kHkF . The following result provides an upper bound on the energy of the difference W W T − W ⋆ W ⋆T when projected onto the column space of W . Its proof is given in Appendix B Lemma 1. Suppose f (X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (3). For any critical point W of (5), let P W ∈ R(m+n)×(m+n) be the orthogonal projector onto the column space of W . Then
β−α
T ⋆ ⋆T kX − X ⋆ kF .
(W W − W W )P W ≤ 2 β+α F We remark that Lemma 1 is a variant of [29, Lemma 3.2]. While the result there requires the 4r-RIP condition of the objective function, our result depends on the (2r, 4r)-restricted strong convexity and smoothness condition. Our result is also slightly tighter than [29, Lemma 3.2]. In addition, for any matrix C, D ∈ Rn×r , the following result relates the distance between CC T and DD T to the distance between C and D. Lemma 2. For any matrices C, D ∈ Rn×r with ranks r1 and r2 , respectively, let R = arg minR′ ∈Or kC− DR′ kF . Then
2 n √ 2 o
T T 2 2 2
CC − DD ≥ max 2( 2 − 1)σr (D), min σr1 (C), σr2 (D) kC − DRkF . F
If C = 0, then we have
2
T T 2 2
CC − DD ≥ σr2 (D) kC − DRkF . F
We present one more useful result in the following Lemma.
7
Lemma 3. [26, Lemma 3] For any matrix C, D ∈ Rn×r , let P C be the orthogonal projector onto the range of C. Let R = arg minR′ ∈Or kC − DR′ kF . Then
2
2
2
1 1
T T T ) (CC T − DD T )P C .
C (C − DR) ≤ CC − DD + (3 + √ 8 F F F 2( 2 − 1)
Finally, we provide the gradient and Hessian expressions for ρ(W ). The gradient of ρ(W ) is given by ∇U ρ(U , V ) = ∇f (X)V + µU (U T U − V T V ),
∇V ρ(U , V ) = ∇f (X)T U − µV (U T U − V T V ). 2
Standard computations give the the Hessian quadrature form [∇ ρ(W )](∆, ∆) for any ∆ =
∆U ∆V
where ∆U ∈ Rn×r , ∆V ∈ Rm×r : 2 T T ∇ ρ(X) (∆, ∆) = ∇2 f (X) (∆U V T + U ∆T + U ∆T V , ∆U V V ) + 2h∇f (X), ∆U ∆V i + [∇2 g(W )](∆, ∆),
where
4.2
D T E D E D E c W,∆ b T∆ + µ W c∆ b T , ∆W T + µ W cW c T , ∆∆T . [∇2 g(W )](∆, ∆) = µ W
The Formal Proof
Any critical point W of ρ(W ) satisfies ∇ρ(W ) = 0, i.e., ∇f (X)V + µU U T U − V T V = 0, ∇f (X)T U − µV U T U − V T V = 0. By (13), we obtain
(12) (13)
U T ∇f (X) = µ U T U − V T V V T .
Multiplying (12) by U T and plugging in the expression for U T ∇f (X) from the above equation V T gives (U T U − V T V )V T V + U T U (U T U − V T V ) = 0, which further implies U TU U TU = V TV V TV . Note that U T U and V T V are the principal square roots (i.e., PSD square roots) of U T U U T U and V T V V T V , respectively. Utilizing the result that a PSD matrix has a unique principal square root [24], we obtain U TU = V TV .
(14)
Thus, we can simplify (12) and (13) by ∇U ρ(U , V ) = ∇f (X)V = 0,
(15)
∇V ρ(U , V ) = ∇f (X) U = 0.
(16)
T
Now we turn to prove the strict saddle property and that there are no spurious local minima. First, note that as guaranteed by Proposition 1, X ⋆ is the unique n × m matrix with rank at most r. Also the gradient of f (X) vanishes at X ⋆ since (1) is an unconstraint optimization problem. Denote the set of critical points of ρ(W ) by n o C := W ∈ R(n+m)×r : ∇ρ(W ) = 0 .
8
We separate C into two subsets:
n o C1 : = C ∩ W ∈ R(n+m)×r : U V T = X ⋆ , n o C2 : = C ∩ W ∈ R(n+m)×r : U V T 6= X ⋆ ,
satisfying C = C1 ∪ C2 . Since any critical point W satisfies (14), g(W ) achieves its global minimum at W . Also f (X) achieves its global minimum at X ⋆ . We conclude that W is the globally optimal solution of ρ for any W ∈ C1 . If we show that any W ∈ C2 is a strict saddle, then we prove that there are no spurious local minima as well as the strict saddle property. Thus, the remaining part is to show that C2 is the set of strict saddles. To show that C2 is the set of strict saddles, it is sufficient to find a direction ∆ along which the Hessian has a strictly negative curvature for each of these points. We construct ∆ = W − W ⋆ R, the difference from W to its nearest global factor W ⋆ , where
R = arg min W − W ⋆ R′ F . R′ ∈Or
Such ∆ satisfies ∆ 6= 0 since X 6= X ⋆ implying W W T 6= W ⋆ W ⋆T . Then we evaluate the Hessian bilinear form along the direction ∆: D E 2 T ∇ ρ(X) (∆, ∆) = 2 ∇f (X), ∆U ∆T + ∇2 f (X) ∆U V T + U ∆T + U ∆T V V , ∆U V V | {z } | {z } Π1
D E D E c∆ b T , ∆W T +µ W cW c T , ∆∆T . +µ W | | {z } {z } Π3
Π2
(17)
Π4
The following result (which is proved in Appendix D) states that Π1 is strictly negative, while the remaining terms are relatively small, though they maybe nonnegative: 2
Π1 ≤ −α kX − X ⋆ kF ,
Π3 ≤ kW ∆T k2F ,
Π2 ≤ βkW ∆T k2F ,
(18)
2
Π4 ≤ 2 kX − X ⋆ kF .
Now substituting (18) into (17) gives 2 ∇ ρ(X) (∆, ∆) = 2Π1 + Π2 + µΠ3 + µΠ4
2
≤ −2αkX − X ⋆ k2F + (β + µ) · kW ∆T k2F + 2µ kX − X ⋆ kF
2 (i) 1
2 ≤ (−2α + 2µ) kX − X ⋆ kF + (β + µ) W W T − W ⋆ W ⋆T 8 F 2 2 β−α 2 + (β + µ) 12 + √ kX − X ⋆ kF β+α 2−1 (ii) 2 1 β−α 2 2 + (12 + √ ) kX − X ⋆ kF ≤ −2α + 2µ + (β + µ) 2 β+α 2−1 (iii)
(19)
2
≤ −0.2α kX − X ⋆ kF ,
where (i) utilizes Lemmas 1 and 3, (ii) utilizes the following inequality (which is proved in Appendix E)
2
T ⋆ ⋆ ⋆ 2 (20)
W W − W W ≤ 4 kX − X kF , F β 1 ≤ 1.5 and µ ≤ 16 α. Thus, if X 6= X ⋆ , ∇2 ρ(X) (∆, ∆) is always negative. and (ii) holds because α This implies that W is a strict saddle. To complete the proof, we utilize Lemma 2 to further bound the last term in (19):
2 2
∇ ρ(X) (∆, ∆) ≤ −0.05α W W T − W ⋆ W ⋆T F √ 2 ⋆ 2 − 1)σ r = r⋆ , 2( r (W ), ≤ −0.05α k∆k2F min σr2c (W ), σr2⋆ (W ⋆ ) , r > r ⋆ , σr2⋆ (W ⋆ ), rc = 0,
9
where r c is the rank of W , the fist inequality utilizes (20), and the second inequality follows from Lemma 2. We complete the proof of Theorem 3 by noting that σℓ2 (W ⋆ ) = 2σℓ (X ⋆ ) for all ℓ ∈ {1, . . . , r ⋆ } since √ QU ⋆ /√2 √ ⋆1/2 QU ⋆ Σ⋆1/2 = W⋆ = 2Σ I QV ⋆ / 2 QV ⋆ Σ⋆1/2 ⋆ is an SVD of W ⋆ , where we recall that X ⋆ = QU ⋆ Σ⋆ QT V ⋆ is an SVD of X . Remark 6. From (19), we observe that a smaller µ yields a more negative bound on ∇2 ρ(X) (∆, ∆). This can be explained intuitively as follows. First note that any critical point W satisfies (14) provided µ > 0, no matter how large or small µ is. The Hessian information about g(W ) is represented by the terms Π3 and Π4 . We have D E D E c∆ b T , ∆W T + W cW c T , ∆∆T Π3 + Π4 = W D T E D T E c ∆, ∆T W c + W c ∆, W c T∆ = W D T E c ∆, W c T ∆ + ∆T W c = W
≥ 0,
where the last line holds since for any r × r matrix A,
2 D E 1D E 1D E 1
A, A + AT = A + AT , A + AT + A − AT , A + AT = A + AT ≥ 0. 2 2 2 F
Thus the Hessian of ρ evaluated at any critical point W is a PSD matrix1 instead of having a negative eigenvalue. In low-rank, PSD matrix optimization problems, the corresponding objective function (without any regularizer as such g(W )) is proved to have the strict saddle property [3, 26]. Therefore, h(W ) is also expected to have the strict saddle property, and so is ρ(W ) when µ is small, i.e., the Hessian of g(W ) has little influence on the Hessian of ρ(W ) when µ is small. Our results also indicate that when the restricted strict convexity constant α is not provided a priori, we can always choose a small µ to ensure the strict saddle property of ρ(W ) is met, and hence we are guaranteed the global convergence of a number of local search algorithms applied to (5).
5
Conclusion
This paper considers low-rank matrix optimization with general objective functions. To reduce the computational complexity, a matrix factorization technique has been utilized. Although the resulting optimization problem is not convex, we show that the reformulated objection function has a simple landscape: there are no spurious local minima and any critical point not being a local minimum is a strict saddle such that the Hessian evaluated at this point has a strictly negative eigenvalue. These properties guarantee that a number of iterative optimization algorithms (such as gradient descent and the trust region method) will converge to the global optimum from a random initialization.
A
Proof of Proposition 2
This proof follows similar steps to the proof of [9, Lemma 2.1]. First note that the bilinear form P ∂ 2 f (Z) [∇2 f (Z)](G, H) = i,j,k,l ∂Z Gij H kl implies [∇2 f (Z)](G, H) is invariant under all scalings for ij ∂Z kl both G and H, i.e., [∇2 f (Z)](aG, bH) = ab[∇2 f (Z)](G, H) for any a, b ∈ R. If either G or H is zero, (3) holds since both sides are 0. 1 This can also be observed since any critical point W is a global minimum point of ρ(W ), which directly indicates that ∇2 ρ(W ) 0.
10
Now suppose both G or H are nonzero. By the scaling invariance property of both sides in (3), we assume kGkF = kHkF = 1 without loss of generality. Note that the (2r, 4r)-restricted strong convexity and smoothness condition (3) implies α kG − Hk2F ≤ [∇2 f (X)](G − H, G − H) ≤ β kG − Hk2F ,
α kG + Hk2F ≤ [∇2 f (X)](G + H, G + H) ≤ β kG + Hk2F . Thus we have −
β−α β−α kGk2F + kHk2F ≤ 2 ∇2 f (Z) (G, H) − (α + β) hG, Hi ≤ kGk2F + kHk2F , 2 2
which further implies 2 2 ∇ f (Z) (G, H) − (α + β) hG, Hi ≤ β − α = (β − α) kGk kHk . F F
B
Proof of Lemma 1
First recall the notation X = U V T , X ⋆ = U ⋆ V ⋆ , and ⋆ U c = U , W ⋆ = U⋆ , , W W = −V V V
c⋆ = W
U⋆ ⋆ . −V
It follows from (15) and (16) that any critical point W satisfies 0 ∇f (X) W = 0, ∇f (X)T 0 which gives 0= = =
0 ∇f (X)T
∇f (X) , ZW T 0
0 ∇f (X) − ∇f (X ⋆ )T T
∇f (X) − ∇f (X ⋆ ) , ZW T 0
α+β (X − X ⋆ ), Z U V T + U Z T ∇f (X) − ∇f (X ⋆ ) − V 2 | {z }
(21)
k1
E α+β D + X − X ⋆, Z U V T + U Z T V 2 | {z } k2
ZU ∈ R(n+m)×r . Here the second line utilizes the fact ∇f (X ⋆ ) = 0. We bound k1 by ZV first using integral form of the mean value theorem for ∇f (X): Z 1 E 2 α+β D k1 = ∇ f (tX + (1 − t)X ⋆ ) (X − X ⋆ , Z U V T + U Z T X − X ⋆, ZU V T + U Z T . V )dt − V 2 0 for any Z =
Noting that all the three matrices tX + (1 − t)X ⋆ , X − X ⋆ and Z U V T + U Z T V have rank at most 2r, it follows from Proposition 2 that Z 1 E 2 α+β D ⋆ T T ∇ f (tX + (1 − t)X ⋆ ) (X − X ⋆ , Z U V T + U Z T X − X , Z V + U Z |k1 | ≤ ) − U V dt V 2 0
β−α
kX − X ⋆ kF Z U V T + U Z T ≤ V , 2 F which when plugged into (21) gives
β−α α+β
k2 = −k1 ≤ kX − X ⋆ kF Z U V T + U Z T V . 2 2 F
11
(22)
†
Now let Z = (W W T −W ⋆ W ⋆T )W T , which gives ZW T = (W W T −W ⋆ W ⋆T )P W . Here † denotes the pseudoinverse of a matrix and P W is the orthogonal projector onto the range of W . Utilizing the fact
c T W = 0 from (7), we further connect the left hand side of (22) with W W T − W ⋆ W ⋆T P W 2 W F by E α+β α + β Dc cT α+β W W , ZW T k2 = k2 + 2 2 4 α+β UUT X − 2X ⋆ T ⋆ ⋆T = , WW − W W PW X T − 2X ⋆T VVT 4 E D α+β W W T − W ⋆ W ⋆T , W W T − W ⋆ W ⋆T P W = 4 (23) E α + β D c ⋆ c ⋆T W W , W W T − W ⋆ W ⋆T P W + 4 E α+β D ≥ W W T − W ⋆ W ⋆T , W W T − W ⋆ W ⋆T P W 4
2 α+β
T ⋆ ⋆T = PW ,
WW − W W 4 F D ⋆ ⋆T E c W c , W ⋆ W ⋆T P W = 0 (noting that W c ⋆T W c ⋆ = 0) and where the inequality follows because W D ⋆ ⋆T E D ⋆ ⋆T E c W c , W W TP W = W c W c , W W T ≥ 0 since it is the inner product between two PSD W matrices. On the other hand, we give an upper bound on the right hand side of (22): q
2
2
⋆
2 Z U V T F + 2 U Z T ≤ kX − X k kX − X ⋆ kF Z U V T + U Z T
V V F F F
≤ kX − X ⋆ kF W W T − W ⋆ W ⋆T P W , F
2
2
2
2 where the last line follows because Z U V T F + Z V U T F = Z U U T F + Z V V T F (since U T U =
2 T 2
2
V T V ), implying 2 Z U V T F + 2 U Z T . This together with (22) and (23) completes V F = ZW F the proof.
C
Proof of Lemma 2
When C 6= 0, the proof follows directly from the following results.
Lemma 4. [26, Lemma 2] For any matrix C, D ∈ Rn×r with rank r1 and r2 , respectively, let R = arg minR∈O kC − DRkF . Then e r
T T
CC − DD ≥ min {σr1 (C), σr2 (D)} · kC − DRkF . F
Lemma 5. [37, Lemma 5.4] For any matrix C, D ∈ Rn×r with rank(D) = r, let R = arg minR∈O kC− e r DRkF . Then
2 √
T T 2 2
CC − DD ≥ 2( 2 − 1)σr (D) kC − DRkF . F
If C = 0, then we have
r2 r2
2
2 X X
T T T σi4 (D) ≥ σr22 (D) σi2 (D) = σr22 (D) kDk2F = σr22 (D) kC − DRk2F .
CC − DD = DD = F
F
i=1
i=1
12
D
Proof of (18)
Bounding term Π1 : Utilizing the fact that ∆U = U − U ⋆ R and ∆V = V − V ⋆ R, we have D E Π1 = ∇f (X), ∆U ∆T V D E = ∇f (X), (U − U ⋆ R)(V − V ⋆ R)T D E = ∇f (X), X + X ⋆ − U ⋆ RT V T − U RT V ⋆T (i)
= − h∇f (X), X − X ⋆ i
(ii)
= − h∇f (X) − ∇f (X ⋆ ), X − X ⋆ i
(iii)
2
≤ −α kX − X ⋆ kF ,
where (i) follows from (15) and (16), (ii) utilizes ∇f (X ⋆ ) = 0, and (iii) follows by using the (2r, 4r)restricted strict convexity property (3): h∇f (X) − ∇f (X ⋆ ), X − X ⋆ i Z 1 2 = ∇ f (tX + (1 − t)X ⋆ ) (X − X ⋆ , X − X ⋆ ) d t 0
≥
Z
0
1
α hX − X ⋆ , X − X ⋆ i d t 2
= α kX − X ⋆ kF , where the first line follows from the integral form of the mean value theorem for vector-valued functions, and the second line uses the fact that both tX + (1 − t)X ⋆ and X − X ⋆ have rank at most 2r, and the (2r, 4r)-restricted strong convexity of the Hessian ∇2 f (·). Bounding term Π2 : By the smoothness condition (3), we have T Π2 = ∇2 f (X) ∆U V T + U ∆T + U ∆T V , ∆U V V
2
≤ β ∆U V T + U ∆T V F
2
2
≤ 2β ∆U V T + U ∆T V F F
2
= β W ∆T , F
2
2 where the last line holds because DU T F = DV T F for any D ∈ Rp×r with arbitrary p ≥ 1 since any critical point W satisfies U T U = V T V .
Bounding term Π3 : D E D E D E T T T Π3 = U ∆T + V ∆T − 2 U ∆T U , ∆U U V , ∆V V V , ∆U V
2
2
2
2
T T T ≤ U ∆T U + V ∆V + U ∆V + V ∆U F F F F
2
T = W ∆ . F
13
Bounding term Π4 : D E cW c T , (W − W ⋆ R) (W − W ⋆ R)T Π4 = W D E (i) cW c T , W W T − W ⋆ W ⋆T =− W
D E D ⋆ ⋆T E cW c T , W W T − W ⋆ W ⋆T + W c W c , W W T − W ⋆ W ⋆T ≤ − W D E cW cT − W c ⋆W c ⋆T , W W T − W ⋆ W ⋆T =− W (ii)
2
≤ 2 kX − X ⋆ kF ,
c T W = 0, and (ii) follows because W c ⋆T W ⋆ = 0 and hW c ⋆W c ⋆T , W W T i ≥ 0 where (i) holds because W since it is the inner product between two PSD matrices.
E
Proof of (20)
To show (20), expanding the left hand side of (20), it is equivalent to show
2
2
T ⋆ ⋆T T ⋆ ⋆T ⋆ 2
U U − U U + V V − V V ≤ 2 kX − X kF . F
F
Expanding both sides of the above equation and utilizing the fact U T U = V T V and U ⋆T U ⋆ = V ⋆T V ⋆ , the remaining is to show trace U U T U ⋆ U ⋆T + V V T V ⋆ V ⋆T ≥ 2 trace U V T V ⋆ U ⋆T . Thus, we obtain (20) by noting that the above equation is equivalent to 2 trace U ⋆T U − V ⋆T V ≥ 0.
References [1] Scott Aaronson. The learnability of quantum states. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 463, pages 3089–3114. The Royal Society, 2007. [2] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. arXiv preprint, 2015. [3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search for low rank matrix recovery. arXiv preprint arXiv:1605.07221, 2016. [4] Pratik Biswas and Yinyu Ye. Semidefinite programming for ad hoc wireless sensor network localization. In Proceedings of the 3rd international symposium on Information processing in sensor networks, pages 46–54. ACM, 2004. [5] Thierry Bouwmans, Necdet Serhat Aybat, and El-hadi Zahzah. Handbook of Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing. CRC Press, 2016. [6] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003. [7] Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3):427–444, 2005. [8] Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14(1):3619–3647, 2013. [9] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9-10):589–592, 2008.
14
[10] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. [11] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015. [12] Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342–2359, 2011. [13] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009. [14] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000. [15] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference, 3(3):189–223, 2014. [16] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016. [17] Dennis DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In Proceedings of the 23rd International Conference on Machine Learning, pages 249–256. ACM, 2006. [18] Steven T Flammia, David Gross, Yi-Kai Liu, and Jens Eisert. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012. [19] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. [20] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. arXiv preprint arXiv:1605.07272, 2016. [21] Nicolas Gillis and Fran¸cois Glineur. Low-rank matrix approximation with weights or missing data is NP-hard. SIAM Journal on Matrix Analysis and Applications, 32(4):1149–1165, 2011. [22] Zaid Harchaoui, Matthijs Douze, Mattis Paulin, Miroslav Dudik, and J´erˆ ome Malick. Large-scale image classification with trace-norm regularization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3386–3393. IEEE, 2012. [23] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pages 665–674. ACM, 2013. [24] Charles R Johnson, Kazuyoshi Okubo, and Robert Reams. Uniqueness of matrix square roots and an application. Linear Algebra and Its Applications, 323(1):51–60, 2001. [25] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. University of California, Berkeley, 1050:16, 2016. [26] Qiuwei Li and Gongguo Tang. The nonconvex geometry of low-rank matrix optimizations with general objective functions. arXiv:1611.03060, 2016. [27] Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao. Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296, 2016. [28] Karthik Mohan and Maryam Fazel. Reweighted nuclear norm minimization with application to system identification. In Proceedings of the 2010 American Control Conference, pages 2953–2959. IEEE, 2010. [29] Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi. Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach. arXiv preprint arXiv:1609.03240, 2016. [30] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
15
[31] Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, and Rebecca Willett. Poisson noise reduction with non-local PCA. Journal of Mathematical Imaging and Vision, 48(2):279–294, 2014. [32] Nathan Srebro, Tommi Jaakkola, et al. Weighted low-rank approximations. In ICML, volume 3, pages 720–727, 2003. [33] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, pages 1329–1336, 2004. [34] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere II: Recovery by Riemannian trust-region method. arXiv preprint arXiv:1511.04777, 2015. [35] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv:1510.06096, 2015.
arXiv preprint
[36] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. arXiv preprint arXiv:1602.06664, 2016. [37] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015. [38] Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd. Generalized low rank models. arXiv preprint arXiv:1410.0342, 2014. [39] Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A unified computational and statistical framework for nonconvex low-rank matrix estimation. arXiv preprint arXiv:1610.05275, 2016.
16