1–21
arXiv:1608.03487v1 [math.OC] 11 Aug 2016
A Richer Theory of Convex Constrained Optimization with Reduced Projections and Improved Rates Tianbao Yang† Qihang Lin‡ Lijun Zhang♮
[email protected] [email protected] [email protected] † Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA ‡ Department of Management Sciences, The University of Iowa, Iowa City, IA 52242, USA ♮ National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China
Abstract This paper focuses on convex constrained optimization problems, where the solution is subject to a convex inequality constraint. In particular, we aim at challenging problems for which both projection into the constrained domain and a linear optimization under the inequality constraint are time-consuming, which render both projected gradient methods and conditional gradient methods (a.k.a. the Frank-Wolfe algorithm) expensive. In this paper, we develop projection reduced optimization algorithms for both smooth and nonsmooth optimization with improved convergence rates. We first present a general theory of optimization with only one projection. Its application to smooth optimization with only one projection yields O(1/ǫ) iteration complexity, which can be further reduced under strong convexity and improves over the O(1/ǫ2 ) iteration complexity established before for nonsmooth optimization. Then we introduce the local error bound condition and develop faster convergent algorithms for non-strongly convex optimization at the price of a logarithmic 2(1−θ) e number of projections. In particular, we achieve a convergence rate of O(1/ǫ ) for 1−θ e non-smooth optimization and O(1/ǫ ) for smooth optimization, where θ ∈ (0, 1] is a constant in the local error bound condition. An experiment on solving the constrained ℓ1 minimization problem in compressive sensing demonstrates that the proposed algorithm achieve significant speed-up.
1. Introduction In this paper, we aim at solving the following convex constrained optimization problem: min
x∈Rd
f (x),
s.t. c(x) ≤ 0
(1)
where f (x) is a smooth or non-smooth convex function and c(x) is a convex function. The problem can find applications in machine learning, signal processing, statistics, marketing optimization, and etc. For example, in distance metric learning one needs to learn a positive semi-definite matrix such that similar examples are close to each other and dissimilar examples are far from each other (Weinberger et al., 2006; Xing et al., 2003), where the positive semi-definite (PSD) constraint can be cast into a convex inequality constraint. Another example arising in compressive sensing is to minimize the ℓ1 norm of high-dimensional vector subject to a measurement constraint (Cand`es and Wakin, 2008). Although general c T. Yang, Q. Lin & L. Zhang.
Yang Lin Zhang
interior-point methods can be applied to solve the problem with linear convergence, they suffer from exceedingly high computational cost per-iteration. Another solution is to employ the projected gradient (PG) method (Nesterov, 2004a) or the conditional gradient (CG) method (Frank and Wolfe, 1956), where the PG method needs to compute the projection into the constrained domain at each iteration and CG needs to solve a linear optimization problem under the constraint. However, for many constraints (e.g., PSD, a quadratic constraint) both projection into the constrained domain and the linear optimization under the constraint are time-consuming, which restrict their capabilities to solving these problems. Recently, there emerges a new direction towards addressing the challenge of expensive projection that is to reduce the number of projections. In the seminal paper (Mahdavi et al., 2012), the authors have proposed two algorithms with only one projection at the end of iterations for non-smooth convex and strongly convex optimization, respectively. The idea of both algorithms is to move the constraint function into the objective function and to control the violation of constraint for intermediate solutions. While their developed algorithms enjoy an optimal convergence rate for non-smooth optimization (i.e., O(1/ǫ2 ) iteration complexity) and a close-to-optimal convergence rate for strongly convex optimiza1 ), there still lack of theory and algorithms with reduced projections for e tion (i.e., O(1/ǫ) smooth convex optimization. In this paper, we bridge the gap by developing a general theory of optimization with only one projection, which gives an iteration complexity of O(1/ǫ) for smooth optimization with only one projection. Beyond that, we make non-trivial improvements on the convergence rates for both non-smooth and smooth non-stronlgy convex optimization at the price of a logarithmic number of projections. In particular, we show that under a mild local error bound condition, the iteration complexities can be improved to 2(1−θ) ) for non-smooth optimization and O(1/ǫ 1−θ ) for smooth optimization, where e e O(1/ǫ θ ∈ (0, 1] is a constant in the local error bound condition. To our knowledge, these are the best convergence rates with only a logarithmic number of projections for non-strongly convex optimization.
2. Related Work The issue of high projection cost in projected gradient descent has received increasing attention in recent years. Most work are based on the Frank-Wolfe technique that eschews the projection in favor of a linear optimization over the constrained domain (Jaggi, 2013; Hazan and Kale, 2012; Lacoste-Julien et al., 2013; Garber and Hazan, 2015). It happens that for many bounded domains (e.g., bounded balls for vectors and matrices, a PSD constraint with a bounded trace norm) the linear optimization over the constrained domain is much cheaper than projection into the constrained domain (Jaggi, 2013). However, there still exist many constraints that render both projection into the constrained domain and linear optimization under the constraint are comparably expensive. Examples include polyhedral constraints, quadratic constraints and a PSD constraint. To tackle these complex constraints, the idea of optimization with reduced number of projections was proposed. In (Mahdavi et al., 2012), the authors developed two algorithms for solving non-smooth problems as in (1), one for non-strongly convex optimization and another for strongly convex optimization, which achieve (almost) optimal convergence rates e suppresses a logarithmic factor. 1. where O()
2
Convex Constrained Optimization with Reduced Projections and Improved Rates
for the two settings in the worst case. However, it does not address smooth optimization problems. In a recent work (Zhang et al., 2013), the authors studied the smooth and strongly convex optimization and they proposed an algorithm with O(κ log(T )) projections and proved an O(1/T ) convergence rate, where κ is the condition number and T is the total number of iterations. Nonetheless, if the condition number is high the number of projections could be very large. It still remains an open problem to develop optimization algorithms with only one projection for smooth problems. Additionally, if trading a logarithmic number of projections for improved convergence rates is attractive, can we achieve better convergence? In this paper, we provide affirmative answers to these questions. It is worth mentioning that although the present work focuses on deterministic optimization instead of stochastic optimization as considered in (Mahdavi et al., 2012; Zhang et al., 2013), the general theory of optimization with only one projection developed here can be easily extended to stochastic optimization by combining with recent advances in stochastic smooth optimization for a finite sum problem (Johnson and Zhang, 2013; Defazio et al., 2014; Zhu, 2016). Another major contribution of this paper is to develop improved convergence for optimization with reduced projections under the local error bound condition. Several recent works also exploit different forms of error bound conditions to improve the convergence (Wang and Lin, 2014; So, 2013; Hou et al., 2013; Zhou et al., 2015; Yang and Lin, 2016). Most notably, the technique used in our work is closely related to (Yang and Lin, 2016). However, for constrained optimization problems the methods in (Yang and Lin, 2016) still need to conduct projections at each iteration.
3. Preliminaries We present some preliminaries in this section. Let Ω = {x ∈ Rd : c(x) ≤ 0} denote the constrained domain, Ω∗ denote the optimal solution set and f∗ denote the optimal objective value. We denote by ∇f (x) the gradient and by ∂f (x) the subgradient of a smooth or non-smooth function, respectively. When f (x) is a non-smooth function, we consider the problem as non-smooth constrained optimization. When both f (x) and c(x) are smooth, we consider the problem as smooth constrained optimization. A function f (x) is L-smooth if it has a Lipschitz continuous gradient, i.e., k∇f (x) − ∇f (y)k ≤ Lkx − yk, where k · k denotes the Euclidean norm. A function f (x) is µ-strongly convex if it satisfies f (x) ≥ f (y) + ∂f (y)⊤ (x − y) +
µ kx − yk2 2
In the sequel, dist(x, Ω) denotes the distance of x to a set Ω, i.e., dist(x, Ω) = minu∈Ω kx − uk. In the sequel, we make the following assumptions to facilitate the development of our algorithms and theory. Assumption 1 For a convex minimization problem (1), we assume (i) there exist x0 ∈ Ω and ǫ0 ≥ 0 such that f (x0 ) − minx∈Ω f (x) ≤ ǫ0 ; (ii) Ω∗ is a non-empty convex compact set; 3
Yang Lin Zhang
(iii) there exists a positive value ρ > 0 such that inf
c(x)=0 v∈∂c(x),v6=0
kvk ≥ ρ
(2)
or more generally there exists a constant ρ > 0 for any x ∈ Rd , such that x♮ = arg minu∈Rd ,c(u)≤0 ku− xk2 satisfies kx♮ − xk ≤
1 [c(x)]+ ρ
(3)
where [s]+ is a hinge operator that is defined as [s]+ = s if s ≥ 0, and [s]+ = 0 if s < 0; (iv) there exists a strictly feasible solution such that c(x) < 0; (v) both f (x) and c(x) are defined everywhere and are Lipschitz continuous with their Lipschitz constants denoted by G and Gc , respectively. Remark: we make several remarks about the assumptions. Assumptions (i) and (ii) are only exploited in developing improved convergence rates in Section 5, where assumption (i) supposes there is a lower bound of the optimal value f∗ . The inequality in (2) is also assumed in (Mahdavi et al., 2012), which is to ensure the distance of the final solution before projection to constrained domain Ω is not too large. Note that the inequality in (3) is a more general condition than (2) as revealed shortly. We present more discussions about Assumption (iii) in subsection 3.1 and exhibit the value of ρ for a number of commonly seen constraints. The strict feasibility assumption (iv) allows us to explore the KKT condition of the projection problem shown below. Traditional projected gradient descent methods need to solve the following projection at each iteration ΠΩ [x] = arg
min
u∈Rd ,c(u)≤0
ku − xk2
Conditional gradient methods (a.k.a. the Frank-Wolfe technique) need to solve the following linear optimization at each iteration minu∈Rd ,c(u)≤0 u⊤ ∇f (x). For many constraint functions (see examples given below), solving the projection problem and the linear optimization could be very expensive. 3.1. Discussion of the condition on the constraint function We first show that the inequality in (3) is weaker than (2) since the inequality (2) implies the inequality (3), which is stated in the following lemma. Lemma 1 For any x ∈ Rd , let x♮ = arg minc(u)≤0 ku − xk2 . If (2) holds, then (3) holds. Below, we demonstrate that several commonly seen constraints in practice satisfy the Assumption (iii). Specially, we discuss three types of constraints, polyhedral constraints, a quadratic constraint and a PSD constraint.
4
Convex Constrained Optimization with Reduced Projections and Improved Rates
Polyhedral constraints. First, we show that when c(x) is a polyhedral function, i.e., its epigraph is a polyhedron, the inequality in (3) is satisfied. To this end, we explore the polyhedral error bound (PEB) condition (Gilpin et al., 2012; Yang and Lin, 2016). In particular, if we consider an optimization problem, minx∈Rd h(x), where the epigraph of h(x) is polyhedron. Let H∗ denote the optimal set and h∗ denote the optimal value of the problem above. The PEB says that there exists ρ > 0 such that for any x ∈ Rd dist(x, H∗ ) ≤
1 (h(x) − h∗ ) ρ
To show that the inequality (3) holds for a polyhedral function c(x), we can consider the optimization problem minx∈Rd [c(x)]+ . The optimal set of the above problem is given by H∗ = {x ∈ Rd : c(x) ≤ 0}. Thus, when c(x) > 0, x♮ = arg minc(u)≤0 ku − xk2 is the closest point in the optimal set to x. Therefore if c(x) is a polyhedral function, by the PEB condition there exists a ρ > 0 such that dist(x, H∗ ) = kx − x♮ k ≤
1 1 ([c(x)]+ − min[c(x)]+ ) = [c(x)]+ x ρ ρ
Let us consider a concrete example, where the problem has a set of affine inequalities c⊤ i x− bi ≤ 0, i = 1, . . . , m. There are two methods to encode this into a single constraint function c(x) ≤ 0. The first method is to use c(x) = max1≤i≤m c⊤ i x − bi , which is a polyhedral function and therefore satisfies (3). The second method is to use c(x) = k[Cx − b]+ k, where [a]+ = max(0, a) and C = (c1 , . . . , cm )⊤ . Thus [c(x)]+ = k[Cx − b]+ k. The inequality in (3) is then guaranteed by the Hoffman’s bound and the parameter ρ is given by the minimum non-zero eigenvalue of C ⊤ C (Wang and Lin, 2014). Note that the projection onto a polyhedron is a linear constrained quadratic programming problem, and the linear optimization over a polyhedron is a linear programming problem. Both have polynomial time complexity that would be high if m and d are large (Karmarkar, 1984; Kozlov et al., 1980). Quadratic constraint. A quadratic constraint can take the form of kAx−yk2 ≤ τ , where A ∈ Rm×d and y ∈ Rm . Such a constraint appears in compressive sensing (Cand`es and Wakin, 2008)2 , where the goal is to reconstruct a sparse high-dimensional vector x from a small number of noisy measurements y = Ax + ε ∈ Rm with m ≪ d. The corresponding optimization problem is min kxk1 , s.t. kAx − yk2 ≤ τ. (4) x∈Rd where τ ≥ kεk is an upper bound on the magnitude of the noise. To check the Assumption (iii), we note that c(x) = kAx − yk2 − τ and ∇c(x) = A⊤ (Ax − y). Let us consider that A denote by v = Ax − y, then on the boundary c(x) = 0 we have has a full row rank 3 andp √ ⊤ kvk = τ and kA vk ≥ τ λmin (AA⊤ ), where λmin (AA⊤ ) > 0 is the minimum eigenvalue p of AA⊤ ∈ Rm×m . Therefore the Assumption (iii) is satisfied with ρ = τ λmin (AA⊤ ). It is notable that the projection and the linear optimization under the quadratic constraint require solving a quadratic programming problem and therefore could be expensive. 2. Here we use the square constraint to make it a smooth function so that the proposed algorithms for smooth optimization are applicable by using proximal gradient mapping to handle the ℓ1 norm. 3. which is reasonable because m ≪ d.
5
Yang Lin Zhang
PSD constraint. A PSD constraint X 0 for X ∈ Rd×d can be written as an inequality constraint −λmin (X) ≤ 0, where λmin (X) denotes the minimum eigen-value of X. The subgradient of c(X) = −λmin (X) when λmin (X) = 0 is given by Conv{−uu⊤ |kuk2 = 1, Xu = 0}, i.e., the convex hull of the outer products of normalized vectors in the null space of the matrix X. To show the subgradient is lower bounded, we have ∂c(X) = Conv{−uu⊤ |kuk = 1, Xu = 0}
= Conv{−U |U 0, Tr(X ⊤ U ) = 0, rank(U ) = 1, Tr(U ) = 1}
= {−U |U 0, Tr(X ⊤ U ) = 0, Tr(U ) = 1}
which is the set of positive semi-definite matrices that are orthogonal to X and have a trace of one. If the dimension of the null space of X is r with 1 ≤ r ≤ d, we can show that the subgradient of c(X) is lower bounded by ρ = √1r ≥ √1d . We postpone the details into the appendix. Finally, we note that both projection and linear optimization under a PSD constraint need to conduct a singular value decomposition, which is time demanding for a large matrix.
4. A General Theory of Optimization with only one projection In this section, we extend the idea of only one projection proposed in (Mahdavi et al., 2012) to a general theory, and then present optimization algorithms with only one projection for non-smooth and smooth optimization, respectively. To tackle the constraint, we introduce a penalty function hγ (x) parameterized by γ such that hγ (x) ≥ λ[c(x)]+
hγ (x) ≤ Cγ, ∀x such that c(x) ≤ 0
(5)
where C is a constant, λ is a constant such that λ > G/ρ. It is notable that the penalty function hγ (x) will also depends on λ; however since it will be set to a constant value, thus the dependence on λ is omitted. We will construct such a penalty function hγ (x) for nonsmooth and smooth optimization in next subsections. We propose to optimize the following augmented objective function min Fγ (x) = f (x) + hγ (x)
x∈Rd
(6)
We can employ any applicable optimization algorithms to optimize Fγ (x) pretending that bT that is not necessarily feasible. In there is no constraint, and finally obtain a solution x order to obtain a feasible solution, we perform one projection to get eT = arg min kx − x bT k2 x c(x)≤0
(7)
bT for Fγ (x) to that of x eT The following theorem allows us to convert the convergence of x for f (x).
6
Convex Constrained Optimization with Reduced Projections and Improved Rates
Theorem 2 Let A be any iterative optimization algorithm applied to minx Fγ (x) with T bT as the final solution. Assume f (x) is Giterations, which starts with x1 and returns x bT for any x ∈ Rd Lipschitz continuous and A enjoys the following convergence of x Fγ (b xT ) − Fγ (x) ≤ BT (γ; x, x1 )
(8)
where BT (γ; x, x1 ) → 0 when T → ∞. Then f (e xT ) − f (x∗ ) ≤
λρ (Cγ + BT (γ; x∗ , x1 )) λρ − G
(9)
where x∗ ∈ Ω∗ is an optimal solution to (1). Remark: It is worth mentioning that we omit some constant factors in the convergence bound BT (γ; x, x1 ) that are irrelevant to our discussions. The notation BT (γ; x, x1 ) emphasizes that it is a function of γ and depends on x1 and x and it will be referred to as BT . In the next several subsections, we will see that by carefully choosing the penalty function hγ (x) we are able to provide nice convergence for smooth and non-smooth optimization with only one projection. In the above theorem, we assume the optimization algorithm A is deterministic. However, a similar result can be extended to a stochastic optimization algorithm, which we will leave to future work for exploration. bT = x eT . Then Fγ (e Proof First, we consider c(b xT ) ≤ 0, which implies that x xT ) ≥ f (e xT ) and Fγ (x∗ ) ≤ f (x∗ )+Cγ, therefore f (e xT ) ≤ Fγ (b xT ) ≤ f (x∗ )+Cγ +BT (γ; x1 , x∗ ). Then (9) follows due to λρ/(λρ − G) ≥ 1. Next, we assume c(b xT ) > 0. Inequality (8) implies that f (b xT ) + λ[c(b xT )]+ ≤ f (x∗ ) + Cγ + BT (γ; x∗ , x1 )
(10)
eT k. Combined with (10) we have By Assumption (iii), we have [c(b xT )]+ ≥ ρkb xT − x
eT k ≤ f (x∗ ) − f (b eT k + Cγ + BT (γ; x1 , x∗ ) λρkb xT − x xT ) + Cγ + BT (γ; x1 , x∗ ) ≤ Gkb xT − x
where the last inequality follows that fact f (x∗ )− f (b xT ) ≤ f (x∗ )− f (e xT )+ f (e xT )− f (b xT ) ≤ eT k2 because the Lipschitz property and f (x∗ ) ≤ f (e Gkb xT − x xT ). Therefore we have Finally, we obtain
eT k ≤ kb xT − x
Cγ + BT (γ; x1 , x∗ ) λρ − G
f (e xT ) − f (x∗ ) ≤ f (e xT ) − f (b xT ) + f (b xT ) − f (x∗ ) eT k2 + Cγ + BT (γ; x1 , x∗ ) ≤ ≤ Gkb xT − x
7
λρ (Cγ + BT (γ; x1 , x∗ )) λρ − G
Yang Lin Zhang
4.1. Non-smooth Optimization We discuss the application of the general theory of one projection optimization to nonsmooth optimization. For non-smooth optimization, we can choose h(x) = λ[c(x)]+ , therefore γ = 0. We will use subgradient descent as an example to demonstrate the convergence for f (x), though many other optimization algorithms designed for non-smooth optimization are applicable (e.g., (Nesterov, 2009)). The update of subgradient descent method is given by the following for t = 1, . . . , T xt+1 = xt − ηt ∂F (xt )
(11)
Following the standard analysis for √ subgradient descent, we can establish the convergence of F (x), in particular BT = O(1/ T ) for a non-strongly convex function f (x) and BT = e O(1/(µT )) for a µ-strongly convex function f (x). Combining that with Theorem 2, we have the following convergence result with the omitted proof included in the appendix.
Corollary 3 Suppose that Assumption (iii), (iv) hold and dist(x1 , Ω∗ )P≤ D. Set F (x) = bT = Tt=1 xt /T . If f (x) and x f (x) + λ[c(x)]+ with λ ≥ G/ρ. Let (11) run for T iterations √ is a convex function, we can set ηt = D/((G + λGc ) T ) and achieve f (e xT ) − f∗ ≤
λρ (G + λGc )D √ λρ − G T
If f (x) is a µ-strongly convex function, we can set ηt = 1/(µt) and achieve f (e xT ) − f∗ ≤
λρ (G + λGc )2 (1 + log T ) λρ − G µT
Remark: The convergence rate remains the same as projected subgradient method as applied to (1) even though only one projection is conducted at the end. The same order of convergence results was established in (Mahdavi et al., 2012) for non-smooth optimization. It should be noted that the log(T ) factor for strongly convex optimization can be removed using suffix averaging (Rakhlin et al., 2012) or polynomial-decay averaging (Shamir and Zhang, 2013). In next section, we will develop improved convergence rates for non-smooth optimization. 4.2. Smooth Optimization For smooth optimization, we consider both f (x) and c(x) to be smooth 4 . Let the smoothness parameter of f (x) and c(x) be Lf and Lc , respectively. In order to ensure the augmented function Fγ (x) is still a smooth function, we construct the following penalty function λc(x) (12) hγ (x) = γ ln 1 + exp γ The following proposition shows that hγ (x) is a smooth function and obeys the condition in (5). 4. it can be extended to when f (x) is non-smooth but its proximal mapping can be easily solved.
8
Convex Constrained Optimization with Reduced Projections and Improved Rates
Proposition 1 Suppose c(x) is Lc -smooth and Gc -Lipschitz continuous. The penalty func2 2 tion in (12) is a (λLc + λ 4γGc )-smooth function and satisfies (i) hγ (x) ≥ λ[c(x)]+ and (ii) hγ (x) ≤ γ ln 2, ∀x such that c(x) ≤ 0. Then Fγ (x) is a smooth function and its the smoothness parameter is given by LF = 2 2 Lf + λLc + λ 4γGc . Next, we will establish the convergence for f (x) using the Nesterov’s optimal accelerated gradient (NAG) methods. The update of one variant of NAG can be written as follows for t = 0, 1, . . . , T 1 ∇Fγ (yt ) LF = xt+1 + βt+1 (xt+1 − xt )
xt+1 = yt − yt+1
(13)
where the value of βt can be set to different values depending on whether f (x) is strongly bT = xT convex or not (see Corollary 4). Previous work have established the convergence of x for Fγ (x), in particular BT = O( LTF2 ) for smooth non-strongly convex optimization and q BT = O LF exp(−T LµF ) for smooth and strongly convex optimization. By combining these results with Theorem 2 and appropriately setting γ, we can achieve the following eT for f (x). convergence of x
Corollary 4 Suppose that Assumption (iii), (iv) hold, dist(y0 , Ω∗ ) ≤ D, λ > G/ρ, f (x) is bT = xT . If f (x) is Lf -smooth and c(x) is Lc -smooth. Let (13) run for T iterations√and x 1+
1+4τ 2
−1 λG√ t−1 cD , βt = τt−1 with τ0 = 1, and convex, we can set γ = τt , where τt = 2 (T +1) 2 ln 2 ! √ achieve λρ λGc D 2 ln 2 (Lf + λLc )D 2 f (e xT ) − f∗ ≤ + λρ − G T +1 (T + 1)2
If f (x) is µ-strongly convex, we can set γ = T 12α with α ∈ (1/2, 1) and βt = achieve 1 1 f (e xT ) − f∗ ≤ O + T 2α T 4α 1 1 Lf +λLc +λ2 G2c /4 2(1−α) (4α ln T ) 1−α . as long as T ≥ µ
√ √ L − µ √ F √ , LF + µ
and
Remark: The converges results above indicate an O(1/ǫ) iteration complexity for smooth optimization and O(1/ǫ1/(2α) ) with α ∈ (1/2, 1) for smooth and strongly convex optimization with only one projection.
5. Improved Convergence for Non-strongly Convex Optimization In this section, we will show that the iteration complexity can be further reduced for both smooth and non-smooth optimization at a price of a logarithmic number of projections. To facilitate the presentation, we first introduce some notations. The ǫ-sublevel set Sǫ and ǫ-level set Lǫ of f (x) are defined as Lǫ = {x ∈ Ω : f (x) = f∗ + ǫ},
Sǫ = {x ∈ Ω : f (x) ≤ f∗ + ǫ}
9
Yang Lin Zhang
Note that Assumption (ii) guarantees that Sǫ is also bounded (Rockafellar, 1970). Let x†ǫ denote the closest point in the ǫ-sublevel set Sǫ to x ∈ Ω, i.e., x†ǫ = arg min ku − xk2 , u∈Ω
s.t. f (u) ≤ f∗ + ǫ.
(14)
Let x∗ denote the closest optimal solution in Ω∗ to x, i.e., x∗ = arg minu∈Ω∗ ku − xk2 . Our development relies on the following key result (Yang and Lin, 2016). We include the proof in the appendix. Lemma 2 Let Dǫ = maxx∈Lǫ dist(x, Ω∗ ). Then for any x ∈ Ω and ǫ > 0 we have
Dǫ (f (x) − f (x†ǫ )). (15) ǫ In order to develop improved convergence, we explore the local error bound condition to further bound Dǫ . kx − x†ǫ k ≤
Definition 3 A problem minx∈Ω f (x) satisfies a local error bound condition if there exist θ ∈ (0, 1] and σ > 0 such that for any x ∈ Sǫ we have dist(x, Ω∗ ) ≤ σ(f (x) − f∗ )θ where Ω∗ denotes the optimal set and f∗ denotes the optimal value. Remark: We would like to remark here that the local error bound condition is a mild condition. Studies have shown that many problems enjoy this condition. For example, (Jerome Bolte, 2015) shows that the Kurdyka-Lojasiewicz (KL) property, a property that is satisfied for many convex functions, implies the local error bound condition. Additionally, (Yang and Lin, 2016) shows that problems that have a polyhedral epigraph satisfy the local error bound condition with θ = 1. It is also easy to see that a locally strongly convex functions satisfy the local error bound condition with θ = 1/2. In the next two subsections, we will develop improved convergence for non-strongly convex optimization with a logarithmic number of projections based on the local error bound condition. 5.1. Improved Convergence for Non-smooth Optimization To establish an improved convergence for non-smooth optimization, we develop a new algorithm shown in Algorithm 1 based on subgradient descent (GD) method, to which we refer as LoPGD. The algorithm runs for K epochs and each epoch employs GD for minimizing F (x) with a feasible solution xk−1 ∈ Ω as a starting point and t iterations of updates. At bk is projected into the constrained domain the end of each epoch, the averaged solution x Ω and the solution xk will be used as the starting point for next epoch. The step size ηk is decreased by half every epoch starting from a given value η1 . The theorem below establishes the iteration complexity of LoPGD for finding an ǫ-optimal solution for (1) and also exhibits λρ ¯ 2 = G2 + λ2 G2 . the values of K, t and η1 . To simplify notations, we let p = λρ−G and G c Theorem 5 Suppose Assumptions (i)∼(iv) hold. Let η1 = ¯2 4σ2 p2 G
ǫ0 ¯2 , 2pG
K = ⌈log2 (ǫ0 /ǫ)⌉ and
t = ǫ2(1−θ) , where θ and σ are constants appearing in the local error bound condition. Then there exists at least one k ∈ {1, . . . , K} such that f (xk ) − f∗ ≤ 2ǫ.
Remark: Since the projection is only conducted at the end of each epoch and the total number of epochs is at most K = log2 (ǫ0 /ǫ), so the total number of projections is only a 2(1−θ) ) that improves e logarithmic number. The iteration complexity in Theorem 5 is O(1/ǫ the one in Corollary 3 without strong convexity. 10
Convex Constrained Optimization with Reduced Projections and Improved Rates
Algorithm 1 LoPGD 1: INPUT: K ∈ N+ , t ∈ N+ , η1 2: Initialization: x0 ∈ Ω, ǫ0 3: for k = 1, 2, . . . , K do 4: Let xk1 = xk−1 5: for s = 1, 2, . . . , t − 1 do 6: Update xks+1 = xks − ηk ∂F (xks ) 7: end for P bk = ts=1 xks /t 8: Let x Q 9: Compute xk = Ω [b xk ] 10: Update ηk+1 = ηk /2 11: end for
Algorithm 2 LoPNAG 1: INPUT: K ∈ N+ and a sequence of numbers t1 , . . . , tK ∈ N+ , γ1 2: Initialization: x0 ∈ Ω, ǫ0 3: for k = 1, 2, . . . , K do 4: Let y0k = xk−1 5: for s = 0, 1, 2, . . . , tk − 1 do 6: Update xks+1 = ysk − L1k ∇Fγk (xks ) k 7: Update ys+1 = xks+1 + βs+1 (xks+1 − k xs ) 8: end for bk = xktk 9: Let x Q 10: Compute xk = Ω [b xk ] 11: Update γk+1 = γk /2 12: end for
5.2. Improved Convergence for Smooth Optimization Similar to non-smooth optimization, we also develop a new algorithm based on NAG shown −1 in Algorithm 2, where Lk = LFγk is the smoothness parameter of Fγk and βs = τs−1 τs , s = 1, . . . , is a sequence with τs updated as in Corollary 4. We refer to this algorithm as LoPNAG. The theorem below exhibits the iteration complexity of LoPNAG and reveals the ¯ = Lf + λLc . values of K, γ1 and t1 , . . . , tK . To simplify notations, we let L Theorem 6 Suppose Assumptions (i)∼(iv) hold and f (x) is Lf √ -smooth and p c(x) is Lc σ 0 smooth. Let γ1 = 6pǫln 18 ln 2, 6(Lf + λLc )ǫ0 /2k−1 }, max{λG p , K = ⌈log (ǫ /ǫ)⌉ and t = c k 2 0 2 ǫ1−θ where θ and σ are constants appearing in the local error bound condition. Then there exists at least one k ∈ {1, . . . , K} such that f (xk ) − f∗ ≤ 2ǫ. Remark: It is not difficult to show that the total number of iterations is bounded by 1−θ ), which improves the one in Corollary 4 without strong convexity. e O(1/ǫ
6. An Experimental Result
We present an experimental result in this section to demonstrate the effectiveness of the proposed algorithms for solving the compressive sensing problem in (4). We generate a synthetic data for testing. In particular, we generate a random measurement matrix A ∈ Rm×d with m = 1000 and d = 5000. The entries of the matrix A are generated independently with the uniform distribution over the interval [−1, +1]. The vector x∗ ∈ Rd is generated with the same distribution at 100 randomly chosen coordinates. The noise ε ∈ Rm is a dense vector with independent random entries with the uniform distribution over the interval [−σ, σ], where σ is the noise magnitude and is set to 0.01. Finally the vector y was obtained as y = Ax∗ + ε. We compare with the start-of-the-art optimization algorithm proposed in (Becker et al., 2011) (i.e., the Nesterov’s smoothing plus the Nesterov’s optimal method for the smoothed 11
Yang Lin Zhang
Table 1: Comparison between the proposed LoPNAG and NESTA (Becker et al., 2011) for solving the measurement constrained compressive sensing problem.
Iters - Projs 5000 - 1 10000 - 2 15000 - 3 20000 - 4 25000 - 5
LoPNAG Rec. Err. Objective 0.018017 52.042878 0.018038 52.042418 0.018043 52.042358 0.018043 52.042358 0.018043 52.042358
Time (s) 18.04 35.88 53.09 70.24 87.32
Iters & Projs 1000 - 2000 3000 - 6000 5000 - 10000 8000 - 16000 10000 - 20000
NESTA Rec. Err. Objective 0.137798 52.703275 0.018669 52.050051 0.018659 52.050046 0.018657 52.050045 0.018657 52.050044
Time (s) 48.49 93.84 245.23 404.72 501.65
problem) known as NESTA. We use the Matlab package of NESTA 5 . For fair comparison, we also use the fast projection code in the NESTA package for conducting projection. We implement the proposed LoPNAG algorithm. To handle the unknown smoothness parameter in the proposed algorithm, we use the backtracking technique (Beck and Teboulle, 2009). The parameter γ is initially set to 0.001 and decreased by half every 5000 iterations after a projection and the target smoothing parameter µ in NESTA is set to 10−5 . For the value of λ in LoPNAG, we tune it from its theoretical value to several smaller values and choose the one that yields the fastest convergence. We report the results in Table 1, which include different number of iterations, the corresponding number of projections, the recovery error of the found solution compared to the underlying true sparse solution, the objective value (i.e., the ℓ1 norm of the found solution) and the running time. Note that each iteration of NESTA requires two projections because it maintains two extra sequence of solutions. From the results, we can see that LoPNAG converges significantly faster than NESTA. Even with only one projection, we are able to obtain a better solution than that of NESTA after running 10000 iterations.
7. Conclusion In this paper, we have considered a convex optimization problem subject to a convex inequality constraint. We have developed a general theory of optimization with only one projection that yields an improved iteration complexity for smooth optimization compared with non-smooth optimization. By exploring the local error bound condition, we further develop new algorithms with a logarithmic number of projections and achieve better convergence for both smooth and non-smooth optimization. As a future work, we plan to explore the application of the general theory to stochastic optimization.
References Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2:183–202, 2009. Stephen Becker, J´erˆ ome Bobin, and Emmanuel J. Cand`es. Nesta: A fast and accurate first-order method for sparse recovery. SIAM J. Img. Sci., 4:1–39, 2011. ISSN 1936-4954. 5. http://statweb.stanford.edu/~ candes/nesta/
12
Convex Constrained Optimization with Reduced Projections and Improved Rates
Emmanuel J. Cand`es and Michael B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21 –30, 2008. Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1646–1654, 2014. Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics (NRL), 3:149–154, 1956. Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 541–549, 2015. Andrew Gilpin, Javier Pe˜ na, and Tuomas Sandholm. First-order algorithm with log(1/epsilon) convergence for epsilon-equilibrium in two-person zero-sum games. Math. Program., 133(1-2): 279–298, 2012. Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the International Conference on Machine Learning (ICML), 2012. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007. Ke Hou, Zirui Zhou, Anthony Man-Cho So, and Zhi-Quan Luo. On the linear convergence of the proximal gradient method for trace norm regularization. In Advances in Neural Information Processing Systems (NIPS), pages 710–718, 2013. Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the International Conference on Machine Learning (ICML), pages 427–435, 2013. Juan Peypouquet Bruce Suter Jerome Bolte, Trong Phong Nguyen. From error bounds to the complexity of first-order descent methods for convex functions. CoRR, abs/1510.08234, 2015. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013. N. Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, pages 302–311, 1984. M.K. Kozlov, S.P. Tarasov, and L.G. Khachiyan. Polynomiale Loesbarkeit der konvexen quadratischen Programmierung. Zh. Vychisl. Mat. Mat. Fiz., 20:1319–1323, 1980. ISSN 0044-4669. Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frankwolfe optimization for structural svms. In Proceedings of the International Conference on Machine Learning (ICML), pages 53–61, 2013. M. Mahdavi, T. Yang, R. Jin, and S. Zhu. Stochastic gradient descent with only one projection. In Advances in Neural Information Processing Systems (NIPS), pages 503–511, 2012. Yurii Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization. Kluwer Academic Publishers, 2004a. Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., 2004b. ISBN 1-4020-7553-7. 13
Yang Lin Zhang
Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120:221–259, 2009. Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012. R.T. Rockafellar. Convex Analysis. Princeton mathematical series. Princeton University Press, 1970. URL https://books.google.com/books?id=wDgAoAEACAAJ. Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 71–79, 2013. Anthony Man-Cho So. Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. CoRR, abs/1309.0113, 2013. Po-Wei Wang and Chih-Jen Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 15(1):1523–1548, 2014. Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems (NIPS), pages 1473–1480, 2006. E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), volume 15, pages 505–512, 2003. Tianbao Yang and Qihang Lin. Rsg: Beating sg without smoothness and/or strong convexity. CoRR, abs/1512.03107, 2016. Lijun Zhang, Tianbao Yang, Rong Jin, and Xiaofei He. O(logt) projections for stochastic optimization of smooth and strongly convex functions. In Proceedings of the International Conference on Machine Learning (ICML), pages 1121–1129, 2013. Zirui Zhou, Qi Zhang, and Anthony Man-Cho So. L1p-norm regularization: Error bounds and convergence rate analysis of first-order methods. In Proceedings of the 32nd International Conference on Machine Learning, (ICML), pages 1501–1510, 2015. Zeyuan Allen Zhu. Katyusha: abs/1603.05953, 2016.
Accelerated variance reduction for faster SGD.
CoRR,
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning (ICML), pages 928–936, 2003.
Appendix A. Proof of Lemma 1 When c(x) ≤ 0, x♮ = x. There is nothing to prove. Therefore we consider c(x) > 0 and x♮ 6= x. By KKT conditions, there exists ζ ≥ 0 and v ∈ ∂c(x♮ ) such that x♮ − x + ζv = 0,
and
ζc(x♮ ) = 0
Since x♮ 6= x, then ζ > 0, c(x♮ ) = 0 and v 6= 0. Therefore, x − x♮ is the same direction as v. On the other hand, c(x) = c(x) − c(x♮ ) ≥ (x − x♮ )⊤ v = kvkkx − x♮ k ≥ ρkx − x♮ k 14
Convex Constrained Optimization with Reduced Projections and Improved Rates
where the second equality uses the fact that x − x♮ is the same direction as v and the last inequality uses the inequality (2).
Appendix B. Lower bound of the subgradient of the constraint function for a PSD constraint We first show that Conv{−uu⊤ |kuk = 1, Xu = 0} = Conv{−U |U 0, Tr(X ⊤ U ) = 0, rank(U ) = 1, Tr(U ) = 1}. In fact, given any u ∈ Rd with kuk = 1 and Xu = 0, we can show uu⊤ 0, Tr(X ⊤ uu⊤ ) = u Xu = 0, rank(uu⊤ ) = 1 and Tr(uu⊤ ) = kuk2 = 1, which means uu⊤ belongs to the set on the right. Since the set on the left is the convex hull of all such uu⊤ , the set on the left is included in the set of the right. OnPthe other hand, P given any element U from the set of the right, we can represent it as K K ⊤ ⊤ λ U where U = k=1 λk = 1, λk ≥ 0, Uk = uk uk for some uk , Tr(X Uk ) = 0 and k=1 k k Tr(Uk ) = 1 for k = 1, . . . , K. These three properties of Uk imply Xuk = 0 and kuk k2 = 1 so that U is a convex combination of some elements of the set on the left. Therefore, the set on the right is included in the set of the left. Next, we want to show ⊤
Conv{−U |U 0, Tr(X ⊤ U ) = 0, rank(U ) = 1, Tr(U ) = 1} = {−U |U 0, Tr(X ⊤ U ) = 0, Tr(U ) = 1} It is easy to see that the set on the left is a subset of the set on the right. To show the opposite, given PK any element U from the set of the right, we consider its eigenvalue decomposition U = k=1 λk uk u⊤ k where K ≤ d and λk > 0 and uk are the eigenvalue and the corresponding eigenvector with kuk k = 1. P ⊤ Since X is PSD, the property Tr(X ⊤ U ) = 0 implies K k=1 λk uk Xuk = 0 so that Xuk must be zero PK for k = 1, . . . , K. As a result, U = k=1 λk Uk with Uk = uk u⊤ k being an element in the set on the PK PK λ = 1. This means U is in the set on the left also. u = left. Note that Tr(U ) = k=1 λk u⊤ k k k=1 k If the dimension of the null space of X is r with 1 ≤ r ≤ d then we can write X = V ΣV ⊤ , Σr 0 where Σ = is a diagonal matrix with Σr ∈ Rd−r,d−r . We can set the constant ρ to be 0 0 the solution of the following optimization problem. ρ=
arg min kU kF U∈Rd×d
s.t. U 0, Tr(X ⊤ U ) = 0, Tr(U ) = 1 To simplify the problem, we note that Tr(X T U ) = Tr(V ΣV ⊤ U ) = Tr(ΣV ⊤ U V ) = 0. U11 , U12 ⊤ where U11 ∈ R(d−r)×(d−r) and U22 ∈ Rr×r . Because Σ is a diagonal Let V U V = U21 , U22 matrix with nonnegative entries, it then leads to that the diagonal entries of U11 are all zeros, as a result U11 = 0 and consequentially U21 = U12 = 0 due to that V ⊤ U V 0. As a result, kU kF = kV ⊤ U V kF = kU22 kF and Tr(U ) = Tr(V ⊤ U V ) = Tr(U22 ). Therefore, we get ρ= As a result, ρ =
√1 r
≥
arg
min
U22 ∈Rr×r
kU kF ,
s.t. U22 0, Tr(U22 ) = 1
√1 . d
15
Yang Lin Zhang
Appendix C. Proof of Proposition 1 The two inequalities are straightforward to prove. We prove the smoothness property. Let q(z) = exp(z)/(1 + exp(z)). It is not difficult to see that q(z) is 1/4-Lipschtiz continuous function. The gradient of hγ (x) is given by ∇hγ (x) =
exp(λc(x)/γ) λ∇c(x) = q(λc(x)/γ)∇c(x) 1 + exp(λc(x)/γ)
Then k∇hγ (x) − ∇hγ (y)k = kq(λc(x)/γ)λ∇c(x) − q(λc(y)/γ)λ∇c(y))k
= kq(λc(x)/γ)λ∇c(x) − q(λc(y)/γ)λ∇c(x) + q(λc(y)/γ)λ∇c(x) − q(λc(y)/γ)λ∇c(y))k ≤ kq(λc(x)/γ)λ∇c(x) − q(λc(y)/γ)λ∇c(x)k + kq(λc(y)/γ)λ∇c(x) − q(λc(y)/γ)λ∇c(y))k λGC ≤ |λc(x)/γ − λc(y)/γ| + λLc kx − yk 42 2 λ Gc + λLc kx − yk ≤ 4γ
Appendix D. Proof of Corollary 3 To prove the corollary, we need the convergence results of subgradient descent methods. We first prove for non-smooth convex optimization. Proposition 2 ((Zinkevich, 2003)) Let xt+1 = xt − η∂F (xt ) run for T iterations. Assume ¯ Then for any x ∈ Rd k∂F (x)k ≤ G. F (b xT ) − F (x) ≤ bT = where x
PT
t=1
¯2 kx1 − xk2 ηG + 2 2ηT
xt /T .
¯ = G + λGc . To prove the result for convex optimization, Since F (x) = f (x) + λ[c(x)]+ , we can let G we let x = x∗ , the closest optimal solution in Ω∗ to x1 , in the above convergence. Then we have F (b xT ) − F (x∗ ) ≤
¯2 ¯2 ¯ ηG dist2 (x1 , Ω∗ ) ηG D2 GD + ≤ + = √ 2 2ηT 2 2ηT T
where the last inequality is due to the value of η. Then by combining with the result in Theorem 2, we have ¯ λρ GD √ f (e xT ) − f (x∗ ) ≤ λρ − G T To the prove convergence for strongly convex optimization, we need the following result.
¯ Let Proposition 3 ((Hazan et al., 2007)) Assume F (x) is µ-strongly convex and k∂F (x)k ≤ G. xt+1 = xt − ηt ∂F (xt ). If ηt = 1/(µt), we have F (b xT ) − F (x) ≤ bT = where x
PT
t=1
¯ 2 (1 + log T ) G 2µT
xt /T .
eT for f (x) is similarly proved. Then the convergence of x 16
Convex Constrained Optimization with Reduced Projections and Improved Rates
Appendix E. Proof of Corollary 4 The following proposition shows the convergence of F (x). Proposition 4 (Beck and Teboulle, 2009; Nesterov, 2004b) Assume F (x) is√LF -smooth. Let (13) 2 1+ 1+4τt−1 −1 run for T iterations. If F (x) is convex, we can set βt = τt−1 with τ0 = 1. , where τt = τt 2 d Then for any x ∈ R we have F (xT ) − F (x) ≤
2LF ky0 − xk2 (T + 1)2
√ √ Lf − µ d If F (x) is µ-strongly convex, we can set βt = √ √ . Then for any x ∈ R we have Lf + µ
r µ µ F (xT ) − F (x) ≤ exp −T F (y0 ) − F (x) + ky0 − xk2 LF 2 We first prove the convergence for a smooth convex function f (x). From Theorem 2 and the construction of hγ (x) in (12), we have 2LF ky0 − x∗ k2 f (e xT ) − f (x∗ ) ≤ p γ ln 2 + (T + 1)2 where p =
λρ λρ−G .
By Proposition 1, we have LF = Lf + λLc +
λ2 G2c 4γ .
Then
2(Lf + λLc )ky0 − x∗ k2 λ2 G2c ky0 − x∗ k2 + f (e xT ) − f (x∗ ) ≤ p γ ln 2 + 2γ(T + 1)2 (T + 1)2 λ2 G2c D2 2(Lf + λLc )D2 ≤ p γ ln 2 + + 2γ(T + 1)2 (T + 1)2 ! √ 2 ln 2λGc D 2(Lf + λLc )D2 + =p (T + 1) (T + 1)2 where the last equality is due to the value of γ. Next, we prove the convergence for a smooth and strongly convex function f (x). First, we have r LF + µ µ ⊤ 2 F (xT ) − F (x∗ ) ≤ exp −T ∇F (x∗ ) (y0 − x∗ ) + ky0 − x∗ k LF 2 r µ ¯ 0 − x∗ k + LF ky0 − x∗ k2 ≤ exp −T Gky LF r r µ µ ¯ ≤ exp −T LF D 2 GD + exp −T LF LF Note that x∗ is not the optimal solution to F (x), hence ∇F (x∗ ) 6= 0 and we use its Lipschitz ¯ = G + λGc . continuity property where G r r µ µ 2 ¯ LF D GD + exp −T f (e xT ) − f (x∗ ) ≤ p γ ln 2 + exp −T LF LF 2 2 2 r r r λ Gc D µ µ µ 2 ¯ ≤ p γ ln 2 + exp −T GD + exp −T (Lf + λLc )D + exp −T LF 4γ LF LF
17
Yang Lin Zhang
To avoid clutter, we will consider the dominating term. Consider γ = T 12α ≤ 1. To bound the second term, we have 2 2 2 r r µ µ λ Gc D 2α T = O exp −T exp −T LF 4γ LF s µ T 2α = O exp −T λ2 G2c Lf + λLc + 4γ r µγ 2α ≤ O exp −T T Lf + λLc + λ2 G2c /4 r µ 2α = O exp −T 1−α T Lf + λLc + λ2 G2c /4 Consider T to be sufficiently larger such that T ≥ r exp −T 1−α Therefore
Lf +λLc +λ2 G2c /4 µ
µ Lf + λLc + λ2 G2c /4
≤
1 2(1−α)
1
(4α ln T ) 1−α , then
1
T 4α
2 2 2 r µ λ Gc D 1 exp −T ≤O LF 4γ T 2α
Similarly we can show the last two terms are dominated by O(1/T 4α ). As a result, 1 1 + f (e xT ) − f (x∗ ) ≤ O T 2α T 4α
Appendix F. Proof of Lemma 2 The proof duplicates that in (Yang and Lin, 2016). We consider x 6∈ Sǫ , otherwise the conclusion holds trivially. By the first-order optimality conditions of (14), we have for any u ∈ Ω, there exists ζ ≥ 0 (the Lagrangian multiplier of problem (14)) (x†ǫ − x + ζ∂f (x†ǫ ))⊤ (u − x†ǫ ) ≥ 0
(16)
Let u = x in the first inequality we have ζ∂f (x†ǫ )⊤ (x − x†ǫ ) ≥ kx − x†ǫ k2 We argue that ζ > 0, otherwise x = x†ǫ contradicting to the assumption x 6∈ Sǫ . Therefore f (x) − f (x†ǫ ) ≥ ∂f (x†ǫ )⊤ (x − x†ǫ ) ≥
kx − x†ǫ k kx − x†ǫ k2 = kx − x†ǫ k ζ ζ
(17)
Next we prove that ζ is upper bounded. Since −ǫ = f (x∗ǫ ) − f (x†ǫ ) ≥ (x∗ǫ − x†ǫ )⊤ ∂f (x†ǫ ) where x∗ǫ is the closest point to x†ǫ in the optimal set. Let u = x∗ǫ in the inequality of (16), we have (x†ǫ − x)⊤ (x∗ǫ − x†ǫ ) ≥ ζ(x†ǫ − x∗ǫ )⊤ ∂f (x†ǫ ) ≥ ζǫ 18
Convex Constrained Optimization with Reduced Projections and Improved Rates
Thus ζ≤
Dǫ kx†ǫ − xk (x†ǫ − x)⊤ (x∗ǫ − x†ǫ ) ≤ ǫ ǫ
Therefore
ǫ kx − x†ǫ k ≥ ζ Dǫ
Combining the above inequality with (17) we have f (x) − f (x†ǫ ) ≥
ǫ kx − x†ǫ k Dǫ
which completes the proof.
Appendix G. Proof of Theorem 5 Let ǫk =
ǫ0 2k .
We assume x1 , . . . , xK−1 6∈ S2ǫ ; otherwise the result holds trivially. Let x†k,ǫ ∈ Ω denote
the closest point to xk in the sublevel set Sǫ of f (x). Then f (x†k,ǫ ) = f∗ + ǫ, k = 1, . . . , K − 1, which is because we assume that x1 , . . . , xK−1 6∈ S2ǫ . We will prove by induction that f (xk ) − f∗ ≤ ǫk + ǫ. It holds for k = 0 because of Assumption (i). We assume it holds for k − 1 and prove it is true for k ≤ K. We consider the k-th epoch of LoPGD. For any x ∈ Rd F (b xk ) − F (x) ≤
¯2 ηk G kxk−1 − xk22 + 2 2ηk t
Let x = x†k−1,ǫ ∈ Ω. Then F (b xk ) −
F (x†k−1,ǫ )
¯2 kxk−1 − x†k−1,ǫ k2 ηk G + ≤ 2 2ηk t
Since F (x†k−1,ǫ ) = f (x†k−1,ǫ ) + λ[c(x†k−1,ǫ )]+ = f (x†k−1,ǫ ) F (b xk ) = f (b xk ) + λ[c(b xk )]+ Then f (b xk ) + λ[c(b xk )]+ −
f (x†k−1,ǫ )
! ¯2 kxk−1 − x†k−1,ǫ k2 ηk G ≤ + 2 2ηk t {z } | Bt
Then
f (b xk ) + λ[c(b xk )]+ ≤ f (x†k−1,ǫ ) + Bt Then λρkb xk − xk k2 ≤ f (x†k−1,ǫ ) − f (b xk ) + Bt ≤ f (x†k−1,ǫ ) − f (xk ) + f (xk ) − f (b xk ) + Bt Assume f (xk ) − f∗ > ǫ (otherwise the proof is done), thus f (x†k−1,ǫ ) ≤ f (xk ). Then b k k + Bt λρkb xk − xk k ≤ Gkxk − x 19
Yang Lin Zhang
leading to kb xk − xk k ≤
Bt λρ − G
Then f (xk ) − f (x†k−1,ǫ ) ≤ f (xk ) − f (b xk ) + f (b xk ) − f (x†k−1,ǫ )
! ¯2 kxk−1 − x†k−1,ǫ k2 ηk G + 2 2ηk t ! ! ¯2 ¯2 Dǫ2 (f (xk−1 ) − f (x†k−1,ǫ ))2 σ 2 (f (xk−1 ) − f (x†k−1,ǫ ))2 ηk G ηk G ≤p + + 2 2ηk tǫ2 2 2ηk tǫ2(1−θ))
λρ Bt = p ≤ Gkb xk − xk k + Bt = λρ − G ≤p
where the third inequality uses Lemma 2 and the last inequality uses the local error bound condition. Since we assume f (xk−1 ) − f∗ ≤ ǫk−1 + ǫ. Thus f (xk−1 ) − f (x†k−1,ǫ ) ≤ ǫk−1 . Then f (xk ) − f (x†k−1,ǫ ) ≤ By noting the values of ηk =
ǫk−1 ¯2 2pG
and t =
¯2 pσ 2 ǫ2k−1 pηk G + 2 2ηk tǫ2(1−θ))
¯2 4σ2 p2 G , ǫ2(1−θ)
f (xk ) − f (x†k−1,ǫ ) ≤
we have
ǫk−1 ǫk−1 + = ǫk . 4 4
Therefore f (xk ) − f∗ ≤ ǫk + ǫ due to the assumption f (xk−1 ) ≥ f∗ + 2ǫ and f (x†k−1,ǫ ) = ǫ. By induction, we therefore show that with at most K = log2 (ǫ0 /ǫ) epochs, we have f (xK ) − f∗ ≤ ǫK + ǫ ≤ 2ǫ
Appendix H. Proof of Theorem 6 Following a similar analysis and using the convergence of NAG and Proposition 1, we have f (xk ) −
f (x†k−1,ǫ )
≤ p γk ln 2 +
λ2 G2c kxk−1 − x†k−1,ǫ k22 2γk t2k
+
(Lf + λLc )kxk−1 − x†k−1,ǫ k22 t2k
!
By using Lemma 2 and the local error bound condition, we have λ2 G2c σ 2 ǫ2k−1 (Lf + λLc )σ 2 ǫ2k−1 † f (xk ) − f (xk−1,ǫ ) ≤ p γk ln 2 + + 2γk t2k ǫ2(1−θ) t2k ǫ2(1−θ) √ p ǫ σ 6(Lf + λLc )ǫk−1 } into the above Plugging the values of γk = 6pk−1 ln 2 , tk = ǫ1−θ max{λGc p 18 ln 2, inequality yields ǫk−1 ǫk−1 ǫk−1 + + = ǫk f (xk ) − f (x†k−1,ǫ ) ≤ 6 6 6 Then f (xk ) − f∗ ≤ ǫk + ǫ 20
Convex Constrained Optimization with Reduced Projections and Improved Rates
Therefore the total number of iterations is σ ǫ1−θ
K q X √ 1/2 max λGc p 18 ln 2 log2 (ǫ0 /ǫ), 6(Lf + λLc ) ǫk−1 k=1
!
√ q √ 2ǫ0 σ ≤ 1−θ max λGc p 18 ln 2 log2 (ǫ0 /ǫ), 6(Lf + λLc ) √ ǫ 2−1
21