FINE TUNING NESTEROV’S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING ´ CLOVIS C. GONZAGA¶ AND ELIZABETH W. KARAS§ Abstract. We modify the first order algorithm for convex programming described by Nesterov in his book [5]. In his algorithms, Nesterov makes explicit use of a Lipschitz constant L for the function gradient, which is either assumed known [5], or is estimated by an adaptive procedure [7]. We eliminate the use of L at the cost of an extra imprecise line search, and obtain an algorithm which keeps the optimal complexity properties and also inherit the global convergence properties of the steepest descent method for general continuously differentiable optimization. Besides this, we develop an adaptive procedure for estimating a strong convexity constant for the function. Numerical tests for a limited set of toy problems show an improvement in performance when compared with the original Nesterov’s algorithms. Key words. Unconstrained convex problems Optimal method AMS subject classifications. 62K05, 65K05, 68Q25, 90C25, 90C60
1. Introduction. We study the nonlinear programming problem (P )
f (x) x ∈ IRn ,
minimize subject to
where f : IRn → IR is convex and continuously differentiable, with a Lipschitz constant L > 0 for the gradient and a convexity parameter µ ≥ 0. It means that for all x, y ∈ IRn , (1.1)
k∇f (x) − ∇f (y)k ≤ Lkx − yk
and (1.2)
1 f (x) ≥ f (y) + ∇f (y)T (x − y) + µkx − yk2 . 2
If µ > 0, the function is said to be strongly convex. Note that µ ≤ L. Assume that the problem admits an optimal solution x∗ (not necessarily unique) and consider an algorithm which generates a sequence {xk }, starting at a given point x0 ∈ IRn . The complexity results here will consist in establishing bounds for the number of iterations k which ensure f (xk ) − f (x∗ ) ≤ ε, for a given precision ε > 0. The bounds will depend on kx0 − x∗ k, on the Lipschitz constant L and on the strong convexity constant µ (if available). No method for solving (P) based on accumulated first √ order information can achieve an iteration-complexity bound of less than O(1/ ε) [4, 13]. The steepest descent method with constant step lengths cannot achieve a complexity better than O(1/ε) [5]. Results for higher order methods are discussed in Nesterov and Polyak [9]. ¶ Department of Mathematics, Federal University of Santa Catarina. Cx. Postal 5210, 88040970 Florian´ opolis, SC, Brazil; e-mail :
[email protected]. Research done while visiting LIMOS – Universit´ e Blaise Pascal, Clermont- Ferrand, France. This author is supported by CNPq and PRONEX - Optimization. § Department of Mathematics, Federal University of Paran´ a. Cx. Postal 19081, 81531-980 Curitiba, PR, Brazil; e-mail :
[email protected],
[email protected]. Research done while visiting the Tokyo Institute of Technology, Global Edge Institute. Supported by CNPq, PRONEX - Optimization and MEXT’s program.
1
2
C. GONZAGA AND E. KARAS
Nesterov [5] proves that with insignificant increase in computational effort per iteration, the computational complexity of the steepest descent method is lowered to √ the optimal value O(1/ ε) for general smooth convex functions. For strongly convex functions, Nesterov’s methods are linearly convergent, with a complexity bound of p L/µ log(1/ε) (see [5, Thm 2.2.2]). For general convex functions, the bound obtained in this theorem is (1.3)
f (xk ) − f (x∗) ≤
4Lkx0 − x∗ k 4Lkx0 − x∗ k ≈ . 2 (k + 2) k2
These results were extended to more general convex problems, including problems with so-called “simple constraints” (see, for instance [2, 6, 7, 8]). In the special case of minimizing a strictly convex quadratic function, conjugate direction methods with exact line searches find the optimal solution in no more than n iterations and satisfy (see [11, p. 170]):
(1.4)
f (xk ) − f (x∗) ≤
Lkx0 − x∗ k Lkx0 − x∗ k ≈ . 2(2k + 1)2 8k 2
Note that this bound is independent of n, and is about 32 times better than the bound (1.3) obtained by Nesterov for general convex functions. Nemirovsky and Yudin [4, Sec 7.2.2] show that no first order method can be more than twice better than conjugate gradients for strongly convex quadratic problems. For non-quadratic convex functions, the same reference [4, Sec 8.2.2] shows that both Fletcher-Reeves and Polak-Ribi`ere versions of conjugate gradient methods are not optimal. This is done by constructing examples in which, even with exact line searches, both methods need a multiple of 1/ε iterations to achieve an ε accuracy. Our paper does a fine tuning of Nesterov’s algorithm [5, Scheme 2.2.6]. His method relies heavily on the usage of a Lipschitz constant L, which is either given or adaptively estimated as in Nesterov [7] at the cost of extra gradient computations. The methods achieve optimal complexity and when L is given this is done with the smallest possible computational time per iteration: only one gradient computation and no function evaluations (besides the ones used in the line search, if it is done at all). We trade this economy in function evaluations for an extra inexact line search, eliminating the usage of L and improving the efficiency of each iteration, while keeping the optimal complexity in number of iterations. Nesterov’s algorithm takes advantage of a strong convexity parameter µ > 0 whenever one is known. Our best algorithm, presented in Section 4, uses a decreasing sequence of estimates for the parameter µ. Remark. Estimations of the Lipschitz constant were used by Nesterov [7]. In another context – constrained minimization of non-smooth homogeneous convex functions – Richt´ arik [12] uses an adaptive scheme for guessing bounds for a distance between a point x0 and an optimal solution x∗ . This bound determines the number of subgradient steps needed to obtain a desired precision. We found no method for guessing the strong convexity constant µ in the literature. Structure of the method. A detailed explanation of the method is in Nesterov’s book [5, Sec. 2.2.1]. Here we give a rough description of its main features. It uses local properties of f (·) (the steepest descent directions) and global properties of convex functions, by generating a sequence functional upper bounds which are locally imposed on the epigraph of f (·) at each iteration. The functional upper bounds are simple distance functions with the
3
FINE TUNING NESTEROV’S ALGORITHM
shape φk (x) = φ∗k +
(1.5)
γk kx − vk k2 , 2
where φ∗k ∈ IR, γk > 0 and vk ∈ IRn . Note that these functions are strictly convex with minimum φ∗k at vk . At start, v0 = x0 is a given point and γ0 > 0 is given. The method will compute a sequence of points xk , a sequence of second order constants γk ≥ µ such that γk → µ and a sequence of functions φk (·) as above. The construction is such that at each iteration φ∗k ≥ f (xk ) (in the present paper, we require φ∗k = f (xk )). Hence any optimal solution x∗ satisfies f (x∗ ) ≤ φ∗k ≤ φk (xk ), and so we can think of φk (·) as an upper bound imposed to the epigraph of f (·). As k increases, the functions φk become flatter. This is shown in Fig. 1.1. φ
f
k
φ
k
f
*
φk
φ
k+1
k
f(x ) vk xk
x
x
Fig. 1.1. The mechanics of the algorithm.
Nesterov shows how to construct these functions so that, defining λk =
γk − µ , γ0 − µ
the following relation holds: f (xk ) ≤ φk (·) ≤ λk φ0 (·) + (1 − λk )f (·). With this property, an immediate examination of an optimal solution results in f (xk ) ≤ φk (x∗ ) ≤ λk φ0 (x∗ ) + (1 − λk )f (x∗ ) = f (x∗ ) + λk (φ0 (x∗ ) − f (x∗ )), and we see that f (xk ) → f (x∗ ) with the speed with which λk → 0. Nesterov’s method ensures λk √ = O(1/k 2 ), and this means optimal complexity: an error of ε > 0 is achieved in O(1/ ε) iterations [5, Chap. 2]. In the present paper we shall not use these parameters λk , dealing directly with the sequences (γk ). Moreover, the complete complexity proof made in Section 4 is for a modified algorithm based on the construction of an adaptive sequence of strong convexity parameter estimates µk ≥ µ such that µk → µ. The practical efficiency of the method depends on how much each iteration explores the local properties around each iterate xk to push λk down. Of course, the worst case complexity cannot be improved, but the speed can be improved in practical problems. This is the goal of this paper. Structure of the paper. In Section 2 we present a prototype algorithm, its main variants and we describe all variables, functions and parameters needed in the analysis.
4
C. GONZAGA AND E. KARAS
Section 3 discusses some details on the implementation and on simplifications of the methods in several different environments. Section 4 describes the modified algorithm based on adaptive estimates of the strong convexity parameter and does a complete complexity analysis, proving that Nesterov’s complexity bound is preserved. Finally, Section 5 compares numerically the performance of the algorithms. The numerical results show that our modifications improve the performance of Nesterov’s algorithms in most examples in a set of toy problems, mainly when no good bound for L is known. 2. Description of the main algorithms. In this section we present a prototype algorithm which includes the choice of two parameters (αk and θk ) in each iteration: different choices of these parameters give different algorithms. 2.1. The algorithm. Here we state the prototype algorithm. At this moment it is not possible to understand the meaning of each procedure: this will be the subject of the rest of this section. Consider the problem (P ). Here we assume that the strong convexity parameter µ is given (possibly null). In Section 4 we modify the algorithm to use an adaptive decreasing sequence of estimates for µ. Algorithm 2.1. Prototype algorithm. Data: x0 ∈ IRn , v0 = x0 , γ0 > µ. k=0 repeat dk = vk − xk . Choose θk ∈ [0, 1]. yk = xk + θk dk . If ∇f (yk ) = 0, then stop with yk as an optimal solution. Steepest descent step: xk+1 = yk − ν∇f (yk ). Choose αk ∈ (0, 1]. γk+1 = (1 − αk )γk + αk µ. 1 ((1 − αk )γk vk + αk (µyk − ∇f (yk ))). vk+1 = γk+1 k = k + 1. A complete proof of the optimal complexity will be done in Section 4, for an algorithm based on adaptive estimates of µ. The proof for a fixed µ is simpler and follows from that one. The variables associated with each iteration of the algorithm are: • xk , vk ∈ IRn , with x0 given and v0 = x0 . • αk ∈ [0, 1). • γk ∈ IR++ : second order constants. γ0 > µ is given. If L is available, then Nesterov [5, Thm. 2.2.2] suggests the choice γ0 = L. • yk ∈ IRn : a point in the segment joining xk and vk yk = xk + θk (vk − xk ). It is from yk that a steepest descent step is computed in each iteration to obtain xk+1 .
5
FINE TUNING NESTEROV’S ALGORITHM
• xk+1 = yk − ν∇f (yk ): next iterate, computed by any efficient line search procedure. We know that if L is known, then the step length ν = 1/L ensures that f (yk ) − f (xk+1 ) ≥ k∇f (yk )k2 /(2L). We require that the line search used must be at least as good as this, whenever L is known. An Armijo search with well chosen parameters is usually the best choice, and even if L is not known it ensures a decrease of at least k∇f (yk )k2 /(4L), which is also acceptable for the complexity analysis. Hence we shall assume that at all iterations f (xk+1 ) ≤ f (yk ) −
(2.1)
1 k∇f (yk )k2 . 4L
The geometry of one iteration of the algorithm is shown in Fig. 2.1 (left): yk is between xk and vk , and xk+1 is obtained by a steepest descent step from yk . The directions (xk+1 − yk ) and (vk+1 − vk ) are both collinear with ∇f (yk ). φk
f
xk
φk+1
yk xk+1 vk
vk+1
ℓ yk xk+1 vk+1
Fig. 2.1. One iteration of the algorithm
Choice of the parameters: different variants of the algorithm stem from different choices of the parameters αk and θk . The study of efficient choices will be the main task of this paper. The choice made by Nesterov and by us will be described in Sec. 2.3, after a technical section on the construction of the functions φk (·). 2.2. Construction of the functions φk (·).. Each iteration starts with the simple quadratic function φk (·) as defined in (1.5) (we call “simple” a quadratic function whose Hessian has the shape tI, with t ≥ 0). A new simple quadratic φk+1 (·) is constructed as a convex combination of φk (·) and a quadratic approximation `(·) of f (·) around yk . This is shown in Fig. 2.1 (right) for a one-dimensional problem. The choice of this combination by Nesterov is so that the minimum value φ∗k+1 of φk+1 (·) satisfies φ∗k+1 ≥ f (xk+1 ) ≥ f (x∗ ). We shall require that φ∗k+1 = f (xk+1 ). Now we quote Nesterov [5] to describe the construction of the functions φk (·). We start from given xk , vk , γk , φk (·). At this point we assume that yk is also given (its calculation will be one of the tasks of this paper). Assume also that xk+1 has been computed and that (2.1) holds. Once yk and xk+1 are given, we must compute αk . Here we change the notation to include α as a variable and define the function α ∈ IR, x ∈ IRn 7→ φk (α, x).
6
C. GONZAGA AND E. KARAS
The function φk+1 (·, ·) is an affine combination of φk (·) and a quadratic approximation `k (·) of f (·) around yk : (2.2)
φk+1 (α, x) = (1 − α)φk (x) + α`k (x),
where `k (x) = f (yk ) + ∇f (yk )T (x − yk ) +
µ kx − yk k2 . 2
Remark.. In this subsection we shall not require `k (·) to be a lower quadratic approximation, i.e., µ > 0 may not be a strong convexity constant. ∇f (yk ) , with If µ > 0, then `k (·) is minimized at x ˆ = yk − µ (2.3)
minn `k (x) = `k (ˆ x) = f (yk ) −
x∈IR
k∇f (yk )k2 , 2µ
µ kx − x ˆk2 . We shall assume that `k (·) 2 is not constant, otherwise yk would be an optimal solution for the problem (P ). Define for α ∈ IR, and `k (·) may be expressed as `k (x) = `k (ˆ x) +
(2.4)
φ∗k+1 (α) = inf n φk+1 (α, x) x∈IR
∈
IR ∪ {−∞}
and (2.5)
γk+1 (α) = (1 − α)γk + αµ.
The function φk+1 (α, ·) is also a simple quadratic function, with Hessian γk+1 (α)I. If γk+1 (α) > 0, then φ∗k+1 (α) is finite. If γk+1 (α) < 0, then φ∗k+1 (α) = −∞. The fact below follows trivially from (2.5): Fact 2.2. For γk > µ ≥ 0, {α ∈ IR | γk+1 (α) > 0} = (−∞, αmax ), with αmax = γk /(γk − µ). So φ∗k+1 (α) is well defined for any α < αmax and φ∗k+1 (α) = −∞ for α > αmax because in this case γk+1 (α) < 0. Here we shall look at what happens for α = αmax and isolate a simple situation where φk and `k have coincident minimizers. Lemma 2.3. Consider γk > µ and αmax = γk /(γk − µ). Then, one and only one of the following situations occurs: (i) vk is a minimizer of `k and φ∗k+1 is linear on (−∞, αmax ] with φ∗k+1 (α) = (1 − α)φ∗k + α`k (vk ). (ii) φ∗k+1 (αmax ) = −∞. Proof. Let us assume that inf x∈IRn φk+1 (αmax , x) is finite. By construction, φk+1 (αmax , ·) is linear, because γ+ (αmax ) = 0. Thus, φk+1 (αmax , ·) = (1 − αmax )φk (·) + αmax `k (·) is constant. If µ = 0, then αmax = 1, `k (·) is constant and obviously vk is a minimizer of `k .
FINE TUNING NESTEROV’S ALGORITHM
7
µ kx − x ˆk2 , where x ˆ is 2 the minimizer of `k (·). Differentiating the expression for φk+1 (αmax , ·) in relation to x, we have for any x ∈ IRn , On the other hand, if µ > 0, we can write `k (x) = `k (ˆ x) +
(1 − αmax )γk (x − vk ) + αmax µ(x − x ˆ) = 0. If, by contradiction, vk 6= x ˆ, then substituting x = vk , we get αmax µ = 0, which cannot be true, proving that vk = x ˆ. So, in any case, vk is a minimizer of both φk (·) and `k (·), and for any α ∈ (−∞, αmax ], φ∗k+1 (α) = (1 − α)φ∗k + α`k (vk ), which is linear, completing the proof. Lemma 2.4. For any α ∈ (−∞, αmax ), the function x 7→ φk+1 (α, x) is a simple quadratic function given by φk+1 (α, x) = φ∗k+1 (α) +
(2.6)
γk+1 (α) kx − vk+1 (α)k2 , 2
with
(2.7)
(2.8)
φ∗k+1 (α)
=
α2 k∇f (yk )k2 + (1 − α) [φ∗k − f (yk )+ f (yk ) − 2γ k+1 (α) i αγk µ + kyk − vk k2 + ∇f (yk )T (vk − yk ) , γk+1 (α) 2 1 ((1 − α)γk vk + α(µyk − ∇f (yk )) γk+1 (α)
vk+1 (α) =
and γk+1 (α) = (1 − α)γk + αµ.
(2.9) In particular, if µ > 0 then (2.10)
φ∗k+1 (1) = f (yk ) −
k∇f (yk )k2 . 2µ
Proof. The proof is straightforward and done in detail in Nesterov’s book [5]. We now use Danskin’s theorem in the format presented in Bertsekas [1] to prove that φ∗k+1 (·) is concave. Lemma 2.5. The function α ∈ IR 7→ φ∗k+1 (α) constructed above is concave in IR and differentiable in (−∞, αmax ). Proof. If φ∗k+1 (αmax ) is finite, then Lemma 2.3 ensures that φ∗k+1 (·) is linear and hence this case is proved. Assume then that φ∗k+1 (αmax ) = −∞. We must prove that φ∗k+1 (·) is concave and differentiable in (−∞, αmax ). We will prove the concavity in an arbitrary interval [α1 , α2 ] ⊂ (−∞, αmax ). For each α ∈ [α1 , α2 ], φk+1 (α, ·) has a unique minimizer vk+1 (α). By continuity of vk+1 (·) in (2.8), vk+1 ([0, α ¯ ]) is bounded, i.e., there exists a compact set V ⊂ IRn such that φ∗k+1 (α) = min φk+1 (α, x). x∈V
So, the hypotheses for Danskin’s theorem are satisfied: • φk+1 (·, x) is concave (in fact linear) for each x ∈ IRn ,
8
C. GONZAGA AND E. KARAS
• φk+1 (α, ·) has a unique minimizer in the compact set V for each α ∈ [α1 , α2 ]. We conclude that φ∗k+1 (·) is concave and differentiable in [α1 , α2 ]. Since [α1 , α2 ] was taken arbitrary, this proves the concavity and differentiability in (−∞, αmax ), completing the proof. Setting yk = xk + θk dk , with dk = vk − xk and θk ∈ [0, 1], (2.7) may be written as φ∗k+1 (α) = f (yk ) −
(2.11)
α2 k∇f (yk )k2 + (1 − α)ζ(α, θk ) 2γk+1
with ζ(α, θ) (2.12)
=
φ∗k − f (xk + θdk )+ αγk µ + (1 − θ)2 kdk k2 + (1 − θ)f 0 (xk + θdk , dk ) . (1 − α)γk + αµ 2
Lemma 2.6. Assume that f (xk ) ≤ φ∗k , yk = xk + θk dk with dk = vk − xk and θk ∈ [0, 1]. Assume also that xk+1 is obtained by a steepest descent step from yk , satisfying (2.1), where the Lipschitz constant L is possibly infinite. Then α2 1 ∗ − (2.13) φk+1 (α) ≥ f (xk+1 ) + k∇f (yk )k2 + (1 − α)ζ(α, θk ). 4L 2γk+1 Furthermore (2.14)
ζ(α, θ) ≥
αγk (1 − θ) f 0 (xk + θdk , dk ). −θ + (1 − α)γk + αµ
Proof. The expression (2.13) follows directly from (2.11) and the fact that the steepest descent step satisfies (2.1). Using the facts that f (xk ) ≤ φ∗k and µ ≥ 0 in the definition of ζ(·, ·), we can write ζ(α, θ) ≥ f (xk ) − f (xk + θdk ) +
αγk (1 − θ)f 0 (xk + θdk , dk ). (1 − α)γk + αµ
By the convexity of f (·), we know that f (xk ) − f (xk + θdk ) ≥ −θf 0 (xk + θdk , dk ), and so αγk ζ(α, θ) ≥ −θ + (1 − θ) f 0 (xk + θdk , dk ), (1 − α)γk + αµ completing the proof. 2.3. Choice of the parameters. In this section we assume that µ ≥ 0 is a strong convexity constant for f (·). We present two different schemes for choosing the parameters θk and αk in Algorithm 2.1. For both schemes we have two main objectives: • Prove that for all choices of θk , it is always possible to compute αk such that φ∗k+1 (αk ) = f (xk+1 ); √ • Prove that in any case αk is large, i.e., αk = Ω( γk+1 ). We begin by describing Nesterov’s choice.
FINE TUNING NESTEROV’S ALGORITHM
9
Nesterov’s choice [5, Scheme 2.2.6].. When a Lipschitz constant L is known, Nesterov’s choices are: • αN is computed so that the central term in (2.13) is null. So, αN is the positive solution of the second degree equation 2Lα2 − (1 − α)γk − αµ = 0.
(2.15)
• θN is so that the right hand side of (2.14) is null: θN =
(2.16)
γk αN . γk + αN µ
Note: If L is unknown but exists, αN and θN are well defined (but unknown). If L does not exist, we set αN = 0, θN = 0. Nesterov’s original method sets αk = αN and θk = θN , obtaining directly from (2.13) and ζ(αN , θN ) = 0 that r γk+1 (2.17) . φ∗k+1 (αN ) ≥ f (xk+1 ) and αN = 2L Our choice. • Compute θk ∈ [0, 1] such that (2.18)
f (xk + θk dk ) ≤ f (xk )
and
(θk = 1 or f 0 (xk + θk dk , dk ) ≥ 0) .
These are an Armijo condition with parameter 0 and a Wolfe condition with parameter 0. • Compute αk ∈ [0, 1] as the largest root of the equation Aα2 + Bα + C = 0
(2.19) with µ
kvk − yk k2 + ∇f (yk )T (vk − yk ) ,
G
= γk
A
1 = G + k∇f (yk )k2 + (µ − γk ) (f (xk ) − f (yk )) , 2
B
=
C
= γk (f (xk+1 ) − f (xk )) .
2
(µ − γk ) (f (xk+1 ) − f (xk )) − γk (f (yk ) − f (xk )) − G,
As we will see this equation comes from the equality φ∗k+1 (α) = f (xk+1 ). The next lemma uses the definition of αN to show under what conditions we can compute αk ≥ αN such that this equality holds. By construction, we have (2.20)
φ∗k+1 (0) = φ∗k ≥ f (xk )
(2.21)
φ∗k+1 (1) = inf n `k (x) ≤ f (x∗ ) ≤ f (xk+1 ). x∈IR
Let us first eliminate a trivial case: due to (2.21), if φ∗k+1 (1) = f (xk+1 ) then xk+1 is an optimal solution, and we may terminate the algorithm. Let us assume that φ∗k+1 (1) < f (xk+1 ).
10
C. GONZAGA AND E. KARAS
The following lemma does not need the assumption that µ > 0 is a strong convexity constant. Lemma 2.7. Assume that φ∗k > f (xk+1 ) > inf n `k (x). Then φ∗k+1 (α) = f (xk+1 ) x∈IR
has one or two real roots and the largest root belongs to [0, 1]. The roots are computed by solving the second degree equation (2.19). Proof. Let α ¯ be the largest α ∈ [0, 1) such that φ∗k+1 (α ¯ ) = maxα∈[0,1] φ∗k+1 (α) ∗ (which exists because φk+1 (·) is concave and continuous in [0, 1), with φ∗k+1 (1) < φ∗k+1 (¯ α)). Since φ∗k+1 (1) < φ∗k+1 (¯ α) and f (xk+1 ) ≥ φ∗k+1 (1), the concave function decreases from α ¯ to 1, and crosses the value f (xk+1 ) exactly once in [α ¯ , +∞), at some point α0 ∈ [0, 1]. Setting φ∗k+1 (α) = f (xk+1 ) in (2.7) and multiplying both sides of the equation by γk+1 (α) = (1 − α)γk + αµ, we obtain a second degree equation in α. The largest root of this equation must be α0 , completing the proof. Lemma 2.8. Consider an iteration k of Algorithm 2.1. Assume that xk is not a minimizer of f and φ∗k+1 (1) < f (xk+1 ). Assume that yk = xk + θk dk is chosen so that ζ(αN , θk ) ≥ 0. Then there exists a value αk ∈ [αN , 1] that solves the equation φ∗k+1 (αk ) = f (xk+1 ) in (2.7). Proof. Assume that αN and yk = xk +θk dk have been computed as in the Lemma. Since xk is not a minimizer of f , by convexity of f and definition of `k (·) φ∗k ≥ f (xk ) > f (x∗ ) ≥ `k (x∗ ) ≥ inf n `k (x). x∈IR
On the other hand, f (xk+1 ) > φ∗k+1 (1) = inf n `k (x). x∈IR
Using Lemma 2.6 with α = αN and ζ(αN , θk ) ≥ 0, we get from (2.13) that φ∗k+1 (αN ) ≥ f (xk+1 ). Now from Lemma 2.7 the equation φ∗k+1 (α) = f (xk+1 ) has a real root, and the largest root α0 is in [0, 1]. Note that α0 ≥ αN , because α0 is the largest root of the concave function φ∗k+1 (·) − f (xk+1 ). We conclude from this lemma that the equality φ∗k+1 (αk ) = f (xk+1 ) can always be achieved whenever ζ(αN , θk ) ≥ 0. Theorem 2.9. Algorithm 2.1 with either of the choices (2.16) or (2.18) for θk is well defined and generates sequences {xk } and {φ∗k } such that for all k = 0, 1, . . ., φ∗k ≥ f (xk ). Moreover, the parameters αk and γk satisfy r γk+1 (2.22) αk ≥ . 2L Proof. (i) Let us first prove that both versions of the algorithm are well defined: Nesterov: immediate. Taking θk = θN we obtain by definition of αN and θN , ζ(αN , θN ) = 0 and Lemma 2.8 ensures that αk ∈ [αN , 1) can be computed as desired. Our method: since f (xk )−f (xk +θk dk ) ≥ 0 and f 0 (xk +θk dk , dk ) ≥ 0 by construction, then ζ(α, θk ) ≥ 0 for any α ∈ [0, 1], and Lemma 2.8 can be used similarly. (ii) In both cases we obtain ζ(αk , θk ) ≥ 0. Hence from (2.13), αk2 1 ∗ − k∇f (yk )k2 , f (xk+1 ) = φk+1 (αk ) ≥ f (xk+1 ) + 4L 2γk+1
FINE TUNING NESTEROV’S ALGORITHM
11
which immediately gives (2.22), completing the proof. We conclude from this that Nesterov’s algorithm, which uses the value θk = θN with αk = αN can be improved by re-calculating αk using (2.7) (and keeping θk = θN ). Remark.. Even computing the best value for αk , the sequence {f (xk )} is not necessarily decreasing. Remark.. In the case µ = 0, the expressions get simplified. Then αN is the positive solution of Lαk2 = (1 − αk )γk , and expression (2.14) is reduced to ζ(α, θ) ≥
α−θ 0 f (xk + θdk , dk ). 1−α
In this case, θN = αN . 3. More on the choice of θk . Our choice of θk uses a line search along dk . In this section we discuss the amount of computation needed in this line search, and show that when L and µ are known it may be done in polynomial time. We conclude from the preceding analysis that what must be shown for a given choice of yk = xk + θk dk is that ζ(αN , θk ) ≥ 0. The following lemma summarizes the conditions that ensure this property. Lemma 3.1. Let yk = xk + θk dk be the choice made by Algorithm 2.1 in iteration k, and let θN be the Nesterov choice (2.16). Assume that one of the following conditions is satisfied: (i) f (yk ) ≤ f (xk ) and f 0 (yk , dk ) ≥ 0; (ii) f (yk ) ≤ f (xk ) and θk ≥ θN ; (iii) f 0 (yk , dk ) ≥ 0 and θk ≤ θN ; µ (iv) f 0 (xk , dk ) ≥ − kdk k2 and θk = 0. 2 Then ζ(αN , θk ) ≥ 0 (and αk ≥ αN may be computed using (2.7)). Proof. (i) It has been proved in Theorem 2.9. Let us substitute θk = θN + ∆θ in (2.14), with γN = (1 − αN )γk + αN µ: αN γk αN γk ζ(αN , θk ) ≥ −θN + (1 − θN ) − ∆θ − ∆θ f 0 (xk + θk dk , dk ). γN γN By definition of θN , −θN + αN γk (1 − θN )/γN = 0. Hence αN γk (3.1) ∆θ f 0 (xk + θk dk , dk ). ζ(αN , θk ) ≥ − 1 + γN (ii) Assume that f (yk ) ≤ f (xk ) and θk ≥ θN . If f 0 (yk , dk ) ≤ 0, then the result follows from (3.1) because ∆θ ≥ 0. Otherwise, it follows from (i). (iii) Assume now that f 0 (yk , dk ) ≥ 0 and θk ≤ θN . Then the result follows from (3.1) because ∆θ ≤ 0. (iv) This case is also trivial by (2.12), completing the proof. All we need is to specify in Algorithm 2.1 the choice of θk and αk as follows: Choose θk ∈ [0, 1] satisfying one of the conditions in Lemma 3.1. Compute αk such that φ∗k+1 = f (xk+1 ) using (2.7). We must discuss the computation of θk , which requires a line search. Of course, if L is given, we can use θk = θN and ensure the optimal complexity, but larger values of αk may be obtained.
12
C. GONZAGA AND E. KARAS
We would like to obtain θk (and then αk ) such that ζ(αk , θk ) is as large as possible. Since the determination of αk depends on the steepest descent step from yk = xk + θk dk , we cannot maximize ζ(·, ·). But positive values of ζ(α, θk ) will be obtained if θk is such that f (xk ) − f (xk + θk dk ) and f 0 (xk + θk dk , dk ) are made as large as possible (of course, these are conflicting objectives). • We define (but do not compute) two points: (3.2)
θ0 ∈ argmin{f (xk + θdk ) | θ ≥ 0},
(3.3)
θ00 = max{θ ∈ [0, 1] | f (xk + θdk ) ≤ f (xk )}.
Let us examine two important cases: • If f 0 (xk , dk ) ≥ 0, then we should choose θk = 0: a steepest descent step from xk follows. • If f (xk + dk ) ≤ f (xk ), then we may choose θk = 1: a steepest descent step from vk follows. If none of these two special situations occur, then 0 < θ0 < θ00 < 1. Any point θk ∈ [θ0 , θ00 ] gives ζ(α, θk ) ≥ 0 for any α ∈ [0, 1], and yk = xk + θk dk satisfies condition (i) of Lemma 3.1. A point in this interval can be computed by the interval reduction algorithm described in the Appendix. We have no complexity estimates for a line search procedure such as this if L and µ are not available. When these constants are known, it is possible to limit the number of iterations in the line search to keep Nesterov’s complexity. We shall discuss briefly the choice of θk in three possible situations, according to our knowledge of the constants. Case 1. No knowledge of L or µ.. We start the iteration checking the special cases above: If f (xk + dk ) ≤ f (xk ), then we choose θk = 1: a steepest descent step from vk will follow. Otherwise, we must decide whether f 0 (xk , dk ) ≥ 0. This may be done as follows: ˜ k ) for a small value of θ˜ > 0 Compute f (xk + θd ˜ k ) ≥ f (xk ), compute ∇f (xk ) and then f 0 (xk , dk ) = ∇f (xk )T dk . If f (xk + θd Else f 0 (xk , dk ) < 0. If f 0 (xk , dk ) ≥ 0, set θk = 0. Else, do a line search as in the Appendix, starting with the points 0, θ˜ and 1. Case 2: L is given.. In this case we start the search with θN computed as in (2.16). The line search can be abbreviated to keep the number of function calculations within a given bound, resulting in a step satisfying one of the three first conditions in Lemma 3.1. There are two cases to consider, depending on the sign of f (xk ) − f (xk + θN dk ). • f (xk + θN dk ) < f (xk ): Condition (ii) of Lemma 3.1 is satisfied for any θk ≥ θN such that f (xk + θk dk ) ≤ f (xk ). We may use an interval reduction method as in the Appendix, or simply take steps to increase θ. A trivial method is the following, using a constant β > 1: θk = θN while βθk ≤ 1 and f (xk + βθk dk ) ≤ f (xk ), set θk = βθk . In any method, the number of function calculations may be limited by a given constant, adopting as θk the largest computed step length which satisfies the descent condition. • f (xk + θN dk ) > f (xk ): Condition (iii) of Lemma 3.1 holds for θN , and we intend to reduce the value of θ. For an interval reduction search between 0 and θN we need
FINE TUNING NESTEROV’S ALGORITHM
13
an intermediate point ν such that f (xk + νdk ) ≤ f (xk ), if it exists. We propose the following procedure: Compute f (xk + νdk ) for ν 0 are given.. This is a special case of case 2, with an interesting property: a point satisfying condition (i) in Lemma 3.1 can be computed in polynomial time without computing ∇f (xk ), using the next lemma. Lemma 3.2. Consider θ0 = argmin{f (xk + θdk ) | θ ≥ 0} and θ00 > θ0 such that f (xk + θ00 dk ) = f (xk ), if it exists. Define θ¯ = µ/(2L). 2 ¯ ¯ k ) ≤ f (xk ) − µ kdk2 , then θ0 ≥ θ¯ and θ00 − θ0 ≥ θ, If f (xk + θd 8L µ otherwise f 0 (xk , dk ) ≥ − kdk2 . 2 2 ¯ k ) ≤ f (xk ) − µ kdk2 . Proof. Assume that f (xk + θ0 dk ) ≤ f (xk + θd 8L Using the Lipschitz condition at θ0 with f 0 (xk + θ0 dk , dk ) = 0, we get for θ ∈ IR, f (xk + θdk ) ≤ f (xk + θ0 dk ) +
L (θ − θ0 )2 kdk2 . 2
Using the assumption, µ2 kdk2 . f (xk + θdk ) ≤ f (xk ) + L(θ − θ0 )2 − 4L 2 For θ = 0, this gives Lθ02 − µ2 /(4L) ≥ 0, or equivalently θ02 ≥ θ¯2 , proving the first inequality. For θ = θ00 , f (xk + θ00 dk ) = f (xk ), and we obtain immediately (θ00 − θ0 )2 ≥ θ¯2 . 2 ¯ k ) > f (xk ) − µ kdk2 . Assume now that f (xk + θd 8L By the Lipschitz condition for ∇f (·), we know that ¯ k ) ≤ f (xk ) + f 0 (xk , dk )θ¯ + L θ¯2 kdk2 . f (xk + θd 2 Assuming by contradiction that f 0 (xk , dk ) < −µkdk2 /2, we obtain µ ¯ 2 L ¯2 θkdk + θ kdk2 2 2 µ2 µ2 = f (xk ) − kdk2 + kdk2 4L 8L µ2 = f (xk ) − kdk2 , 8L
¯ k ) < f (xk ) − f (xk + θd
contradicting the hypothesis and completing the proof. In this case we may use a golden search, as follows: 2 ¯ k ) ≥ f (xk ) − µ kdk2 , set θk = 0 and stop (condition (iv) in Lemma 3.1 If f (xk + θd 8L holds). If f (xk + dk ) ≤ f (xk ), set θk = 1 and stop (condition (ii) in Lemma 3.1 holds). (Now we know that θ0 ≤ θ00 − µ/(2L) < 1). ¯ 1] Use an interval reduction search as in the Appendix in the interval [θ,
14
C. GONZAGA AND E. KARAS
If we use a golden section search, then at each iteration j before the stopping 0 00 condition is met, A≤ √ we have A ≤ θ ≤ θ ≤ B, and the interval length satisfies 0B − j β , with β = ( 5 − 1)/2 ≈ 0.62. Using the lemma above, a point B ∈ [θ , θ00 ] will have been found when β j ≤ µ/(2L). Hence the number of iterations of the search will be bounded by jmax =
log(µ/(2L)) . log β
This provides a bound on the computational effort per iteration of O(log(2L/µ)) function evaluations. 4. Adaptive convexity parameter µ. In computational tests (see next section) we see that the use of a strong convexity constant is very effective in reducing the number of iterations. But very seldom such constant is available. In this section we state an algorithm which uses a decreasing sequence of estimates for the value of the constant µ. We assume that a strong convexity parameter µ∗ (possibly null) is given, and the method will generate a sequence µk → µ∗ , starting with µ0 ∈ [µ∗ , γ0 ). We still need an extra hypothesis: that the level set associated with x0 is bounded. Note that since f (·) is convex, this is equivalent to the hypothesis that the optimal set is bounded. The algorithm iterations are the same as in Algorithm 2.1, using a parameter µk ≥ µ∗ . The parameter is reduced (we do this by setting µk = max{µ∗ , µk /10}) in two situations: • When γk − µ∗ < β(µk − µ∗ ), where β > 1 is fixed (we used β = 1.02), meaning that γk is too close to µk . • When it is impossible to satisfy the equation φ∗k+1 (α) = f (xk+1 ) for α ∈ [0, 1]. By Lemma 2.7, this can only happen when f (xk+1 ) ≤ minx∈IRn `(x). This minimum is given by φ∗k+1 (1) = f (yk ) − k∇f (yk )k2 /(2µk ), using expression (2.10). So, µk cannot be larger than µ ˜=
k∇f (yk )k2 . 2(f (yk ) − f (xk+1 ))
If µk > µ ˜, we reduce µk . Notation: Denote φk+1 (x) = φk+1 (αk , x) and φ∗k+1 = φ∗k+1 (αk ). Algorithm 4.1. Algorithm with adaptive convexity parameter. Data: x0 ∈ IRn , v0 = x0 , γ0 > 0, β > 1, µ∗ = µ, µ0 ∈ [µ∗ , γ0 ). k=0 Set µ+ = µ0 . repeat dk = vk − xk . Choose θk ∈ [0, 1] as in Algorithm 2.1. yk = xk + θk dk . If ∇f (yk ) = 0, then stop with yk as an optimal solution. Steepest descent step: xk+1 = yk − ν∇f (yk ). If L is known, ν ≥ 1/L. If (γk − µ∗ ) < β(µ+ − µ∗ ), then choose µ+ ∈ [µ∗ , γk /β]. k∇f (yk )k2 . Compute µ ˜= 2(f (yk ) − f (xk+1 )) ∗ If µ+ ≥ µ ˜, then µ+ = max{µ , µ ˜/10}. Compute αk as the largest root of (2.19) with µ = µ+ .
FINE TUNING NESTEROV’S ALGORITHM
15
γk+1 = (1 − αk )γk + αk µ+ . 1 vk+1 = ((1 − αk )γk vk + αk (yk µ+ − ∇f (yk ))). γk+1 Set µk = µ+ . k = k + 1. We suggest γ0 = L if L is known, µ0 = max{µ∗ , γ0 /100}, β = 1.02. We also suggest the following reduction scheme for µ+ , provided that β ≤ 10: µ+ = max{µ∗ , γk /10}. Independently of this choice, the following fact holds: Fact 4.2. For any k ≥ 0, (γk − µ∗ ) ≥ β(µk − µ∗ ). Proof. If at the beginning of an iteration k, (γk − µ∗ ) < β(µ+ − µ∗ ), then we chose µ+ ∈ [µ∗ , γk /β] with β > 1. Then, β(µ+ − µ∗ ) ≤ γk − βµ∗ ≤ γk − µ∗ . As µk ≤ µ+ , we complete the proof. We now show the optimality of the algorithm, using an additional hypothesis. Hypothesis.. Given x0 ∈ IRn , assume that the level set associated with x0 is bounded, i.e., D = sup{kx − yk | f (x) ≤ f (x0 ), f (y) ≤ f (x0 )} < ∞. Define Q =
D2 . 2
Lemma 4.3. Consider γ0 > µ∗ . Let x∗ be an optimal solution of (P ). Then, at all iterations of Algorithm 4.1, (4.1)
φk (x∗ ) − f (x∗ ) ≤
γ0 + L Q(γk − µ∗ ). γ0 − µ∗
Proof. First let us prove that (4.1) holds for k = 0. By definition of φ0 and convexity of f with Lipschitz constant L for the gradient, we have φ0 (x∗ ) − f (x∗ )
= ≤ ≤
γ0 f (x0 ) − f (x∗ ) + kx∗ − x0 k2 2 L + γ0 ∗ kx − x0 k2 2 (L + γ0 )Q.
Thus, we can use induction. By (2.2) and (1.2), we have µk φk+1 (x) = (1 − αk )φk (x) + αk f (yk ) + ∇f (yk )T (x − yk ) + kx − yk k2 2 µk − µ∗ 2 ≤ (1 − αk )φk (x) + αk f (x) + kx − yk k . 2 Thus, φk+1 (x∗ ) − f (x∗ ) ≤ (1 − αk ) (φk (x∗ ) − f (x∗ )) + αk Q(µk − µ∗ ).
16
C. GONZAGA AND E. KARAS
Using the induction hypothesis and the definition of γk+1 , we have φk+1 (x∗ ) − f (x∗ ) ≤
(1 − αk )
γ0 + L Q(γk − µ∗ ) + αk Q(µk − µ∗ ) γ0 − µ∗
≤
γ0 + L Q ((1 − αk )γk + αk µk − µ∗ ) γ0 − µ∗
=
γ0 + L Q(γk+1 − µ∗ ), γ0 − µ∗
completing the proof. Now we prove a technical lemma, following a very similar result in Nesterov [5, Lemma 2.2.4]. Lemma 4.4. Consider pa positive sequence {λk }. Assume that there exists M > 0 such that λk+1 ≤ (1 − M λk+1 )λk . Then, for all k > 0, λk ≤
4 1 . M 2 k2
1 Proof. Denote ak = √ . Since {λk } is a decreasing sequence, we have λk p √ λk − λk+1 λk − λk+1 λk − λk+1 p p ≥ ak+1 − ak = p =p . √ λk λk+1 λk λk+1 λk + λk+1 2λk λk+1 Using the hypothesis, p λk − (1 − M λk+1 )λk M p . ak+1 − ak ≥ = 2 2λk λk+1 Thus, ak ≥ a0 +
Mk 4 Mk ≥ and consequently λk ≤ 2 2 , completing the proof. 2 2 M k
Now we can discuss how fast (γk − µ∗ ) goes to zero for estimating the rate of convergence of the algorithm. Lemma 4.5. Consider γ0 > µ∗ . Then, for all k > 0, r !k ∗ 2 β−1 µ 8β L 1 (4.2) γk − µ∗ ≤ min 1− (γ0 − µ∗ ), . β 2L (β − 1)2 k 2 Proof. By the definition of γk+1 , γk+1 − µ∗ = (1 − αk )(γk − µ∗ ) + αk (µk − µ∗ ). By Fact 4.2, (γk − µ∗ ) ≥ β(µk − µ∗ ). Thus, γk+1 − µ∗ (4.3)
αk ≤ (1 − αk )(γk − µ∗ ) + (γk − µ∗ ) β β−1 = 1− αk (γk − µ∗ ). β
17
FINE TUNING NESTEROV’S ALGORITHM
By Theorem 2.9 r αk ≥
(4.4)
γk+1 ≥ 2L
r
γk+1 − µ∗ . 2L
Thus, ∗
γk+1 − µ ≤
β−1 1− β
r
µ∗ 2L
! ∗
(γk − µ ) ≤
r
β−1 1− β
µ∗ 2L
!k (γ0 − µ∗ ),
proving the first inequality in (4.2). On the other hand, from (4.3) and (4.4), we have β−1p γk+1 − µ∗ ≤ 1 − √ γk+1 − µ∗ (γk − µ∗ ). β 2L Applying Lemma 4.4 with λk = γk − µ∗ , for k ≥ 0, and M =
β−1 √ , we complete the β 2L
proof. The next theorem discusses the complexity bound of the algorithm. Theorem 4.6. Consider γ0 > µ∗ ≥ 0. Then Algorithm 4.1 generates a sequence {xk } such that, for all k > 0, r !k ∗ 2 β−1 µ 8β L 1 f (xk ) − f (x∗ ) ≤ min 1− , Q(L + γ0 ). β 2L (β − 1)2 (γ0 − µ∗ ) k 2 Proof. By construction, f (xk ) ≤ φk (x) for all x ∈ IRn , in particular for x∗ . Using this and Lemma 4.3, we have: f (xk ) − f (x∗ ) ≤
(L + γ0 ) Q(γk − µ∗ ). γ0 − µ∗
Applying Lemma 4.5, we complete the proof. This lemma ensures that an error of ε > 0 for the final objective function value √ is achieved in O(1/ ε) iterations. Global convergence. In our algorithms, each iteration produces f (xk+1 ) ≤ f (yk ) ≤ f (xk ). Hence, if we consider the sequence x1 , y1 , x2 , y2 , . . ., we have a descent algorithm, and every other step is a steepest descent step with an Armijo step length. The global convergence of such a method is a standard result. 5. Numerical results. In this section, we report the results of a limited set of computational experiments, comparing the variants of Algorithm 2.1 and Algorithm 4.1. The codes are written in Matlab and we used as stopping criterion the norm of the objective function gradient and the function values, whenever the optimal value is available. In all the algorithms, the line searches are of the Armijo type, and the same routine is used for all. We did not work on the efficiency of the line searches, and so it does not make much sense to compare methods based on the number of function calculations, with one exception: the original Nesterov algorithm never computes function values, which will be commented ahead. Hence our comparisons will be based
18
C. GONZAGA AND E. KARAS
on the number of iterations, with one or two gradient computations per iteration and one or two line searches per iteration (see below). Here are the algorithms to be tested, with their acronyms: • [Nesterov1] Algorithm 2.1 with θk = θN , αk = αN . • [αG , θG ] Algorithm 2.1 with our choice of parameters. • [Adaptive] Algorithm 4.1 with our choice of parameters. • [Nesterov2] Nesterov’s algorithm with adaptive estimates for L, from reference [7]. • [Cauchy] Steepest descent algorithm. • [F-R] Fletcher-Reeves as presented in [10, p.120]. We shall see that in most examples [Adaptive] has the best performance among the optimal methods cited. When there are good estimates for the constants L and µ, [Nesterov1] may be the best. At the end of the section we compare our method with the Fletcher and Reeves conjugate gradient method, and conclude that in many cases this last method has the best performance. Now we describe our tests for three different convex functions: (i) Quadratic function with µ = 1 and L = 1000. (ii) Nesterov’s “worst function in the world”. (iii) A convex function including exponentials. In (i) and (iii), the Hessian eigenvalues at the unique minimizer are randomly distributed between µ = 1 and L = 1000. (i) Quadratic functions. Given n, L > µ > 0 (in these tests, L = 1000 and µ = 1), the quadratic test functions are given by x ∈ IRn 7→ f (x) =
1 T x Qx, 2
where Q is a diagonal matrix with random entries in the interval [1, L]. For each experiment, we generate Q and a random initial point x0 with kx0 k∞ = 1. We generated several instances of problems with n assuming the values 10, 100, 200, 500, 1000, 2500 which were solved by all methods. Figure 5.1 shows the performance of the methods assuming ignorance of µ and L. The algorithms were fed with data µ = 0, L = 100000. On the left, we show the function values for a typical example in dimension 1000, and on the right the performance profiles for all methods on the whole set of problems. In all cases [Adaptive] used the smallest number of iterations, followed by [Nesterov2] and [αG , θG ]. [Nesterov1] always needed over 10 times the number of iterations of [Adaptive]. Figure 5.2 shows the results for the same tests, but using the knowledge of the values µ = 1 and L = 1000. Now [Nesterov1] performs very well and even if the number of iterations is a little larger, each iteration is faster because only one line search is needed. Now [Nesterov2] is slower because it does not take advantage of the knowledge of µ. (ii) Nesterov’s “worst function of the world”. This function is defined and explained in Nesterov [5, p.67]. Given L > µ > 0 (we took µ = 1, L = 1000), set Q = L/µ. The function is ! n−1 X µ(Q − 1) µ n 2 2 (5.1) x1 + (xi − xi+1 ) + kxk2 x ∈ IR 7→ f (x) = 8 2 i=1
19
FINE TUNING NESTEROV’S ALGORITHM f
1 0.9
3
10
0.8 0.7
0
10
0.6 0.5
−3
10
0.4 Nesterov1 θG,αG
0.3
−6
10
0.2
Adaptive Nesterov2 Cauchy
0.1
500
1000
1500
k
0
5
10
15
20
25
30
35
Fig. 5.1. Quadratic functions with µ and L unknown. f 1
3
10
0.9 0.8 0
0.7
10
0.6 −3
0.5
10
0.4 Nesterov1 θ ,α
0.3
−6
10
G
0.2
200
400
600
800
k
G
Adaptive Nesterov2 Cauchy
0.1 0 1
2
3
4
5
6
Fig. 5.2. Quadratic functions with µ and L given.
Generating a set of problems for the same values of n as before and several starting points, Figure 5.3 shows the performance profiles for all problems in two situations: on the left, considering µ and L unknown (µ = 0 and L = 100000); on the right for µ = 1 and L = 1000 given to the algorithms. We see that the behavior is similar to the case (i), with a detail: [Nesterov1] was worse than [Cauchy] for unknown µ and L. This is certainly due to the choice L = 100000, which affects the computation of θN in the algorithm. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
Nesterov1 θG,αG Adaptive
0
5
10
15
20
25
30
35
40
0
Nesterov2 Cauchy
2
4
6
8
10
12
14
16
18
20
Fig. 5.3. Performance profile for the function defined by (5.1).
(iii) Function with exponentials This function is convex and includes exponentials. The Hessian eigenvalues at the
20
C. GONZAGA AND E. KARAS
optimizer are randomly distributed between 1 and 1000: f (x) =
(5.2)
n X 1 kxk2 + qi (exi − xi − 1), 2 i=1
where qi ∈ [1, 999] for i = 1, . . . , n. The minimizer is x = 0, with f (0) = 0. We use the same values of n as in the former examples. For this function the Hessian matrix H(x) is diagonal with Hii (x) = qi exi + 1, and hence there is no global Lipschitz constant for the gradient. Figure 5.4 shows a typical run with n = 1000 and the performance profile for µ = 0, L = 100000 fed to the algorithms. f 1
3
10
0.9 0.8 0
0.7
10
0.6 −3
0.5
10
0.4 Nesterov1 θ ,α
0.3
−6
10
G
0.2 0.1
500
1000
1500
2000
2500
k
0
G
Adaptive Nesterov2 Cauchy 5
10
15
20
25
30
Fig. 5.4. Objective function value and performance profile for a function with exponentials.
Again, [Adaptive] has the best performance, and [Nesterov2] takes about 5 times the number of iterations (computing at least two gradients per iteration). Here again that [Nesterov1] is slower than [Cauchy]. Comparison with Fletcher-Reeves algorithm. As we saw in the Introduction, conjugate gradient algorithms are the most efficient first order methods for quadratic functions. They are also known to perform very well for a strictly convex function in a region where the second derivatives are near their values at an optimal solution (possibly using restarts and precise line searches). There exists a vast literature on conjugate gradient methods, and we do not intend to review it here. We shall construct an example in which the second derivatives are illbehaved near the optimal solution, and observe that our method performs better than Fletcher-Reeves. For a drastic but not so simple counterexample, see the reference [4] discussed in the Introduction. The separable function. We start by defining a function g : IR → IR as follows: partition the set [0, +∞] in p intervals defined by the points 0, 2−p+2 , 2−p+3 , . . . , 20 , +∞, and associate with each interval a positive value chosen at random. Now integrate twice the resulting function to obtain a strictly convex quadratic by parts function. Given τj ∈ [1, L], j = 1, . . . , n, L > 0, our separable function is constructed as x ∈ IRn 7→ f (x) =
n X
τj g(xj ).
j=1
This function has a global minimum at the origin, the Hessian is always positive diagonal, the derivatives are L-Lipschitz continuous, but the Hessian H(x) differs very much from H(0) at any points x with kxk∞ > 2−p+2 .
21
FINE TUNING NESTEROV’S ALGORITHM
Figure 5.5 shows a typical behavior of our algorithm and [F-R] for problems with dimension n = 200 and n = 2000. The algorithm [F-R] was implemented with restarts at every n iterations, and also when the search direction becomes almost orthogonal to the gradient. The line search used was either golden section or Armijo. In practically all runs with dimensions from 100 to 10000, [F-R] needed more iterations than [Adaptive]. f
f
F−R
10 0
0
10
10
−3
−3
10
10
−6
−6
10
10
0
Adaptive
3
3
10
200
400
600
800
1000
k
0
200
400
600
800
1000
k
Fig. 5.5. “Separable function” values in IR200 and IR2000 .
Appendix: interval reduction search. Let g : [0, 1] → IR be a convex differentiable function. Assume that g(0) = 0, g 0 (0) < 0 and g 0 (1) > 0. We shall study the problem of finding θ ∈ [0, 1] such that g(θ) ≤ 0 and g 0 (θ) ≥ 0. This can be seen as a line search problem with an Armijo constant 0 and a Wolfe condition with constant 0, for which there are algorithms in the literature (see for instance [3, 10]). In the present case, in which the function is convex, we can use a simple interval reduction algorithm, based on the following step: Assume that three points 0 ≤ A < ν < B ≤ 1 are given, satisfying g(A) ≥ g(ν) ≤ g(B). Note that as consequences of the convexity of g, the interval [A, B] contains a minimizer of g and g 0 (B) ≥ 0. The problem can be solved by the following interval reduction scheme: Algorithm 5.1. Interval reduction algorithm while g(B) > 0, Choose ξ ∈ [0, 1], ξ 6= ν. Set u = min{ν, ξ}, v = max{ν, ξ}. If g(u) ≤ g(v), set B = v, ν = u, else set A = u, ν = v. The initial value of ν and the values of ξ in each iteration may be such that u and v define the golden section √ of the interval [A, B], and then the interval length will be reduced by a factor of ( 5 − 1)/2 ≈ 0.62 in each iteration. We can also choose ξ as the minimizer of the quadratic function through g(A), g(ν), g(B). In this case one must avoid the case ξ = µ, in which ξ must be perturbed. Acknowledgements. The authors would like to thank Prof. Masakazu Kojima and Prof. Mituhiro Fukuda for useful discussions about the subject. We also thank the anonymous referees and Prof. Paulo J. S. Silva for their valuable comments and suggestions which very much improved this paper. This research was partially completed when the second author was visiting Tokyo Institute of Technology under the project supported by MEXT’s program “Promotion of Environmental Improvement
22
C. GONZAGA AND E. KARAS
for Independence of Young Researchers” under the Special Coordination Funds for Promoting Science and Technology. REFERENCES [1] D. P. Bertsekas, A. Nedi´ c, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, Belmont, USA, 2003. [2] G. Lan, Z. Lu, and R.D.C. Monteiro. Primal-dual first-order methods with O(1/) iterationcomplexity for cone programming. Technical report, School of ISyE, Georgia Tech, USA, 2006. Accepted in Mathematical Programming. [3] J. J. Mor´ e and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease. Technical Report MCS-P330-1092, Mathematics and Computer Science Division, Argonne National Laboratory, 1992. [4] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York, 1983. [5] Y. Nesterov. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston, 2004. [6] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 – 152, 2005. [7] Y. Nesterov. Gradient methods for minimizing composite objective function. Discussion paper 76, CORE, UCL, Belgium, 2007. [8] Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 110(2):245 – 259, 2007. [9] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108:177 – 205, 2006. [10] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, 1999. [11] B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. [12] P. Richt´ arik. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141 – 1167, 2011. [13] N. Shor. Minimization Methods for Non-Differentiable Functions. Springer Verlag, Berlin, 1985.