Optimization of Convex Functions with Random Pursuit∗
arXiv:1111.0194v1 [math.OC] 1 Nov 2011
S. U. Stich†
C. L. M¨ uller‡
B. G¨artner§
October 31, 2011
Abstract We consider unconstrained randomized optimization of convex objective functions. We analyze the Random Pursuit algorithm, which iteratively computes an approximate solution to the optimization problem by repeated optimization over a randomly chosen one-dimensional subspace. This randomized method only uses zeroth-order information about the objective function and does not need any problem-specific parametrization. We prove convergence and give convergence rates for smooth objectives assuming that the one-dimensional optimization can be solved exactly or approximately by an oracle. A convenient property of Random Pursuit is its invariance under strictly monotone transformations of the objective function. It thus enjoys identical convergence behavior on a wider function class. To support the theoretical results we present extensive numerical performance results of Random Pursuit, two gradient-free algorithms recently proposed by Nesterov, and a classical adaptive step-size random search scheme. We also present an accelerated heuristic version of the Random Pursuit algorithm which significantly improves standard Random Pursuit on all numerical benchmark problems. A general comparison of the experimental results reveals that (i) standard Random Pursuit is effective on strongly convex functions with moderate condition number, and (ii) the accelerated scheme is comparable to Nesterov’s fast gradient method and outperforms adaptive step-size strategies.
Keywords. continuous optimization, convex optimization, randomized algorithm, line search AMS. 90C25, 90C56, 68W20, 62L10 ∗
The project CG Learning acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number: 255827 † Institute of Theoretical Computer Science, ETH Z¨ urich, and Swiss Institute of Bioinformatics,
[email protected] ‡ Institute of Theoretical Computer Science, ETH Z¨ urich, and Swiss Institute of Bioinformatics,
[email protected] § Institute of Theoretical Computer Science, ETH Z¨ urich,
[email protected]
1
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
2
1
Introduction
Randomized zeroth-order optimization schemes were among the first algorithms proposed to numerically solve unconstrained optimization problems [1, 4, 30]. These methods are usually easy to implement, do not require gradient or Hessian information about the objective function, and comprise a randomized mechanism to iteratively generate new candidate solutions. In many areas of modern science and engineering such methods are indispensable in the simulation (or black-box) optimization context, where higher-order information about the simulation output is not available or does not exist. Compared to deterministic zeroth-order algorithms such as direct search methods [18] or interpolation methods [6] randomized schemes often show faster and more robust performance in real-world applications. While probabilistic convergence guarantees even for non-convex objectives are readily available for many randomized algorithms [34], provable convergence rates are often not known or unrealistically slow. Notable exceptions can be found in the literature on adaptive step size random search (also known as Evolution Strategies) [3, 11], on Markov chain methods for volume estimation, rounding, and optimization [33], and in Nesterov’s recent work on complexity bounds for gradient-free convex optimization [26]. Although Nesterov’s algorithms are termed “gradient-free” their working mechanism does, in fact, rely on approximate directional derivatives that have to be available via a suitable oracle. We here relax this requirement and investigate a true randomized gradient- and derivative-free optimization algorithm: Random Pursuit (RP µ ). The method comprises two very elementary primitives: a random direction generator and an (approximate) line search routine. We establish theoretical performance bounds of this algorithm for the unconstrained convex minimization problem min f (x)
subject to x ∈ Rn ,
(1)
where f is a smooth convex function. We assume that there is a global minimum and that the curvature of the function f can bounded by a constant. Each iteration of Random Pursuit consists of two steps: A random direction is sampled uniformly at random from the unit sphere. The next iterate is chosen such as to (approximately) minimize the objective function along this direction. This method ranges among the simplest possible optimization schemes as it solely relies on two easy-to-implement primitives: a random direction generator and an (approximate) one-dimensional line search. A convenient feature of the algorithm is that it inherits the invariance under strictly monotone transformations of the objective function from the line search oracle. The algorithm thus enjoys convergence guarantees even for non-convex objective functions that can be transformed into convex objectives via a suitable strictly monotone transformation.
3
RANDOM PURSUIT
Although Random Pursuit is fully gradient- and derivative-free, it can still be understood from the perspective of the classical gradient method. The gradient method (GM) is an iterative algorithm where the current approximate solution xk ∈ Rn is improved along the direction of the negative gradient with some step size λk : xk+1 = xk + λk (−∇f (xk )) .
(2)
When the descent direction is replaced by a random vector the generic scheme reads xk+1 = xk + λk u ,
(3)
where u is a random vector distributed uniformly over the unit sphere. A crucial aspect of the performance of this randomized scheme is the determination of the step size. Rastrigin [30] studied the convergence of this scheme on quadratic functions for fixed step sizes λk where only improving steps are accepted. Many authors observed that variable step size methods yield faster convergence [21, 15]. Schumer and Steiglitz [32] were among the first to develop an effective step size adaptation rule which is based on the maximization of the expected one-step progress on the sphere function. A similar analysis has been independently obtained by Rechenberg for the (1+1)-Evolution Strategy (ES) [31]. Mutseniyeks and Rastrigin proposed to choose the step size such as to minimize the function value along the random direction [23]. This algorithm is identical to Random Pursuit with an exact line search. Convergence analyses on strongly convex functions have been provided by Krutikov [19] and Rappl [29]. Rappl proved linear convergence of RP µ without giving exact convergence rates. Krutikov showed linear convergence in the special case where the search directions are given by n linearly independent vectors which are used in cyclic order. Karmanov [13, 14, 35] already conducted an analysis of Random Pursuit on general convex functions. Thus far, Karmanov’s work has not been recognized by the optimization community but his results are very close to the work presented here. We enhance Karmanov’s results in a number of ways: (i) we prove expected convergence rates also under approximate line search; (ii) we show that continuous sampling from the unit sphere can be replaced with discrete sampling from the set {±ei : i = 1, . . . , n} of signed unit vectors, without changing the expected convergence rates; (iii) we provide a large number of experimental results, showing that Random Pursuit is a competitive algorithm in practice; (iv) we introduce a heuristic improvement of Random Pursuit that is even faster on all our benchmark functions; (v) we point out that Random Pursuit can also be applied to a number of relevant non-convex functions, without sacrificing any theoretical and practical performance guarantees. On the other hand, while we prove fast convergence only in expectation, Karmanov’s more intricate analysis also yields fast convergence with high probability.
4
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
Polyak [28] describes step size rules for the closely related randomized gradient descent scheme: xk+1 = xk + λk
f (xk + µk u) − f (xk ) u, µk
(4)
where convergence is proved for µk → 0 but no convergence rates are established. Nesterov [26] studied different variants of method (4) and its accelerated versions for smooth and non-smooth optimization problems. He showed that scheme (4) is at most O(n2 ) times slower than the standard (sub-)gradient method. The use of exact directional derivatives reduces the gap further to O(n). For smooth problems the method is only O(n) slower than the standard gradient method and accelerated versions are O(n2 ) slower than fast gradient methods. Kleiner et al. [17] studied a variant of algorithm (3) for unconstrained semidefinite programming: Random Conic Pursuit. There, each iteration comprises two steps: (i) the algorithm samples a rank-one matrix (not necessarily uniformly) at random; (ii) a two-dimensional optimization problem is solved that consists of finding the optimal linear combination of the rankone matrix and the current semidefinite matrix. The solution determines the next iterate of the algorithm. In the case of trace-constrained semidefinite problems only a one-dimensional line search is necessary. Kleiner and co-workers proved convergence of this algorithm when directions are chosen uniformly at random. The dependency between convergence rate and dimension are, however, not known. Nonetheless, their work greatly inspired our own efforts which is also reflected in the name “Random Pursuit” for the algorithm under study. The present article is structured as follows. In Section 2 we present the Random Pursuit algorithm with approximate line search. We introduce the necessary notation and formulate the assumptions on the objective function. In Section 3 we derive a number of useful results on the expectation of scaled random vectors. In Section 4 we calculate the expected one-step progress of Random Pursuit with approximate line search (RP µ ). We show that (besides some additive error term) this progress is by a factor of O(n) worse than the one-step progress of the gradient method. These results allow us to derive the final convergence results in Section 5. We show that RP µ meets the convergence rates of the standard gradient method up to a factor of O(n), i.e., linear convergence on strongly convex functions and convergence rate 1/k for general convex functions. The linear convergence on strongly convex functions is best possible: For the sphere function our method meets the lower bound [12]. For strongly convex objective function the method is robust against small absolute or relative errors in the line search. In Section 6 we present numerical experiments on selected test problems. We compare RP µ with the standard gradient method, Nesterov’s random gradient scheme and its accelerated version [26], an adaptive step size random search,
5
RANDOM PURSUIT
and an accelerated heuristic version of RP µ . In Section 7 we discuss the theoretical and numerical results as well as the present limitations of the scheme that may be alleviated by more elaborate randomization primitives. We also provide a number of promising future research directions.
2
The Random Pursuit (RP) algorithm
We consider problem (1) where f is a differentiable convex function with bounded curvature (to be defined below). The algorithm RP µ is a variant of scheme (3) where the step sizes are determined by a line search. Formally, we define the following oracles: Definition 2.1 (Line search oracle). For x ∈ Rn , a direction u ∈ S n−1 and a convex function f , a function LS : Rn × S n−1 → R with LS(x, u) = arg min f (xk + hu)
(5)
h∈R
is called an exact line search oracle. For accuracy µ ≥ 0 the functions abs LSapproxrel µ and LSapproxµ with LS(x, u) − µ ≤ LSapproxabs µ (x, u) ≤ LS(x, u) + µ, max{0, (1 − µ)} · LS(x, u) ≤
LSapproxrel µ (x, u)
and,
(6)
≤ LS(x, u),
(7)
are absolute, respectively relative, approximate line search oracles. LSapproxµ , we denote any of the two.
By
˜ close This means that we allow an inexact line search to return a value h to the optimal value h∗ = LS(x, u). To simplify subsequent calculations, ˜ ≤ h∗ in the case of relative approximation, but this we also require that h requirement is not essential. As the optimization problem (6) cannot be solved exactly in most cases, we will describe and analyze our algorithm by means of the two latter approximation routines. The formal definition of algorithm RP µ is shown in Algorithm 1. At iteration k of the algorithm a direction u ∈ S n−1 is chosen uniformly at random and the next iterate xk+1 is calculated from the current iterate xk as xk+1 := xk + LSapproxµ (xk , u) · u.
(8)
This algorithm only requires function evaluations. No additional first or second-order information of the objective is needed. Note also that besides the starting point no further input parameters describing function properties (e.g. curvature constant, etc.) are necessary. The actual run time will, however, depend on the specific properties of the objective function.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
6
Algorithm 1 Random Pursuit (RP µ ) A problem of the form (1) N ∈ N : number of iterations x0 : an initial iterate Output: Approximate solution xN to (1). Input:
1: for k ← 0 to N − 1 do 2: choose uk uniformly at random from S n−1 3: Set xk+1 ← xk + LSapproxµ (xk , uk )uk 4: end for 5: return xN
2.1
Discrete Sampling
As our analysis below reveals, the random vector uk enters the analysis only in terms of expectations of the form E[hx, uk i uk ] and E[khx, uk i uk k2 ]. In Lemmas 3.3 and 3.4 we show that these expectations are the same for uk ∼ S n−1 and uk ∼ {±ei : i = 1, . . . , n}, the set of signed unit vectors. It follows that continuous sampling from S n−1 can be replaced with discrete sampling from {±ei : i = 1, . . . , n} without affecting our guarantees on the expected runtime. Under this modification, fast convergence still holds with high probability, but the bounds get worse [14].
2.2
Quasiconvex functions
If f and g are functions, g is called a strictly monotone transformation of f if f (x) < f (y) ⇔ g(f (x)) < g(f (y)),
x, y ∈ Rn .
It is clear from this that the distribution of xk in RP µ is the same for the function f and the function g ◦ f , if g is a strictly monotone transformation of f . This follows from the fact that the result of any line search is invariant under strictly monotone transformations. This observation allows us to run RP µ on any strictly monotone transformation of any convex function f , with the same theoretical and practical performance as on f itself. The functions obtainable in this way form a subclass of the class of quasiconvex functions, and they include non-convex functions as well. In Section 6.2.3 we will experimentally verify the invariance of RP µ under strictly monotone transformations on one instance of a quasiconvex function.
2.3
Function Basics
We now introduce some important inequalities that are useful for the subsequent analysis. We always assume that the objective function is differen-
7
RANDOM PURSUIT tiable and convex. The latter property is equivalent to f (y) ≥ f (x) + h∇f (x), y − xi ,
x, y ∈ Rn .
(9)
We also require that the curvature of f is bounded. By this we mean that for some constant L1 , 1 f (y) − f (x) − h∇f (x), y − xi ≤ L1 kx − yk2 , 2
x, y ∈ Rn .
(10)
We will also refer to this inequality as the quadratic upper bound. It means that the deviation of f from any of its linear approximations can be bounded by a quadratic function. A differentiable function is strongly convex with parameter m if the quadratic lower bound f (y) − f (x) − h∇f (x), y − xi ≥
m ky − xk2 , 2
x, y ∈ Rn ,
(11)
holds. Let x∗ be the unique minimizer of a strongly convex function f with parameter m. Then equation (11) implies this useful relation: m 1 kx − x∗ k2 ≤ f (x) − f (x∗ ) ≤ k∇f (x)k2 , 2 2m
∀x ∈ Rn .
(12)
The former inequality uses ∇f (x∗ ) = 0, and the latter one follows from (11) via m f (x∗ ) ≥ f (x) + h∇f (x), x∗ − xi + kx∗ − xk2 2 m 1 ≥ f (x) + minn h∇f (x), y − xi + ky − xk2 = f (x) − k∇f (x)k y∈R 2 2m by standard calculus.
3
Expectation of scaled random vectors
We now study the projection of a fixed vector x onto a random vector u. This will help analyze the expected progress of Algorithm 1. We start with the case u ∼ N (0, In ) and then extend it to u ∼ S n−1 . Throughout this section, let x ∈ Rn be a fixed vector and u ∈ Rn a random vector drawn according to some distribution. We will need the following facts about the moments of the standard normal distribution. Lemma 3.1. (i) Let ν ∈ N (0, 1) be drawn from the standard normal distribution over the reals. Then E[ν] = E[ν 3 ] = 0 ,
E[ν 2 ] = 1 ,
E[ν 4 ] = 3 .
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
8
(ii) Let u ∈ N (0, In ) be drawn from the standard normal distribution over Rn . Then Eu [uuT ] = In ,
Eu [(uuT )2 ] = (n + 2)In .
Proof. Part (i) is standard, and the latter two matrix equations easily follow from (i) via X u2k . (uuT )ij = ui uj , (uuT )2ij = ui uj k
Lemma 3.2 (Normal distribution). Let u ∼ N (0, In ). Then h i Eu [hx, ui u] = x , and Eu khx, ui uk2 = (n + 2) kxk2 . Proof. We calculate Eu [hx, ui · u] = Eu [uuT x] = Eu [uuT ]x = x, by Lemma 3.1(ii). For the second moment we get h i Eu khx, ui · uk2 = Eu [xT (uuT )2 x] = xT Eu [(uuT )2 ]x = (n + 2) kxk2 , again using Lemma 3.1(ii). Lemma 3.3 (Spherical distribution). Let u ∼ S n−1 . Then h i h i 1 1 Eu [hx, ui u] = x , and Eu khx, ui uk2 = Eu hx, ui2 = kxk2 . n n Proof. Let v ∼ N (0, In ). We observe that the random vector w = v/ kvk has the same distribution as u. In particular, T T Ev vv T vv I h i= n, Eu uu = Ev (13) 2 = n kvk Ev kvk2 where we have used that the two random variables pendent (see [8]), along with Ev vv T = In ,
vv T kvk2
and kvk2 are inde-
h i Ev kvk2 = n ,
a consequence of Lemma 3.1. Now we use (13) to compute In 1 Eu [hx, ui · u] = Eu uuT x = x = x n n and h i 1 In Eu hx, ui2 = Eu xT uuT x = xT Eu uuT x = xT x = kxk2 . n n The same result can be derived when the vector u is chosen to be a random signed unit vector.
9
RANDOM PURSUIT
Lemma 3.4. Let u ∼ U := {±ei : i = 1, . . . , n} where ei denotes the i-th standard unit vector in Rn . Then h i h i 1 1 Eu [hx, ui u] = x , and Eu khx, ui uk2 = Eu hx, ui2 = kxk2 . n n Proof. We calculate n 1 X 1X 1 Eu [hx, ui · u] = hx, ui · u = xi e i = x , 2n n n i=1
u∈U
and similarly n h i 1 X 1X 2 1 2 2 Eu hx, ui = hx, ui = xi = kxk2 . 2n n n i=1
u∈U
4
Single step progress
To prepare the convergence proof of Algorithm RP µ in the next section, we study the expected progress in a single step, which is the quantity E [f (xk+1 ) | xk ] . It turns out that we need to proceed differently, depending on whether the function under consideration is strongly convex (the easier case) or not. We start with a preparatory lemma for both cases. We first analyze the case when an approximate line search with absolute error is applied. Using an approximate line search with relative error will be reduced to the case of an exact line search.
4.1
Line search with absolute error
Lemma 4.1 (Absolute Error). Let xk ∈ Rn be the current iterate and xk+1 ∈ Rn the next iterate generated by algorithm RP µ with absolute line search accuracy µ. For every positive h ∈ R and every point z ∈ Rn we have h L1 h2 L1 µ2 h∇f (xk ), z − xk i + kz − xk k2 + . n 2n 2 Proof. Let x0k+1 := xk + LS(xk , uk )uk be the exact line search optimum. Here, uk ∈ S n−1 is the chosen search direction. By definition of the approximate line search (6), we have E [f (xk+1 ) | xk ] ≤ f (xk ) +
f (xk+1 ) ≤ max f (x0k+1 + νuk ) |ν|≤µ (10) L1
≤ f (x0k+1 ) + max ∇f (x0k+1 ), νuk + ν 2 |ν|≤µ | {z } 2 0
= f (x0k+1 ) +
µ2
L1 2
,
(14)
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
10
where we used the quadratic upper bound (10) in the second inequality with x = x0k+1 and y = x0k+1 + νuk . Since x0k+1 is the exact line search optimum, we in particular have f (x0k+1 ) ≤ f (xk + tk uk ) ≤ f (xk ) + h∇f (xk ), tk uk i +
L1 t2k , 2
∀tk ∈ R, (15)
where we have applied (10) a second time. Putting together (14) and (15), and taking expectations, we get L1 t2k L1 µ2 Euk [f (xk+1 ) | xk ] ≤ f (xk ) + Euk h∇f (xk ), tk uk i + | xk + . 2 2 (16) Now it is time to choose tk such that we can control the expectations on the right-hand side. We set tk := h hz − xk , uk i , where h > 0 and z ∈ Rn are the “free parameters” of the lemma. Via Lemma 3.3, this entails Euk [tk uk ] = h(z − xk ) ,
h2 kz − xk k2 , Euk t2k = n
and the lemma follows.
4.2
Line search with relative error
In the case of relative line search error, we can prove a variant of Lemma 4.1 with a denominator n0 slightly larger than n. As a result, the analysis under relative line search error reduces to the analysis of exact line search (approximate line search error 0) in a slightly higher dimension; in the sequel, we will therefore only deal with absolute line search error. Lemma 4.2 (Relative Error). Let xk ∈ Rn be the current iterate and xk+1 ∈ Rn the next iterate generated by algorithm RP µ with relative line search accuracy µ. For every positive h ∈ R and every point z ∈ Rn we have E [f (xk+1 ) | xk ] ≤ f (xk ) +
h L1 h2 h∇f (x ), z − x i + kz − xk k2 , k k 0 0 n 2n
where n0 = n/(1 − µ). Proof. By the definition (7) of relative line search error, xk+1 is a convex combination of xk and x0k+1 , the exact line search optimum. More precisely, we can compute that xk+1 = (1 − γ)xk + γx0k+1 ,
11
RANDOM PURSUIT where γ ≥ 1 − µ. By convexity of f , we thus have f (xk+1 ) ≤ (1 − γ)f (xk ) + γf (x0k+1 ) ≤ µf (xk ) + (1 − µ)f (x0k+1 ) , since f (x0k+1 ) ≤ f (xk ). Hence E [f (xk+1 ) | xk ] ≤ µf (xk ) + (1 − µ) E f (x0k+1 ) | xk .
(17)
Using Lemma 4.1 with absolute line search error 0 yields a bound for the latter term: h L1 h2 E f (x0k+1 ) | xk ≤ f (xk ) + h∇f (xk ), z − xk i + kz − xk k2 . n 2n Putting this together with (17) yields E [f (xk+1 ) | xk ] ≤ f (xk ) + (1 − µ)
h L1 h2 h∇f (xk ), z − xk i + kz − xk k2 n 2n
,
and with n0 = n/(1 − µ), the lemma follows.
4.3
Towards the strongly convex case
Here we use z = xk − ∇f (xk ) in Lemma 4.1, the value of z that leads to the smallest right-hand side in the inequality of the lemma. Corollary 4.1. Let xk ∈ Rn be the current iterate and xk+1 ∈ Rn the next iterate generated by algorithm RP µ with absolute line search accuracy µ. For any positive hk ≤ L11 it holds that E [f (xk+1 ) | xk ] ≤ f (xk ) −
L1 µ2 hk k∇f (xk )k2 + . 2n 2
Proof. Lemma 4.1 with z = xk − ∇f (xk ) yields E [f (xk+1 ) | xk ] ≤ f (xk ) −
L1 h2k hk L1 µ2 h∇f (xk ), ∇f (xk )i + k∇f (xk )k2 + . n 2n 2
We conclude E [f (xk+1 ) | xk ] ≤ f (xk ) −
hk n |
L1 hk L1 µ2 1− k∇f (xk )k2 + . 2 2 {z } h
k ≥ 2n
12
4.4
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
Towards the general convex case
For this case, we apply Lemma 4.1 with z = x∗ . Corollary 4.2. Let xk ∈ Rn be the current iterate and xk+1 ∈ Rn the next iterate generated by algorithm RP µ with absolute line search accuracy µ. Let x∗ ∈ Rn be one of the minimizers of the function f . For any positive hk ≥ 0 it holds that E [f (xk+1 ) − f (x∗ ) | xk ] ≤ (1 −
L1 h2k ∗ hk L1 µ2 ) (f (xk ) − f (x∗ )) + kx − xk k2 + . n 2n 2
Proof. We use Lemma 4.1 with z = x∗ and apply convexity (9) to bound the term h∇f (xk ), x∗ − xk i from above by f (x∗ ) − f (xk ). Subtracting f (x∗ ) from both sides yields the inequality of the corollary.
5
Convergence results
Here use the previously derived bounds on the expected single step progress (Corollaries 4.1 and 4.2) to show convergence of the algorithm.
5.1
Convergence analysis for strongly convex functions
We first prove that algorithm RP µ converges linearly in expectation on strongly convex functions. Despite strong convexity being a global property, it is sufficient if the function is strongly convex in the neighborhood of its minimizer, see Theorem 5.2. Theorem 5.1. Let f be strongly convex with parameter m, and consider the sequence {xk }k≥0 be generated by RP µ with absolute line search accuracy µ. Then for any N ≥ 0, we have m N L2 nµ2 ∗ E [f (xN ) − f (x )] ≤ 1 − (f (x0 ) − f (x∗ )) + 1 . L1 n 2m Proof. We use Corollary 4.1 with h = to estimate the progress in one step as
1 L1
and the quadratic lower bound
1 L1 µ2 E [f (xk+1 ) − f (x∗ ) | xk ] ≤ f (xk ) − f (x∗ ) − k∇f (xk )k2 + 2nL1 2 2 (12) m L1 µ (f (xk ) − f (x∗ )) + . ≤ 1− nL1 2 After taking expectations (over xk ), the partition theorem of conditional expectations yields the recurrence m L1 µ2 E [f (xk+1 ) − f (x∗ )] ≤ 1 − E [f (xk ) − f (x∗ )] + . nL1 2
13
RANDOM PURSUIT This implies m N L1 µ2 E [f (xN ) − f (x )] ≤ 1 − (f (x0 ) − f (x∗ )) + ω(N ) , nL1 2 ∗
with ω(N ) :=
N −1 X
(1 −
i=0
m i L1 n ) ≤ . L1 n m
The bound of the theorem follows. We remark that by strong convexity also the error kxN − x∗ k can be bounded using the results of this theorem. This means, the algorithm does not only converge in terms of function value, but also in terms of the solution itself. Each strongly convex function has a unique minimizer x∗ . Using the quadratic lower (12) bound we recall that: f (x) − f (x∗ ) ≥
m kx − x∗ k2 , 2
∀x ∈ Rn .
(18)
It turns out that instead of strong convexity (11) the weaker condition (18) is sufficient to have linear convergence. Theorem 5.2. Suppose f has a unique minimizer x∗ satisfying (18) with parameter m. Consider the sequence {xk }k≥0 be generated by RP µ with absolute line search accuracy µ. Then for any N ≥ 0, we have ∗
E [f (xN ) − f (x )] ≤
m 1− 4L1 n
N
(f (x0 ) − f (x∗ )) +
L21 nµ2 . 2m
Proof. To see this we use Corollary 4.2 with property (18) to get L1 h2k hk L1 µ2 E [f (xk+1 ) − f (x ) | xk ] ≤ 1 − (f (xk ) − f (x∗ )) + kxk − x∗ k2 + n 2n 2 2 2 L1 hk hk L1 µ ≤ 1− + . (f (xk ) − f (x∗ )) + n mn 2 ∗
m Setting hk to 2L , the term in the left bracket becomes (1 − 4Lm1 n ). Now the 1 proof continues as the proof of Theorem 5.1.
5.2
Convergence analysis for convex functions
We now prove that algorithm RP µ converges in expectation on smooth (not necessarily strongly) convex functions. The rate is, however, not linear anymore.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
14
Theorem 5.3. Let x∗ a minimizer of f , and let the sequence {xk }k≥0 be generated by RP µ with absolute line search accuracy µ. Assume there exists R, s.t. ky − x∗ k < R for all y with f (y) ≤ f (x0 ). Then for any N ≥ 0, we have E [f (xN ) − f (x∗ )] ≤
Q N L1 µ2 + , N +1 4
where Q = max 2nL1 R2 , f (x0 ) − f (x∗ ) . Proof. By assumption, there exists an R ∈ R, s.t. kxk − x∗ k ≤ R, for all k = 0, 1, . . . , N . With Corollary 4.2 it follows for any step size hk ≥ 0: E [f (xk+1 ) − f (x∗ ) | xk ] ≤
1−
hk n
(f (xk ) − f (x∗ )) +
L1 h2k 2 L1 µ2 R + . 2n 2 (19)
Taking expectation we obtain 2 hk hk nL1 R2 L1 µ2 ∗ E [f (xk+1 ) − f (x )] ≤ 1 − E [f (xk ) − f (x )] + + . n n 2 2 ∗
By Lemma A.1, the choice hk := the rightmost error term yields
2n k+1
E [f (xN ) − f (x∗ )] ≤
for k = 0, . . . (N − 1) and summing up L1 µ2 Q + ω 0 (N ) , N +1 2
(20)
with 0
ω (N ) := 1 +
N −1 N −1 X Y k=1 i=k
2 1− i+1
≤1+
N −1 X k=1
2 1− N
k ≤
N . 2 (21)
We note that for > 0 the exact algorithm RP 0 needs O n steps to guarantee an approximation error of . According to the discussion preceding Lemma 4.2, this still holds under an approximate line search with fixed relative error. In the absolute error model, however, the error bound p of Theorem 5.3 becomes meaningless as N → ∞. Nevertheless, for Nopt = 2 Q/(L1 µ2 ) the bound yields p E f (xNopt ) − f (x∗ ) ≤ µ QL1 .
RANDOM PURSUIT
5.3
15
Remarks
We emphasize that the constant L1 and the strong-convexity parameter m that describe the behavior of the function are only needed for the theoretical analysis of RP µ . These parameter are not input parameters to the algorithm. No pre-calculation or estimation of these parameters is thus needed in order to use the algorithm on convex functions. Moreover, the presented analysis does not need parameters that describe the properties of the function on the whole domain. It is sufficient to restrict our view on the sub-level set determined by the initial iterate. Consequently, if the function parameters get better in a neighborhood of the optimum, the performance of the algorithm may be better than the theoretically prediction from the worst case analysis.
6
Computational experiments
We complement the presented theoretical analysis with extensive numerical optimization experiments on selected benchmark functions. We compare the performance of the RP µ algorithm with a number of gradient-free algorithms that share the simplicity of Random Pursuit in terms of the computational search primitives used. We also introduce a heuristic acceleration scheme for Random Pursuit, the accelerated RP µ method (ARP µ ). We finally present as reference method a steepest descent scheme that uses analytic gradient information. The test function set comprises two quadratic functions with different condition numbers, two variants of Nesterov’s smooth function [25], and a non-convex funnel-shaped function. We first detail the algorithms, their input requirements, and necessary parameter choices. We then present the definition of the test functions, describe the experimental performance evaluation protocol, and present the numerical results.
6.1
Algorithms
We now introduce the set of tested algorithms. All methods have been implemented in MATLAB and will be made publicly available on the authors’ web site. 6.1.1
Random Gradient methods
We consider two randomized methods that are discussed in detail in [26]. The first algorithm, the Random Gradient Method (RG), implements the iterative scheme described in (4). A necessary ingredient for the algorithm is an oracle that provides directional derivatives. The accuracy of the directional derivatives is controlled by the finite difference step size µ. A pseudo-code representation of the approximate Random Gradient method
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
16
(RG µ ) along with a convergence proof is described in [26, Section 5]. We implemented RG µ and used the parameter setting µ = 1E − 5. A necessary input to the RG µ algorithm is the function-dependent Lipschitz constant L1 that is used to determine the step size λk = 1/(4(n+4)L1 ). We also consider Nesterov’s fast Random Gradient Method (FG) [26]. This algorithm simultaneously evolves two iterates in the search space where, in each iteration, a directional derivative is approximately computed at specific linear combinations of these points. In [26, Section 6] Nesterov provides a pseudo-code for the approximate scheme FG µ and proves convergence on strongly convex functions. We implemented the FG µ scheme and used the parameter setting µ = 1E − 5. Further necessary input parameters are both the L1 constant and the strong convexity parameter m of the respective test function.
6.1.2
Random Pursuit methods
In the implementation of the RP µ algorithm we choose the sampling directions uniformly at random from the hypersphere. We use the built-in MATLAB routine fminunc.m with optimset(’TolX’=µ) as approximate line search oracle LSapproxµ with µ = 1E − 5. Inspired by the FG scheme we also designed an accelerated Random Pursuit algorithm (ARP µ ) which is summarized in Algorithm 2. The structure of this algorithm is similar Algorithm 2 Accelerated Random Pursuit (ARP µ ) Input: N, x0 , m, L1 1: 2: 3: 4: 5: 6: 7: 8:
θ = L11 n , γ0 ≥ m for k = 0 to N do Compute βk > 0 satisfying θ−1 βk2 = (1 − βk )γk + βk m =: γk+1 . βk k γk Set λk = γk+1 m, δk = γkβ+β , and yk = (1 − βk )xk + βk vk . km uk ∼ N (0, In ) Set xk+1 = yk + LSapproxµ (yk , uk )uk . k ,uk ) Set vk+1 = (1 − λk )vk + λk yk + LSapprox(y uk . βk n end for
to Nesterov’s FG µ scheme. In ARP µ the step size calculation is, however, provided by the line search oracle. Although we currently lack theoretical guarantees for this scheme we here report the experimental performance results. Analogously to the FG µ algorithm, the accelerated RP µ algorithm needs the function-dependent parameters L1 and m as necessary input. The line search oracle is identical to the one in standard Random Pursuit. Both RP µ algorithms are invariant to strictly monotone transformations g(·) of the objective function thus making it applicable to a wider class of (possibly non-convex) functions.
17
RANDOM PURSUIT 6.1.3
Adaptive step size random search methods
The previous randomized schemes proceed along random directions either by using pre-calculated step sizes or by using line search oracles. In adaptive step size random search methods the step size is dynamically controlled such as to approximately guarantee a certain probability p of finding an improving iterate. Schumer and Steiglitz [32] were among the first to propose such a scheme. In the bio-inspired optimization literature, the method is known as the (1+1)-Evolution Strategy (ES) [31]. J¨agersk¨ upper [11] provides a convergence proof of ES on convex quadratic functions. We here consider the following generic ES algorithm summarized in Algorithm 3. Algorithm 3 (1+1)-Evolution Strategy (ES) with adaptive step size control Input: N, x0 , σ0 , Probability of improvement p = 0.27 1
−p
Set cs = e 3 , cf = cs · e 1−p . 2: for k = 0 to N do 3: uk ∼ N (0, In ) 4: if f (xk + σk uk ) ≤ f (xk ) then 5: Set xk+1 = xk + σk uk and σk+1 = cs · σk . 6: else 7: Set xk+1 = xk and σk+1 = cf · σk . 8: end if 9: end for 1:
Depending on the specific random direction generator and the underlying test function different optimality conditions can be formulated for the probability p. Schumer and Steiglitz [32] suggest the setting p = 0.27 which is also considered in this work. For the all considered test functions the initial step size σ0 has been determined experimentally in order to guarantee the targeted p at the start (see Table 6 for the respective values). The ES algorithm shares RP µ ’s invariance under strictly monotone transformations of the objective function.
6.1.4
Standard Gradient Method
In order to illustrate the numerical efficiency of the randomized zeroth-order schemes with first-order methods, we also consider the standard Gradient Method (GM) as outlined in (2). The step size is set to λk = L11 [25]. The function-dependent constant L1 is, thus, part of the input to the GM algorithm.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
18
6.2
Benchmark functions
We now present the set of test functions used for the numerical performance evaluation of the different optimization schemes. We present the three function classes and detail the specific function instances and their properties. 6.2.1
Quadratic functions
We consider quadratic test functions of the form: 1 f (x) = (x − 1)T Q(x − 1) , 2
(22)
where x ∈ Rn and Q ∈ Rn×n is a diagonal matrix. For given L1 the diagonal entries are chosen in the interval [1, L1 ]. The minimizer of this function class is x∗ = 1 and f (x∗ ) = 0. The derivative is ∇f (x) = Qx. We consider two different matrix instances. Setting Q = In the n-dimensional identity matrix the function reduces to the shifted sphere function denoted here by f1 . In order to get a quadratic function with anisotropic axis lengths we use a matrix Q whose first n/2 diagonal entries are equal to L1 and the remaining entries are set to 1. This ellipsoidal function is denoted by f2 . 6.2.2
Nesterov’s smooth function
We also consider Nesterov’s smooth function as introduced in Nesterov’s text book [25]. The generic version of this function reads: L1 f3 (x) = 4
" # ! n−1 1 2 X x + (xi+1 − xi )2 + x2n − x1 . 2 1
This function has derivative ∇f3 (x) =
L1 4
(Ax − e1 ), where
2 −1 0 0 −1 2 −1 0 0 −1 2 1 A= . . .. .. 0
(23)
i=1
0
and e1 = (1, 0, . . . , 0)T . .. . −1 2 −1 0 −1 2
The optimal solution is located at: x∗i = 1 −
i , n+1
for i = 1, . . . , n.
This function is not strongly convex. Adding a regularization term leads, however, to a strongly convex function with parameter m. Given L1 ≥ m >
19
RANDOM PURSUIT 0, the regularized function reads: L1 − m f4 (x) = 4
" ! # n−1 1 2 X m 2 x1 + (xi+1 − xi ) + x2n − x1 + kxk2 . (24) 2 2 i=1
This function is strongly convex with parameter m. L1 −m L1 −m Its derivative ∇f4 (x) = 4 A + mI x− 4 e1 , and the optimal solution x∗ satisfies A +
6.2.3
4m L1 −m
x∗ = e 1 .
Funnel function
We finally consider the following funnel-shaped function q f5 (x) = log 1 + 10 (x − 1)T (x − 1) ,
(25)
∗ ∗ where x ∈ Rn . The minimizer of this function p is x = 1 with f5 (x ) = 0. Its derivative for x 6= 1 is ∇f5 (x) = 10/(1 + 10 (x − 1)T (x − 1)) · sign (x − 1). A one-dimensional graph of f5 is shown in the left panel of Figure 6. The function f5 arises from a strictly monotone transformation of f1 and thus belongs to the class of strictly quasi-convex functions.
6.3
Numerical optimization results
To illustrate the performance of Random Pursuit in comparison with the other randomized methods we here present and discuss the key numerical results. For all numerical tests we follow the identical protocol. All algorithms use as starting point x0 = 0 for all test functions. In order to compare the performance of the different algorithms across different test functions, we follow Nesterov’s approach [26] and report relative solution accuracies with respect to the scale S ≈ 12 L1 R2 where R := kx0 − x∗ k is the Euclidean distance between starting point and optimal solution of the respective function. The properties of the four convex and continuously differentiable test functions and the quasi-convex funnel function along with the upper bounds on R2 and the corresponding scales S are summarized in Table 1.
f1 f2 f3 f4 f5
Name Sphere Ellipsoid Nesterov smooth Nesterov strong Funnel
function class strongly convex strongly convex convex strongly convex not convex
L1 1 1000 1000 1000 -
m 1 1 1 -
R2 n n n+1 √3 1000 4
n
S 1 2n
50n 500 · 1000 1 2n
n+1 3
Table 1: Test functions with parameters L1 , µ, R and the used scale S.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
20
Due to the inherent randomness of a single search run we perform 25 runs for each pair of problem instance/algorithm with different random number seeds. We compare the different methods based on two record values: (i) the minimal, mean, and maximum number of iterations and (ii) the minimal, mean, and maximum number of function evaluations (FES) needed to reach a certain solution accuracy. While the former records serve as a means to compare the number of oracle calls in the different method, the latter one only considers evaluations of the objective function as relevant performance cost. It is evident that measuring the performance of the algorithms in terms of oracle calls favors Random Pursuit because the line search oracle ”does more work” than an oracle that, for instance, provides a directional derivative. For Random Gradient methods the number of FES is just twice the number of iterations when a first-order finite difference scheme is used for directional derivatives. For the ES algorithm the number of iterations and FES is identical. For Random Pursuit methods the relation between iterations and FES depends on the specific test function, the relative accuracy parameter µ, and the actual implementation of the line search, i.e., whether Golden section or binary search are used. For the comparison of the randomized schemes and the first-order GM we also discard a factor of n in the number of iterations in order to account for the reduced available information in the random methods. 6.3.1
Performance on the quadratic test functions for n ≤ 256
We first consider the two quadratic test functions in n = 22 , . . . , 28 dimensions. Table 2 summarizes the minimum, maximum, and mean number of iterations (in blocks of size n) needed for each algorithm to reach the absolute accuracy 1.91 · 10−6 S on the sphere function f1 . For the firstorder GM algorithm the absolute number of iterations is reported. Three n 4 8 16 32 64 128 256
min 5 8 10 11 12 12 13
RP max mean 17 10 16 12 14 12 14 12 14 13 14 13 14 13
min 40 39 33 31 30 30 30
RG max 65 53 41 36 34 32 31
mean 53 44 37 33 32 31 30
GM 1 1 1 1 1 1 1
min 31 30 30 28 28 29 29
FG max 49 40 37 35 33 32 31
mean 39 35 33 31 31 31 30
min 5 5 10 11 12 12 13
ARP max mean 17 10 13 11 14 12 16 12 14 13 14 13 14 13
min 28 28 30 33 33 35 35
ES max 46 43 42 41 41 40 40
mean 38 35 36 37 37 37 37
Table 2: Recorded minimum, maximum, and mean #iterations/n on the sphere function f1 to reach a relative accuracy of 1.91 · 10−6 . For GM the absolute number of iterations is recorded. key observations can be made from these data. First, all algorithms show the theoretically expected linear scaling of the run time with dimension for strongly convex functions. Second, Random Pursuit and accelerated Random Pursuit achieve almost identical performance. The same is true for
21
RANDOM PURSUIT
the algorithm pair RG/FG. Third, the Random Pursuit algorithms outperforms all other zeroth-order methods in terms of number of iterations. Only the last observation changes when the number of FES is considered. Tables 3 summarizes the number of function evaluations (in blocks of size n) for all algorithms on f1 . We see that the RP µ algorithms outperform the n 4 8 16 32 64 128 256
min 20 34 38 45 47 50 56
RP max mean 69 39 65 47 54 48 57 50 57 52 56 53 63 59
min 80 78 65 62 60 59 59
RG max 131 105 81 71 68 64 62
mean 106 87 73 66 64 62 61
min 62 59 61 57 57 58 58
FG max 99 81 74 69 66 64 62
mean 78 70 66 62 61 61 60
min 20 22 40 43 50 51 56
ARP max mean 69 39 53 43 56 47 62 50 56 52 56 54 63 59
min 28 28 30 33 33 35 35
ES max 46 43 42 41 41 40 40
mean 38 35 36 37 37 37 37
Table 3: Recorded minimum, maximum, and mean #FES/n on the sphere function f1 to reach a relative accuracy of 1.91 · 10−6 . Random Gradients methods for low dimensions and perform equally well for n = 256. However, the adaptive step size ES algorithm outperforms all other methods for n ≥ 8. The line search oracle in the RP µ algorithms consume on average four FES per iteration. We also observe that the gap between minimum and maximum number of FES reduces with increasing dimension for all methods. For the high-conditioned ellipsoidal function f2 we observe a genuinely different behavior of the different algorithms. In Figure 1 we graphically show for each algorithm the mean number of iterations (in blocks of size n) needed to reach the absolute accuracy 1.91 · 10−6 S on f2 . The minimum, maximum, and mean number of iterations are reported in the Appendix in Table 7. We again observe the theoretically expected linear scaling of the number of iterations with dimension. The mean number of iterations now spans two orders of magnitude for the different algorithms. Standard Random Pursuit outperforms the RG and the ES algorithm as well as the first-order GM scheme. Moreover, the accelerated RP µ scheme outperforms the FG scheme by a factor of 4. All methods show, however, an increased overall run time due to the high condition number of the quadratic form. This is also reflected in the increased number of FES that are needed by the line search oracle in the RP µ algorithms. The line search oracle now consumes on average 12-14 FES per iteration. Figure 2 shows for each algorithm the recorded mean number of FES (in blocks of size n) on f2 . We observe that Random Pursuit still outperforms Random Gradient for small dimensions but needs a comparable number of FES for n ≥ 64 (around 30.000 FES in blocks of n). The ES, the ARP µ , and the FG algorithm need an order of magnitude fewer FES. The accelerated RP µ is only outperformed by the FG algorithm. The performance of the ES algorithm is, however, remarkable given the fact that it does not need information about the parameters L1 and m which are of fundamental importance for the accelerated schemes.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
22
Random Pursuit Random Gradient 10
Gradient
5
Fast Random Gradient Accelerated RP
iterations/n
(1+1)-ES
10
10
10
4
3
2
4
8
16
32 dimension n
64
128
256
Figure 1: Average number of iterations (in log scale) vs. dimension n on the ellipsoidal function f2 to reach a relative accuracy of 1.91 · 10−6 . Further data are available in Table 7. 6.3.2
Performance on the full benchmark set for n = 64
We now illustrate the behavior of the different algorithms on the full benchmark set for fixed dimension n = 64. We observed similar qualitative behavior for all other dimensions. Table 4 contains the number of iterations needed to reach the scale-dependent accuracy 1.91 · 10−6 for all algorithms. We observe that Random Pursuit outperforms the RG, the GM, and the ES function f1 f2 f3 f4 f5
min 12 1899 2068 954 26
RP max 14 2096 2191 1023 30
mean 13 2001 2136 995 28
min 30 16601 18922 8727 -
RG max 34 17333 19075 8995 -
mean 32 16868 19004 8854 -
GM 1 3934 4474 2086 -
min 30 990 892 441 -
FG max 34 1079 970 534 -
mean 32 1038 942 458 -
min 12 233 192 137 26
ARP max mean 14 13 250 242 678 473 188 159 30 28
min 33 5451 5766 2651 73
ES max 41 5954 6050 2854 85
Table 4: Average number of iterations in blocks of size n = 64 to reach the scale-dependent accuracy 1.91·10−6 . For GM the exact number of iterations is reported. Observed minimum iterations across all algorithms are marked in bold face for each function. algorithm, and that the ARP µ algorithm outperforms all tested schemes in terms of number of iterations on all functions except the sphere function. The performance of the ES scheme is similar to that of the GM algorithm. We consistently observe an improved performance of all algorithms on the regularized strongly convex function f4 as compared to its convex counterpart f3 . This expected behavior is most pronounced for the ARP µ scheme where, on average, the number of iterations is reduced to 159/473 ≈ 1/3. For function f4 we illustrate the convergence behavior of the different algorithms in Figure 3. After a short algorithm-dependent initial phase we ob-
mean 37 5729 5916 2751 78
23
RANDOM PURSUIT
Random Pursuit Random Gradient Fast Random Gradient Accelerated RP (1+1)-ES
5
function evals./n
10
10
10
4
3
4
8
16
32 dimension n
64
128
256
Figure 2: Average number of FES (in log scale) vs. dimension n on the ellipsoidal function f2 to reach a relative accuracy of 1.91 · 10−6 . Further data are available in Table 8. serve linear convergence of all algorithms for fixed dimension, i.e., a constant reduction of the logarithm of the distance to the minimum per iteration. We also observe that the accelerated Random Pursuit consistently outperforms standard Random Pursuit for all measured accuracies on f4 (see Table 12 for the corresponding numerical data). This behavior is less pronounced for the function pair f1 /f2 as shown in Figure 4. On f1 both Random Pursuit schemes have identical progress rates that are also consistent with the theoretically predicted one. On f2 Random Pursuit outperforms the accelerated scheme for low accuracies (see also Table 9 for the numerical data) but is quickly outperformed due to faster progress rate of the accelerated scheme. We also observe that the theoretically predicted worst-case progress rate (dotted line in the right panel of Figure 4) does not reflect the true progress on this test function. Comparison of the numerical results on the function pair f1 /f5 (see Figure 5) demonstrates the expected invariance under strictly monotone transformations of the Random Pursuit algorithms and the ES scheme. These algorithms enjoy the same convergence behavior (up to small random variations) while the Random Gradient schemes fail to converge to the target accuracy. We also report the performance of the different algorithms in terms of number of FES needed to reach the target accuracy of 1.91 · 10−6 for the different test functions. For all algorithms the minimum, maximum, and average number of FES are recorded in Table 5. We observe that the RP µ algorithm outperforms the standard Random Gradient method on all tested functions. However, Random Pursuit is not competitive compared to the accelerated schemes and the ES algorithm. The accelerated RP µ scheme is
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
24 10
-1
Random Pursuit Random Gradient Gradient 10
-2
Fast Random Gradient Accelerated RP (1+1)-ES
-3
accuracy
10
10
10
10
-4
-5
-6
0
1000
2000
3000
4000 5000 iterations/n
6000
7000
8000
9000
Figure 3: Average accuracy (in log scale) vs. number of iterations (in blocks of size n) for all algorithms on f4 in n = 64 dimensions. Further data are available in the Appendix in Table 12. function f1 f2 f3 f4 f5
min 47 27723 25520 11629 338
RP max 57 30272 27034 12482 384
mean 52 29071 26351 12122 360
min 60 33202 37844 17455 -
RG max 68 34667 38150 17990 -
mean 64 33736 38008 17708 -
min 57 1980 1785 883 -
FG max 66 2159 1939 1069 -
mean 61 2077 1885 916 -
min 50 3035 2199 1557 342
ARP max mean 56 52 3247 3159 8149 5609 2134 1825 399 361
min 33 5451 5766 2651 73
ES max 41 5954 6050 2854 85
Table 5: Average number of FES in blocks of size n = 64 to reach the scale-dependent accuracy 1.91 · 10−6 . Observed minimum FES across all algorithms are marked in bold face for each function. only outperformed by the FG algorithm. The latter scheme shows particularly good performance on the convex function f3 with considerably lower variance. For functions f2 –f4 the RP µ algorithms need around 12 − 15 FES per line search oracle call. We emphasize again that the performance of the adaptive step size ES scheme is remarkable given the fact that it does not need any function-specific parametrization. A comparison to the parameterfree Random Pursuit scheme shows that it needs around four times fewer FES on functions f2 –f4 . We also remark that Random Pursuit with discrete sampling, i.e., using the set of signed unit vectors for sample generation (see Section 2.1), yields numerical results on the present benchmark that are consistent with our theoretical predictions. We observed improved performance of Random Pursuit with discrete sampling on the function triplet f1 /f2 /f5 . This is evident as the coordinate system of these functions coincide with the standard basis. Thus, algorithms that move along the standard coordinate system are favored. On the function pair f3 /f4 we do not see any significant deviation from the presented performance results.
mean 37 5729 5916 2751 78
25
RANDOM PURSUIT
Random Pursuit Accelerated RP Theoretical 10 10
10
10
-3
-5.4
10
10 10
-2
-5.3
-4
-5
-5.5
12
12.2
12.4 12.6 Sphere
12.8
0
500
1000 1500 Ellipsoid
2000
2500
Figure 4: Numerical convergence rate of standard and accelerated Random Pursuit on f1 (left panel) and f2 (right panel) in n = 64 dimensions. For both instances the theoretically predicted worst-case progress rate (dotted line) is shown for comparison. A comparison between theory and experiments shows that (i) the observed convergence of Random Pursuit is about two times faster than the worst case results obtained in Section 5 and (ii) the scaling according to different parameters L1 (sphere vs. ellipsoid) is better than expected from theory, and (iii) the invariance properties of RP µ algorithms and the ES scheme are confirmed. We finally note that the present numerical results for the Random Gradient methods are consistent with the ones presented in [26].
7
Discussion and Conclusion
In this article we have analyzed the Random Pursuit (RP µ ) algorithm for zeroth-order optimization of convex functions. The algorithm iteratively computes an approximate solution to the optimization problem by repeated optimization over a randomly chosen one-dimensional subspace. The onedimensional subproblem is solved by an exact or approximate line search oracle. The algorithm only uses zeroth-order information about the objective function and does not need any problem-specific parametrization. We have derived a convergence proof and convergence rates for Random Pursuit on convex functions. We have used a quadratic upper bound technique to bound the expected single-step progress of the algorithm. This results in global linear convergence for strongly convex functions and convergence of the order 1/k for general convex functions. For line search oracles with relative error µ the same results have been obtained with convergence 1 rates reduced by a factor of 1−µ . For inexact line search with absolute
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
26 5
Random Pursuit on FUNNEL (1+1)-ES on FUNNEL
4
Random Gradient on FUNNEL Random Pursuit on SPHERE
3
(1+1)-ES on SPHERE Random Gradient on SPHERE
2
accuracy
1 0
-1 -2 -3 -4 0
20
40
60
80
100 iterations/n
120
140
160
180
200
Figure 5: Numerical convergence rate of the RP µ , the ES, and the RG scheme on f1 and f5 in n = 64 dimensions. The accuracy is measured in terms of the logarithmic distance to the optimum log (kxk − x∗ k2 ). error the convergence rates stay unaltered, and convergence is ensured on strongly convex functions. Convergence on general convex functions can be established if the number of steps does not exceed a suitable constant that depends on the properties of the function and the dimensionality. The convergence rate of Random Pursuit exceeds the rate of the standard (firstorder) Gradient Method by a factor of n. J¨agersk¨ upper showed that no better performance can be expected for strongly convex functions [12]. He derived a lower bound for algorithms of the form 3 where at each iteration the step size along the random direction is chosen such as to minimize the distance to the minimum x∗ . On sphere functions f (x) = (x − x∗ )T (x − x∗ ) Random Pursuit coincides with the described scheme, thus achieving the lower bound. The numerical experiments showed that (i) standard Random Pursuit is effective on strongly convex functions with moderate condition number, and (ii) the accelerated scheme is comparable to Nesterov’s fast gradient method and outperforms the ES algorithm. The experimental results also revealed that (i) the RP µ ’s observed convergence is around two times faster than the worst case theoretical prediction, (ii) the scaling according to different parameters L1 is better than expected from theory, and (iii) both continuous and discrete sampling can be employed in Random Pursuit. We confirmed the invariance of the RP µ algorithms and ES under monotone transformations of the objective functions on the quasiconvex funnel-shaped function f5 where Random Gradient algorithms fail. We also emphasize that the observed performance of the ES scheme is remarkable given the fact that it does not need any function-specific input parameters.
27
RANDOM PURSUIT
The present theoretical and experimental results hint at a number of potential enhancements for standard Random Pursuit in future work. First, RP µ ’s convergence rate depends on the function-specific parameter L1 that bounds the curvature of the objective function. Any reduction of this dependency would imply faster convergence on a large class of function. The empirical results on the function pair f1 /f2 (see Table 9) also suggest that complicated accelerated schemes do not present any significant advantage on functions with small constant L1 . It is conceivable that Random Pursuit can incorporate a mechanism to learn second-order information about the function “on the fly”, thus improving the conditioning of the original optimization problem and potentially reducing it to the L1 ≈ 1 case. This may be possible using techniques from randomized Quasi-Newton approaches [2, 20] or differential geometry [5]. It is noteworthy that heuristic versions of such an adaptation mechanism have proved extremely useful in practice for adaptive step size algorithms [16, 10, 22] Second, we have not considered Random Pursuit for constrained optimization problems of the form: min f (x)
subject to x ∈ K ,
(26)
where K ⊂ Rn is a convex set. The key challenge is how to treat iterates xk+1 = xk + LSapprox(xk , u)u generated by the line search oracle that are outside the domain K. A classic idea is to apply a projection operator πK and use the resulting x0k+1 := πK (xk+1 ) as the next iterate. However, finding a projection onto a convex set (except for simple bodies such as hyper-parallelepipeds) can be as difficult as the original optimization problem. Moreover, it is an open question whether general convergence can be ensured, and what convergence rates can be achieved. Another possibility is to constrain the line search to the intersection of the line and the convex body K. In this case, it is evident that one can only expect exponentially slow convergence rates for this method. Consider the linear function f (x) = 1T x and K = Rn+ . Once an iterate xk lies at the boundary ∂K of the domain, say the first coordinate of xk is zero, then only directions u with positive first coordinate may lead to an improvement. As soon as a constant fraction of the coordinates are zero, the probability of finding an improving direction is exponentially small. Karmanov [14] proposed the following combination of projection and line search constraining: First, a random point y at some fixed distance of the current iterate is drawn uniformly at random and then projected to the set K. A constrained line search is now performed along the line through the current iterate xk and πK (y). It remains open to study the convergence rate of this method. Finally, we envision convergence guarantees and provable convergence rates for Random Pursuit on more general function classes. The invariance of the line search oracle under strictly monotone transformations of the objective function already implied that Random Pursuit converges on certain
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
28 4
25
3.5 20
3 2.5
f5 (x)
15
2
fGC (x) 10
1.5 1
5
0.5 0 ï4
ï2
0
2
x
4
6
0 ï5
0
5
x
Figure 6: Right panel: Graph of function f5 in 1D. Left Panel: Graph of a globally convex function fGC . strictly quasiconvex functions (see right panel of 5 for the graph of such an instance). It also seems in reach to derive convergence guarantees for Random Pursuit on the class of globally convex (or δ-convex) functions [9] or on convex functions with bounded perturbations [27]. This may be achieved by appropriately adapting line search methods to these function classes. In summary, we believe that the theoretical and experimental results on Random Pursuit represent a promising first step toward the design of competitive derivative-free optimization methods that are easy to implement, possess theoretical convergence guarantees, and are useful in practice.
Acknowledgments We sincerely thank Martin Jaggi for several helpful discussions.
RANDOM PURSUIT
29
References [1] R. L. Anderson, Recent advances in finding best operating conditions, Journal of the American Statistical Association, 48 (1953), pp. 789–798. [2] B. Betro and L. De Biase, A Newton-like method for stochastic optimization, in Towards Global Optimization, vol. 2, North-Holland, 1978, pp. 269–289. [3] H. G. Beyer, The theory of evolution strategies, Natural Computing, Springer-Verlag New York, Inc., New York, NY, USA, 2001. [4] S. H. Brooks, A Discussion of Random Methods for Seeking Maxima, Operations Research, 6 (1958), pp. 244–251. [5] H. B. Cheng, Cheng L. T., and S. T. Yau, Minimization with the affine normal direction, Comm. Math. Sci., 3 (2005), pp. 561–574. [6] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to derivative-free optimization, MPS-SIAM Book Series on Optimization, SIAM, 2009. [7] E. Hazan, Sparse approximate solutions to semidefinite programs, in Proceedings of the 8th Latin American conference on Theoretical informatics, LATIN’08, Berlin, Heidelberg, 2008, Springer-Verlag, pp. 306– 316. [8] R. Heijmans, When does the expectation of a ratio equal the ratio of expectations?, Statistical Papers, 40 (1999), pp. 107–115. 10.1007/BF02927114. [9] T. C. Hu, V. Klee, and D. Larman, Optimization of globally convex functions, SIAM Journal on Control and Optimization, 27 (1989), pp. 1026–1047. [10] C. Igel, T. Suttorp, and N. Hansen, A computational efficient covariance matrix update and a (1+1)-CMA for evolution strategies, in GECCO ’06: Proceedings of the 8th annual conference on Genetic and evolutionary computation, New York, NY, USA, 2006, ACM, pp. 453– 460. ¨ gersku ¨ pper, Rigorous runtime analysis of the (1+1) ES: 1/5[11] J. Ja rule and ellipsoidal fitness landscapes, in Foundations of Genetic Algorithms, Alden Wright, Michael Vose, Kenneth De Jong, and Lothar Schmitt, eds., vol. 3469 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2005, pp. 356–361. 10.1007/11513575.14.
30 [12]
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER , Lower bounds for hit-and-run direct search, in Stochastic Algorithms: Foundations and Applications, Juraj Hromkovic, Richard Kr´ alovic, Marc Nunkesser, and Peter Widmayer, eds., vol. 4665 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2007, pp. 118–129.
[13] V. G. Karmanov, Convergence estimates for iterative minimization methods, USSR Computational Mathematics and Mathematical Physics, 14 (1974), pp. 1 – 13. [14]
, On convergence of a random search method in convex minimization problems, Theory of Probability and its applications, 19 (1974), pp. 788–794. in Russian.
[15] D. C. Karnopp, Random search techniques for optimization problems, Automatica, 1 (1963), pp. 111 – 121. ¨ m and L. Taxen, Stochastic Optimization in System [16] G. Kjellstro Design, IEEE Trans. Circ. and Syst., 28 (1981). [17] A. Kleiner, A. Rahimi, and M. I. Jordan, Random conic pursuit for semidefinite programming, in Neural Information Processing Systems, 2010. [18] T. G. Kolda, R. M. Lewis, and V. Torczon, Optimization by direct search: New perspectives on some classical and modern methods, Siam Review, 45 (2004), pp. 385–482. [19] V. N. Krutikov, On the rate of convergence of the minimization method along vectors in given directional sy, USSR Comput. Maths. Phys., 23 (1983), pp. 154–155. in russian. [20] D. Leventhal and A. S. Lewis, Randomized hessian estimation and directional search, Optimization, 60 (2011), pp. 329–345. [21] R. L. Maybach, Solution of optimal control problems on a high-speed hybrid computer, Simulation, 7 (1966), pp. 238–245. ¨ ller and I. F. Sbalzarini, Gaussian adaptation revisited [22] C. L. Mu an entropic view on covariance matrix adaptation, in EvoApplications, C. Di Chio et al., ed., no. 6024 in Lecture Notes in Computer Science, Springer, 2010, pp. 432–441. [23] V. A. Mutseniyeks and L. A. Rastrigin, Extremal control of continuous multi-parameter systems by the method of random search, Eng. Cybernetics, 1 (1964), pp. 82–90.
RANDOM PURSUIT
31
[24] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust Stochastic Approximation Approach to Stochastic Programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–1609. [25] Y. Nesterov, Introductory Lectures on Convex Optimization, Kluwer, Boston, 2004. [26]
, Random Gradient-Free Minimization of Convex Functions, tech. report, ECORE, 2011.
[27] H. X. Phu, Minimizing convex functions with bounded perturbation, SIAM Journal on Optimization, 20 (2010), pp. 2709–2729. [28] B. Polyak, Introduction to Optimization, Optimization Software - Inc, Publications Division, New York, 1987. [29] G. Rappl, On Linear Convergence of a Class of Random Search Algorithms, ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 69 (1989), pp. 37–45. [30] L. A. Rastrigin, The convergence of the random search method in the extremal control of a many parameter system, Automation and Remote Control, 24 (1963), pp. 1337–1342. [31] I. Rechenberg, Evolutionsstrategie; Optimierung technischer Systeme nach Prinzipien der biologischen Evolution., Frommann-Holzboog, Stuttgart–Bad Cannstatt, 1973. [32] M. Schumer and K. Steiglitz, Adaptive step size random search, Automatic Control, IEEE Transactions on, 13 (1968), pp. 270 – 276. [33] S. Vempala, Recent Progress and Open Problems in Algorithmic Convex Geometry, in IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2010), Kamal Lodaya and Meena Mahajan, eds., vol. 8 of Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, 2010, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp. 42–64. [34] A. A. Zhigljavsky and Zilinskas A. G., Stochastic Global Optimization, Springer-Verlag, Berlin, Germany, 2008. ´ ski and P. Neumann, Stochastische Verfahren zur Suche [35] R. Zielin nach dem Minimum einer Funktion, Akademie-Verlag, Berlin, Germany, 1983.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
32
A
Lemmas
Lemma A.1. Let {ft }t∈N be a sequence with fi ∈ R+ . Suppose ft+1 ≤ (1 − θ/t) ft + Cθ2 /t2 ,
for t ≥ 1,
for some constant θ > 1 and C > 0. Then it follows by induction that ft ≤ Q(θ)/t , where Q(θ) = max θ2 C/(θ − 1), f1 . A very similar result was stated without proof in [24] and also Hazan [7] is using the same. Proof. For t = 1 it holds f1 ≤ Q(θ) by assumption. Suppose t > 1 and Q(θ) = θ2 C/(θ − 1). Then it follows ft+1 ≤
θ2 C(t − 1) θ2 C θ2 C(t − θ) Cθ2 + = ≤ . (θ − 1)t2 t2 (θ − 1)t2 (θ − 1)(t + 1)
If on the other hand Q(θ) = f1 , then f1 ≥
θ2 C ⇔ (θ − 1)f1 ≥ θ2 C , (θ − 1)
and it follows ft+1 ≤
B B.1
(t − θ)f1 Cθ2 (t − 1)f1 θ2 C − (θ − 1)f1 f1 + = + ≤ . 2 2 2 2 t t t t t+1
Tables Initial σ0 of the ES algorithm for all test functions
Table 6 reports the empirically determined optimal initial step sizes σ0 used as input to the ES algorithm. dim 4 8 16 32 64 128 256
f1 0.79158 0.49167 0.32692 0.22292 0.15542 0.10925 0.076658
f2 1.3897 0.78761 0.49500 0.32547 0.22243 0.15638 0.10902
f3 0.2054 0.08922 0.04134 0.019911 0.0097212 0.0048305 0.0024171
f4 0.20395 0.088145 0.041273 0.019905 0.0097127 0.0048335 0.0024114
Table 6: The initial values of the stepsize σ for (1 + 1)-ES on the test functions for various dimensions.
33
RANDOM PURSUIT
B.2
Data for the ellipsoid test functions for n ≤ 256
Table 6 reports the numerical data used to produce Figures 1 and 2. n 4 8 16 32 64 128 256
min 236 787 1326 1769 1899 1987 2063
RP max 472 1241 1763 2026 2096 2145 2173
mean 364 1088 1624 1880 2001 2076 2117
min 28322 22461 18981 17381 16601 16183 15960
RG max 33608 24610 20403 18393 17333 16721 16276
mean 31549 23666 19805 17858 16868 16376 16115
GM 3934 3934 3934 3934 3934 3934 3934
min 966 981 975 968 990 978 1007
FG max 1575 1262 1164 1102 1079 1061 1053
mean 1282 1155 1076 1048 1038 1030 1026
min 124 174 218 221 233 237 238
ARP max mean 682 322 285 226 256 232 256 237 250 242 252 245 251 245
min 2491 3786 4967 5183 5451 5512 5603
ES max 4557 5799 6034 6145 5954 5964 6065
mean 3784 4906 5400 5625 5729 5753 5805
Table 7: Ellipsoid function f2 to accuracy 1.91 · 10−6 , S = 3200, L = 1000. #iterations/n, (GM: #iterations).
n 4 8 16 32 64 128 256
min 3155 11043 19182 25768 27723 28791 29363
RP max 6236 17124 25225 29258 30272 30757 30691
mean 4775 15216 23320 27302 29071 29894 30016
min 56643 44923 37962 34762 33202 32365 31920
RG max 67217 49221 40806 36785 34667 33441 32552
mean 63097 47331 39610 35715 33736 32753 32230
min 1933 1963 1951 1937 1980 1956 2013
FG max 3150 2525 2329 2203 2159 2121 2106
mean 2564 2310 2152 2097 2077 2059 2053
min 1408 2113 2730 2870 3035 3139 3210
ARP max mean 6983 3631 3483 2774 3239 2930 3243 3043 3247 3159 3354 3259 3411 3322
min 2491 3786 4967 5183 5451 5512 5603
ES max 4557 5799 6034 6145 5954 5964 6065
mean 3784 4906 5400 5625 5729 5753 5805
Table 8: Ellipsoid function f2 to accuracy 1.91 · 10−6 , S = 3200, L = 1000. #function evals./n.
B.3
Number of iterations for increasing accuracy for n = 64
Tables 9-13 summarize the number of iterations needed to achieve a corresponding accuracy (acc) for fixed dimension n = 64. acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 2 3 4 4 5 5 6 7 7 8 9 9 10 11 11 12
RP max mean 3 3 4 3 5 4 6 5 6 5 7 6 8 7 9 7 9 8 10 9 10 10 11 10 12 11 13 12 13 12 14 13
min 6 8 9 11 13 14 16 18 19 21 22 24 25 27 28 30
RG max 8 10 12 13 15 17 18 20 22 24 26 28 29 31 32 34
mean 7 8 10 12 14 15 17 19 20 22 24 25 27 29 30 32
GM 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
min 6 8 9 11 13 14 16 18 19 21 22 24 25 27 28 30
FG max 8 10 12 13 15 17 18 20 22 24 26 28 29 31 32 34
mean 7 8 10 12 14 15 17 19 20 22 24 25 27 29 30 32
min 2 3 4 4 5 5 6 7 7 8 9 10 10 11 12 12
ARP max mean 3 3 4 3 5 4 5 5 6 5 7 6 8 7 9 7 9 8 10 9 11 9 11 10 12 11 13 12 13 12 14 13
min 6 8 10 12 14 15 17 18 20 22 23 26 28 29 31 33
ES max 9 12 14 16 18 20 22 24 26 28 31 32 35 36 38 41
mean 8 10 12 14 16 18 20 21 23 25 27 29 31 33 35 37
Table 9: Sphere function f1 , m = 1, L = 1, S = 32, n = 64. #iterations/n, (GM: #iterations).
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
34
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 2 2 3 4 104 273 433 598 761 921 1080 1243 1406 1570 1732 1899
RP max 3 3 4 132 298 460 624 787 954 1118 1284 1446 1607 1766 1928 2096
mean 2 3 3 53 210 373 536 700 862 1024 1187 1350 1512 1675 1837 2001
min 9 11 13 15 381 1863 3340 4815 6280 7754 9232 10712 12179 13654 15130 16601
RG max 11 13 15 20 1103 2578 4062 5536 7018 8488 9961 11439 12906 14388 15854 17333
mean 10 12 14 17 677 2150 3624 5094 6564 8036 9508 10980 12453 13923 15395 16868
GM 1 1 1 1 124 470 817 1163 1509 1856 2202 2549 2895 3241 3588 3934
min 5 17 31 340 404 464 521 576 630 683 736 787 839 890 940 990
FG max 18 21 359 423 484 544 601 657 712 767 820 873 925 977 1029 1079
mean 16 18 261 380 444 504 562 618 673 727 781 833 885 936 988 1038
min 64 70 77 84 92 101 129 142 153 161 168 176 189 203 219 233
ARP max mean 79 71 86 77 93 84 101 91 125 107 140 129 153 143 164 153 174 162 183 170 191 178 199 187 214 201 227 217 238 230 250 242
min 5 6 7 18 481 936 1376 1826 2264 2732 3172 3635 4083 4537 4989 5451
ES max 9 10 19 465 928 1410 1869 2325 2774 3239 3690 4138 4593 5042 5492 5954
mean 6 8 10 268 723 1177 1631 2085 2538 2994 3447 3905 4361 4819 5273 5729
Table 10: Ellipsoid function f2 , m = 1, L = 1000, S = 3200, n = 64. #iterations/n, (GM: #iterations).
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 0 0 0 1 3 8 26 79 200 405 665 948 1229 1509 1792 2068
RP max 0 0 0 1 4 12 37 109 264 501 778 1060 1345 1619 1902 2191
mean 0 0 0 1 3 10 31 92 228 453 723 1005 1288 1570 1853 2136
min 0 0 0 3 18 74 256 790 1993 3955 6349 8849 11375 13886 16401 18922
RG max 0 0 0 5 23 84 283 837 2065 4053 6465 8964 11482 14016 16545 19075
mean 0 0 0 4 21 79 269 811 2022 4004 6412 8917 11435 13958 16482 19004
GM 1 1 1 1 5 19 64 191 477 945 1512 2101 2694 3288 3881 4474
min 0 0 0 2 8 29 55 104 164 224 293 369 397 450 463 892
FG max 0 0 0 4 13 40 84 152 258 328 382 427 613 894 935 970
mean 0 0 0 3 10 34 70 130 201 279 348 402 442 677 887 942
min 0 0 0 0 3 12 22 33 35 58 74 88 96 129 188 192
ARP max mean 0 0 0 0 0 0 1 1 19 8 66 25 75 41 132 61 173 87 199 127 255 151 391 213 449 265 533 342 632 400 678 473
min 0 0 0 2 6 22 73 233 577 1142 1867 2641 3406 4223 5008 5766
ES max 0 0 0 4 10 31 102 292 700 1344 2101 2891 3692 4461 5254 6050
mean 0 0 0 3 9 26 86 257 633 1249 1998 2780 3563 4348 5133 5916
Table 11: Nesterov smooth f3 (23), L = 1000, S = 10833, n = 64. #iterations/n, (GM: #iterations).
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 1 3 8 19 43 82 131 195 267 352 441 539 642 740 845 954
RP max 2 5 12 31 62 106 166 236 317 409 506 605 715 816 920 1023
mean 1 4 10 24 50 90 146 214 295 384 479 579 682 786 890 995
min 7 25 79 199 415 772 1248 1841 2540 3326 4174 5057 5969 6893 7816 8727
RG max 9 31 90 219 458 824 1341 1965 2694 3513 4387 5288 6206 7138 8064 8995
mean 8 28 82 204 432 791 1284 1900 2621 3420 4277 5168 6079 7001 7926 8854
GM 2 7 19 48 102 187 303 449 619 807 1009 1219 1434 1650 1868 2086
min 4 10 27 46 74 104 138 169 195 244 282 321 351 383 423 441
FG max 5 16 37 70 107 140 177 216 259 293 335 376 402 430 453 534
mean 5 13 33 59 88 118 157 191 228 266 306 342 376 409 435 458
min 1 5 9 21 25 39 49 52 64 68 89 96 108 117 127 137
ARP max mean 6 2 19 11 45 22 53 32 58 41 88 53 96 65 109 73 114 82 127 95 132 107 160 119 168 130 177 138 184 148 188 159
min 3 8 19 52 109 214 352 533 740 976 1233 1508 1788 2071 2357 2651
ES max 5 13 35 76 152 272 447 638 875 1132 1403 1681 1974 2269 2568 2854
mean 4 11 27 65 135 247 399 587 810 1059 1325 1603 1885 2173 2462 2751
Table 12: Nesterov strongly convex function f4 (23), m = 1, L = 1000, S = 1000, n = 64. #iterations/n, (GM: #iterations)
35
RANDOM PURSUIT
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 4 7 8 10 11 13 14 16 17 18 19 21 22 23 25 26
RP max mean 6 5 9 7 11 9 13 11 15 12 16 14 17 15 19 17 20 18 22 19 23 21 24 22 25 23 27 25 28 26 30 28
min -
RG max -
mean -
GM -
min -
FG max -
mean -
min 4 7 8 10 12 13 14 15 17 18 19 21 22 23 24 26
ARP max mean 6 5 8 7 10 9 12 11 14 12 15 14 17 15 18 17 20 18 22 19 23 21 25 22 26 23 28 25 29 26 30 28
min 13 19 24 29 32 35 39 43 47 51 55 59 62 67 71 73
ES max 17 25 32 37 42 46 50 54 58 62 66 70 74 77 81 85
mean 14 22 27 32 36 40 43 47 51 55 59 63 67 70 74 78
Table 13: Funnel function f5 (25), S = 32, n = 64. #iterations/n
B.4
Number of function evaluations for increasing accuracy for fixed dimension n = 64
Tables 14-18 summarize the number of function evaluations needed to achieve a corresponding accuracy (acc) for fixed dimension n = 64.
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 10 12 15 17 20 22 25 27 30 32 35 38 41 43 45 47
RP max mean 14 12 17 14 20 17 23 20 27 22 29 25 33 28 35 31 38 33 40 36 42 39 44 42 48 44 52 47 54 50 57 52
min 11 15 18 22 25 28 32 35 38 41 44 47 51 53 57 60
RG max 15 19 23 27 30 34 37 40 44 47 52 56 58 61 65 68
mean 14 17 20 24 27 30 34 37 41 44 47 51 54 57 60 64
min 12 15 18 21 24 27 29 32 35 38 41 44 47 51 54 57
FG max 15 18 21 25 28 32 35 38 42 45 48 51 55 58 63 66
mean 13 17 20 23 26 29 33 36 39 42 45 49 52 55 58 61
min 9 12 15 17 20 22 24 27 29 32 36 39 41 44 47 50
ARP max mean 13 11 16 14 19 17 22 20 25 22 29 25 32 27 35 30 38 33 40 36 43 38 46 41 49 44 51 47 53 49 56 52
min 6 8 10 12 14 15 17 18 20 22 23 26 28 29 31 33
ES max 9 12 14 16 18 20 22 24 26 28 31 32 35 36 38 41
mean 8 10 12 14 16 18 20 21 23 25 27 29 31 33 35 37
Table 14: Sphere function f1 , m = 1, L = 1, S = 32, n = 64. #function evals./n.
¨ ¨ S. U. STICH, C. L. MULLER AND B. GARTNER
36
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 12 15 20 29 1416 3809 6096 8495 10936 13340 15726 18147 20573 22981 25337 27723
RP max 21 25 32 1671 3925 6240 8577 10961 13455 15896 18390 20800 23183 25521 27890 30272
mean 16 19 25 662 2803 5125 7458 9848 12278 14705 17144 19566 21970 24375 26733 29071
min 18 21 25 30 761 3725 6681 9629 12560 15507 18463 21423 24357 27308 30259 33202
RG max 22 25 30 39 2206 5155 8124 11072 14036 16977 19921 22877 25812 28776 31707 34667
mean 20 23 28 34 1354 4300 7248 10189 13129 16072 19015 21960 24906 27846 30790 33736
min 10 34 62 680 808 928 1042 1152 1259 1365 1471 1575 1678 1780 1880 1980
FG max 36 41 717 847 968 1087 1203 1315 1425 1533 1641 1747 1851 1954 2057 2159
mean 33 37 522 760 887 1008 1124 1236 1347 1455 1562 1666 1770 1873 1975 2077
min 637 705 777 863 978 1054 1526 1702 1844 1960 2068 2188 2389 2627 2857 3035
ARP max mean 809 733 883 802 966 872 1092 963 1418 1185 1613 1489 1790 1688 1955 1835 2108 1966 2242 2088 2362 2211 2508 2353 2743 2552 2918 2789 3083 2993 3247 3159
min 5 6 7 18 481 936 1376 1826 2264 2732 3172 3635 4083 4537 4989 5451
ES max 9 10 19 465 928 1410 1869 2325 2774 3239 3690 4138 4593 5042 5492 5954
mean 6 8 10 268 723 1177 1631 2085 2538 2994 3447 3905 4361 4819 5273 5729
Table 15: Ellipsoid function f2 , m = 1, L = 1000, S = 3200, n = 64. #function evals/n.
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 0 0 0 5 22 73 282 930 2440 4996 8225 11739 15220 18678 22154 25520
RP max 0 0 0 8 33 123 410 1304 3232 6186 9610 13117 16645 20032 23511 27034
mean 0 0 0 7 27 96 344 1094 2788 5588 8942 12445 15948 19423 22902 26351
min 0 0 0 7 36 147 511 1579 3987 7909 12697 17699 22750 27772 32801 37844
RG max 0 0 0 10 45 169 566 1675 4130 8106 12930 17928 22963 28033 33091 38150
mean 0 0 0 8 42 158 538 1621 4045 8009 12824 17833 22870 27917 32963 38008
min 0 0 0 5 17 58 110 208 327 448 586 739 795 900 926 1785
FG max 0 0 0 7 25 80 167 305 517 656 765 855 1225 1788 1870 1939
mean 0 0 0 6 21 67 140 261 401 557 696 803 885 1355 1775 1885
min 0 0 0 4 29 102 198 301 328 574 763 903 1004 1410 2145 2199
ARP max mean 0 0 0 0 0 0 12 7 163 66 604 221 690 375 1264 584 1716 866 2132 1332 2864 1621 4485 2372 5201 3010 6242 3979 7381 4691 8149 5609
min 0 0 0 2 6 22 73 233 577 1142 1867 2641 3406 4223 5008 5766
ES max 0 0 0 4 10 31 102 292 700 1344 2101 2891 3692 4461 5254 6050
mean 0 0 0 3 9 26 86 257 633 1249 1998 2780 3563 4348 5133 5916
Table 16: Nesterov smooth f3 (23), L = 1000, S = 10833, n = 64. #function evals./n.
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 8 29 74 198 490 969 1578 2374 3260 4315 5417 6621 7872 9069 10331 11629
RP max 15 45 120 342 718 1258 2012 2877 3883 5011 6220 7435 8772 10005 11264 12482
mean 12 35 97 255 570 1071 1757 2605 3607 4710 5882 7116 8371 9628 10885 12122
min 13 50 157 397 831 1543 2495 3681 5079 6653 8347 10113 11937 13786 15633 17455
RG max 18 61 179 438 915 1648 2682 3929 5387 7026 8774 10577 12411 14275 16129 17990
mean 15 56 164 408 864 1583 2568 3800 5242 6840 8555 10337 12159 14002 15852 17708
min 8 21 54 93 148 208 276 338 389 487 563 642 702 766 845 883
FG max 11 32 75 139 214 281 354 433 519 587 671 752 804 861 906 1069
mean 9 26 66 119 175 237 314 383 457 532 612 685 753 818 870 916
min 8 43 83 184 223 373 477 530 675 727 978 1074 1212 1286 1448 1557
ARP max mean 52 15 173 93 421 193 497 290 555 391 896 516 989 662 1144 763 1209 876 1370 1037 1437 1179 1782 1323 1882 1468 1986 1559 2078 1687 2134 1825
min 3 8 19 52 109 214 352 533 740 976 1233 1508 1788 2071 2357 2651
ES max 5 13 35 76 152 272 447 638 875 1132 1403 1681 1974 2269 2568 2854
mean 4 11 27 65 135 247 399 587 810 1059 1325 1603 1885 2173 2462 2751
Table 17: Nesterov strongly convex function f4 (23), m = 1, L = 1000, S = 1000 n = 64. #function evals./n.
37
RANDOM PURSUIT
acc. 6.25 · 10−2 3.12 · 10−2 1.56 · 10−2 7.81 · 10−3 3.91 · 10−3 1.95 · 10−3 9.77 · 10−4 4.88 · 10−4 2.44 · 10−4 1.22 · 10−4 6.10 · 10−5 3.05 · 10−5 1.53 · 10−5 7.63 · 10−6 3.81 · 10−6 1.91 · 10−6
min 37 63 83 103 118 140 159 178 196 215 236 257 279 297 320 338
RP max mean 58 45 86 71 110 92 129 112 150 132 168 151 187 171 207 190 230 210 252 232 270 252 293 273 316 294 340 316 365 339 384 360
min -
RG max -
mean -
min -
FG max -
mean -
min 40 62 84 103 123 140 159 177 197 218 239 257 277 295 318 342
ARP max mean 51 46 76 70 102 92 123 112 144 132 167 151 191 171 212 192 231 211 265 234 289 255 313 275 330 295 355 317 378 338 399 361
min 13 19 24 29 32 35 39 43 47 51 55 59 62 67 71 73
ES max 17 25 32 37 42 46 50 54 58 62 66 70 74 77 81 85
mean 14 22 27 32 36 40 43 47 51 55 59 63 67 70 74 78
Table 18: Funnel function f5 (25),S = 32 n = 64. #function evals./n.