Jan 21, 2015 - are based on noise-free algorithms, and evaluate individuals ..... Auger, A., Jebalia, M., Teytaud, O.: Xse: quasi-random mutations for evolution.
Log-log Convergence for Noisy Optimization S Astete-Morales, Jialin Liu, O Teytaud
To cite this version: S Astete-Morales, Jialin Liu, O Teytaud. Log-log Convergence for Noisy Optimization. Evolutionary Algorithms 2013, 2013, Bordeaux, France. pp.16 - 28, 2014, Proceedings of EA 2013. .
HAL Id: hal-01107772 https://hal.inria.fr/hal-01107772 Submitted on 21 Jan 2015
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.
Log-log Convergence for Noisy Optimization S. Astete-Morales, J.-L. Liu, O. Teytaud TAO (Inria), LRI, UMR 8623 (CNRS - Univ. Paris-Sud), France + AI Lab, National Dong Hwa University, Hualien, Taiwan
Abstract. We consider noisy optimization problems, without the assumption of variance vanishing in the neighborhood of the optimum. We show mathematically that simple rules with exponential number of resamplings lead to a log-log convergence rate. In particular, in this case the log of the distance to the optimum is linear on the log of the number of resamplings. As well as with number of resamplings polynomial in the inverse step-size. We show empirically that this convergence rate is obtained also with polynomial number of resamplings. In this polynomial resampling setting, using classical evolution strategies and an ad hoc choice of the number of resamplings, we seemingly get the same rate as those obtained with specific Estimation of Distribution Algorithms designed for noisy setting. We also experiment non-adaptive polynomial resamplings. Compared to the state of the art, our results provide (i) proofs of log-log convergence for evolution strategies (which were not covered by existing results) in the case of objective functions with quadratic expectations and constant noise, (ii) log-log rates also for objective functions with expectation E[f (x)] = ||x − x∗ ||p , where x∗ represents the optimum (iii) experiments with different parametrizations than those considered in the proof. These results propose some simple revaluation schemes. This paper extends [1].
1
Introduction
In this introduction, we first present the noisy optimization setting and the local case of it. We then classify existing optimization algorithms for such settings. Afterwards we discuss log-linear and log-log scales for convergence and give an overview of the paper. In all the paper, log represents the natural logarithm and N is a standard Gaussian random variable (possibly multidimensional, depending on the context), except when it is specified explicitly that N may be any random variable with bounded density. Noisy optimization. This term will denote the optimization of an objective function which has internal stochastic effects. When the algorithm requests f itness(·) of a point x, it gets in fact f itness(x, θ) for a realization of a random variable θ. All calls to f itness(·) are based on independent realizations of the same random variable θ. The goal of a noisy optimization algorithm is to find x such that E(f itness(x, θ)) is minimized (or nearly minimized). Local noisy optimization. Local noisy optimization refers to the optimization of an objective function in which the main problem is noise, and not local
minima. Hence, diversity mechanisms as in [2] or [3], in spite of their qualities, are not relevant here. We also restrict our work to noisy settings in which noise does not decrease to 0 around the optimum. This constrain makes our work different from [4]. In [5, 6] we can find noise models related to ours but the results presented here are not covered by their analysis. On the other hand, in [7–9], different noise models (with Bernoulli fitness values) are considered, inclusing a noise with variance which does not decrease to 0 (as in the present paper). They provide general lower bounds, or convergence rates for specific algorithms, whereas we consider convergence rates for classical evolution strategies equipped with resamplings. Classification of local noisy optimization algorithms. We classify noisy local convergence algorithms in the following 3 families: – Algorithms based on sampling, as far as they can, close to the optimum ((I don’t understand the “approximation to the optimum” part)) Maybe this part was useless, I have removed it, is it ok without this part ?. In this category, we include evolution strategies[10, 6, 5] and EDA[11] as well as pattern search methods designed for noisy cases[12–14]. Typically, these algorithms are based on noise-free algorithms, and evaluate individuals multiple times in order to cancel (reduce) the effect of noise. Authors studying such algorithms focus on the number of resamplings; it can be chosen by estimating the noise level[15], or using the step-size, or, as in parts of the present work, in a non-adaptive manner. – Algorithms which learn (model) the objective function, sample at locations in which the model is not precise enough, and then assume that the optimum is nearly the optimum of the learnt model. Surrogate models and Gaussian processes[?,16] belong to this family. However, Gaussian processes are usually supposed to achieve global convergence (i.e. good properties on multimodal functions) rather than local convergence (i.e. good properties on unimodal functions) - in this paper, we focus on local convergence. ((Which can be a problem when we have multimodal function? ))I rephrased, is it better now ?. – Algorithms which combine both ideas, assuming that learning the objective function is a good idea for handling noise issues but considering that points too far from the optimum cannot be that useful for an optimization. This assumption makes sense at least in a scenario in which the objective function cannot be that easy to learn on the whole search domain ((This last sentence seems confuse to me. “This” refers to what? the algorithms that mix both ideas?))I rephrased, is it better ?. CLOP[7, 8] is such an approach. Log-linear scale and log-log scale: uniform and non-uniform rates. To ensure the convergence of an algorithm and analyze the rate at which it converges are part of the main goals when it comes to the study of optimization algorithms. In the noise-free case, evolution strategies typically converge linearly in loglinear scale, this is, the logarithm of the distance to the optimum typically scales linearly with the number of evaluations (see Section 2.1 for more details on this). The case of noisy fitness values leads to a log-log convergence[9]. We investigate
conditions under which such a log-log convergence is possible. In particular, we focus on uniform rates. Uniform means that all points are under a linear curve in the log-log scale. Formally, the rate is the infimum of C such that with probability 1 − δ, for m sufficiently large, all iterates after m fitness evaluations verify log ||xm || ≤ −C log m, where xm is the mth evaluated individual. This is, all points are supposed to be “good” (i.e. satisfy the inequality); not only the best point of a given iteration. In contrast, a non-uniform rate would be the infimum of C such that log ||xkm || ≤ −C log km for some increasing sequence km . The state of the art in this matter exhibits various results. For an objective function with expectation E[f (x)] = ||x − x∗ ||2 , when the variance is not supposed to decrease in the neighborhood of the optimum, it is known that the best possible slope in this log-log graph is − 12 (see [17]), but without uniform rate. When optimizing f (x) = ||x||p + N , this slope is provably limited to − p1 under locality assumption (i.e. when sampling far from the optimum does not help, see [9] for a formalization of this assumption), and it is known that some ad hoc EDA can 1 reach − 2p (see [18]). For evolution strategies, the slope is not known. Also, the optimal rate for E[f (x)] = ||x − x∗ ||p for p 6= 2 is unknown; we show that our evolution strategies with simple revaluation schemes have linear convergence in log-log representation in such a case. Algorithms considered in this paper. We here focus on simple revaluation rules in evolution strategies, based on choosing the number of resamplings. We start with rules which decide the number of revaluations only depending on the iteration number n. This is, independently of the step-size σn , the parents xn and fitness values. To the best of our knowledge, these simple rules have not been analyzed so far. Nonetheless, they have strong advantages: we get a linear slope in log-log curve simple rules only depending on n whereas rules based on numbers of resamplings defined as a function of σn have a strong sensitivity to parameters. Also evolution strategies, contrarily to algorithms with good nonuniform rates, have a nice empirical behavior from the point of view of uniform rates, as shown mathematically by [18]. Overview of the paper. In this paper we show mathematical proofs and experimental results on the convergence of the evolutionary algorithms that will be described in the following sections, which include some resampling rules aiming to cancel the effect of noise. The theoretical analysis presents an exponential number of resamplings together with an assumption of scale invariance. This result is extended to an adaptive rule of resamplings (Section 2.3), in which the number of evaluations depend on the step size only; we also get rid of the scale invariant assumption. Essentially, the algorithms for which we get a proof have the same dynamics as in the noise-free case, they just use enough resamplings for cancelling the noise. This is consistent with the existing literature, in particular [18] which shows a log-log convergence for an Estimation of Distribution Algorithm with exponentially decreasing step-size and exponentially increasing number of resamplings. In the experimental part, we see that another solution is a polynomially increas-
ing number of resamplings (independently of σn ; the number of resamplings just smoothly increases with the number of iterations, in a non-adaptive manner), leading to a slower convergence when considering the progress rate per iteration, but the same log-log convergence when considering the progress rate per evaluation. We could get positive experimental results even with the non-proved polynomial number of revaluations (non-adaptive); maybe those results are the most satisfactory (stable) results. We could also get convergence with adaptive rules (number of resamplings depending on the step-size), however results are seemingly less stable than with non-adaptive methods.
2
Theoretical analysis: exponential non-adaptive rules can lead to log/log convergence.
Section 2.1 is devoted to some preliminaries. Section 2.2 presents results in the scale invariant case, for an exponential number of resamplings and non-adaptive rules. Section 2.3 will focus on adaptive rules, with numbers of resamplings depending on the step-size. 2.1
Preliminary: noise-free case
In the noise-free case, for some evolution strategies, we know the following results, almost surely (see e.g. Theorem 4 in [19], where, however, the negativity of the constant is not proved and only checked by Monte-Carlo simulations (((I would delete this comment, I think it is a bit too specific and it is not explored later. But maybe it is more important than what I can see, so I am not sure)))OT:Ok for me): log(σn )/n converges to some constant (−A) < 0 and log(||xn ||)/n converges to some constant (−A0 ) < 0. This implies that for any ρ < A, log(σn ) ≤ −ρn for n sufficiently large. So, supn≥1 log(σn ) + ρn is finite. With these almost sure results, now consider V the quantile 1 − δ/4 of exp supn≥1 log(σn ) + ρn . Then, with probability at least 1 − δ/4, ∀n ≥ 1, σn ≤ V exp(−ρn). We can apply the same trick for lower bounding σn , and upper and lower bounding ||xn ||, all of them with probability 1 − δ/4, so that all bounds hold true simultaneously with probability at least 1 − δ. Hence, for any α < A0 , α0 > A0 , ρ < A, ρ0 > A, there exist C > 0, C 0 > 0, V > 0, V 0 > 0, such that with probability at least 1 − δ ∀n ≥ 1, C 0 exp(−α0 n) ≤ ||xn || ≤ C exp(−αn); (1) ∀n ≥ 1,
V 0 exp(−ρ0 n) ≤ σn ≤ V exp(−ρn).
(2)
We will first show, in Section 2.2, our noisy optimization result (Theorem 1): (i) in the scale invariant case (ii) using Eq. 1 (supposed to hold in the noise-free case) We will then show similar results in Section 2.3: (i) without scale-invariance (ii) using Eq. 2 (supposed to hold in the noise-free case) (iii) with other resamplings schemes
2.2
Scale invariant case, with exponential number of resamplings
We consider Alg. 1, a version of multi-membered Evolution Strategies, the (µ,λ)ES. µ denotes the number of parents and λ the number of offspring (µ ≤ λ). In every generation, the selection takes place among the λ offspring, produced from a population of µ parents. Selection is based on the ranking of the individuals according their f itness(·) taking the µ best individuals among the population. Here xn denotes the parent at iteration n. Algorithm 1 An evolution strategy, with exponential number of resamplings. If we consider K = 1 and ζ = 1 we obtain the case without resampling. N is an arbitrary random variable with bounded density (each use is independent of others). (((In this case N is a standard Gaussian, right? )))OT: seemingly the bounded density is enough (for the N used in the mutations). Parameters: K > 0,ζ ≥ 0, λ ≥ µ > 0, a dimension d > 0. Input: an initial x1 ∈ Rd and an initial σ0 > 0. n←1 while (true) do Generate λ individuals i1 , . . . , iλ independently using ij = xn + σn,j N . (3) Evaluate each of them rn = dKζ n e times and average their fitness values. Select the µ best individuals j1 , . . . , jµ . Update: from x, σn , i1 , . . . , iλ and j1 , . . . , jµ , compute xn+1 and σn+1 . (((the computation is not determined? it doesnt matter? it is a general EA? it should appear, i can’t find it. Maybe we could put it in the remarks below?)))OT: I believe it does not matter, provided that the assumption of the theorem are valid. n←n+1 end while
We now state our first theorem, under log-linear convergence assumption (the assumption in Eq. 5 is just Eq. 1). Theorem 1. Consider the fitness function f (z) = ||z||p + N
(4)
over Rd and x1 = (1, 1, . . . , 1). Consider an evolution strategy with population size λ, parent population size µ, such that without resampling, for any δ > 0, for some α > 0, α0 > 0, with probability 1 − δ/2, with objective function f itness(x) = ||x||, ∃C, C 0 ; C 0 exp(−α0 n) ≤ ||xn || ≤ C exp(−αn). (5) Assume, additionally, that there is scale invariance: σn = C 00 ||xn || 00
(6)
for some C > 0. Then, for any δ > 0, there is K0 > 0, ζ0 > 0 such that for K ≥ K0 , ζ > ζ0 , Eq. 1 also holds with probability at least 1 − δ for fitness function as in Eq. 4 and resampling rule as in Alg. 1.
Remarks: (i) Informally speaking, our theorem shows that if a scale invariant algorithm converges in the noise-free case, then it also converges in the noisy case with the exponential resampling rule, at least if parameters are large enough (a similar effect of constants was pointed out in [4] in a different setting). (ii) We assume that the optimum is in 0 and the initial x1 at 1. Note that these assumptions have no influence when we use algorithms invariant by rotation and translation. (iii) We show a log-linear convergence rate as in the noise-free case, but at the cost of more evaluations per iteration. When normalized by the number of function evaluations, we get log ||xn || linear in the logarithm of the number of function evaluations, as detailed in Corollary 1. Proof of the theorem: In all the proof, N denotes a standard Gaussian random variable (depending on the context, in dimension 1 or d). Consider an arbitrary δ > 0, n ≥ 1 and δn = exp(−γn) for some γ > 0. Define pn the probability that two generated points, e.g. i1 and i2 , are such that | ||i1 ||p − ||i2 ||p | ≤ δn . Step 1: Using Eq. 3 and Eq. 6, we show that pn ≤ B 0 exp(−γ 0 n) (7) for some B 0 > 0, γ 0 > 0 depending on γ, d, p, C 0 , C 00 , α0 . Proof of step 1: with N1 and N2 two d-dimensional independent standard Gaussian random variables, pn ≤ P (| ||1 + C 00 N1 ||p − ||1 + C 00 N2 ||p | ≤ δn /||xn ||p ). (8) Define densityM ax the supremum of the density of | ||1+C 00 N1 ||p −||1+C 00 N2 ||p | −p we get pn ≤ densityM axC 0 exp((pα0 − γ)n), hence the expected result with γ 0 = γ −pα0 and B 0 = densityM ax(C 0 )−p . Notice that densityM ax is upper bounded. In particular, γ 0 is arbitrarily large, provided that γ is sufficiently large. (1) Step 2: Consider now pn the probability that there exists i1 and i2 such (1) that | ||i1 ||p − ||i2 ||p | ≤ δn . Then, pn ≤ λ2 pn ≤ B 0 λ2 exp(−γ 0 n). √ (2) Step 3: Consider now pn the probability that |N / Kζ n | ≥ δn /2. First, √ (2) (2) we write pn = P (N ≥ δ2n Kζ n ). So by Chebychev inequality, pn ≤ 00 00 00 B exp(−γ n) for γ = log(ζ) − 2γ arbitrarily large, provided that ζ is large enough, and B 00 = 4/K. √ (3) Step 4: Consider now pn the probability that |N / Kζ n | ≥ δn /2 at least (3) (2) once for the λ evaluated individuals of iteration n. Then, pn ≤ λpn . Step 5: In this step we consider the probability that two individuals are (4) misranked due to noise. Let us now consider pn the probability that at least two points ia and ib at iteration n verify ||ia ||p ≤ ||ib ||p (9) and
noisyEvaluation(ia ) ≥ noisyEvaluation(ib )
(10)
where noisyEvaluation(i) is the average of the multiple evaluations of individual i. Eqs. 9 and 10 occur simultaneously if either two points have very similar
fitness (difference less than δn ) or the noise is big (larger than δn /2). Therefore, (4) (1) (3) (2) (4) pn ≤ pn + pn ≤ λ2 pn + λpn so pn ≤ (B 0 + B 00 )λ2 exp(− min(γ 0 , γ 00 )n). Step 6: Step 5 was about the probability that at least two points at iteration P (4) n are misranked due to noise. We now consider n≥1 pn , which is an upper bound on the probability that in at least one iteration there is a misranking of two individuals. P (4) If γ 0 and γ 00 are large enough, n≥1 pn < δ. This implies that with probability at least 1 − δ, provided that K and ζ have been chosen large enough for γ and γ 0 to be large enough, we get the same rankings of points as in the noise free case - this proves the expected result. The following corollary shows that this is a log-log convergence. Corollary 1: log-log convergence with exponential resampling. With n −1 en the number of evaluations at the end of iteration n, we have en = Kζ ζζ−1 . We then get, from Eq. 1, log(||xn ||)/ log(en ) → −
α log ζ
(11)
with probability at least 1 − δ. Eq. 11 is the convergence in log/log scale. We have shown this property for an exponentially increasing number of resamplings, which is indeed similar to R-EDA[18], which converges with a small number of iterations but with exponentially many resamplings per iteration. In the experimental section 3, we will check what happens in the polynomial case.
2.3
Extension: adaptive resamplings and removing the scale invariance assumption
We have assumed above a scale invariance. This is obviously not a nice feature of our proof, because scale invariance does not correspond to anything real; in a real setting we do not know the distance to the optimum. We show below an extension of the result above using the assumption of a log-linear convergence of σn as in Eq. 2 instead of the scale invariance used before. In the corollary below, we also get rid of the non-adaptive rule with exponential number of resamplings, replaced by a number of resamplings depending on the step-size σn only, as in Eq. 2. In one corollary, we switch to both (i) adaptive
resampling rule and (ii) no scale invariance; each change can indeed be proved independently of the other. Algorithm 2 An evolution strategy, with number of resamplings polynomial in the step-size. The case without resampling means Y = 1 and η = 0. N is an arbitrary random variable with bounded density (each use is independent of others). Parameters: Y > 0,η ≥ 0, λ ≥ µ > 0, a dimension d > 0. Input: an initial x1 ∈ Rd and an initial σ0 > 0. n←1 while (true) do Generate λ individuals i1 , . . . , iλ independently using ij = xn + σn,j N .
(12)
Evaluate each of them rn = dY σn −η e times and average their fitness values. Select the µ best individuals j1 , . . . , jµ . Update: from x, σn , i1 , . . . , iλ and j1 , . . . , jµ , compute xn+1 and σn+1 . n←n+1 end while
Corollary 2: adaptive resampling and no scale-invariance. The proof of Theorem [1] also holds without scale invariance, under the following assumptions: – For any δ > 0, there are constants ρ > 0, V > 0, ρ0 > 0, V 0 > 0 such that with probability at least 1 − δ, Eq. 2 holds. – The number of revaluations is η 1 Y (13) σn with Y and η sufficiently large. – Individuals are still randomly drawn using xn + σn N for some random variable N with bounded density. Remark: This setting is useful in cases like self-adaptive algorithms, in which we do not use directly a Gaussian random variable, but a Gaussian random variable multiplied e.g. by exp( √1d )Gaussian, with Gaussian a standard Gaussian random variable. For example, SA-ES algorithms as in [19] are included in this proof because they converge log-linearly as explained in Section 2.1. Proof of corollary 2: Two steps of the proof are different, namely step 1 and step 2. We here adapt the proofs of these two steps. Adapting step 1: Eq. 8 becomes Eq. 14: pn ≤ P (| ||1 + Cn00 N1 ||p − ||1 + Cn00 N2 ||p | ≤ δn /||xn ||p ). (14) where Cn00 = σn /||xn || ≥ t0 exp(−tn) for some t > 0, t0 > 0 depending on ρ, ρ0 , V, V 0 only. Eq. 14 leads to −p pn ≤ (Cn00 )−d densityM axC 0 exp((pα0 − γ)n), hence the expected result with γ 0 = γ − pα0 − dt. densityM ax is upper bounded due to the third condition of corollary 2.
Adapting step 2: It is sufficient to show that the number of resamplings is larger (for each iteration) than in the Theorem 1, so that step 2 still holds. Eq. 13 η implies that the number of revaluations at step n is at least Y V1 exp(ρηn). This is more than Kζ n , at least if Y and η are large enough. This leads to the same conclusion as in the Theorem 1, except that we have probability 1 − 2δ instead of 1 − δ (which is not a big issue as we can do the same with δ/2). The following corollary is here for showing that our result leads to the log-log convergence. Corollary 3: log-log convergence for adaptive resampling. With en the η number of evaluations at the end of iteration n, we have en = Y V1 exp(ρη) exp(ρηn)−1 exp(ρη)−1 . We then get, from Eq. 1, α log(||xn ||)/ log(en ) → − (15) ρη with probability at least 1 − δ. Eq. 15 is the convergence in log/log scale.
3
Polynomial number of resamplings: experiments
We here consider a polynomial number of resamplings, as in Alg. 3.
Algorithm 3 An evolution strategy, with polynomial number of resamplings. The case without resampling means K = 1 and ζ = 0. Parameters: K > 0,ζ ≥ 0, λ ≥ µ > 0, a dimension d > 0. Input: an initial x1 ∈ Rd and an initial σ0 > 0. n←1 while (true) do Generate λ individuals i1 , . . . , iλ independently using 1 σn,j = σn × exp( √ N ) d ij = xn + σn,j N .
(16)
ζ
Evaluate each of them rn = dKn e times and average their fitness values. Select the µ best individuals j1 , . . . , jµ . Update: from x, σn , i1 , . . . , iλ and j1 , . . . , jµ , compute xn+1 and σn+1 . n←n+1 end while
Experiments are performed in a “real” setting, without scale invariance. Importantly, our mathematical results hold only log-log convergence under the assumption that constants are large enough. We present results with fitness function f (x) = ||x||p + N with p = 2 in Fig. 1. (((PLEASE CHECK THE FOLLOWING COMMENTS IN COMPARED TO THE BIGNOISE AT EA2013)))
In experiments with the following parameters (as recommended in [10,√20]): p = 1 or p = 4, dimension 2, 3, 4, 5, ζ = 1, 2, 3, µ = min(d, dλ/4e), λ = dd de, slopes are usually better than −1/(2p) for ζ = 2 or ζ = 3 and worse for ζ = 1. Non-presented experiments show that ζ = 0 performs very poorly. Seemingly results for ζ large are farther from the asymptotic regime. We conjecture that the asymptotic regime is −1/(2p) but that it is reached later when ζ is large. R-EDA[18] reaches −1/(2p); we seemingly get slightly better but this might be due to a non-asymptotic effect. Fig. 1 provides results with high numbers of evaluations. 1
1 slope=−0.3267
slope=−0.2829
K=2 ζ=2 p=2 d=2
0
0
−1
log(||y||)
log(||y||)
−1
−2
−2
−3
−3
−4
−4
−5
K=2 ζ=2 p=2 d=4
−5 0
5
10
15
0
5
10
15
log(Evaluations)
log(Evaluations)
(a) d = 2, ζ = 2.Slope = −0.3267.
(b) d = 4, ζ = 2.Slope = −0.2829.
1
1 slope=−0.2126
0.5
K=2 ζ=1 p=2 d=2
slope=−0.1404 0.5
K=2 ζ=1 p=2 d=4
0 0
−0.5
−1
log(||y||)
log(||y||)
−0.5
−1.5
−1
−2 −1.5 −2.5 −2 −3
−3.5
0
5
10 log(Evaluations)
(c) d = 2, ζ = 1.Slope = −0.2126.
15
−2.5 0
5
10
15
log(Evaluations)
(d) d = 4, ζ = 1.Slope = −0.1404.
Fig. 1: Experiments in dimension 2, 3, 4 with ζ = 1, 2 (number of evaluations shown by x-axis) for rn = Kdnζ e (i.e. polynomial, non-adaptive) with µ = 2, λ = 4, p = 2 and K = 2. The slope is evaluated on the second half of the iterations. We get slopes close to −1/(2p). All results are averaged over 20 runs.
4
Experiments with adaptivity: Y σn−η revaluations
We here show experiments with Alg. 2. The algorithm should converge linearly in log-log scale as shown by Corollary 3, at least for large enough values of Y and η.
Notice that we consider values of µ, λ for which log-linear convergence is proved in the √ noise-free setting (see Section 2.1).In all this section, µ = min(d, dλ/4e), λ = dd de. Slopes as estimated on the case η = 2 (usually the most favorable, and an important case naturally arising in sufficiently differentiable problems) are given in Table 1 (left) for dimension d = 100. In this case we are far from the asymptotic regime.
Table 1: Left: Dimension 100. Estimated slope for the adaptive rule with 2 rn = d σ1n e resamplings at iteration n. Slopes are estimated on the second half of the curve. Right: Dimension 10. Estimated slope for the adaptive rule 2 with rn = dY σ1n e resamplings at iteration n (Y = 1 as in previous curves, and Y = 20 for checking the impact of convergence; the negative slope (apparent convergence) for Y = 20 is stable, as well as the divergence or stagnation for Y = 1 for p = 4). Slopes are estimated on the second half of the curve. d = 100 p slope for Y = 1 1 -0.52 2 -0.51 4 -0.45
d = 10 p slope for Y = 1 slope 1 -0.51 2 -0.18 4 >0
for Y = 20 -0.50 -0.17 -0.08
We get results close to − 12 in all cases This slope of − 12 is reachable by algorithms which learn a model of the fitness function, as e.g. [7]. In this case of high dimension we are far from the slope 1/(−2p), which might be the case for the asymptotic results. This is suggested by experiments in dimension 10 summarized in Table 1 (right). We also point out that the known complexity bounds is − p1 (from [9]), and maybe the slope can reach − p1 in some cases. η Results with Y σ1 are moderately stable (impact of Y , in particular). This supports our preference for stable rules, such as non-adaptively choosing n2 revaluations per individual at iteration n.
5
Conclusion
We have shown mathematically log-log convergence results and studied experimentally the slope in this convergence. These results were shown for evolution strategies, which are known for having good uniform rates, rather than good non-uniform rates(((We have not mention the regret in the paper before, this comment maybe out of place, or otherwise maybe we could explain more?)))Yes, citing regret is an overkill, I tried to rephrase. We summarize these two parts below and give some research directions. Log-log convergence. We have shown that the log-log convergence (i.e. linear
convergence with x-axis the log of the number of evaluations and y-axis the log of the distance to the optimum) occurs in various cases: – Non-adaptive rules, with number of resamplings exponential in the iteration counter. Here we have a mathematical proof, which includes the assumption of scale invariance; as shown by Corollary 2, this can be extended to non scale-invariant algorithms; – Adaptive rules, with number of resamplings polynomial in 1/σn with σn the step-size. Here we have a mathematical proof; however, there is a strong sensitivity to constants η which participate in the number of resamplings Y and η 1 per individual, Y σn ; – Non-adaptive rule, with polynomial number of resamplings. This case is a quite convenient scheme experimentally but we have no proof. Slope in log-log convergence. Experimentally, the best slope in the log-log 1 representation is often close to − 2p for fitness function f (x) = ||x||p + N . It is known that under modeling assumptions (i.e. the function is regular enough for being optimized by learning), it is possible to do better than that (the slope be1 comes −1/2 for parametric cases, see [7] and references therein), but − 2p is the best known exponent under locality assumption. Basically, locality assumption ensures that most points are reasonably good, whereas some specialized noisy optimization algorithms sample a few very good points and essentially sample individuals far from the optimum (see e.g. [7]). Further work. The main further work is the mathematical analysis of the polynomial number of resamplings in the non-adaptive case. Also, a combination of adaptive and non-adaptive rules might be interesting; adaptive rules are intuitively satisfactory, but non-adaptive polynomial rules provide simple efficient solutions, with empirically easy (no tuning) (((What is tuning? of the parameters?)))JL:YES results. If our life depended on a scheme, we would for the moment choose a simple polynomial rule with a number of revaluations quadratic in the number of evaluations, in spite of (maybe) moderate elegance due to lack of adaptivity.
References 1. Morales, S.A., Liu, J., Teytaud, O.: Noisy optimization convergence rates. In: GECCO (Companion). (2013) 223–224 2. Jones, D., Schonlau, M., Welch, W.: (Efficient global optimization of expensive black-box functions) 3. Auger, A., Jebalia, M., Teytaud, O.: Xse: quasi-random mutations for evolution strategies. In: Proceedings of Evolutionary Algorithms, 12 pages. (2005) 4. Jebalia, M., Auger, A., Hansen, N.: Log linear convergence and divergence of the scale-invariant (1+1)-ES in noisy environments. Algorithmica (2010) 5. Arnold, D.V., Beyer, H.G.: A general noise model and its effects on evolution strategy performance. IEEE Transactions on Evolutionary Computation 10 (2006) 380–391 6. Finck, S., Beyer, H.G., Melkozerov, A.: Noisy optimization: a theoretical strategy comparison of es, egs, spsa & if on the noisy sphere. In: GECCO. (2011) 813–820
7. Coulom, R.: Clop: Confident local optimization for noisy black-box parameter tuning. In: Advances in Computer Games. Springer Berlin Heidelberg (2012) 146–157 8. Coulom, R., Rolet, P., Sokolovska, N., Teytaud, O.: Handling expensive optimization with large noise. In: Foundations of Genetic Algorithms. (2011) 9. Teytaud, O., Decock, J.: Noisy Optimization Complexity. In: FOGA - Foundations of Genetic Algorithms XII - 2013, Adelaide, Australie (2013) 10. Beyer, H.G.: The Theory of Evolution Strategies. Natural Computing Series. Springer, Heideberg (2001) 11. Yang, X., Birkfellner, W., Niederer, P.: Optimized 2d/3d medical image registration using the estimation of multivariate normal algorithm (EMNA). In: Biomedical engineering. (2005) 12. Anderson, E.J., Ferris, M.C.: A direct search algorithm for optimization with noisy function evaluations. SIAM Journal on Optimization 11 (2001) 837–857 13. Lucidi, S., Sciandrone, M.: A derivative-free algorithm for bound constrained optimization. Comp. Opt. and Appl. 21 (2002) 119–142 14. Kim, S., Zhang, D.: Convergence properties of direct search methods for stochastic optimization. In: Proceedings of the Winter Simulation Conference. WSC ’10, Winter Simulation Conference (2010) 1003–1011 15. Hansen, N., Niederberger, S., Guzzella, L., Koumoutsakos, P.: A method for handling uncertainty in evolutionary optimization with an application to feedback control of combustion. IEEE Transactions on Evolutionary Computation 13 (2009) 180–197 16. Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization (2008) 26 pages 17. Fabian, V.: Stochastic Approximation of Minima with Improved Asymptotic Speed. Annals of Mathematical statistics 38 (1967) 191–200 18. Rolet, P., Teytaud, O.: Bandit-based estimation of distribution algorithms for noisy optimization: Rigorous runtime analysis. In: Proceedings of Lion4 (accepted); presented in TRSH 2009 in Birmingham. (2009) 97–110 19. Auger, A.: (Convergence results for (1,λ)-SA-ES using the theory of ϕ-irreducible Markov chains) 20. Fournier, H., Teytaud, O.: Lower bounds for comparison based evolution strategies using vc-dimension and sign patterns. Algorithmica (2010)