Derivative Free Optimization

Derivative Free Optimization Anne Auger (Inria)⇤ and Laurent Dumas (U. Versailles) AMS Master - Optimization Paris-Saclay Master ⇤ [email protected]

2015

Course Organization

Part I - Stochastic Methods

taught by Anne Auger

Friday 27/11, 04/12, 11/12, 18/12, 08/01 from 2pm to 5:15

Part II - Deterministic Methods

taught by Laurent Dumas

Friday 15/01, 22/01, 29/01, 05/02, 12/02 from 2pm to 5:15

Final Exam

Friday 19/02 from 2pm till 5pm

Announcement: Master Thesis / Thesis Projects We propose various master thesis projects that can be followed by a thesis directly related to this class, i.e. around adaptive stochastic search algorithms (in particular CMA-ES)

Projects range from I

Theory (convergence proofs using stochastic approximation theory or Markov chain theory, ...)

I

Algorithm design (for large-scale, multi-objective, constrained, ...)

I

Applications (CIFRE PhD thesis for example)

Some subjects are posted here: randopt.gforge.inria.fr/thesisprojects/ or contact me directly: [email protected]

First Example of a Black-Box Continuous Optimization Problem

Problem Statement Continuous Domain Search/Optimization I

I

Task: minimize an objective function (fitness function, loss function) in continuous domain f : X ✓ Rn ! R,

Black Box scenario (direct search scenario) x

I I

I

x 7! f (x) f(x)

gradients are not available or not useful problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding

Search costs: number of function evaluations

What Makes a Function Difficult to Solve? Why stochastic search? I

I

I

non-linear, non-quadratic, non-convex

ruggedness

on linear and quadratic functions much better search policies are available

I

90

80

70

60

50

non-smooth, discontinuous, multimodal, and/or noisy function

dimensionality (size of search space)

(considerably) larger than three I

100

non-separability ill-conditioning

40

30

20

10

0 −4

−3

−2

−1

0

1

2

3

4

3 2 1 0 −1 −2 −3 −3

−2

−1

0

1

2

3

dependencies between the objective variables gradient direction Newton direction

Curse of Dimensionality The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1]10 would require 10010 = 1020 points. A 100 points appear now as isolated points in a vast empty space. Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces.

Separable Problems Definition (Separable Problem) A function f is separable if arg min

(x1 ,...,xn )

f (x1 , . . . , xn ) =

✓

arg min f (x1 , . . .), . . . , arg min f (. . . , xn ) x1

xn

◆

) it follows that f can be optimized in a sequence of n independent 1-D optimization processes

Example: Additively decomposable functions f (x1 , . . . , xn ) =

n X

3 2 1

fi (xi )

i=1

function Pn Rastrigin f (x) = 10n + i =1 (xi2 10 cos(2⇡xi ))

0 −1 −2 −3 −3

−2

−1

0

1

2

3

Non-Separable Problems Building a non-separable problem from a separable one

(1,2)

Rotating the coordinate system I I

f : x 7! f (x) separable

f : x 7! f (Rx) non-separable

R rotation matrix

3

3

2

2

R !

1 0

1 0

−1

−1

−2

−2

−3 −3

1

−2

−1

0

1

2

3

−3 −3

−2

−1

0

1

2

3

Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). "Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms." BioSystems, 39(3):263-278

Ill-Conditioned Problems I If f is convex quadratic, f : x 7!

1 T x Hx 2

= with H positive, definite, symmetric matrix

1 2

P

2 i hi ,i xi

+

1 2

P

i 6=j

hi ,j xi xj ,

H is the Hessian matrix of f

I ill-conditioned means a high condition number of Hessian Matrix H

cond(H) =

max (H) min (H)

Example / exercice 1 0.8 0.6 0.4

1 f (x) = (x12 + 9x22 ) 2 condition number equals 9

0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1

−0.5

0

0.5

1

Shape of the iso-fitness lines

Ill-conditionned Problems consider the curvature of iso-fitness lines ill-conditioned means “squeezed” lines of equal function value (high curvatures)

gradient direction

f 0 (x)T

Newton direction H 1 f 0 (x)T

Condition number equals nine here. Condition numbers up to 1010 are not unusual in real world problems.

Stochastic Search A black box search template to minimize f : Rn ! R Initialize distribution parameters ✓, set population size While not terminate

2N

1. Sample distribution P (x|✓) ! x 1 , . . . , x 2 Rn 2. Evaluate x 1 , . . . , x on f 3. Update parameters ✓

F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))

Everything depends on the definition of P and F✓


2N


F✓ (✓, x 1 , . . . , x , f (x 1 ), . . . , f (x ))

Everything depends on the definition of P and F✓ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms

A Simple Example: The Pure Random Search Also an Ineffective Example

The Pure Random Search I

Sample uniformly at random a solution

I

Return the best solution ever found

Exercice I I I

Consider uniform sampling on [ 1, 1]n and f (x) = kxk21 Show that the algorithm converges in probability

Compute the expected hitting time of a ball of radius ✏ (we can consider the infinity norm).

Non-adaptive Algorithm For the pure random search P (x|✓) is independent of ✓ (i.e. no ✓ to be adapted): the algorithm is "blind" In this class: present algorithms that are "much better" than that

Evolution Strategies New search points are sampled normally distributed x i = m+ y i

for i = 1, . . . ,

as perturbations of m,

with y i i.i.d. ⇠ N (0, C)

where x i , m 2 Rn , 2 R+ , C 2 Rn⇥n


for i = 1, . . . ,




where I I

the mean vector m 2 Rn represents the favorite solution the so-called step-size

2 R+ controls the step length

the covariance matrix C 2 Rn⇥n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters I


for i = 1, . . . ,




where I I

the mean vector m 2 Rn represents the favorite solution the so-called step-size

2 R+ controls the step length

the covariance matrix C 2 Rn⇥n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters I

The question remains how to update m, C, and .

Normal Distribution 1-D case

probability density

0.4

Standard Normal Distribution

0.3 0.2 0.1 0 −4

−2

0

2

4

probability density of the 1-D standard normal distribution N (0, 1) (expected (mean) value, variance) = (0,1) ✓ 2◆ 1 x p(x) = p exp 2 2⇡

General case I Normal distribution N m,

2

(expected value, variance)⇣= (m, 2⌘) (x m)2 1 p density: pm, (x) = 2⇡ exp 2 2

I A normal distribution is entirely determined by its mean value and

variance

I The family of normal distributions is closed under linear transformations:

if X is normally distributed then a linear transformation aX + b is also normally distributed

I Exercice: Show that m + N (0, 1) = N m,

2

Normal Distribution General case A random variable following a 1-D normal distribution is determined by its mean value m and variance 2 . In the n-dimensional case it is determined by its mean vector and covariance matrix

Covariance Matrix

If the entries in a vector X = (X1 , . . . , Xn )T are random variables, each with finite variance, then the covariance matrix ⌃ is the matrix whose (i, j) entries are the covariance of (Xi , Xj ) ⇥

⌃ij = cov(Xi , Xj ) = E (Xi

µi )(Xj

µj )

⇤

where µi = E(Xi ). Considering the expectation of a matrix as the expectation of each entry, we have ⌃ = E[(X

µ)(X

µ)T ] ⌃ is symmetric, positive definite

The Multi-Variate (n-Dimensional) Normal Distribution Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m 2 Rn and its symmetric positive definite n ⇥ n covariance matrix ⇣ ⌘ C. T 1 1 1 density: pN(m,C) (x) = (2⇡)n/2 |C|1/2 exp (x m) C (x m) , 2 The mean value m

2−D Normal Distribution

I determines the displacement (translation)

0.4

I value with the largest density (modal value)

0.3

I the distribution is symmetric about the

0.1

distribution mean

0.2

0 5

5

0

N (m, C) = m + N (0, C)

0 −5 −5

The Multi-Variate (n-Dimensional) Normal Distribution Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m 2 Rn and its symmetric positive definite n ⇥ n covariance matrix ⇣ ⌘ C. T 1 1 1 density: pN(m,C) (x) = (2⇡)n/2 |C|1/2 exp (x m) C (x m) , 2 The mean value m

2−D Normal Distribution

I determines the displacement (translation)

0.4

I value with the largest density (modal value)

0.3

I the distribution is symmetric about the

0.1

distribution mean

0.2

0 5

5

0

N (m, C) = m + N (0, C)

0 −5 −5

The covariance matrix C I determines the shape I geometrical interpretation: any covariance matrix can be uniquely

identified with the iso-density ellipsoid {x 2 Rn | (x m)T C 1 (x m) = 1}

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid {x 2 Rn | (x m)T C 1 (x m) = 1} Lines of Equal Density

N m,

2I

⇠ m + N (0, I)

one degree of freedom components are independent standard normally distributed

where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A ⇥ N (0, I) ⇠ N 0, AAT holds for all A.


N m,

2I

⇠ m + N (0, I)


N m, D2 ⇠ m + D N (0, I)

n degrees of freedom components are independent, scaled



N m,

2I

⇠ m + N (0, I)


N m, D2 ⇠ m + D N (0, I)

n degrees of freedom components are independent, scaled

1

N (m, C) ⇠ m + C 2 N (0, I)

(n2 + n)/2 degrees of freedom components are correlated


Where are we? Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms

Covariance Matrix Adaptation

Rank-One Update Cumulation—the Evolution Path Rank-µ Update

Adaptation: What do we want to achieve? New search points are sampled normally distributed xi = m + yi

for i = 1, . . . ,


where x i , m 2 Rn ,

2 R+ , C 2 Rn⇥n

I

the mean vector should represent the favorite solution

I

the step-size controls the step-length and thus convergence rate should allow to reach fastest convergence rate possible

I

the covariance matrix C 2 Rn⇥n determines the shape of the distribution ellipsoid adaptation should allow to learn the “topography” of the problem particulary important for ill-conditionned problems C / H 1 on convex quadratic functions

Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms



Evolution Strategies (ES) Simple Update for Mean Vector

Let µ: # parents, : # offspring

Plus (elitist) and comma (non-elitist) selection (µ + )-ES: selection in {parents} [ {offspring} (µ, )-ES: selection in {offspring}

ES algorithms emerged in the community of bio-inspired methods where a parallel between optimization and evolution of species as described by Darwin served in the origin as inspiration for the methods. Nowadays this parallel is mainly visible through the terminology used: candidate solutions are parents or offspring, the objective function is a fitness function, ...

(1 + 1)-ES Sample one offspring from parent m x = m + N (0, C) If x better than m select m

x

The (µ/µ, )-ES - Update of the mean vector Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point x i = m +

yi |{z}

⇠N(0,C)

Let x i: the i-th ranked solution point, such that f (x 1: )  · · ·  f (x : ).

Notation: we denote y i : the vector such that x i : = m + y i : Exercice: realize that y i : is generally not distributed as N (0, C)

The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.



yi |{z}

⇠N(0,C)



The new mean reads m

µ X

wi x i:

i=1

where w1

···

wµ > 0,

Pµ

i=1 wi = 1,

Pµ 1

2 i =1 wi

=: µw ⇡

4




yi |{z}

⇠N(0,C)



The new mean reads m

µ X

wi x i:

= m+

i=1

where w1

···

wµ > 0,

µ X

wi y i:

|i=1 {z

=: y w

Pµ

i=1 wi = 1,

Pµ 1

2 i =1 wi

} =: µw ⇡

4


Invariance Under Monotonically Increasing Functions Rank-based algorithms Update of all parameters uses only the ranks f (x1: )  f (x2: )  ...  f (x

:

g (f (x1: ))  g (f (x2: ))  ...  g (f (x

)

:

))

8g

g is strictly monotonically increasing g preserves ranks




Why Step-Size Control? 0

10

random search

function value

step−size too small |

(1+1)-ES (red & green)

constant step−size −3

10

|

f (x) =

step−size too large

10

−9

0

xi2

i=1

−6

10

n X

in [ 2.2, 0.8]n for n = 10

optimal step−size (scale invariant)

0.5 1 1.5 function evaluations

2 4

x 10

Why Step-Size Control?

km

x ⇤k

p = f (x)

(5/5w , 10)-ES, 11 runs

with optimal step-size

f (x) =

n X

xi2

i=1

for n = 10 and x 0 2 [ 0.2, 0.8]n


f (x) =

km

x ⇤k

p = f (x)

(5/5w , 10)-ES, 2⇥11 runs

with optimal versus adaptive step-size

n X

xi2

i=1

for n = 10 and x 0 2 [ 0.2, 0.8]n

with too small initial


f (x) =

km

x ⇤k

p = f (x)

(5/5w , 10)-ES

comparing number of f -evals to reach kmk = 10

n X

xi2

i=1

for n = 10 and x 0 2 [ 0.2, 0.8]n

5 : 1100 100 650

⇡ 1.5


f (x) =

km

x ⇤k

p = f (x)

(5/5w , 10)-ES

n X

xi2

i=1

for n = 10 and x 0 2 [ 0.2, 0.8]n

comparing optimal versus default damping parameter d : 1700 1100 ⇡ 1.5

Why Step-Size Control? 0.2

constant σ

0

random search

normalized progress

function value

10

−3

10

−6

10

−9

10

0


opt kparentk

0.1 0.05 0

adaptive step−size σ

500 1000 function evaluations

0.15

1500

−3

10

' n

evolution window refers to the step-size interval ( performance is observed

−2

−1

10 10 normalized step size

opt

0

10

'

) where reasonable

Step-size control Theory

I On well conditioned problem (sphere function f (x) = kxk2 ) step-size

adaptation should allow to reach (close to) optimal convergence rates need to be able to solve optimally simple scenario (linear function, sphere function) that quite often (always?) need to be solved when addressing a real-world problem I Is it possible to quantify optimal convergence rate for step-size adaptive ESs?

Lower bound for convergence Exemplified on (1+1)-ES

Consider a (1+1)-ES with any step-size adaptation mechanism

(1+1)-ES with adaptive step-size Iteration k: ˜ k+1 = X k + X Nk+1 with (Nk )k i.i.d. ⇠ N (0, I) k |{z} |{z} | {z }

offspring

parent

step size

X k+1 =

(

˜ k+1 X Xk

˜ k+1 )  f (X k ) if f (X otherwise

Lower bound for convergence (II) Exemplify on (1+1)-ES

Theorem

For any objective function f : Rn ! R, for any y ⇤ 2 Rn E [ln kX k+1 where

y ⇤ k]

⌧ = max

2R+

y ⇤ k]

E [ln kX k E [ln k |

⌧ |{z}

lower bound

e1 + N k] |{z}

(1,0,...,0)

{z

=:'( )

}

"Tight" lower bound

Theorem

Lower bound reached on the sphere function f (x) = g (kx y ⇤ k), (with g : R ! R, increasing mapping) for step-size proportional to the distance to the optimum where k = kx y ⇤ k with := opt such that '( opt ) = ⌧ .

(Log)-Linear convergence of scale-invariant step-size ES Theorem

The (1+1)-ES with step-size proportional to the distance to the optimum k = kxk converges (log)-linearly on the sphere function f (x) = g (kxk), (with g : R ! R, increasing mapping) in the sense 1 kX k k ln k kX 0 k

!

k!1

'( ) =: CR(1+1) ( )

almost surely. 0 −0.05

0

c(sigma)*dimension

distance to optimum

−0.1

10

−10

10

−0.15 dim=2 min for dim=2 dim=3 min for dim=3 dim=5 min for dim=5 dim=10 min for dim=10 dim=20 min for dim=20 dim=160 min for dim=160

−0.2 −0.25 −0.3 −0.35 −0.4 −0.45

−20

10

0

1000


n = 20 and

4000

5000

= 0.6/n

−0.5 0

2

4 6 sigma*dimension

8

10

Asymptotic results When n ! 1

Theorem

Let > 0, the convergence rate of the (1+1)-ES with scale-invariant step-size on spherical functions satisfies at the limit lim n ⇥ CR(1+1)

n!1

n

=p

2⇡

exp

⇣

2⌘

8

2

+

2

⇣

2

⌘

is the cumulative distribution of a normal distribution. optimal convergence rate decreases to zero like 0 −0.05 −0.1 c(sigma)*dimension

where

⇣ ⌘

−0.15 −0.2 −0.25 −0.3 −0.35 −0.4 −0.45 −0.5 0

2

4 6 sigma*dimension

8

10

1 n

Summary of theory results 0.2

constant σ

0

random search

normalized progress

function value

10

−3

10

−6

10

−9

10

0


opt kparentk

0.1 0.05 0



0.15

1500

−3

10

' n

evolution window refers to the step-size interval ( performance is observed

−2

−1

10 10 normalized step size

opt

0

10

'

) where reasonable




Methods for Step-Size Control I 1/5-th success ruleab , often applied with “+”-selection

increase step-size if more than 20% of the new solutions are successful, decrease otherwise I

-self-adaptationc , applied with “,”-selection mutation is applied to the step-size and the better one, according to the objective function value, is selected simplified “global” self-adaptation

I path length controld (Cumulative Step-size Adaptation, CSA)e , applied

with “,”-selection

a

Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog b Schumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models, Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 9(2) e Ostermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV

One-fifth success rule

# increase

# decrease

One-fifth success rule

Probability of success (ps ) 1/2

Probability of success (ps ) 1/5

“too small”

One-fifth success rule ps : # of successful offspring / # offspring (per generation) ✓ ◆ 1 ps ptarget Increase if ps > ptarget ⇥ exp ⇥ 3 1 ptarget Decrease if ps < ptarget

(1 + 1)-ES

ptarget = 1/5 IF offspring better parent ps = 1, ⇥ exp(1/3) ELSE ps = 0, / exp(1/3)1/4

Why 1/5? Asymptotic convergence rate and probability of success of scale-invariant step-size (1+1)-ES 0.5 0.4

c(sigma)*dimension

0.3 0.2 0.1 0 −0.1

CR

(1+1)

min (CR(1+1))

−0.2

proba of success −0.3 0

2

4 6 sigma*dimension

8

10

sphere - asymptotic results, i.e. n = 1 (see slides before)

1/5 trade-off of optimal probability of success on the sphere and corridor

Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation xi m

Measure the length of the evolution path the pathway of the mean vector m in the generation sequence

+ decrease

+ increase

=

m + yi m + yw

Path Length Control (CSA) The Equations

Sampling of solutions, notations as on slide “The (µ/µ, )-ES - Update of the mean vector” with C equal to the identity.

Initialize m 2 Rn , 2 R+ , evolution path p = 0, set c ⇡ 4/n, d ⇡ 1.

Path Length Control (CSA) The Equations

Sampling of solutions, notations as on slide “The (µ/µ, )-ES - Update of the mean vector” with C equal to the identity.

Initialize m 2 Rn , 2 R+ , evolution path p = 0, set c ⇡ 4/n, d ⇡ 1. m p

m + yw

Pµ

where y w = i=1 wi y i: update mean q p 2 (1 c ) p + 1 (1 c ) µw yw | {z } | {z } for 1 c accounts ✓ accounts ◆◆ for wi ✓ c kp k ⇥ exp 1 update step-size d EkN (0, I) k | {z } >1 () kp k is greater than its expectation

Step-size adaptation What is achieved

(1 + 1)-ES with one-fifth success rule (blue) constant σ

0

function value

10

random search

−3

10

f (x) =

n X

xi2

i=1

−6

step−size σ

10

−9

10

0


in [ 0.2, 0.8]n for n = 10



1500 Linear convergence

Step-size adaptation What is achieved

km

x ⇤k

(5/5, 10)-CSA-ES, default parameters

f (x) =

n X

xi2

i=1

in [ 0.2, 0.8]n for n = 30

Derivative Free Optimization

Derivative Free Optimization

Suggest Documents

DERIVATIVE-FREE OPTIMIZATION OF

Derivative Free Optimization

Benchmarking Derivative-Free Optimization ... - Optimization Online

Derivative-Free Optimization Methods for Finite

A LINESEARCH-BASED DERIVATIVE-FREE ... - Optimization Online

Recent Developments in Derivative-free Multiobjective Optimization

MNH: A Derivative-Free Optimization Algorithm Using

DERIVATIVE-FREE OPTIMIZATION METHODS ... - SIM - TU Darmstadt

Derivative-Free Trajectory Optimization with Unscented Dynamic ...

Derivative-Free Optimization Via Evolutionary Algorithms ... - CiteSeerX

Derivative-Free Trajectory Optimization with Unscented Dynamic ...

Research Article A Derivative-Free Mesh Optimization

Derivative-Free Optimization - Stanford Earth Sciences

Experimental Comparisons of Derivative Free Optimization Algorithms

Derivative-Free Optimization via Classification - LAMDA

Derivative-free optimization of rate parameters of capsid ... - arXiv

Derivative-Free Optimization Methods for Finite Minimax Problems.

A computational framework for derivative-free optimization of ...

Derivative-free optimization for autofocus and astigmatism correction ...

Derivative Free Surrogate Optimization for Mixed-Integer ... - CiteSeerX

Derivative-Free Optimization for Oil Field Operations - Semantic Scholar

Derivative-free global ship design optimization using ... - Math Unipd

A derivative-free optimization method to compute scalar perturbation ...

Comparison of Derivative-Free Optimization Methods for Groundwater ...