Global Optimization by Modified Diffusion - MIT

0 downloads 0 Views 626KB Size Report
Sep 30, 2006 - (8). This together with the form of the modified drift suggest that the ... 0. 1. 2. Derivative of H(x) x. H. ′ (x). 6. 5. 4. 3. −2. 1. 0. 1. 0. 0.5. 1.
Global Optimization by Modified Diffusion Oleg V. Poliannikov, Elena Zhizhina and Hamid Krim September 30, 2006 Abstract In this paper, we study a diffusion stochastic dynamics with a general diffusion coefficient. The main result is that adapting the diffusion coefficient to the Hamiltonian allows to escape local wide minima and to speed up the convergence of the dynamics to the global minima. We prove the convergence of the invariant measure of the modified dynamics to a measure concentrated on the set of global minima and show how to choose a diffusion coefficient for a certain class of Hamiltonians.

1

Introduction

Gibbs field based stochastic methods have long been recognized as an effective approach to problems of global optimization, see for instance [2, 4, 5, 3, 1, 8]. Their essence can be summarized in the following way. Consider an equilibrium stochastic dynamics with a stationary Gibbs measure, where the latter is associated with some energy functional H. The dynamics is then changed so that it is no longer in equilibrium, and its limit distribution is concentrated on the set of global minima of H. This approach to optimization is generally called simulated annealing. More precisely, consider a stochastic diffusion dynamics, whose invariant measure is given by 1 dµ(x) = e−βH(x) dm(x), (1) Z where X = [a, b]d ⊂ Rd , H : X → R is the energy functional, and m - normalized Lebesgue measure on X. We seek to find the set of global minima of H. The classical technique of solving this problem is to stochastically perturb the deterministic gradient descent: dX(t) = −∇H(X(t)) dt + σ(t) dW (t),

(2)

where W (t) is a realization of the standard Brownian motion. It is known that in order for the limiting measure to concentrate on the set of global minima of H, the diffusion coefficient should slowly decay to zero. We conventionally refer to the behavior of the function σ as the cooling schedule of

1

our dynamics. We call this standard dynamics (spatially) homogeneous because the diffusion coefficient does not depend on the state variable X(t). In this paper, we propose a new diffusion process whose distinguished feature is a spatially inhomogeneous diffusion coefficient. It is important that the stationary Gibbs distribution of the newly introduced dynamics be identical to that of the homogeneous diffusion. It is, however, shown that by appropriately constructing the inhomogeneous diffusion, one can improve the speed of convergence of the overall dynamics to the stationary distribution. We prove that the order of the speed of convergence cannot be improved on, but the corresponding coefficient can be in principle chosen optimally. The inhomogeneous diffusion coefficient that leads to the optimal speed of convergence depends on the functional H at hand. Its exact form for a general H continues to be an open problem. Of particular interest in many applications is a situation where the global minimum of H is so narrow that a standard diffusion tends to overlook it. We demonstrate that it is possible to adapt the diffusion to the cost functional and to hence alleviate this problem. The performance of the adapted diffusion is shown to offer superior performance in comparison to its classical counterpart. This paper is organized as follows. In the second section we formulate results on stochastic dynamics and its approximations which are non-homogeneous Markov chains. In the third section we analyze and compare the convergence properties of the modified diffusions and of the Langevin dynamics using numerical simulations. The conclusions and description of possible extensions are deferred to the fourth section. Finally, Appendix A contains the proof of the main result.

2

Main Results

Let X = [−R, R] ⊂ R, and denote by F (X) a space of functions on X with periodical boundary conditions. Let H : X → R be a function bounded from below. For definitiveness, assume that H(x) ≥ 0, ∀x ∈ [−R, R],

min

x∈[−R,R]

H(x) = 0.

(3)

Proposition 1. Consider a continuous time Markov process on X whose infinitesimal generator La is given by   1 1 La f (x) = δ(a)ω(x)f ′′ (x)− ω(x)H ′ (x) − δ(a)ω ′ (x) f ′ (x), f ∈ F (X). (4) 2 2 Then the stationary distribution γ˜ a has a Gibbs density, i.e. 2H(x)

e− δ(a) d˜ γ (x) = dm(x), Zδ(a) a

(5)

where Zδ(a) is a normalization constant, and m is the normalized Lebesgue measure on X. 2

Proof. follows from the equality Z La f (x)d˜ γ a (x) = 0 for any f ∈ D(La ). Consider an approximation in time of the above diffusion process. It is a Markov chain Xn given by   p 1 ′ ′ Xn+1 = Xn − ω(Xn )H (Xn ) − δ(a)ω (Xn ) a + aδ(a)ω(Xn ) Wn , (6) 2

where a, δ(a) > 0 are parameters, ω(x) is a fixed nonnegative smooth function and (Wn ) is an i.i.d random sequence, and E[Wn ] = 0, var[Wn ] = 1. Proposition 2. For any a, δ(a) > 0 Markov chain (6) has a stationary distribution, which will be denoted γ a . Proof. See, for example, [6].

We now show that as a → 0, the stationary measure γ a becomes concentrated on the set of global minima of H. First, we introduce the following notation. We denote by Oε (x) the ε-neighborhood of any x ∈ X, and by Uε (H) the union of ε-neighborhoods of all global minima of H: [ Oε (x). (7) Oε (x) = X ∩ (x − ε, x + ε), Uε (H) = x:H(x)=0

Theorem 1. Let ε > 0 and assume the existence of C0 > 0, such that δ(a) ≥ C0 a − ln a . Then γ (Uε (H)) → 1 as a → 0 and δ(a) → 0. Proof. See Appendix A. It is seen from Equations (A18)–(A20) that the rate of convergence of the modified diffusion remains similar, since the parameter δ(a) in the cooling procedure decreases in the same way as for the classical dynamic. The speed of convergence, however, may be improved by minimizing the coefficients in (A18) ˜ n ≈ Xn , the coefficients in (A18) - (A19) have the - (A19). We note that if X same expression: i h 3 (8) K(ω) ≡ Eγ a ω(Xn ) (H ′ (Xn )) . This together with the form of the modified drift suggest that the function ω could be chosen inversely proportional to H ′ :  1/|H ′ (X)|, |H ′ (X)| > k ′ ω(X) ≡ ω (H (X)) = (9) 1/k, |H ′ (X)| ≤ k,

where the suitable choice of parameter k ensures the stability of the numerical algorithm in the neighborhood of local nimima, where H ′ (X) ≈ 0. 3

3

Numerical Simulations

In this section we present simulation results, which demonstrate the performance of the newly proposed modified diffusion versus the standard dynamics. In both simulation cases we use the same cooling schedule: an = a0 e−Kn ,

(10)

where K > 0, K ≪ 1, e.g. K = 0.1. Functions H are chosen such that its global minima are narrow relative to other local minima, which is one of the most challenging situations in applications. We observe that by the choice of ω in (9), wide minima are naturally disfavored, as the jump size is large in such areas. Points then tend to concentrate in the vicinities of the global minima, where the value of H ′ is larger, and hence the random jumps are naturally suppressed.

3.1

Example 1

Consider the function H1 (X) = α(X + π)2 − (1 − α)

1 2 , 1 + Xd

(11)

where X = [−2π, π], α = 0.08 and d = 0.01. Let the initial point be X0 = −3. We let 100 points start from X0 and evolve each according to its own realization of the diffusion process. We observe that because of the shape of H, points tend to come back to the local minimum they started from, if a is small (Figures 1 and 2). The modified diffusion on the other hand, forces a particle to leave the neighborhood of the starting point but not from that of the global minimum (Figure 3).

3.2

Example 2

Consider now H2 (X) = X 2 cos 10X,

X ∈ [−π, π],

X0 = 0.

(12)

Again we note that for smaller a0 , the classical diffusion fails to leave local minima, while higher values of a0 result in uniform coverage of the entire domain, similar to the high-temperature regime. The modified diffusion as expected does a far better job at finding the global minima see Figures 4-6.

3.3

Example 3

Finally, we consider a two dimensional example H3 (X, θ) = βΦ1 (X, θ) + (1 − β)Φ2 (X), 4

(13)

Final distribution of points

H(x)

1 0.5 0 −6

−5

−4

−6

−5

−4

−6

−5

−4

−3 −2 x Derivative of H(x)

−1

0

1

−1

0

1

−1

0

1

H′(x)

2 1 0 −1 −3

−2 x Histogram of points

30 20 10 0

−3

−2

Figure 1: Classical diffusion: H(X) = H1 (X), X0 = −3, a0 = 0.5 where Φ1 and Φ2 are “data fidelity” and “smoothness” terms correspondingly: Φ1 (X, θ) = kX − θk2 ,

Φ2 (X) = −

1 1+

1 d2 (X1

− X2 )2

,

X = (X1 , X2 )⊤ (14)

with θ ∈ (−5, 5)⊤ , X0 = θ, β = 0.01, d = 0.5, a0 = 0.5. We observe (Figure 9) that as in previous examples, particles tend to cluster around the global minimum under the modified diffusion, whereas they hover around a local minimum for a small discretization step, and spread over the entire computation domain if the step is too large.

4

Conclusions

Problems of global optimization are of great theoretical as well as practical importance. Gibbs fields based methods have a tremendous potential because of their tractability and ease of implementation. The main shortcoming of these methods is the slow speed of convergence to the global minimum. In this paper we have considered diffusion dynamics with a general diffusion coefficient. We have shown that while it is impossible to improve the order of the speed of convergence, the speed of convergence may be improved by choosing the diffusion coefficient adapted to a particular functional H. The resulting dynamics then outperforms its classical counterpart at finding the global minimum of an energy functional. 5

Final distribution of points

H(x)

1 0.5 0 −6

−5

−4

−6

−5

−4

−6

−5

−4

−3 −2 x Derivative of H(x)

−1

0

1

−1

0

1

−1

0

1

H′(x)

2 1 0 −1 −3

−2 x Histogram of points

30 20 10 0

−3

−2

Figure 2: Classical diffusion: H(X) = H1 (X), X0 = −3, a0 = 1 We refer to our proposed approach as adapted diffusion based algorithm since the choice of the function ω(x) depends on the energy function H. In this case, the diffusion coefficient is non-homogeneous and is determined by the remainder terms (A18) - (A19). We did not discuss here how to find an optimal ω(x) under given function H. The main goal for us is to show that we can construct a nonhomogeneous diffusion which escapes wide local minima and stays at deep and tight minima of the energy function H. When solving examples where the global minimum is very narrow, we proceed by cooling the dynamics where the value of H ′ is large and heating it up when H ′ ≈ 0. In so doing, we force the diffusion to favor the global minimum and reject other local extrema. As may be seen from above simulations, our proposed algorithm reveals deep and tight minima, whereas the clasical diffusion escapes these minima for a short time and prefers to stay at local wide minima. That is an important property, which is desirable in global optimization problems for functionals with deep and tight minima.

A

Proof of Theorem 1

The main line of the proof follows reasoning from [7].

6

Final distribution of points

H(x)

1 0.5 0 −6

−5

−4

−6

−5

−4

−6

−5

−4

−3 −2 x Derivative of H(x)

−1

0

1

−1

0

1

−1

0

1

H′(x)

2 1 0 −1 −3

−2 x Histogram of points

10

5

0

−3

−2

Figure 3: Modified diffusion: H(X) = H1 (X), X0 = −3, a0 = 0.5 Fix ε > 0 and consider two smooth functions φ1,2 (x) satisfying  H(x) > 3ε  1, α1 (x), H(x) ∈ [2ε, 3ε] , φ1 (x) =  0, H(x) < 2ε

and

  1, α2 (x), φ2 (x) =  0,

H(x) < ε H(x) ∈ [ε, 3ε] , H(x) > 3ε

(A1)

(A2)

where 0 ≤ α1,2 (x) ≤ 1. Construct a function φa (x) = φ1 (x) − η(a)φ2 (x), such that

Clearly,

Z

(A3)

φa (x) d˜ γ a (x) = 0.

R

η(a) = R

φ1 (x)e− φ2 (x)e

7



2H(x) δ(a) 2H(x) δ(a)

dm(x) dm(x)

(A4)

,

(A5)

Final distribution of points

H(x)

0.5 0 −0.5 −3

−2

−1

0 x Derivative of H(x)

1

2

3

−3

−2

−1

0 x Histogram of points

1

2

3

−3

−2

−1

0

1

2

3

H′(x)

5 0 −5

40 30 20 10 0

Figure 4: Classical diffusion: H(X) = H2 (X), X0 = 0, a0 = 0.5 and because we can bound the numerator and denominator: Z 2H(x) 4ε φ1 (x)e− δ(a) dm(x) ≤ e− δ(a) , Z 2H(x) 2ε φ2 (x)e− δ(a) dm(x) ≥ mε e− δ(a)  with mε ≡ m {x : H(x) ≤ ε} , we have 4ε

η(a) ≤

e− δ(a)

mε e

=

2ε − δ(a)

2ε 1 − δ(a) e → 0, mε

(A6)

(A7)

when δ(a) → 0. Consider the equation La ψa = φa .

(A8)

Because of (A4), Equation (A8) has a solution. Expanding this equation, we get   1 1 (A9) δ(a)ω(x)ψa′′ − ω(x)H ′ − δ(a)ω ′ ψa′ = φa (x), 2 2 which is solvable for ψa′ , and therefore: 2H(x)

ψa′ (x)

e δ(a) = ω(x)

Z

x

−R

2H(u) 2 φa (u)e− δ(a) du, δ(a)

8

(A10)

Final distribution of points

H(x)

0.5 0 −0.5 −3

−2

−1

0 x Derivative of H(x)

1

2

3

−3

−2

−1

0 x Histogram of points

1

2

3

−3

−2

−1

0

1

2

3

H′(x)

5 0 −5

1.5 1 0.5 0

Figure 5: Classical diffusion: H(X) = H2 (X), X0 = 0, a0 = 50 and 1 2 φa (x) + δ(a) ω(x) Z x 2 ′ ′ 2H(x) 2H(u) 2 δ(a) ω(x)H (x) − ω (x) δ(a) + e φa (u)e− δ(a) du. 2 ω (x) −R δ(a)

ψa′′ (x) =

(A11)

Differentiating once again we get to the leading order as δ(a) → 0: ψa′′′ (x) =



2 δ(a)

3

(H ′ (x))2 2H(x) e δ(a) ω(x)

Z

Let ψa (x) =

x

φa (u)e−

2H(u) δ(a)

−R

Z

du + O



 2H(x) 1 δ(a) . e δ 2 (a) (A12)

x

ψa′ (u)du.

−R

Now with the help of Equation (6) we construct a Taylor approximation of

9

Final distribution of points

H(x)

0.5 0 −0.5 −3

−2

−1

0 x Derivative of H(x)

1

2

3

−3

−2

−1

0 x Histogram of points

1

2

3

−3

−2

−1

0

1

2

3

H′(x)

5 0 −5

4 3 2 1 0

Figure 6: Modified diffusion: H(X) = H2 (X), X0 = 0, a0 = 0.5 ψa (Xn+1 ): ψa (Xn+1 ) = ψa (Xn )    p 1 aδ(a)ω(Xn ) Wn ψa′ (Xn ) − ω(Xn )H ′ (Xn ) − δ(a)ω ′ (Xn ) a − 2 2   p 1 1 ′ ′ + ω(Xn )H (Xn ) − δ(a)ω (Xn ) a − aδ(a)ω(Xn ) Wn ψa′′ (Xn ) 2 2 3   p 1 1 ˜n ), − ω(Xn )H ′ (Xn ) − δ(a)ω ′ (Xn ) a − aδ(a)ω(Xn ) Wn ψa′′′ (X 6 2 (A13) ˜ n is a point in the interval between Xn and Xn+1 . Taking the expecwhere X tation of both sides of the above equality, and using the stationarity of Xn and

10

Figure 7: Classical diffusion: H(X1 , X2 ) = H3 (X1 , X2 ), X0 = (−5, 5)⊤ , a0 = 0.9 equalities EWn = EWn3 = 0, EWn2 = 1 we have:    0 = −Eγ a ω(Xn )H ′ (Xn ) − 21 δ(a)ω ′ (Xn ) ψa′ (Xn ) a   2 p  + 21 E ω(Xn )H ′ (Xn ) − 12 δ(a)ω ′ (Xn ) a − aδ(a)ω(Xn ) Wn ψa′′ (Xn )   3 p  ˜n) ω(Xn )H ′ (Xn ) − 12 δ(a)ω ′ (Xn ) a − aδ(a)ω(Xn ) Wn ψa′′′ (X − 61 E    = −a Eγ a ω(Xn )H ′ (Xn ) − 12 δ(a)ω ′ (Xn ) ψa′ (Xn ) + 21 a δ(a)Eγ a [ω(Xn )ψa′′ (Xn )] h i 2 + 21 a2 Eγ a ω(Xn )H ′ (Xn ) − 21 δ(a)ω ′ (Xn ) ψa′′ (Xn ) h i  ˜ n ) + O(a3 ). − 21 a2 δ(a)Eγ a ω(Xn )H ′ (Xn ) − 21 δ(a)ω ′ (Xn ) ω(Xn )ψa′′′ (X (A14)

11

Figure 8: Classical diffusion: H(X1 , X2 ) = H3 (X1 , X2 ), X0 = (−5, 5)⊤ , a0 = 20 Dividing by a, from Equations (A9) and (A14), we obtain: " # 2 1 1 ′ ′ ′′ ω(Xn )H (Xn ) − δ(a)ω (Xn ) ψa (Xn ) − Eγ a [φa (Xn )] = aEγ a 2 2    1 1 ˜ n ) + O(a2 ). ω(Xn )H ′ (Xn ) − δ(a)ω ′ (Xn ) ω(Xn )ψa′′′ (X − aδ(a)Eγ a 2 2 (A15) Recall that Z Z a (A16) Eγ a φa (Xn ) = φ1 (x) γ (dx) − η(a) φ2 (x) γ a (dx). Since φ2 (x) is a bounded function, and η(a) → 0, the entire second term in the right hand side vanishes as a → 0. Also Z φ1 (x) γ a (dx) ≥ γ a ({H(x) > 3ε}) > 0, (A17)

so in order to prove the theorem, we need to show that the right hand side of Equation (A15) tends to 0 as a → 0. From Equation (A11) we have for the first 12

Figure 9: Modified diffusion: H(X1 , X2 ) = H3 (X1 , X2 ), X0 = (−5, 5)⊤ , a0 = 0.5 term in Equation (A15) as a → 0: "

# 2 1 ′ ′′ ω(Xn )H (Xn ) − δ(a)ω (Xn ) ψa (Xn ) aEγ a 2     C a =O e δ(a) Eγ a ω(Xn ) (H ′ (Xn ))3 . δ 2 (a) ′

(A18)

Analogously, using representation in Equation (A12), the second term in Equation (A15) could be written as    1 ′ ′ ′′′ ˜ a ω(Xn )H (Xn ) − δ(a)ω (Xn ) ω(Xn )ψa (Xn ) a δ(a) Eγ 2   2  (A19) ′ ˜   H ( X ) n C a  2  ′ δ(a) Eγ a ω (Xn )H (Xn ) =O e ˜n)  δ 2 (a) ω(X

13

as a → 0. Expressions (A18) - (A19) go to zero if ln a − 2 ln δ(a) +

C → −∞, δ(a)

which is satisfied if δ(a) → 0 under

δ(a) ≥ −

C0 ln a

(A20)

with a constant C0 , C0 > C. Finally, for arbitrary ε > 0: γ a ({H(x) > 3ε}) → 0 as a → 0. Theorem is proved.

References [1] T.-S. Chiang, C.-R. Hwang, and S.-J. Sheu. Diffusion for global optimization in rn . SIAM Journal of Control and Optimization, 25:737–753, 1987. [2] S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. PAMI, 6(6):721–741, 1984. [3] S. Geman and G. Hwang. Diffusion for global optimization. SIAM Journal of Control and Optimization, 24:1031–1043, 1986. [4] S. Geman and G. Reynolds. Constrained restoration and recovery of discontinuities. PAMI, 14(3):367–383, 1992. [5] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983. [6] Thomas M. Liggett. Interacting Particle Systems. Springer, 2005. [7] Lennart Ljung, Georg Pflug, and Harro Walk. Stochastic approximation and optimization of random systems. Birkhauser Verlag, Basel, Switzerland, 1992. [8] G. Winkler. Image analysis, random fields and Markov chain Monte Carlo methods, A mathematical introduction. Springer, New York, 2003.

14

Suggest Documents