Convergence and rst hitting time of simulated annealing algorithms for ...

Convergence and rst hitting time of simulated annealing algorithms for continuous global optimization M.Locatelli Dipartimento di Informatica Universita di Torino Corso Svizzera, 185 10149 Torino - Italy e-mail [email protected] Abstract In this paper simulated annealing algorithms for continuous global optimization are considered. Under the simplifying assumption of known optimal value, the convergence of the algorithms and an upper bound for the expected rst hitting time, i.e. the expected number of iterations before reaching the global optimum value within accuracy ", are established. The obtained results are compared with those for the ideal algorithm PAS (Pure Adaptive Search) and for the simple PRS (Pure Random Search) algorithm.

KEYWORDS: global optimization, simulated annealing, convergence, rst hitting time

1 Introduction The simulated annealing approach was inspired by a physical phenomenon. If we reduce the temperature of a liquid, the thermal mobility of the molecules is lost. If the decrease is slow enough a pure crystal is formed, corresponding to a state of minimum energy. If the decrease is too fast a polycrystalline or an amorphous state with higher energy are reached. In [21] a Monte Carlo method was proposed to simulate the physical process. In [6] and [17] the analogy between the reduction of the energy in the physical system and the reduction of the objective function in an optimization problem, has lead to the de nition of 1

simulated annealing (SA) algorithms for combinatorial optimization problems. In this paper we consider the application of SA algorithms to continuous global optimization problems, i.e. problems with the following form f = xmin f(x); 2X where the feasible region X Rn is a continuous domain and the objective function f is a continuous function. There is a wide literature on practical and theoretical results about SA algorithms for continuous global optimization. From the practical point of view we cite [1], [3], [5], [7], [8], [10], [12]-[15], [16], [22], [24], [25], [26]. In particular, the most thoroughly investigated SA software is the ASA code introduced and developed in [12]-[15] and available at the web site http://www.ingber.com/#ASAREPRINTS. From the theoretical point of view convergence studies have been presented in [2], [8], [9], [11], [18]-[20]. A SA algorithm for the solution of continuous global optimization problems can be described as follows.

SA algorithm Step 0 Let x0 2 X, t0 0 and k := 0. Step 1 Sample a point yk+1 from a given distribution D(; xk), and evaluate f at yk+1 . Step 2 Find a value A = A(xk ; yk+1; tk ) 2 [0; 1], where tk is a parameter called temperature; sample a value u from the uniform distribution over [0; 1]; if u A move to yk+1 , i.e. set xk+1 := yk+1, otherwise stay in xk , i.e. set xk+1 := xk .

Step 3 Let tk+1 = U(xk+1) 0 be the new value of the temperature. Step 4 Check a stopping criterion and, if it fails, set k := k + 1 and go back to Step 1. One important assumption which will be made throughout this paper is that the optimal value f is known, but not its position. This assumption is ful lled in some special cases. For instance, a nonlinear system ri(x) = 0 = 1; : : :; `; 2

can be solved by globally minimizing the nonnegative objective function f(x) =

` X i=1

[ri (x)]2:

Any solution x of the nonlinear system satis es f(x ) = 0, so that for this problem f = 0. When f is not known it can be substituted by an estimate of it. A way to do that and to ensure convergence of the algorithm even without the knowledge of f will be shortly discussed after the description of the acceptance function A, the next candidate distribution D, and the so called cooling schedule U employed in this paper. These three functions must be speci ed in order to de ne a simulated annealing algorithm. In order to completely specify the algorithm we should also describe a stopping criterion, but since we are interested here in the convergence and the expected time to reach the global optimum value within accuracy " for the SA algorithm, this issue will not be discussed here. The optimal value will be employed both in the de nition of the cooling schedule U and in the de nition of the distribution D of the next candidate point.

Acceptance function The Metropolis function

A = A(xk ; yk+1; tk ) = min 1; exp f(xk ) ?t f(yk+1 ) k

:

(1)

will be employed. The rule is such that we move with probability 1 to a better point yk+1 , i.e. a point with lower function value with respect to xk , but we may also accept a move to a worse point (the so called backtracking) with a probability which is an increasing function of the temperature tk and decreases to 0 as tk approaches 0 (by de nition, the probability is equal to 0 for tk = 0).

Cooling schedule The function U is de ned as follows tk = U(xk ) = 1[f(xk ) ? f ]g1 ;

(2)

where 1 > 0 and g1 0 are constants. Basically, as the distance from the optimal value increases, the temperature and, consequently, the acceptance probability also increase. Thus, it is easier to accept worse points when we are far from the optimal value.

Next candidate distribution The distribution D at iteration k is the uniform distribution over X \ S(xk ; Rxk ), i.e. over the intersection of the feasible set X with a sphere 3

S(xk ; Rxk ) with center in the current point xk and radius de ned as follows Rxk = 2[f(xk ) ? f ]g2 ;

(3)

for some 2 > 0 and g2 0. Again, if we are far from the optimal value we are able to perform larger steps. The choice of relating both the temperature and the step size to the distance of the function value in the current point xk from the global optimum value has been experimentally tested in [26] with quite encouraging results. As already remarked, it is possible to construct convergent algorithms even in the case that the optimal value f is not known. It is necessary to substitute f in (2) with its estimate fk ? ck , and in (3) with another estimate fk ? rk , where fk = mini=0;:::;k f(xi ) is the record value at iteration k, while fck g and frk g are nonincreasing and converging to 0 deterministic sequences. In order to ensure convergence it is required that both fck g and frkg converge to 0 "slowly enough" (see [20] for a proof in the case g2 = 0). We also underline that, even if, for the sake of simplicity, the distribution D has been chosen to be the uniform one over X \ S(xk ; Rxk ), more general distributions with the same support set could be chosen without compromising the convergence of the algorithm. We nally note that in view of the de nitions of A; U and D, the Markovian property holds, i.e. for any k 2 N and for any A; B; Ci X, i < k ? 1, it holds that P[xk 2 A j xk?1 2 B; xi 2 Ci i < k ? 1] = P[xk 2 A j xk?1 2 B]:

(4)

Moreover, all the transition probabilities are stationary, i.e. for any A; B X and any k; j 2 N P[xk+j 2 A j xj 2 B] = P[xk 2 A j x0 2 B]: (5)

The paper is organized as follows. In Section 2 an assumption will be introduced and both the convergence and an upper bound for the expected rst hitting time, i.e. the expected number of iterations to reach the global optimum value within accuracy " will be derived. The assumption depends on two functions which will be de ned in Section 3. Finally, in Section 4 the results obtained for the SA algorithm will be compared to those of two other algorithms, an ideal one, the PAS (Pure Adaptive Search) algorithm, and a simple 4

one, the PRS (Pure Random Search) algorithm. Some technical proofs will be given in the Appendix.

2 Convergence and expected time to reach accuracy " In this section we introduce a single assumption and see how it can be used in order to prove the convergence of the algorithm. The assumption introduces some bounds on the probability of moving into or staying inside the set B" = fx 2 X : f(x) f + "g; i.e. the set of points within accuracy " > 0 from the global optimum value.

Assumption 1 There exist two functions t : R+ ?! R+

N : R+ ?! N ;

and a positive constant C such that

8 x 2 X : P[xk+N (") 2 B" j xk = x] C";

(6)

8 i 2 N P[xk+i 2 B" j xk 2 B" ] [1 ? t(")]i ;

(7)

and

t(")N(") ! 0 as " ! 0: (8) " While (6) gives a lower bound for the probability of moving into the set B" in N(") iterations from any point in X, from (7) we can derive the upper bound 1 ? [1 ? t(")]N (") for the probability of moving out of the set B" after N(") iterations (see Figure 1). Finally, (8) 1 ? [1 ? t(")]N (") X

B"

C"

Figure 1: Bounds for the probability of entering and exiting the set B" in N(") iterations. 5

guarantees that it is "easier" to get into B" than to move out of it in N(") iterations, i.e. that the probability of moving into it decreases to 0 slower than the probability of moving out of it as " ! 0. Indeed, (8) implies that t(")N(") ! 0 as " ! 0, so that 1 ? [1 ? t(")]N (") 1 ? expf?t(")N(")g t(")N("):

(9)

In other words, over a number N(") of iterations the improving component of SA (i.e. the tendency to sample points with better function values) is stronger than its backtracking component (i.e. the acceptance of worse points). It is not clear yet how the functions t and N should be chosen in order to guarantee that Assumption 1 holds. This will be made clear in the next section where it will be shown that the choice of these functions is strictly related to the objective function f and to the parameters 1 ; 2; g1 and g2 appearing in (2) and (3). At the moment we only show that this assumption guarantees the convergence in probability of the algorithm to the global optimum, as established in Theorem 1, and allows to give an upper bound for the expected number of iterations before reaching the level set B" , as established in Theorem 2.

Theorem 1 Under Assumption 1, it holds that 8 " > 0 P[xk 2 B" ] ! 1 as k ! 1:

Proof. The proof follows the ideas of Theorem 2 in [20]. We want to prove that for any

> 0 there exists a positive integer K such that

8 k K P[xk 2 B" ] 1 ? :

(10)

First we note that for any "0 " it holds that P[xk 2 B" ] P[xk 2 B"0 ]:

(11)

Moreover, for any k 2 [iN("0 ); (i + 1)N("0 )), it holds that P[xk 2 B"0 ] P[xk 2 B"0 j xiN ("0 ) 2 B"0 ]P[xiN ("0) 2 B"0 ]:

(12)

It follows from (7) that P[xk 2 B"0 j xiN ("0) 2 B"0 ] [1 ? t("0 )]k?iN ("0) [1 ? t("0 )]N ("0 ) ; 6

(13)

where the last quantity is, in view of (8), asymptotic to expf?t("0 )N("0 )g and thus converges to 1 as "0 ! 0. Therefore, by combining (11), (12) and (13), it follows that for "0 small enough P[xk 2 B" ] P[xiN ("0) 2 B"0 ] 1 ? 2 :

Then, in order to prove (10) it is enough to prove that there exist a positive integer I and a small enough "0 such that 8 i I P[xiN ("0) 2 B"0 ] 1 ? 2 : Now let ai ("0 ) = P[xiN ("0) 2 B"0 ] u("0 ) = P[x(i+1)N ("0) 62 B"0 j xiN ("0 ) 2 B"0 ] z("0 ) = P[x(i+1)N ("0) 2 B"0 j xiN ("0 ) 62 B"0 ]:

We note that, in view of the stationary condition (5), the values u("0 ) and z("0 ) are independent from i. It holds that ai+1("0 ) = [1 ? u("0 )]ai("0 ) + z("0 )[1 ? ai ("0 )]:

(14)

Condition (6) gives the lower bound z("0 ) = C" for z("0 ), while Condition (7) gives the upper bound u("0 ) = 1 ? [1 ? t("0 )]N ("0 ) ; for u("0 ). We note that, in view of (8) and (9), u("0 ) ! 0 as "0 ! 0: z("0 ) By substituting these bounds in (14), it follows that ai+1("0 ) ? ai("0 ) z("0 ) ? ai ("0 )[z("0 ) + u("0 )]:

(15) (16)

We note that ai+1 ("0 ) < ai("0 ) implies

0) 0 ai ("0 ) > z("0 )z(" + u("0 ) =: v(" );

7

(17)

i.e. the sequence fai ("0 )g can only decrease when it gets above the level v("0 ). In view of (15), v("0 ) ! 1 as "0 ! 0: (18) Moreover, from (16) and ai ("0 ) 2 [0; 1] ai ("0 ) ? ai+1("0 ) [ai("0 ) ? 1]z("0 ) + ai ("0 )u("0 ) u("0 ):

(19)

i.e. the maximumallowed decrease for the sequence fai ("0 )g is bounded from above by u("0 ). In view of (8) and (9) it holds that u("0 ) ! 0 as "0 ! 0:

(20)

Now we distinguish two cases. The rst one is the case in which the sequence fai ("0 )g never decreases. Then it must converge to a limit ("0 ) and if we take the limit for i ! 1 of both sides in (16) it holds that 0) 0 ("0 ) z("0 )z(" + u("0 ) = v(" ): Therefore, for i big enough,

ai ("0 ) v("0 ) ? u("0 ):

The second case occurs when there exists a j 2 N such that aj +1 ("0 ) < aj ("0 ). It follows from (17) and (19) that in such a case for any i j 0) 0 0 0 ai ("0 ) z("0 )z(" + u("0 ) ? u(" ) = v(" ) ? u(" ); Therefore, in both cases it holds that there exists a positive integer I such that 8 i I ai ("0 ) v("0 ) ? u("0 ): It follows from (18) and (20) that 0

0

lim [v(" ) ? u(" )] = 1: "0 !0 Then, for any small enough "0 and any i I it holds that ai ("0 ) 1 ? 2 ; as we wanted to prove. 8

Even more important than the convergence result is to derive an upper bound for the expected number of iterations to reach the level set B" . Let us de ne the sequence f i gi=0;1;::: as follows

i = xiN (") i = 0; 1; : : :; i.e. f i g is a subsequence of the sequence fxk g. For any nonnegative integer t let T " = t , t 2 B" ; i 62 B" 8 i = 0; ::; t ? 1; and

Tx" = t , xt 2 B" ; xi 62 B" 8 i = 0; ::; t ? 1:

The random variables T " and Tx" denote respectively the rst time the sequence f i g enters the level set B" , and the rst time the sequence fxig enters the same set. Obviously, for any nonnegative integer t it holds that T " = t ) Tx" tN("); so that

Tx" N(")T " ;

and, in particular

E[Tx"] N(")E[T " ]:

(21)

The next theorem gives an upper bound for E[Tx"].

Theorem 2 It holds that

E[Tx" ] N(") C" ;

where C is the same as in (6).

Proof. By de nition E[T " ] =

1 X t=1

tP[ t 2 B" ; i 62 B" 8 i = 0; ::; t ? 1];

which can be rewritten as follows 1 X t=1

tP[ t 2 B" j i 62 B" 8 i = 0; ::; t ? 1]

tY ?1 j =0

P[ j 62 B" j i 62 B" 8 i = 0; ::; j ? 1]:

In view of (6) this can be bounded from above by the expected number of trials to obtain a success by repeating a Bernoulli experiment with parameter C", which is equal to 1=C". 9

Now the result of the theorem immediately follows from (21).

3 Conditions for satisfying Assumption 1 As already remarked in the previous section, while Assumption 1 guarantees the convergence of the algorithm and allows to establish an upper bound for the expected number of iterations in order to reach the set B" , it is not clear how we should choose the functions t and N so that the assumption is satis ed. In this section we will introduce some other assumptions which will enable us to de ne the functions t and N and to relate the de nition of these functions to some properties of the objective function f, in particular in a neighborhood of the global optimum, and to the parameters 1; 2; g1 and g2 appearing in (2) and (3). In order to better understand the assumptions and some intermediate results, these will be illustrated through the following very simple example.

Example 1 Let the objective function be f(x) =k x ? 0:5e k22 , and let the feasible set be the unit hypercube X = [0; 1]n. The global optimum for this problem is obviously the point 0:5e where e is the vector (1; : : :; 1).

We note that, though very simple, this example is representative of the class of strictly convex quadratic problems and, under some regularity assumptions, many objective functions can be approximated by functions in this class in a neighborhood of the global optimum. This is important because, as we will see in what follows, appropriate choices of the parameters are related to the behavior of the objective function in a neighborhood of the global optimum. The assumptions will not be introduced all at once but in the places where they are needed. The rst assumption introduces some restrictions on the structure of the problems to be solved.

Assumption 2

1. The feasible set X is the unit hypercube [0; 1]n, and the objective function f is continuous.

2. The set X = fx 2 X : f(x) = f g is a singleton fxg.

10

We underline that some relaxations of these assumptions are possible. For instance, the feasible region X could be any convex compact body, while Assumption 2.2 can be relaxed into the requirement that the set j X j has a nite cardinality. In order to simplify the proofs, these relaxations are not considered here. The second assumption requires that there exists a neighborhood of the global optimum such that each point in the current level set is reachable in a single iteration, i.e. the support of the distribution D contains the whole current level set (see also Figure 2).

Assumption 3 There exists 1 > 0 such that xk 2 B1 ) S(xk ; Rxk ) Bf (xk )?f :

Bf(xk )?f

x

xk

HHH HHH HHH Rxk

Figure 2: The support of the distribution D at point xk contains the level set Bf (xk )?f . The next observation shows how Assumption 3 can be satis ed by choosing 2 and g2 in (3) according to the variability of the objective function f in a neighborhood of the global optimum.

Observation 1 Let p; ; 1 > 0 be such that 8 x 2 B1 : f(x) ? f k x ? x kp : Then, by choosing Assumption 3 holds.

2 2g2 and g2 p1 ; 11

(22) (23)

Proof. Let xk 2 B1 and denote by k = f(xk ) ? f 1 its distance from the optimum value. We denote by `(k ) the maximum distance between the optimal point x and the level set Bk , and by xk a point where the maximum is attained, i.e.

k x ? x k : `(k ) = xmax k x ? x k; xk 2 arg xmax 2B 2B k

k

We note that the maximum exists because from the continuity of f and the compactness of X it follows that also Bk is a compact set. The value 1 is chosen small enough so that for any k 1 it holds that `(k ) 1: (24) From the triangular inequality it follows that

8 x; y 2 Bk : kx ? yk kx ? x k + ky ? x k 2`(k ): Therefore,

Rxk 2`(k ) ) S(xk ; Rxk ) Bk ;

(25)

and Assumption 3 is satis ed. By de nition of Rxk and noting that f(xk ) f(xk ), Rxk = 2[f(xk ) ? f ]g2 2[f(xk ) ? f ]g2 ; and, in view of (22), it holds that Rxk 2g2 k xk ? x kg2 p = 2 g2 [`(k )]g2 p : Finally, from (23) and (24) it follows that Rxk 2`(k ) and from (25) the observation follows. Now let us derive the values 2 and g2 for our example

Example 1(continue) We immediately note that for this example p = 2 and = 1: Consequently, in view of Observation 1, we should choose

g2 21 and 2 2: 12

Assumption 3 allows us to derive a lower bound for the probability of moving, in a single iteration, from the set B2 n B into the set B when is small enough. The lower bound, derived in the following lemma, is given by the function s : R+ ! R+ de ned as follows m(B ) ; (26) s() = V ( 2[2]g2 ) where m denotes the Lebesgue measure and V (r) the volume of an n-dimensional sphere with radius r.

Lemma 1 Let Assumption 3 hold. Then for any 2 (0; 21 ] P[xk+1 2 B j xk 2 B2 n B ] s(); where s() is de ned in (26).

Proof. Since 2 1, it follows from Assumption 3 and xk 2 B2 n B that S(xk ; Rxk ) Bf (xk )?f B : Then, in view of the de nition of the next candidate point distribution ) P[xk+1 2 B j xk 2 B2 n B ] = m(S(xm(B : ; R k xk ) \ X) Finally, from f(xk ) ? f 2 it follows that Rxk 2[2]g2 and, consequently, ) P[xk+1 2 B j xk 2 B2 n B ] m(S(x ;m(B g2 ) \ X) s(); [2] k 2 as we wanted to prove. Now we compute the lower bound for our example.

Example 1 (continue) It is easily seen that m(B ) = V ( 12 ) = Kn n2 ; where Kn is a constant depending on the dimension. Therefore, n 2 s() = K (K2ng2 )nng2 = O n[ 21 ?g2 ] : n 2

13

(27)

The next step is to derive an upper bound for the probability of moving out, in a single iteration, from the set B . The upper bound will be the function t appearing in (7). Before deriving it we need to introduce an observation.

Observation 2 It holds that 8 x 2 [0; 1]n 8 r 2 [0; 1] m(S(x; r) \ [0; 1]n) m(S(x; r)) = V (r); where = 21n .

Proof. For a xed r 2 [0; 1] the minimal value for m(S(x; r) \ [0; 1]n) is attained at a corner of the hypercube and is equal to 21n V (r).

Now we are ready to derive the lower bound for the probability of moving out in a single iteration from the set B . Given the following three functions (

t1 () = exp ?

1

)

1 2 g1 ?1 ) ( 1 t2 () = exp ? 1 [] g12?1 ? m B+(1+g1 )=2 n B ? t3 () = ; V 2 2 g2 where is the same as in Observation 2, the function t is de ned as follows

t() = t1 () + t2 () + t3(): The next lemma shows that the function t de nes the upper bound we are looking for.

Lemma 2 Let Assumption 2 hold. Then 8 > 0 : P[xk+1 62 B j xk 2 B ] t():

Proof. See the Appendix. It is now easy to see that the function t de ned above satis es (7). Indeed, it holds that P[xk+i 2 B" j xk 2 B" ]

i Y h=1

P[xk+h 2 B" j xk+j 2 B" ; j h ? 1]; 14

where, in view of the Markovian property (4), each term in the product can be rewritten as follows P[xk+h 2 B" j xk+h?1 2 B" ]: Thus, in view of Lemma 2, each term in the product can be bounded from below by [1 ? t(")] and the whole product by [1 ? t(")]i , so that (7) is satis ed. Now we need to introduce the function N. To this aim we have to introduce a nal assumption, which basically states that in a single iteration it is easier for the algorithm to move into B when it is close enough to the global optimum (improving component of SA) than to move out of it (backtracking component of SA) as decreases to 0. As we will see, this assumption introduces a restriction on the choices of g1 and g2.

Assumption 4 It holds that

t() s() ! 0 as ! 0: We note that in particular it must hold that t() ! 0 as ! 0. This immediately introduces a restriction on the choice of g1 . Indeed, from the de nition of the functions t1 and t2 we see that, in order to have t() ! 0, it must hold g1 > 1. The function t() will now be derived for our example.

Example 1 (continue) It holds that m B+(1+g1 )=2 n B = V ([ + (1+g1 )=2 ] 21 ) ? V ( 21 ):

?

Then

?

g1 ?1 m B+(1+g1 )=2 n B Kn( + (1+g1 )=2) n2 ? Kn () n2 +n( 21 ?g2 ) : 2 = = O t3 () = V (2 [ 2 ]g2 ) Kn ( 2g22 )n ()ng2

For g1 > 1 and any g2 21 , t3 () dominates the terms t1 () and t2() as ! 0, so that t() is of the same order as t3 (), i.e. g1 ?1 +n

t() = O

2

( 21 ?g2 ) :

(28)

In this example Assumption 4 turns out to be a restriction on the choice of the value for g1 (but not for g2). Indeed, by recalling (27)

t() = O( g12?1 ): s() 15

and we must have g1 > 1.

In order to introduce the function N we need some steps. The rst step introduces a function M : R+ ! N with the following properties 9 =

M()t() ! 0 as ! 0: (29) M()s() ! 1 ; Under Assumption 4 such a function always exists. Now we derive a possible function for our example.

Example 1 (continue) Let

2r

M() =

6 6 6 6 6

3

log2 1 ( 21 ?g2 )n

7 7 7 7 7

:

(30)

It is easy to see that this function satis es (29).

The following lemma proves that, for small enough, we can move from the set B2 n B to the set B in M() iterations with probability at least equal to 12 .

Lemma 3 There exists 2 2 (0; 1=2] such that 8 2 (0; 2]

P[xk+M () 2 B j xk 2 B2 n B ] 21 :

Proof. See the Appendix. The previous lemma shows that, when we are in a neighborhood of the global optimum (the set B2 ), we are able to halve the distance from f with probability at least 21 within a number of iterations which is related to the distance from f . Then, we are ready for the second step towards the de nition of the function N. First we introduce the following functions for any integer j 0 Q(j; 2) =

j X

M 22i for j 1; Q(0; 2) = 0: i=1

Then, let N1 : R+ ! N be a function de ned as follows

l m N1 (") = Q(J; 2); J = log2 "2 :

16

(31)

The following lemma proves that we can move from any point inside B2 to the set B" in N1 (") iterations with probability at least C1 ", for some positive constant C1.

Lemma 4 There exists a constant C1 > 0 such that 8 x 2 B2 ; " > 0 : P[xk+N1(") 2 B" j xk = x] C1":

Proof. We note that, also exploiting the Markovian property (4) P[xk+N1(") 2 B" j xk = x]

J Y j =1

P[xk+Q(j;2) 2 B 2j2 j xk+Q(j ?1;2) 2 B 2j?2 1 ]:

(See also Figure 3). Indeed, by the de nition of J in (31), it holds that B2

B 2 2

B 2 X n B2

M T

2

2

M

iter.

2

8

4

iter. B 2 8

iter. M

2

4

iter.

Figure 3: The set B2 is reached with positive probability in T iterations, while we can move ? from set B 2j?2 1 to set B 2j2 in M 2j2 iterations with probability 12 . B J2 B" : By noting that for each j 2 f1; : : :; J g

2

Q(j; 2) ? Q(j ? 1; 2) = M 22j ;

17

it follows from Lemma 3 that each term in the product is bounded from below by 12 . Therefore, dlog2 ( 2 )e " C1"; P[xk+N1(") 2 B" j xk = x] 21 where C1 is a positive constant depending on 2. We are almost ready for the de nition of the function N but we need a further step. We can not simply set N N1 because Lemma 4 only guarantees that condition (6) is satis ed for x 2 B2 and not for any x 2 X. In order to complete the de nition of N we only need to prove that the set B2 is reachable with positive probability in a nite number of iterations. This result is stated in the following lemma.

Lemma 5 There exist a positive integer T and a constant > 0 such that 8 x 2 X : P[xk+T 2 B2 j xk = x] :

Proof. See the Appendix. Now we show how Lemma 5 can be used to complete the de nition of N. We de ne N as follows N(") = T + N1 ("); (32) and in the following lemma we prove that for N de ned as above, condition (6) is satis ed (see also Figure 3).

Lemma 6 Let N be de ned as in (32). Then, there exists a constant C > 0 such that 8 " > 0 and for any x 2 X P[xk+N (") 2 B" j xk = x] C":

(33)

Proof. A lower bound for the probability in (33) is given by P[xk+T 2 B2 j xk = x]P[xk+N1(")+T 2 B" j xk+T 2 B2 ]:

(34)

In view of Lemma 5, the rst probability in (34) can be bounded from below by . In view of Lemma 4, the second probability in (34) can be bounded from below by C1". Therefore, by choosing C = C1 > 0 the result of the lemma follows. 18

As a nal result we derive an upper bound for N(") for our example and see that in this example condition (8) imposes a further restriction on the choice of g1 .

Example 1 (continue) First we notice that the function M() chosen for our example increases as decreases to 0. Then, an upper bound for N(") is given by q

0

? 1

l m log2 1 m " 2 @ T + log2 " M(") = O log2 "2 12 ?g2 ) n ( " l

A

:

(35)

By recalling the asymptotic result (28) with = ", it holds that

t(")N(") O log 23 1 " g12?1 ?1 ; 2 " " where the right-hand side converges to 0 if g1 > 3.

4 A comparison between PAS, PRS and SA Now we are able to compute the upper bound for the expected number of iterations before reaching the level set B" for our example.

Example 1 (continue) According to Theorem 2 an upper bound is given by NC"(") , while

an upper bound for N(") is given by (35). Therefore, 0

3

?

1

log 2 1 E[Tx"] O @ n( 1 2?g2")+1 A : " 2 In particular, if we choose g2 = 21 , the upper bound is 0

32 ? 1 1 log O@ 2 " A:

"

(36)

The result obtained for our example is signi cative. As already remarked, many objective functions can be approximated by strictly convex quadratic functions in a neighborhood of the global optimum. Since an appropriate choice of the parameters is only related to the behavior of f in a neighborhood of the global optimum, the order (36) with respect to "1 of the upper bound for the expected number of iterations can be extended to more general 19

functions. Now we compare this result with those of the ideal algorithm PAS (Pure Adaptive Search) and of the simple algorithm PRS (Pure Random Search). The PAS algorithm, introduced in [27], can be described as follows.

The PAS algorithm Step 0 Set k = 0, X0 = X; Step 1 sample xk+1 from the uniform distribution over Xk = fy 2 X : f(y) < f(xk )g:

Step 2 if a stopping criterion is met, stop; otherwise, set k = k + 1 and return to Step 1. In [27] it is shown that, under the assumptions of lipschitzianity of f and convexity of X, the expected number of PAS iterations in order to reach B" is bounded from above by (37) 1 + n log Ld " ; where L is the Lipschitz constant of f and d is the diameter of X. The drawback of the PAS algorithm lies in the uniform generation over the level set Xk at Step 1 of the algorithm. In [27] it is pointed out that there is no known ecient procedure for generating a point uniformly distributed over a general region. Therefore, PAS can be considered as an ideal algorithm at whose results real algorithms should aim. A variant of the PAS algorithm is the HAS (Hesitant Adaptive Search) algorithm (see [4]), which tries to bring PAS closer to SA algorithms. Indeed, while in PAS each new iterate xk+1 improves the function value of the previous iterate, in HAS hesitation is allowed with a given probability, i.e. at some iterations it may hold that xk+1 = xk . As previously seen, also SA algorithms allow hesitation, however, with respect to SA algorithms, HAS does not allow backtracking, i.e. the new iterate xk+1 can not have a function value higher than f(xk ), while this is possible for SA algorithms (see also [23] for a further analysis of the relation between SA and PAS). We also mention here the well known PRS (Pure Random Search) algorithm, de ned as follows.

The PRS algorithm Step 0 Set k = 0; 20

Step 1 sample xk+1 from the uniform distribution over X; Step 2 if a stopping criterion is met, stop; otherwise, set k = k + 1 and return to Step 1. Note that PRS always allows backtracking, but, with respect to SA algorithms, it does not allow hesitation. While PAS and HAS algorithms oer excellent theoretical results but are not practically implementable, PRS is easier to implement (unless X has a very complicated shape), but has poor theoretical properties. Indeed, the expected time before reaching the set B" is equal to m(X) (38) m(B ) ; "

where m denotes the Lebesgue measure. For our example the expected time is O("? n2 ). The result (36) is a step towards the upper bound (37) for the ideal PAS algorithm, and compares very favourably with the upper bound (38) for the PRS algorithm. The strong improvement with respect to PRS is remarkable, because the proposed SA algorithm only diers from PRS because it allows hesitation, and because at each iteration the region over which a uniform point is sampled is not the whole feasible region but a neighborhood of the current point. With respect to PAS, the SA algorithm does not have the linear dependence on n appearing in (37). The constant of proportionality in (36) may depend exponentially on n. On the other hand, we can not expect that a real algorithm is able to solve the NP-hard global optimization problems without having to do with the curse of dimensionality.

5 Conclusion In this paper a SA algorithm has been investigated under the simplifying assumption that the global optimum value is known. The algorithm, described in Section 1, generates at each iteration a uniform random point over a sphere whose radius depends on the dierence between the current function value and the optimal value f , and accept or reject it according to the Metropolis acceptance function where also the temperature depends on the same dierence. Under appropriate assumptions, introduced in Sections 2 and 3, the algorithm has been proven to be convergent and an upper bound for the expected rst hitting time, i.e. the expected number of iterations to reach the level set B" , has been given. In Section 4 the obtained results have been compared with those for the ideal algorithm PAS (Pure Adaptive Search) and for the simple PRS (Pure Random Search) algorithm, and it has been shown that, under an appropriate choice of the parameters related to the behavior of the objective 21

function in a neighborhood of the global optimum, the SA algorithm is a step towards the PAS results. Currently, some computations are exploring the eciency of SA algorithms where both the temperature and the radius depend on the distance from the optimal value (or an estimate of it). Some promising results have been already obtained in [26], but in that paper only the case g1 ; g2 = 1 has been considered, and the results only cover the case of known optimal value. The complete computational results will appear in a future work but partial results are available at the web site http://www.di.unito/locatell/part comp.ps.

22

Appendix A Proof of Lemma 2 Proof. We rst note that the probability P[xk+1 62 B j xk 2 B ] can be split into the sum P[xk+1 62 B j xk 2 B 12 ]P[xk 2 B 21 j xk 2 B ] + +P[xk+1 62 B j xk 2 B n B 12 ]P[xk 2 B n B 12 j xk 2 B ]; which can be bounded from above by P[xk+1 62 B j xk 2 B 12 ] + P[xk+1 62 B j xk 2 B n B 12 ]:

(39)

We note that xk+1 62 B and xk 2 B 2 implies f(xk+1 ) ? f(xk ) 2 , while xk 2 B 2 implies h ig1 txk 1 2 : Therefore, an upper bound for the rst probability in (39) is (

exp ?

)

= t1 (): 21 2 g1 The second probability in (39) can be further split into the sum i

h

P xk+1 62 B+(1+g1 )=2 j xk 2 B n B 12 + h

i

+P xk+1 2 B+(1+g1 )=2 n B j xk 2 B n B 12 :

(40) (41)

By arguments completely analogous to those for the derivation of an upper bound for the rst probability in (39), it follows that probability (40) is bounded from above by (1+g1 )=2 exp ? f(xk+1t) ? f(xk ) exp ? []g1 = t2 (): xk 1 Probability (41) is bounded from above by the probability of sampling a point inside the set B+(1+g1 )=2 n B , i.e., in view of the de nition of the next candidate distribution D, by

?

m B+(1+g1 )=2 n B m(S(xk ; Rxk ) \ X) ; 23

which, in view of Observation 2, can be bounded from above by ? m B+(1+g1 )=2 n B : V (Rxk ) Finally, by observing that xk 2 B n B 21 implies Rxk 2 2 g2 , it follows that an upper bound for probability (41) is t3(). Therefore, an upper bound for (39) is given by the sum of the three functions t1 ; t2 and t3, i.e. by the function t as we wanted to prove.

B Proof of Lemma 3 Proof. Let us denote with Gjj12 (U) = fxk+j 2 U; j = j1 ; : : :; j2 g j1 j2 ; U X; the event that the sequence fxk g never leaves the set U from iteration k +j1 up to iteration k + j2 . For any j 2 f1; : : :; M()g let us consider the following event (also illustrated in Figure 4) Ej = G0j ?1(B2 n B ) \ fxk+j 2 B g \ GjM+1() (B ); (42) i.e. the event of staying inside the set B2 n B up to iteration k + j ? 1, moving into B at iteration k + j, and, nally, never leaving the set B up to iteration k + M(). Each event Ej implies the event fxk+M () 2 B g which is, consequently, implied also by the union of the events Ej 's. Therefore P[xk+M () 2 B j xk 2 B2 n B ] P[[jM=1()Ej j xk 2 B2 n B ]; and since the events Ej 's are pairwise disjoint P[xk+M () 2 B j xk 2 B2 n B ]

M () X j =1

P[Ej j xk 2 B2 n B ]:

(43)

In view of the de nition of the event Ej , each term in the sum can be rewritten as the following product P[GjM+1()(B ) j G0j ?1(B2 n B ) \ fxk+j 2 B g] P[G0j ?1(B2 n B ) \ fxk+j 2 B g j xk 2 B2 n B ]: 24

(44)

k+j

B2 n B

B from k + j + 1 to k + M ()

from k to k + j ? 1 Figure 4: Graphical illustration of the event Ej . Taking into account the Markovian property (4), the probability in (44) can be rewritten as follows MY () P[xk+h 2 B j xk+h?1 2 B ]; h=j +1

which, in view of Lemma 2, can be bounded from below by [1 ? t()]M ()?j [1 ? t()]M () :

(45)

Therefore, a lower bound for the sum in (43) is given by [1 ? t()]M ()

M () X j =1

P[G0j ?1(B2 n B ) \ fxk+j 2 B g j xk 2 B2 n B ]:

Since the events

G0j ?1(B2 n B ) \ fxk+j 2 B g; j 2 f1; : : :; M()g; are pairwise disjoint, the sum in (46) turns out to be the probability of the event

E1 = f9 j 2 f1; : : :; M()g : xk+j 2 B ; xk+i 2 B2 n B ; i < j g; 25

(46)

i.e. the event that, within M() iterations, we will move from the set B2 n B into the set B . Therefore, the lower bound (46) for the sum in (43) is equivalent to [1 ? t()]M ()P[E1 j xk 2 B2 n B ] It can be seen that the event E1 is implied by the following event

E2 = f9 j 2 f1; : : :; M()g : xk+j 2 B g \ G0M () (B2 ): Indeed, E2 not only requires that from B2 n B we will move, within M() iterations, into B , but also that after having reached B we will never leave B2 up to iteration k + M(). Therefore P[E1 j xk 2 B2 n B ] P[E2 j xk 2 B2 n B ]: and a lower bound for (43) is then [1 ? t()]M () P[E2 j xk 2 B2 n B ] Now we note that

h

i

h

i

E2 = G0M () (B2 ) n G0M () (B2 n B ) :

Therefore P[E2 j xk 2 B2 n B ] = P[G0M ()(B2 )] ? P[G0M ()(B2 n B )]: In a way completely analogous to the derivation of the lower bound (45) for the probability (44), it can be seen that P[G0M ()(B2 )] [1 ? t(2)]M () : (47) Moreover, in view of the Markovian property (4), it holds that P[G0M ()(B2 n B )] =

MY () j =1

P[xk+j 2 B2 n B j xk+j ?1 2 B2 n B ]:

In view of Lemma 1, for 21 it holds that for any j 2 f1; : : :; M()g P[xk+j 2 B2 n B j xk+j ?1 2 B2 n B ] 1 ? P[xk+j 2 B j xk+j ?1 2 B2 n B ] 1 ? s(): Then,

P[G0M ()(B2 n B )] [1 ? s()]M () : 26

(48)

Therefore, by combining the bounds (47) and (48) P[E2 j xk 2 B2 n B ] [1 ? t(2)]M () ? [1 ? s()]M () : Then, we can conclude that a lower bound for (43) is [1 ? t()]M ()f[1 ? t(2)]M () ? [1 ? s()]M () g; which, in view of (29), converges to 1 as ! 0. Therefore, there exists 2 > 0 such that for any 2 the lower bound is greater than 21 , which concludes the proof.

C Proof of Lemma 5 Proof. Let

&

'

? T = 2diam(X) ; (49) 2 22 g2 where diam(X) = pn denotes the diameter of [0; 1]n. Consider the segment [xk ; x] and the following sequence of points g2 ? xk wi = xk + i 22 22 k xx ? xk k : Let xk = arg minfi : wi 62 [xk; x ] or wi 2 B 34 2 g:

We note that, in view of the de nition (49) of T

8 xk 2 X : xk T: Next de ne a nite sequence of points fwi; i = 0; : : :; xk g as follows. If xk = 0, set w0 = xk , otherwise set wi = wi; i 2 f0; : : :; xk ? 1g wxk 2 [xk ; x] \ @B 34 2 ; where @B 34 2 denotes the border of the set B 43 2 . Since, by assumption, the objective function f is continuous over the compact set X, it is also uniform continuous over X. This 27

implies that 9 = (2) > 0 such that

8 x; y 2 X : k x ? y k ) j f(x) ? f(y) j< 42 : Therefore, 8 i 2 f0; : : :; xk ? 1g

and

wi 62 B 43 2 ; xk+i 2 S(wi ; ) ) xk+i 62 B 22 ;

(50)

xk+xk 2 S(wxk ; ) ) xk+xk 2 B2 :

(51)

Now let

h ig2 o n = min ; 42 22 ; 1 : We prove that for any i 2 f0; : : :; xk ? 1g

xk+i 2 S(wi ; ) ) s(xk+i ; Rxk+i ) S(wi+1 ; ):

(52)

Indeed, in view of (50) h ig2 Rxk+i = 2[f(xk+i ) ? f ]g2 = 2 22 : Moreover, for any y 2 S(wi+1 ; ), it follows from the triangular inequality that h ig2 ky ? xk+ik k| y ?{zwi+1k} + k| wi+1{z? wi k} + k| wi ?{zxk+ik} 2 22 Rxk+i ; 42 [ 22 ]g2 22 [ 22 ]g2 42 [ 22 ]g2

i.e. (52) holds. Now let us denote with

Fk+i = fxk+j 2 S(wj ; ); 0 j < ig; the event that the sequence of points generated by the algorithm from iteration k up to iteration k + i ? 1 jumps through the spheres S(wi ; ). Then, a lower bound for P[xk+T 2 B2 j xk = x] is given by xk Y i=0

P[xk+i 2 S(wi ; ) j Fk+i ] P[xk+T 2 B2 j Fk+xk +1 ]; 28

'$ '$ '$ &%&%&% k+1

S (w0 ; )

k+2

S (w1 ; )

k+3

k + xk

?1

'$ '$ &%&% k + xk

B2

S (wxk ?1 ; )

S (w2 ; )

S (wxk ; )

from k + xk + 1 to k + T

Figure 5: A possible way to be in the set B2 after T iterations with a positive probability starting from any point xk 2 X. i.e. the probability that we move inside the spheres S(wi ; ) until we reach the set S(wxk ; ) B2 (see (51)) at iteration k + xk , and then we lie inside the set B2 in all the subsequent T ? xk iterations (see also Figure 5). Each term in the product is equal to m(S(wi ; ) \ X) f(xk+i ) ? f(xk+i?1 ) : exp ? m(S(xk+i?1 ; Rxk+i?1 ) \ X) txk+i?1 We note that f(xk+i ) ? f(xk+i?1 ) can be bounded from above by F = maxx2X f(x) ? f . Moreover, in view of (50), h ig1 txk+i?1 1 22 : Finally, it follows from Observation 2 and 1 that

m(S(wi ; ) \ X) V (); while, obviously

m(S(xk+i?1 ; Rxk+i?1 ) \ X) m(X):

Therefore, a lower bound for each term in the product is (

)

V () exp ? F : m(X) 1 22 g1 Moreover, it follows from from Lemma 2 that P[xk+T 2 B2 j xk+xk 2 B2 ] [1 ? t(2 )]T : 29

Therefore, by combining all the bounds found so far the result of the lemma follows by choosing ( ( ))T F V ()

= [1 ? t(2)] m(X) exp ? 2 g1 1 2

References [1] M.M. Ali, C. Storey, Aspiration Based Simulated Annealing Algorithm, Journal of Global Optimization, 11(2), 181-191 (1997) [2] C.J.P. Belisle, Convergence Theorems for a Class of Simulated Annealing Algorithms on