Evolution under Partial Information - SAGE project

Evolution under Partial Information Duc-Cuong Dang

Per Kristian Lehre

ASAP Research Group School of Computer Science University of Nottingham

ASAP Research Group School of Computer Science University of Nottingham

[email protected]

[email protected]

ABSTRACT

1.

Complete and accurate information about the quality of candidate solutions is not always available in real-world optimisation. It is often prohibitively expensive to evaluate candidate solution on more than a few test cases, or the evaluation mechanism itself is unreliable. While evolutionary algorithms are popular methods in optimisation, the theoretical understanding is lacking for the case of partial information. This paper initiates runtime analysis of evolutionary algorithms where only partial information about fitness is available. Two scenarios are investigated. In partial evaluation of solutions, only a small amount of information about the problem is revealed in each fitness evaluation. We formulate a model that makes this scenario concrete for pseudoBoolean optimisation. In partial evaluation of populations, only a few individuals in the population are evaluated, and the fitness values of the other individuals are missing or incorrect. For both scenarios, we prove that given a set of specific conditions, non-elitist evolutionary algorithms can optimise many functions in expected polynomial time even when vanishingly little information available. The conditions imply a small enough mutation rate and a large enough population size. The latter emphasises the importance of populations in evolution.

In the past decades, there have been a significant advances in the theoretical understanding Evolutionary Algorithms (EAs) and their behaviours in optimisation problems. A large number of results on the time-complexity of EAs have been rigorously proved [14]. In most of those analyses, it often assumes that the true fitness value of a solution can be received by calling an evaluation procedure. The timecomplexity is then analysed and computed based as the number of those calls. In real-world optimisation problems, it often happens that complete and accurate information about the quality of candidate solutions is either not available or prohibitively expensive to obtain. Optimisation problems with noise, dynamic objective function or stochastic data, generally known as optimisation under uncertainty [6], are examples of the unavailability. On the other hand, expensive evaluations often occur in engineering, especially in structural and engine design. For example, in order to determine the fitness of a solution, the solution has to be put in a real experiment or a simulation which may be time/resource consuming or even requiring collection and processing of a large amount of data. Such complex tasks give rise to the use the surrogate model methods [5] to assist EAs where a cheap and approximate procedure fully or partially replaces the expensive evaluations. The full replacement is equivalent to the case of unavailability. We summarise this kind of problem as optimisation only relying on imprecise, partial or incomplete information (for now) about the problem. While EAs have been widely and successfully used in this challenging area of optimisation [5, 6], only few rigorous theoretical studies have been dedicated to fully understand the behaviours of EAs under such environments. Especially for discrete optimisation, the studies are often limited to the case of imprecise information and only extreme configuration of EAs are considered. In [2, 3], the inefficiency of (1 + 1) EA has been rigorously demonstrated for noisy and dynamic optimisation of OneMax function. The reason behind this efficiency is that in such environments, it is very difficult for the algorithm to decide one solution is better than another, e.g. very often it will choose the wrong candidate. On the other hand, such effect could be reduced by having a population, which is not the case for (1 + 1) EA. It has been shown in [11] that with a very large or infinite population, noise has no effect on EAs. In fact, the analysis presented in [11] is based on approximations which are not rigorous (but make sense for infinite population). Therefore, the exact order of the population size cannot be specified.

Categories and Subject Descriptors F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity

General Terms Theory, Algorithms, Complexity

Keywords Partial Evaluation, Runtime Analysis, Non-elitism

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GECCO’14, July 12–16, 2014, Vancouver, BC, Canada. Copyright 2014 ACM 978-1-4503-2662-9/14/07 ...$15.00. http://dx.doi.org/10.1145/2576768.2598375.

1359

INTRODUCTION

Algorithm 1 Population Selection-Variation Algorithm Require: Finite search space X , and initial population P0 ∼ Unif(X λ ). 1: for t = 0, 1, 2, . . . until termination condition met do 2: for i = 1 to λ do 3: Sample It (i) ∈ [λ] according to psel (Pt ). 4: x := Pt (It (i)). 5: Sample x0 according to pmut (x). 6: Pt+1 (i) := x0 . 7: end for 8: end for

In this paper, we initiate runtime analysis of evolutionary algorithms where only partial or incomplete information about fitness is available. Two scenarios are investigated: (i) in partial evaluation of solutions, only a small amount of information about the problem is revealed in each fitness evaluation, we formulate a model that makes this scenario concrete for pseudo-Boolean optimisation (ii) in partial evaluation of populations, only a few individuals in the population are evaluated, and the fitness values of the other individuals are missing. For both scenarios, we rigorously prove that given a set of specific conditions, non-elitist evolutionary algorithms can optimise many functions in expected polynomial time even with very little information available. The conditions imply a small enough mutation rate and a large enough population size. The latter emphasises the importance of having a population and understand its required size to overcome the hazardous environment filled with incomplete/imprecise information. The remainder of the paper is organised as follows. The next section provides the generalised model of optimisation and summarises our main tool to analyse EAs with non-elitist populations which is presented in a companion paper. In Section 3, the scenario for partial evaluation of solutions is presented with specific examples of pseudo-Boolean optimisation. Partial evaluation of populations is discussed and analysed in Section 4 and finally, some conclusions are drawn.

2.

independently by picking one parent from Pt using a selection mechanism, denoted by psel , then by mutating it using a variation operator, denoted by pmut . For pure analytical purposes, It (i) is used to denote the sorting index i of Pt , e.g. Pt (It (i)) returns the di/λe-ranked individual of Pt . Note that the populations do not overlap, hence the scheme is non-elitist. It also makes sense to assume psel is monotone as follows. Let us denote by psel (i | P ) the probability of an individual i of P being selected by psel . A selection mechanism psel is f -monotone if for all P ∈ X λ and pairs i, j ∈ [λ] it holds that psel (i | P ) ≥ psel (j | P ) ⇐⇒ f (P (i)) ≥ f (P (j)). We then define the cumulative selection probability for psel as the probability of selecting an individual with fitness at least as good as that of the γ-ranked individual. Formally,

ANALYTICAL TOOLS

Definition 3. The cumulative selection probability β associated with selection mechanism psel is defined for all γ ∈ (0, 1] and P ∈ X λ by X β(γ, P ) := psel (y | P ) · [f (y) ≥ f (P (I(dγλe)))]

For any positive integer n, define [n] := {1, 2, . . . , n}. The Hamming-distance is denoted by H(·, ·). The natural logarithm is denoted by ln(·), and the logarithm to the base 2 is denoted Pnby log(·). For a bitstring x of length n, define |x|1 := i=1 xi . Given a function f (x) fully defined, the generalised optimisation problem of f (x) is defined as,

y∈P

Algorithm 1 has been investigated in a string of papers [7, 10, 8, 1]. The algorithm has exponential expected runtime when the selective pressure is low relative to the mutation rate [7, 10]. Conversely, when the cumulative selection probability is sufficiently high, it is possible to derive upper bounds on the expected runtime using a fitness-level technique [8]. The following theorem recently shown in [1] is a further improvement which provides tighter upper bounds when β(γ) ≥ γ(1 + ε) for arbitrary small ε > 0.

Definition 1. In the problem of optimising f (x) knowing only Fc (x) (c is a positive parameter), the evaluation of a bitstring x returns Fc (x) instead of f (x). For c = 1, F1 (x) = f (x) or it is a static/classical optimisation problem, shortly called optimising f (x). For c < 1, Fc (x) is a random variable parameterised by c and x. An illustrative example of optimisation with imprecise information is the noisy case. In [3], Fc (x) has probability of c of being equal to f (x). Otherwise with probability 1 − c, it is equal to the fitness of a solution randomly picked from the neighbourhood of x with radius 1 in Hamming distance. Our examples of Fc (x) for optimisation with incomplete information are given in the next section. The runtime is still defined in terms of the discovery of a true optimal solution.

Theorem 4 ([1]). Given a function f : X → R, and an f -based partition (A1 , . . . , Am+1 ), let T be the number of function evaluations until Algorithm 1 with an f -monotone selection mechanism psel obtains an element in Am+1 for the first time. If there exist parameters p0 , s1 , . . . , sm , s∗ ∈ (0, 1], and γ0 ∈ (0, 1) and δ > 0, such that (C1) pmut y ∈ A+ j | x ∈ Aj ≥ sj ≥ s∗ for all j ∈ [m] (C2) pmut y ∈ Aj ∪ A+ j | x ∈ Aj ≥ p0 for all j ∈ [m] (C3) β(γ, P )p0≥ (1 +δ)γ for all P ∈ X λ and γ ∈ (0, γ0 ) 16m δ 2 γ0 2 with a = (C4) λ ≥ ln a acs∗ 2(1 + δ) = min{δ/2, 1/2} and c = 4 /24 ! m X 2 p0 1 then E [T ] ≤ mλ(1 + ln(1 + cλ)) + . c (1 + δ)γ0 j=1 sj

Definition 2. Assuming maximisation, the runtime of an algorithm A to optimise f (x) knowing only Fc (x) is considered as the number of times Fc (x) being sampled by A until for the first time a solution x∗ such that f (x∗ ) = max{f (x)} is conceived by A. We focus on when A finds a true optimal solution for the first time and ignore how A recognises such an achievement. The algorithms analysed in this paper belong to the algorithmic scheme of Algorithm 1. In each step t, the algorithmic scheme generates a new population Pt+1 based on the current one Pt . Each individual of Pt+1 is generated

In the theorem, A+ j is a short notation for ∪k>j Ak . In this paper, focus on the special case of Algorithm 1 where

1360

3.

the search space X is the set of bitstrings of length n, the selection operator psel is binary tournament selection, and the mutation operator pmut is bitwise mutation, where each bit is flipped with probability χ/n. Note that the parameter χ is not necessarily constant with respect to n. The following simplifications and auxiliary results will be useful.

PARTIAL EVALUATION OF PSEUDO-BOOLEAN FUNCTIONS

We consider pseudo-Boolean functions over bitstrings of length n. It is well known that for any pseudo-Boolean function f : {0, 1}n → R, there exists a set S := {S1 , . . . , Sk } of k subsets Si ⊆ [n] with associated weights wi ∈ R such that

Corollary 5. Given any constant γ0 ∈ (0, 1), if 1/δ ∈ poly(m) and 1/s∗ ∈ poly(m) then there exists a constant b such that condition (C4) is satisfied with

f (x) =

k X i=1

b ln (m) δ2 Proof. For γ0 being a constant and δ ∈ (0, 1), we remark that 1/ = O(1/δ), 1/c = O(1/δ 4 ) and 1/a = (2/γ0 )(1/δ 2 + 1/δ) = O(1/δ 2 ). Since 1/δ ∈ poly(m) and 1/s∗ ∈ poly(m), then there must exist a constant d > 0 such that md ≥ 16m/(acs∗ ). Thus it suffices to choose b := 8d/γ0 so that b 4d 1 1 ln (m) = + ln (m) δ2 γ0 δ 2 δ2 4 1 1 2 16m d ≥ m ≥ ln + ln γ0 δ 2 δ a acs∗ (C4’) λ ≥

wi

Y

xj .

(1)

j∈Si

For example, we have k = n and Si := {i} for linear functions, in which OneMax is the particular case where wi = 1 for all i ∈ [n]. For LeadingOnes, we also get k = n and wi = 1 but Si := [i]. In a classical optimisation problem, full access to (S, W ) is guaranteed so that f (x) can be exactly determined from Equation (1). In an optimisation problem under incomplete or partial information, the access to (S, W ) is made random, so often incomplete. There are many ways to define the randomisation, however in this paper we focus on the following model. Fc (x) :=

Then any λ ≥ (b/δ 2 ) ln(m) satisfies (C4).

k X i=1

Corollary 6. Given any constant γ0 ∈ (0, 1), for any δ ∈ (0, 1), the expected runtime of Algorithm 1 satisfying conditions (C1-4) is !! m X 1 1 4 O mλ(1 + ln(1 + δ λ)) + . δ5 sj j=1

wi Ri

Y

xj with Ri ∼ Bin(1, c)

(2)

j∈Si

Informally, each subset Si has probability c of being taken into account in the evaluation. We consider classical functions, such as OneMax and LeadingOnes, to be the typical examples. As mentioned in the previous section, Fc (x) is used instead of f (x) in each partial evaluation. We first have a look at the case of OneMax, the optimisation problem is then denoted as OneMax(c). Literally, each 1-bit has only probability c to be added up in fitness evaluation. We will rigorously prove that for 1/c ∈ poly(n), a population-based EA can optimise OneMax(c) in expected polynomial time. On the other hand, the classical (1+1) EA needs exponential time in expectation for any constant c < 1. For simplification, we restrict psel in our analysis to binary tournament selection. The algorithm is detailed in Algorithm 2.

Proof. For δ ∈ (0, 1), we have = min{δ/2, 1/2} = δ/2. Hence 2/(c) = 48/5 = 1536/δ 5 . In addition, we have p0 ≤ γ10 , so by Theorem 4 (1+δ)γ0 ! m δ4 λ 1 X 1 1536 mλ 1 + ln 1 + + E [T ] ≤ 5 δ 384 γ0 j=1 sj and the result follows. We also use drift analysis [9] to prove the inefficiency of (1 + 1) EA under partial/incomplete information. The following theorem, which is a corollary result of Theorem 2.3 in [4], was presented in [9].

Algorithm 2 EAs (2-Tournament, Partial Information) 1: Sample P0 ∼ Unif(X λ ), where X = {0, 1}n . 2: for t = 0, 1, 2, . . . until termination condition met do 3: for i = 1 to λ do 4: Sample two parents x, y ∼ Unif(Pt ). 5: fx := Fc (x) and fy := Fc (y). 6: if fx > fy then 7: z := x 8: else if fx < fy then 9: z := y 10: else 11: z ∼ Unif({x, y}) 12: end if 13: Flip each bit position in z with probability χ/n. 14: Pt+1 (i) := z. 15: end for 16: end for

Theorem 7 (Hajek’s theorem [9]). Let {Xt }t≥0 be a stochastic process over some bounded state S ∈ [0, ∞), Ft be the filtration generated by X0 , ..., Xt . Given a(n) and b(n) depending on a parameter n such that b(n) − a(n) = Ω(n), define T := min{t|Xt ≥ b(n)}. If there exist positive constants λ, , D such that (L1) E [Xt+1 − Xt |Ft , Xt > a(n)] ≤ − (L2) (|Xt+1 − Xt ||Ft ) Y with E eλY ≤ D then there exists a positive constant c such that Pr (T ≤ ecn |X0 < a(n)) ≤ e−Ω(n) Informally, if the progress (toward state b(n)) becomes negative from state a(n) (L1) and big jumps are very rare (L2), then E [T ] is exponential, e.g., by Markov’s inequality

Theorem 4 requires lower bounds for β(γ) and δ so that the necessary condition for the mutation rate χ/n is established. The value of β(γ) depends on the probability that the fitter of individual x and y is selected in lines 6–12 of

E [T |X0 < a(n)] ≥ Pr (T > ecn |X0 < a(n)) ecn ≥ 1 − e−Ω(n) ecn = eΩ(n)

1361

Algorithm2. Formally, for any x and y where f (x) > f (y), we want to know a lower bound on the probability that the algorithm selects z = x.

the portion, or at least x is picked from the portion then wins the tournament. β(γ) ≥ γ 2 + 2γ(1 − γ) Pr(z = x) = γ(1 + (1 − γ)(2 Pr(z = x) − 1))

Lemma 8. For any two bitstrings x and y of length n with f (x) > f (y), we have Pr(z = x) ≥ (1/2)(1 + c Pr(X = Y )) where X and Y are identical independent random variables following distribution Bin(f (y), c).

From Corollary 9, we have 2 Pr(z = x) − 1

Proof. Each bit of the f (x) 1-bits of x has probability c being counted, so Fc (x) ∼ Bin(f (x), c). Similarly, we have Fc (y) ∼ Bin(f (y), c). Remark that f (x) = f (y) + (f (x) − f (y)), we can decompose Fc (x) and Fc (y) further. Let X, Y and ∆ be independent random variables such that X ∼ Bin(f (y), c), Y ∼ Bin(f (y), c) and ∆ ∼ Bin(f (x) − f (y), c), then we have Fc (x) = X + ∆ and Fc (y) = Y . By the law of total probability,

≥2

1 2

64c/81 1+ p 6 (n − 1)c(1 − c) + 1

!! −1

64c/81 = p 6 (n − 1)c(1 − c) + 1 So β(γ) ≥ γ

Pr(z = x) Pr(∆ = 0) = Pr(X > Y ) + Pr(X = Y ) Pr(∆ > 0) + 2 Pr(∆ = 0) = Pr(X > Y ) + Pr(X = Y ) Pr(∆ ≥ 0) − 2 (f (x)−f (y)) (1 − c) = Pr(X > Y ) + Pr(X = Y ) 1 − 2 1−c ≥ Pr(X > Y ) + Pr(X = Y ) 1 − 2

(1 − γ)64c/81 1+ p 6 (n − 1)c(1 − c) + 1

!

Corollary 11. For any c ∈ (0, 1) and any constant γ0 ∈ (0, 1), then there exists constant a ∈ (0, 1) such that p β(γ) ≥ γ(1 + 2δ) for all γ ∈ (0, γ0 ] where δ = min{ac, a c/n}. Proof. From Lemma 12, for all γ ∈ (0, γ0 ] we have uc β(γ) ≥ γ 1 + √ v nc + 1 where u = (1 − γ0 )64/81, v = 6 √ u uc ≥ c. If c ≤ 1/n, then nc ≤ 1 and √ v+1 v nc + 1 √ If c > 1/n, then nc > 1 and r uc uc u c √ √ = > √ v+1 n v nc + 1 v nc + nc

The inequality is due to (1−c) ∈ (0, 1) and f (x)−f (y) ≥ 1. Because X and Y are identical and independent, we have Pr(X < Y ) = Pr(X > Y ). We also have the total probability Pr(X > Y ) + Pr(X = Y ) + Pr(X < Y ) = 1. The two results imply Pr(X > Y ) = (1 − Pr(X = Y ))/2. Put into the previous calculation of Pr (z = x), 1 − Pr(X = Y ) 1−c + Pr(X = Y ) 1 − Pr(z = x) ≥ 2 2 1 = (1 + c Pr(X = Y )) 2

The statement now follows by choosing a = u/(2(v +1)). Theorem 12. Given c ∈ (0, 1) such that 1/c ∈ poly(n), there exist constants a and b such that Algorithm p 2 with χ = δ/3 and λ = b ln(n)/δ 2 where δ = min{ac, a c/n} optimises OneMax(c) in expected time  n ln(n)   O if c ≤ 1/n and  c7 9/2 n ln(n)   O if c > 1/n. c7/2

Corollary 9. For any two bitstrings x and y of length n with f (x) > f (y), we have ! 64c/81 1 Pr(z = x) ≥ 1+ p 2 6 (n − 1)c(1 − c) + 1 Proof. From Lemma 8, we apply the result of Lemma 21 for X, Y ∼ Bin(n − 1, c) (in the worst case, we have f (y) = n − 1) and d = 3 (arbitrarily chosen). Lemma 10. For any γ ∈ (0, 1), the cumulative selection probability of Algorithm 2 is at least ! (1 − γ)64c/81 β(γ) ≥ γ 1 + p 6 (n − 1)c(1 − c) + 1

Proof. Remark that Corollary 11 and 1/c ∈ poly(n) imply 1/δ ∈ poly(n), δ ∈ (0, 1) and χ < 1/3. We then use the partition Aj := {x ∈ {0, 1}n | |x|1 = j} to analyse the runtime. The probability of improving a solution at fitness level j by mutation is lower bounded by the probability that a single 0-bit is flipped and not the other bits, χ χ n−1 j χ n 1− > 1− χ 1− (n − j) n n n n

Proof. Without loss of generality, for any inputs x and y of the tournament selection we assume that f (x) ≥ f (y). Recall that β(γ) is the probability of picking an individual with fitness at least equal to the fitness of the γ-ranked individual, e.g. belonging to the upper γ-portion of the population. Therefore, it suffices either x and y are picked from

It follows from Theorem 18 and χ < 1/3 that (1−χ/n)n ≥ (1 − χ) > 2/3. Therefore, χ(1 − χ/n)n > (δ/3)(2/3) = 2δ/9, and it suffices to choose the parameters sj := (1 − j/n)(δ/9) and s∗ := δ/(9n) so that (C1) is satisfied. In addition, (1 − χ/n)n is the probability of not flipping any bit in the mutation, hence picking p0 = 1 − χ satisfies (C2).

1362

It now follows from Corollary 11 that for all γ ∈ (0, γ0 ]

i.e. 1/c ∈ poly(n). Particularly, for any constant c ∈ (0, 1), the non-elitist EAs can optimise LeadingOnes(c) in O(n5 ) and OneMax(c) in O(n9/2 ln(n)). We now show that the classical (1 + 1) EA, which is summarised in Algorithm 3, already requires exponential runtime on OneMax(c) for any constant c < 1.

β(γ)p0 ≥ γ(1 + 2δ)(1 − χ) = γ(1 + 2δ)(1 − δ/3) = γ(1 − δ/3 + 2δ − 2δ 2 /3) ≥ γ(1 − δ/3 + 2δ − 2δ/3) = γ(1 + δ) Therefore, (C3) is satisfied with the given value of δ. Because 1/δ ∈ poly(n), then 1/s∗ ∈ poly(n) and by Corollary 5, there exists a constant b such that condition (C4) is satisfied with λ = (b/δ 2 ) ln(m). All conditions are satisfied, and by Corollary 6, the expected optimisation time is !! m X 1 1 4 mλ(1 + ln(1 + δ λ)) + O δ5 sj j=1 !! n n ln(n) 1 nX1 2 . =O (1 + ln(1 + δ ln(n))) + δ5 δ2 δ j=1 j

Algorithm 3 (1+1) EA (Partial Information) 1: Sample x0 ∼ Unif({0, 1}n ). 2: for t = 0, 1, 2, . . . until termination condition met do 3: x0 := xt . 4: Flip each bit position in x0 with probability 1/n. 5: if Fc (x0 ) ≥ Fc (xt ) then 6: xt+1 := x0 . 7: else 8: xt+1 := xt . 9: end if 10: end for

√ By the definition of δ, it follows that δ = O(1/ n), so P n 2 1 + ln(1 + δ ln(n)) = O(1). Furthermore, nδ j=1 1/j = O(n ln(n)/δ) is dominated by the left term n ln(n)/δ 2 . Hence, the optimisation time is E [T ] = O(n ln(n)/δ 7 ). The p theorem follows by noting that δ = ac if c ≤ 1/n, and δ = a c/n otherwise.

Theorem 14. For any constant c ∈ (0, 1), the expected optimisation time of (1+1) EA on OneMax(c) is eΩ(n) . Proof. We use Theorem 7 with Xt as the Hamming distance from 0n (all-zero bitstring) to xt , so Xt := H(0n , xt ) and b(n) := n. The use of Fc (x) implies that the algorithm can accept degraded solutions with less 1-bits than the current solution. This typically happens when all bit positions of x0 are not flipped except a 1-bit then the evaluation of x0 does not recognise the change, such an event is denoted by E. To analyse Pr (E), let us denote the event that a specific bit position i being flipped and not the others, and Fc (x0 ) does not evaluate position i, by Ei . n−1 1 1 1−c Pr (Ei ) = (1 − c) 1− ≥ n n ne

We now consider the partial evaluation for LeadingOnes, so the optimisation problem LeadingOnes(c). Literally, Q each product ij=1 xj , i ∈ [n] has the same probability c of contributing to the fitness value in each evaluation. The following result holds for the function. Theorem 13. For any c ∈ (0, 1) where 1/c ∈ poly(n), there exist constants a and b such that Algorithm p 2 with χ = δ/3 and λ = b ln(n)/δ 2 where δ = min{ac, a c/n} optimises LeadingOnes(c) in expected time  n ln(n)   O if c ≤ 1/n and  c7 9/2 n ln(n) n5   O + 3 if c > 1/n 7/2 c c

Assuming that there are currently j 1-bits in xt , events Ei for those bits are non-overlapping. In addition, under the condition that one of those events Ei happens (so with probability j Pr (Ei )), we use X to denote the number of the remaining 1-bits in xt being evaluated, and Y the number of evaluated 1-bits in x0 . Indeed, X and Y are identical and independent variables following the same binomial distribution Bin(f (x), c), so Pr(Y ≥ X) ≥ 1/2. We get,

Proof. Conditions (C2)-(C4) are shown exactly as in the proof of Theorem 12. For condition (C1), use the canonical partition Aj := {x ∈ {0, 1}n | LeadingOnes(x) = j}. The probability of improving a solution at fitness level j by mutation is lower bounded by the probability that the leftmost 0-bit is flipped, and no other bits are flipped, χ χ n−1 2δ 1 1− (χ(1 − χ)) ≥ > n n n 9n

Pr (E) ≥ j ·

j(1 − c) 1−c · Pr (Y ≥ X) ≥ ne 2ne

The drift is then E [Xt+1 − Xt |Ft ] ≤ −1 ·

j(1 − c) 1 + (n − j) · 2ne n

We therefore choose sj = s∗ = δ/(9n) and the runtime is n ln(n) n2 1 2 E [T ] = O (1 + ln(1 + δ ln(n))) + δ5 δ2 δ n ln(n) n2 =O + 6 δ7 δ

We have a negative drift when j(1−c) ≥ n−j or 2ne n n 1−c j≥ = n 1 − 2e + 1 − c 1 + 1−c 2e

The result p now follows by noting that δ = ac if c ≤ 1/n, and δ = a c/n otherwise.

From that, ∈ (1 − c, 2e), there exists for any constant 1−c+ a(n) := n 1 − 2e+1−c so that

We have shown that non-elitist EAs with populations of polynomial sizes and without elitism can optimise OneMax(c) and LeadingOnes(c) in expected polynomial runtime, precisely in terms of partial evaluation calls, for very small c,

1363

E [Xt+1 − Xt |Ft , Xt > a(n)] ≤ −

2e

With n sufficiently large, we have b(n) − a(n) = Ω(n) and the condition (L1) of Theorem 7 is satisfied. Condition

(L2) holds trivially due to the property of Hamming distance |H(xt , 0n ) − H(xt+1 , 0n )| ≤ H(xt , xt+1 ) ≤ H(xt , x0 ). So

individual belongs to the best γ-portion, or (ii) the tournament selection happens (with probability r), and at least one of the selected parents belongs to the γ-portion.

(|Xt − Xt+1 |Ft ) Z where Z := H(xt , x0 ) Then Z ∼ Bin(n, 1/n) and E eλZ ≤ e for λ = ln(2). We also remark that if each bit position of x0 is initialised with probability 1/2, then X0 will be highly concentrated near n/2 ≤ a(n). It then follows from Theorem 7 that (1 + 1) EA requires expected exponential runtime to optimise OneMax(c).

4.

β(γ) ≥ γ(1 − r) + r(1 − (1 − γ)2 ) = γ(1 + (1 − γ)r) So for all γ ∈ (0, γ0 ], β(γ) ≥ γ(1 + (1 − γ0 )r) and the statement follows by choosing a = (1 − γ0 )/2. The lemma can be generalised to k-tournament, however, larger tournament sizes may cause overhead in the average number of evaluations per generation. This number is a random variable following Bin(kλ, r). Hence, we focus on k = 2. Similar to the previous section, we consider r small, i.e. 1/r ∈ poly(n). The following theorem holds for the runtime of OneMax(c).

PARTIAL EVALUATION OF POPULATIONS

In the previous section, we have seen that EAs with populations do not need complete but only a small amount of information about the problem in each solution evaluation to discover a true optimal solution in expected polynomial runtime. One might wonder if the same thing could be true about the information containing within a population. We first give the motivation for how such a lack of information about the population can arise. In real-life applications, the evaluation of solutions can be both time-consuming (e.g. requiring extensive simulation) and inaccurate. We associate a probability 1 − r with the event that the quality of a solution is not available to the algorithm. We model such a scenario in Algorithm 4 by assuming that the outcome of any comparison between two individuals is only available to the algorithm with probability r.

Theorem 16. There exist constants a and b such that Algorithm 4 optimise OneMax under condition r ∈ (0, 1) and 1/r ∈ poly(n), with χ = δ/3 and λ = b ln(n)/δ 2 where δ = ar in O(n ln(n)/r7 ). Proof. We use the same partition from Theorem 12 and the proof idea is similar. We first remark that by Lemma 15 and from 1/r ∈ poly(n) we have 1/δ ∈ poly(n), δ ∈ (0, 1) and χ < 1/3. The probability of improving a solution at fitness level j by mutation is lower bounded by χ χ n−1 j 2δ 1− > 1− (n − j) n n n 9 So we can pick sj := (1 − j/n)(δ/9), s∗ := δ/(9n) and p0 = 1 − χ so that (C1) and (C2) are satisfied. It now follows Lemma 15 that for all γ ∈ (0, γ0 ]

Algorithm 4 EAs (2-Tournament, Partially Eval. Pop.) λ

n

1: Sample P0 ∼ Unif(X ), where X = {0, 1} . 2: for t = 0, 1, 2, . . . until termination condition met do 3: for i = 1 to λ do 4: Sample ( two parents x, y ∼ Unif(Pt ). argmax{f (x), f (y)} with probability r 5: z := x otherwise 6: Flip each bit position in z with probability χ/n. 7: Pt+1 (i) := z 8: end for 9: end for

β(γ)p0 ≥ γ(1 + 2δ)(1 − χ) = γ(1 + 2δ)(1 − δ/3) ≥ γ(1 + δ) Therefore, (C3) is satisfied with the given value of δ. Because 1/δ ∈ poly(n), then 1/s∗ ∈ poly(n) and by Corollary 5, there exists a constant b such that condition (C4) is satisfied with λ = (b/δ 2 ) ln(m). All conditions are satisfied, and by Corollary 6 the expected optimisation time is !! n n ln(n) 1 nX1 2 O (1 + ln(1 + δ ln(n))) + δ5 δ2 δ j=1 j n ln(n) =O δ7

Unlike Algorithm 2, the algorithm in this section use complete information about the problem to evaluate solutions. However, the evaluation of individuals in the parent populations is not systematic but at random. Two individuals x and y are sampled uniformly at random from the population as parents (line 4). With probability r, the individuals are evaluated and the fitter individual is selected (line 5). Otherwise, a parent x is arbitrarily chosen without concern for the fitness value. Like before, efficient implementation is possible but that does not change the outcome of our analysis. The analysis is more straightforward compared to the previous section.

The result now follows by noting that δ = ar. When the parameter r is small, the fitness function is evaluated only a few times per generation. More precisely, the fitness function is evaluated 2N times, where N ∼ Bin(λ, r). The proof of Theorem 4 pessimistically assumes that the fitness function is evaluated 2λ times per generation. A more sophisticated analysis which takes this into account may lead to tighter bounds than those above. We can also use Algorithm 4 to optimise LeadingOnes. The following theorem holds for its runtime.

Lemma 15. For any r ∈ (0, 1) and constant γ0 ∈ (0, 1), there exists a constant a ∈ (0, 1) such that Algorithm 4 satisfies β(γ) ≥ γ(1 + 2δ) for all γ ∈ (0, γ0 ), where δ = ar.

Theorem 17. There exist constants a and b such that Algorithm 4 optimise OneMax under condition r ∈ (0, 1) and 1/r ∈ poly(n), with χ = δ/3 and λ = b ln(n)/δ 2 where δ = ar in O(n ln(n)/r7 + n2 /r6 ).

Proof. An individual among the best γ-portion of the population is selected if either (i) a uniformly chosen individual is chosen as parent (with probability 1 − r), and this

1364

Proof. We use the same approach as the proof of Theorem 13, including the partition. So we have the same sj = s∗ = δ/(9n), and the expected runtime for any r ∈ (0, 1) with 1/r ∈ poly(n) is n ln(n) n ln(n) n2 n2 O + = O + δ7 δ6 r7 r6

[4] B. Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied Probability, 14(3):502–525, 1982. [5] Y. Jin. Surrogate-assisted evolutionary computation: Recent advances and future challenges. Swarm and Evolutionary Computation, 1(2):61–70, 2011. [6] Y. Jin and J. Branke. Evolutionary optimization in uncertain environments - a survey. IEEE Transactions on Evolutionary Computation, 9(3):303–317, 2005. [7] P. K. Lehre. Negative drift in populations. In Proceedings of the 11th International Conference on Parallel Problem Solving from Nature, PPSN’10, pages 244–253, Berlin, Heidelberg, 2010. Springer-Verlag. [8] P. K. Lehre. Fitness-levels for non-elitist populations. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO’11, pages 2075–2082, New York, NY, USA, 2011. ACM. [9] P. K. Lehre. Drift analysis. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO’12, pages 1239–1258, New York, NY, USA, 2012. ACM. [10] P. K. Lehre and X. Yao. On the impact of mutation-selection balance on the runtime of evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 16(2):225–241, 2012. [11] B. L. Miller and D. E. Goldberg. Genetic algorithms, selection schemes, and the varying effects of noise. Evolutionary Computation, 4:113–131, 1996. [12] D. S. Mitrinovi´c. Elementary Inequalities. P. Noordhoff Ltd, Groningen, 1964. [13] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. [14] P. S. Oliveto, J. He, and X. Yao. Time complexity of evolutionary algorithms for combinatorial optimization: A decade of results. International Journal of Automation and Computing, 4(3):281–293, 2007.

Our results have shown that EAs do not need the fitness values of all individuals in their populations to efficiently optimise a function. Another interpretation is that competition between every individual before reproduction is not necessary for efficient evolution. Note that non-elitism can be considered as a way of reducing the competitiveness in populations. These observations have many analogies with evolution in biology.

5.

CONCLUSIONS

We have shown by rigorous analysis the runtime of evolutionary algorithms under circumstances where only partial information about fitness is available. Two scenarios have been studied. In partial evaluation of solutions, only a small amount of information about the problem is revealed in each fitness evaluation. In partial evaluation of populations, only a few individuals in the population are evaluated, and the fitness values of the other individuals are missing. For both scenarios, we have proved that non-elitist evolutionary algorithms satisfying a set of specific conditions can optimise many functions in expected polynomial time even when very little information is available. The conditions often imply a small enough mutation rate and/or a large enough population size. The importance of populations in evolution is emphasised through the results. Future work should evaluate the advantages of populations in optimisation under imprecision and uncertainty, such as in noisy or stochastic optimisation.

Acknowledgements APPENDIX

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 618091 (SAGE) and from the British Engineering and Physical Science Research Council (EPSRC) grant no EP/F033214/1 (LANCS).

6.

Theorem 18 (Bernoulli’s inequality [12]). For any integer n ≥ 0 and any real number x ≥ −1, it holds that (1 + x)n ≥ 1 + nx. Theorem 19 (Jensen’s inequality [12]). For any function f (x) convex in [α, β] and ai ∈ [α, β] with i ∈ [n],

REFERENCES

[1] D.-C. Dang and P. K. Lehre. Refined upper bounds on the expected runtime of non-elitist populations from fitness-levels. To appear in the Proceedings of the 16th Annual Conference on Genetic and Evolutionary Computation (GECCO’14), Vancouver, Canada, 2014. [2] S. Droste. Analysis of the (1+1) EA for a dynamically bitwise changing OneMax. In Proceedings of the 2003 International Conference on Genetic and Evolutionary Computation, GECCO’03, pages 909–921, Berlin, Heidelberg, 2003. Springer-Verlag. [3] S. Droste. Analysis of the (1+1) EA for a noisy OneMax. In Proceedings of the 2004 International Conference on Genetic and Evolutionary Computation, GECCO’04, pages 1088–1099, Berlin, Heidelberg, 2004. Springer-Verlag.

f(

n n 1X 1X ai ) ≤ f (ai ) n i=1 n i=1

Theorem 20 (Chebyshev’s Inequality [13]). For any random variable X with finite expected value µ and finite non-zero variance σ 2 , it holds that Pr (|X − µ| ≥ dσ) ≤ 1/d2 for any d > 0. Lemma 21. Let X and Y be identical independent discrete random variables with finite expected value µ and finite non-zero variance σ 2 , it holds that Pr(X = Y ) ≥

1365

(1 − 1/d2 )2 for any d ≥ 1 2dσ + 1

Proof. For any k, ` ∈ R with dke ≤ b`c, we have Pr (X = Y ) =

X

Pr(X = i)2 ≥

i∈Z

b`c X

Pr(X = i)2

i=dke

2 b`c i=dke Pr(X = i)

Pr (X = Y ) ≥

P ≥

The second last equality is due to Theorem 19 with convex function f (x) = x2 . It suffices to pick k = µ − dσ and ` = µ + dσ, so that by Theorem 20, we get

b`c − dke + 1

≥

Pr(k < X < `)2 b`c − dke + 1

=

1366

Pr(|X − µ| < dσ)2 2dbσc + 1 (1 − 1/d2 )2 (1 − Pr(|X − µ| ≥ dσ))2 ≥ 2dbσc + 1 2dσ + 1