Minimum Population Search – Lessons from building

0 downloads 0 Views 243KB Size Report
a heuristic technique with two population members. Antonio Bolufé- ..... difference vector step away from each member and then an orthogonal step ... In (2), Fi is a scaling factor which determines the sign and size of the step from ..... 1.75e+1. 9.68e−4. F19: Composite Griewank-Rosenbrock. Size. UMDA. DE. PSO. NM. MPS.
Minimum Population Search – Lessons from building a heuristic technique with two population members Antonio Bolufé-Röhler

Stephen Chen

School of Mathematics and Computer Science University of Havana Havana, Cuba [email protected]

School of Information Technology York University Toronto, Canada [email protected]

Abstract—Population-based heuristics can be effective at optimizing difficult multi-modal problems. However, population size has to be selected correctly to achieve the best results. Searching with a smaller population increases the chances of convergence and the efficient use of function evaluations, but it also induces the risk of premature convergence. Larger populations can reduce this risk but can cause poor efficiency. This paper presents a new method specifically designed to work with very small populations. Computational results show that this new heuristic can achieve the benefits of smaller populations and largely avoid the risk of premature convergence. Keywords—heuristic search; population-based methods; multimodality; search efficiency; scalability

I.

INTRODUCTION

Heuristic search methods can be roughly divided into two broad groups: single-point and population-based methods. Single-point heuristics are iterative methods which continuously improve a single solution. From a current point, a set of new solutions is generated and a decision is made whether or not to “move” to one of them. The generation of the new solutions usually follows a closeness criterion, so they are frequently called “neighbor solutions”. The search behavior of single-point heuristics can be seen as a walk through the search space. Since new solutions are created from only one solution, single-point heuristics strongly rely on local information. One of their main advantages is that they can directly advance toward the local optimum of the attraction basin where the current solution is located. Although the walk to the corresponding optimum can require an exponential number of steps in some cases, this direct advance toward (local) optima has proven to be successful on many optimization functions. In multi-modal search spaces, single-point methods often require an escaping mechanism to avoid getting trapped in poor local optima. To avoid (premature) convergence, algorithms based on local search may apply different approaches such as accepting non-improving solutions, restarting from different solutions, or modifying the problem’s landscape. Nevertheless, finding good local optima may be a difficult task if the number of optima is very large, the function does not present an

adequate global structure, interaction between variables occurs, or the search space is high dimensional. A population is useful to deal with multi-modality, as it can lead to a better model of the function’s landscape. Populationbased methods can increase exploration since several solutions are simultaneously kept in the population and information among them can be continuously exchanged. Using a population allows a search technique to identify and target the most promising regions of the search space, to find and follow gradients, and to learn distribution functions which capture the correlations among variables. By exploiting these techniques, population-based heuristics such as genetic algorithms (GA) [1], ant colony optimization (ACO) [2], particle swarm optimization (PSO) [3], estimation of distribution algorithms (EDA) [4], and differential evolution (DE) [5] have proven to be very effective optimization methods. However, a new and fundamental issue arises when using population methods. The selection of the population size has been extensively addressed in the literature, and a brief review of the previous work reveals that there are many differing recommendations for population size. For example, the recommended population size for (standard) PSO ranges from 20-100 particles [3] while approximately 15 particles should be used in multi-swarm systems [6], the recommended population size in DE is also 20-100 members [5], and estimation of distribution algorithms usually require much larger populations of up to 2000 members [4]. The optimal population size can vary greatly with the search strategy and its specific implementation. The population size is also directly related to the convergence rate: larger populations lead to higher exploration, but they can also lead to fewer generations and lower efficiency when a fixed budget of function evaluations (FEs) is used. Searching with a smaller population increases the chances of convergence and the efficient use of function evaluations, but it also reduces exploration and induces the risk of premature convergence. If these drawbacks can be limited, populationbased heuristics might be able to benefit more from using smaller populations. A population with one member is indistinguishable from single-point search, so the minimum size for a population-based method is two.

The next section analyses the main features of single-point and population-based methods. This analysis leads to a discussion about how to search with a minimum population size in Section III. Section IV then provides an implementation of the proposed method and computational results are presented in Section V. A discussion about the new search strategy is carried out in Section VI before the paper is summarized in Section VII. II.

BACKGROUND

Population-based methods require a population large enough to maintain diversity and ideally guarantee the exploration of many different search regions. Conversely, it is known that small populations may lead to a loss of diversity, stagnation, or premature convergence. However, if techniques for explicitly increasing diversity and exploration are included, it may be possible to benefit from the faster convergence of smaller populations without these negative features. For example, the recommended population size for EDAs should be the square root of the problem size [7], but such a large population is rarely used in practice. If an explicit diversity preservation technique such as over-selection is applied, then it is possible to achieve good results for EDA with much smaller populations (e.g. 50 for DIM=100) [8]. In GAs, the optimum population size varies depending on the problem, and highly multi-modal problems have been shown to benefit from very large populations (e.g. 500-1000 for DIM=10) [1]. But, depending on the problem's encoding, much smaller populations (e.g. 20) can also improve results [9]. For PSO, the recommended population size is in the range of 20100 [3], but the use of dynamic neighborhoods, restarts, and mutation can lead to good results with a population as small as 5 [10]. In multi-swarm systems which have an explicit/separate diversification mechanism, good results can be obtained with a swarm size of 15 [11]. Independent of the heuristic method used, small population sizes have proven to be effective in fields such as Interactive Evolutionary Computation (IEC) and Large Scale Global Optimization (LSGO). Based on human-machine interactions, IEC algorithms need to perform as few evaluations as possible. Considerably reducing the population size has proven here to be an effective technique to increase efficiency [12]. In large scale optimization, as a result of the exponential increase in the search space volume versus the (usually) linear increase in function evaluation budgets, a higher efficiency in the use of evaluations is also required [13]. However, depending on the search technique, the reduction of the population size has a limit. In DE, the minimum size is four since the three random solutions and the target solution need to be different [5]. If the population size is reduced to its absolute minimum, i.e. two, several advantages provided from a population-based approach could be compromised, for instance, the ability to simultaneously explore different regions of the search space. A minimum population size also increases the chance of losing diversity, and a small non-diversified population will make it much harder to escape from local optima. Thus, with small populations, it becomes more difficult to effectively optimize multi-modal functions.

Several techniques have been developed by single-point methods to deal with these disadvantages. For example, Tabu Search takes steps toward unexplored (i.e. not tabu) directions to promote exploration and to escape local optima [14]. Simulated Annealing (SA) may accept non-improving solutions with a probability based on a “temperature” parameter. By gradually reducing this parameter, SA transitions from a global to a more local search [15]. A search method with a minimum population could try to benefit from single-point search features such as increasing exploration steps toward unexplored regions or using a parameter-based technique to transition from global to local search. Combining the strengths of single-point and populationbased methods is not new in the field of (meta)heuristic algorithms. This hybridization can be stage-wise: first, running a population-based global search heuristic (such as ACO or GA), and then starting a single-point method (such as hill climbing) from the best found solutions [2]. Another approach is to simultaneously promote global and local search in an attempt to balance exploration and exploitation. Many heuristics implicitly implement this approach through mechanisms such as probabilistically accepting non-improving solutions in SA [15] or attracting particles toward the best found solution(s) in PSO [3]. Another idea are simplex methods which use a population to determine the search direction and step size from a deterministic set of operations (e.g. reflection, expansion, and contraction) [16]. Many of these methods have concurrent processes for exploration and exploitation. If a small (local search) step leads to a better solution, methods such as PSO (which stores personal best positions), and DE (which uses elitist selection) will accept this solution and make future comparisons against it. Since redirecting the search towards a new region may depend on finding a solution there with a better fitness than this improved solution, concurrent exploitation can interfere with the explorative processes of these search techniques. Many of these methods also have mechanisms to reduce exploration over time such as the cooling schedule of SA to reduce temperature [15], the constriction factor in PSO to slow down particles [3], and the “self-scaling” of DE in which convergence leads to shorter difference vectors which lead to more convergence [5]. To have an explicit maximum step size is also common for many of these techniques, but they often do not have minimum step sizes. In the absence of a minimum step size which could control the convergence rate, populationbased heuristics usually depend on using populations that are large enough to limit the loss of diversity and thus avoid premature convergence. Although not part of “standard” implementations, crowding and niching techniques also exist that can slow or prevent convergence on multi-modal problems. In niching, a population can split itself across different parts of the search space by using multiple, minimally-interacting subpopulations [17]. In crowding, each new solution is compared against a subset of the population, and the new solution replaces the one most similar to it if it is better [18]. The use of these types of explicit techniques for controlling convergence could be an essential part of a heuristic method with a minimum population size.

III.

MINIMUM POPULATION SEARCH

The generation of new solutions is a key difficulty that a population-based method with only two members will face. New solutions in population-based heuristics need to achieve a balance between exploiting the best features of their “parents” and maintaining enough diversity to keep exploring the search space. With only two members, the primary use of recombination may lead to poor results [9]. The small population provides few building blocks to combine, and this limited diversity will likely lead to premature convergence. To learn a distribution function with only two samples of the search space is also not feasible. Something that can be done with two points is to subtract them to calculate a difference vector. Difference vectors are directly or indirectly used in several heuristic methods such as DE [5], PSO [3], and Nelder-Mead (NM) [16]. Difference vectors are essentially slopes between two points, and they provide an effective way to identify exploitable gradients. Even with a small population, it is possible to generate new trial points by adding a difference vector to a population member (1), like DE. However, since only two population members are available, xb has to be one of x1 or x2. As a consequence, this procedure will only generate solutions in the subspace determined by x1 and x2, i.e. the line that joins these two points.

x p = xb  F ( x1  x 2 ) 





Methods such as DE overcome this “difficulty” by having larger populations. Furthermore, after generating the “perturbed” vector (xp), the actual trial solution is created through recombination with a separate target solution. This recombination step allows DE to search beyond the lines that are defined by the difference vectors [5]. A way to perform this recombination with only two members is to combine xp with the solution that was not chosen as the base. However, since this point will be on the same line as the difference vector, this recombination process will have negligible exploratory effect. An alternative way to generate a trial point would be to take an orthogonal step to the x1-x2 line. This orthogonal step will promote exploration, and it could be an effective substitute to the recombination usually performed in DE. Following these ideas, a simple search technique can be developed for a minimum population size. During each iteration, two new solutions could be generated by taking a difference vector step away from each member and then an orthogonal step away from the x1-x2 line. By selecting the best two points among the old and new population members (i.e. elitist selection), the population could be evolved. Although the use of difference vectors to exploit the best features of the current population and the use of the orthogonal step to explore into the “missing dimension(s)” can be considered as a coherent balance of exploration and exploitation, the small population and elitist selection criteria can still lead to premature convergence. This can be especially problematic for multi-modal problems where the use of such a small population may not be enough to model the topography of the whole search space.

To avoid premature convergence, it is possible to apply a technique designed to control the convergence rate and maintain diversity. For this purpose, the thresheld convergence technique is a good candidate [19]. Its underlying idea is to control convergence by disallowing solutions which are too close to already existing population members. Applying thresheld convergence to minimum population search is straightforward. Once a point on the x1-x2 line has been generated using a difference vector, the length of the orthogonal step can be adjusted to have the new trial point fall into an “acceptable” range from its parent solution. To permit a final convergence, the threshold distance is continuously decreased as the search proceeds (see Algorithm 1). IV.

IMPLEMENTATION

The initial implementation of MPS has been tested on twodimensional search spaces. Working in two dimensions helps isolate the concepts of MPS, and it is easier to graphically describe and plot the algorithm's search behavior. In two dimensions, each iteration of MPS starts with the generation of “line” points along the subspace (line) determined by the two population members. This is done by adding the difference vector formed by x1 and x2 to each population member (base) xi. In (2), Fi is a scaling factor which determines the sign and size of the step from the base vector along the x1-x2 line. linei = xi  Fi ( x1  x 2 ) 





The step made along the difference vector can be viewed as an exploitative component. To add exploration, the actual trial points are generated by adding to each “line” point a step perpendicular to x1-x2 (3). As it can be noticed in (3), the orthogonal step is also scaled by a factor denoted as Ostep_i. The signs of both scaling factors (Fi and Ostep_i) determine the direction of the difference vector and orthogonal step relative to the base vector, and their absolute values determine the size of each step. Lastly, the final distance from the new trial point (triali) to its “parent” solution (xi) must fall between the minimum (min_step) and maximum (max_step) distances fixed by the threshold. trial i = linei  O step _ i  orth 

The distance between a “line” point and its parent solution must be smaller than the maximum allowed step (max_step), so Fi is drawn with a uniform distribution from [-max_step/dist, Algorithm 1 Minimum Population Search Generate initial population Repeat

Calculate threshold function (Equation 6) Generate offspring solutions (Fig. 1) Update population Until Stopping Criteria

max_step/dist], where dist is the distance between the two population members. The size of the orthogonal step must then guarantee that the distance from triali to the base vector xi stays within the [min_step, max_step] interval. Thus, the Ostep_i factor is selected with a uniform distribution from [min_orthi, max_orthi]. The min_orthi and max_orthi values are calculated by (4) and (5), respectively, in which dbi is the Euclidean norm (distance) between the “line” point (linei) and its corresponding parent solution (xi). 2

min _ orthi = max(min_ stepi  dbi 2 ,0) 

2

max _ orthi = max(max_ stepi  dbi 2 ,0) 

Before evaluating the new trial points, their feasibility is checked and they are clamped to the boundaries if they are outside the feasible region. If the distance from the clamped solution to any of the two current population members is less than min_step, then the clamped solution is rejected, i.e., not evaluated. The two best solutions are selected to form the new population from the current population and those trial points which were evaluated. This process is repeated until a stopping condition is met. The key parts of thresheld convergence are the min_step and max_step values which control the convergence rate. They are updated by a rule similar to that used in previous attempts to control convergence for PSO [20] and DE [21] in which an initial threshold is selected that then decays over the course of the search process. Equation (6) shows how min_step is calculated (max_step = 2 * min_step). In (6), α represents a fraction of the main space diagonal, FEs is the total available

Fig. 1. Visualization of MPS search process in 2D: x1 and x2 are the current population members, the crosses are the difference vectors (2) and the diamonds are the trial points (after the orthogonal deviation and clamping). The discontinuous circle lines show the minimum and maximum step thresholds around x1 and x2.

amount of function evaluations, k is the number of evaluations used so far, and γ is the parameter that controls the decay rate of the threshold. The current implementation of MPS uses α=0.3 and γ=3. min _ stepi =   diagonal  ([ FEs  k ] / FEs ) 

(6)

To allow the convergence control mechanism to work effectively, the initial points should be at least the initial threshold value (α*diagonal) away from each other. In two dimensions, a simple way to achieve this is to select both initial points to be on the space diagonal. For example, if the search space is bounded by the same lower and upper bound on each dimension (as in the used benchmark functions), then the initial points could be x1=(bound/2, bound/2) and x2 =(-bound/2, bound/2). Fig. 1 shows the search strategy of MPS in a twodimensional search space. V.

COMPUTATIONAL RESULTS

A set of experiments has been designed to test the effectiveness of the proposed algorithm. These experiments have been performed using the Black-Box Optimization Benchmark (BBOB) functions [22]. There are 24 BBOB functions divided into five sets which are 1-separable functions, 2-functions with low or moderate conditioning, 3unimodal functions with high conditioning, 4-multi-modal functions with adequate global structure, and 5-multi-modal functions with weak global structure. In Table I, some key attributes of the functions are indicated. Different instances can be generated for each function, e.g. each instance has a different optimal value (shifted in f-space). TABLE I BBOB FUNCTIONS Set

Function Name

s X X X X X

Attribute u gs X X X X X X X X X

1 Sphere 2 Ellipsoidal, original 3 Rastrigin 1 4 Büche-Rastrigin 5 Linear Slope 6 Attractive Sector 7 Step Ellipsoidal 2 8 Rosenbrock, original 9 Rosenbrock, rotated 10 Ellipsoidal, rotated X X 11 Discus X X 3 12 Bent Cigar X 13 Sharp Ridge X 14 Different Powers X 15 Rastrigin, rotated X 16 Weierstrass X 4 17 Schaffers F7 X 18 Schaffers F7, moderately ill-conditioned X 19 Composite Griewank-Rosenbrock F8F2 X 20 Schwefel 21 Gallagher’s Gaussian 101-me Peaks 5 22 Gallagher’s Gaussian 21-hi Peaks 23 Katsuura 24 Lunacek bi-Rastrigin Names and selected attributes of the 24 functions in the BBOB problem set – separable (s), unimodal (u), global structure (gs).

The experiments include comparisons to other populationbased heuristics with distinctive search strategies (i.e. UMDA, PSO, and DE) and the simplex-based method Nelder-Mead. The UMDA algorithm is a standard implementation using Gaussian density functions and truncation selection [4]. The implementation of standard PSO [3] is the same as that described more fully in [20]. The DE method is the highly common and frequently effective variant labeled DE/rand/1/bin [5]. The Nelder-Mead implementation is based on the fminsearch method in Matlab [16] with a modification to allow bound constraints. A. Component Analysis The minimum population search heuristic has two main steps: the use of difference vectors to generate the “line” points along the x1-x2 line, and the orthogonal step for generating the actual trial points away from this line. The first step allows the exchange of information between the population members. However, the limited diversity of a very small population makes difference vectors alone insufficient as an exploration mechanism. The orthogonal step plays a role similar to mutation – it provides the necessary diversity to avoid premature convergence. In a system such as differential evolution, the recombination between the perturbed solution and another target solution allows DE to search beyond the lines that are defined by the difference vectors [5]. Evolutionary strategies (ES) may use a Gaussian distribution to mutate a solution and perform exploration [23]. Both approaches could have been used in MPS instead of the orthogonal step. An experiment was performed to test the effectiveness of these alternate techniques. Results in Table II compare the performance of using recombination (MPS_REC), Gaussian mutation (MPS_MUT) and the orthogonal step (MPS_ORTH). The variation based on recombination (MPS_REC) takes a component from the “line” point and another from the population member not used as a base. The variation based on Gaussian mutation (MPS_MUT) adds to each component the product of a Gaussian random variable (with µ=0 and σ=1) and the norm of the difference vector. Multiplying by the norm makes mutation scale as the search proceeds; this is similar to the self-adaptation of single-point evolutionary strategies [23]. This comparison shows that the MPS_ORTH outperforms in almost all functions the alternative variations based on DE and ES. The increased exploration provided by the orthogonal step makes it a more effective search component than simple recombination or mutation. The step leads the search to regions orthogonal to the gradient formed by the two population members providing a more reliable mechanism for maintaining diversity. Thus, this newly designed search mechanism is better suited for heuristics with small populations and for problems with multi-modal fitness functions. To maintain diversity in the population, the size of the orthogonal step is as important as its direction. The MPS_ORTH variation only uses a uniform distribution between zero and the length of the difference vector for the size of the orthogonal step. To further improve exploration, MPS includes a mechanism for controlling convergence which

determines an acceptable minimum and maximum size for this step. A minimum step size may avoid premature convergence on multi-modal functions and stalling on unimodal functions. Results in Table II show that MPS (with thresheld convergence) leads to the best performance on 23 of the 24 BBOB functions. B. Optimal Population Sizes Minimum population search has been designed to perform well when using the smallest possible population. Thus, it should provide better performance than other heuristic methods when the population size is reduced. The next set of experiments attempts to confirm this hypothesis. Table III shows the results for several search techniques (UMDA, PSO, DE, NM, and MPS) with different population sizes (100, 50, 25, 10, 5, 2) using a budget of 500 FEs. Among the 24 BBOB functions, F15 (Rastrigin Rotated), F16 (Weierstrass), F17 (Schaffers), and F19 (Composite Griewank-Rosenbrock) were selected to show the efficiency in performance. These functions are characterized as multi-modal, good global structure, and no ill-conditioning. Since Nelder-Mead and MPS use fixed population sizes, results for them are only provided for a size of two (although Nelder-Mead actually uses a population of three members). In Table III, UMDA, PSO, and DE can be seen to have their best performance with different population sizes. Based on statistical methods, UMDA has its best results with large populations (50 and 100). PSO and DE have their best results for population sizes of 10 and 25, respectively. However, when the population size is reduced to small values (2 and 5), the performance of UMDA, PSO, and DE all drop. These results show that even for low dimensions such as DIM=2, the search strategies of UMDA, PSO, and DE all require relatively large population sizes to provide good results. TABLE II COMPONENT ANALYSIS Set MPS_REC MPS_MUT MPS_ORTH MPS 1 7.21E+00 2.59E+00 7.00E−04 2.06E−08 2 6.00E+05 5.26E+05 8.72E+03 7.84E+00 1 3 3.65E+01 2.62E+01 1.22E+01 9.82E−01 4 6.78E+01 5.37E+01 1.06E+01 1.44E+00 5 2.06E+01 1.26E+01 1.15E+01 7.56E−05 6 1.11E+04 1.56E+02 6.14E+02 7.71E−05 7 2.58E+01 3.54E+00 3.72E−01 2.56E−02 2 8 3.26E+03 5.67E+02 7.00E−01 3.95E−01 9 1.77E+03 4.63E+02 1.77E+00 2.36E−01 10 1.27E+05 3.05E+04 1.70E+03 1.24E+01 11 8.10E+04 1.65E+03 1.45E+01 1.17E+01 12 3.58E+06 3.57E+04 1.46E+02 3 1.40E+01 13 4.73E+02 1.47E+02 1.90E+01 7.82E−01 14 6.53E+00 1.25E+00 3.68E−03 4.59E−04 15 6.27E+01 1.28E+01 1.12E+01 9.26E−01 16 1.19E+01 5.38E+00 1.09E+00 5.45E−02 4 17 6.86E+00 1.92E+00 1.42E+00 7.27E−03 18 2.35E+01 6.87E+00 2.36E+00 6.87E−01 19 2.41E+00 6.98E−01 3.01E−01 1.84E−02 20 5.11E+02 1.49E+00 9.74E−01 4.83E−01 21 6.64E+00 2.22E+00 2.93E+00 5.10E−01 5 22 2.58E+01 1.04E+01 6.28E+00 3.36E−01 23 3.75E+00 2.83E+00 1.85E+00 5.77E−01 24 1.79E+01 8.19E+00 9.68E+00 2.37E+00 Comparison of MPS with variations based on DE and ES. Results are the mean error over 25 trials from known optimum with 500 FEs

Size 100 50 25 10 5 2 Size 100 50 25 10 5 2 Size 100 50 25 10 5 2 Size 100 50 25 10 5 2

TABLE III COMPARISON ON DIFFERENT POPULATION SIZES F15: Rastrigin Rotated UMDA DE PSO NM MPS 1.88e+0 1.81e+0 --3.65e−1 5.14e−1 1.09e+0 1.24e+0 --1.36e+0 9.16e−1 --5.75e−1 8.96e+0 2.49e+0 --6.61e−1 4.28e+1 1.11e+1 1.23e+0 --9.55e+1 4.64e+1 9.06e+0 9.82e+0 5.31e−1 F16: Weierstrass UMDA DE PSO NM MPS 1.75e−1 3.32e−1 2.05e−1 --2.85e−1 1.31e−1 --1.15e−1 2.07e−1 5.68e−2 --2.59e−1 2.52e+0 1.66e+0 --3.40e−2 2.93e+1 5.91e+0 7.00e−2 --5.38e+1 2.97e+1 1.52e+0 3.78e+0 1.61e−2 F17: Schaffers F7 UMDA DE PSO NM MPS 2.24e−1 2.14e−1 --3.34e−3 5.85e−3 7.99e−2 1.20e−1 --1.48e−1 4.53e−2 --6.02e−3 1.69e+0 2.15e−1 --3.94e−3 9.51e+0 2.90e+0 6.98e−2 --2.35e+1 1.19e+1 2.82e+0 1.75e+1 9.68e−4 F19: Composite Griewank-Rosenbrock UMDA DE PSO NM MPS 4.39e−2 2.96e−2 --1.22e−2 1.46e−2 3.36e−2 2.16e−2 --4.10e−2 1.06e−2 --2.33e−2 6.69e−1 1.09e−1 --8.70e−3 7.92e+0 2.04e−1 1.02e−2 --1.68e+1 2.43e−1 4.07e−1 1.45e+0 1.02e−2 Mean error from known optimum with different population sizes.

The results in Table III also confirm that the optimal population size is algorithm-dependent. Performing well on a multi-modal search space for even DIM=2 requires a population size large enough to escape/avoid local optima and to search multiple attraction basins. Algorithms such as NelderMead (with a population of size 3) do not provide competitive results against UMDA, PSO, and DE, which can use larger populations. However, MPS provides results that are similar (or better) than those of UMDA, PSO, and DE for all of the tested population sizes. These results show that it is possible to search effectively on multi-modal spaces with a population-

based heuristic that uses the minimum population size of two. C. Reduced Evaluation Budgets An advantage of smaller populations is the ability to perform more generations with a fixed budget of FEs. More generations increase the chances of a search technique to converge over the best found regions. The following experiments analyze the behavior of several population-based methods including MPS, DE, PSO, and UMDA with varying budgets of allowed function evaluations. Population sizes for UMDA, DE, and PSO are 100, 25, and 10, respectively (as per their best results in Table III). Nelder-Mead results are not included for two reasons. First, its results (mean error) are comparatively much worse than those from the other algorithms, and second, its deterministic search strategy causes NM to have minimal variation in performance with different amounts of FEs. Results are again presented for BBOB F15, F16, F17, and F19 to focus mainly on multi-modal functions with good global structure (Figs. 2 and 3). As the amount of FEs is reduced, the number of generations drops, and the difficulty of detecting and converging over the best regions of multi-modal search spaces increases. Figs. 2 and 3 show how the performance of UMDA, PSO, DE, and MPS degrades as the amount of available FEs is reduced. Results show that with enough FEs (from 2000 to 5000), all of the methods provide similar performance. However, with low amounts of FEs (500 and 1000), performance often decreases noticeably. For all the tested functions, MPS provides the best performance with low FEs, and it consistently outperforms the other heuristics when only 500 FEs are used. D. General Results The final set of experiments compares the performance of all the heuristic methods for each of the BBOB functions. The results in Table IV show that MPS provides good performance for the targeted multi-modal functions and for unimodal and separable problems as well. In BBOB sets 4 and 5 (multimodal functions with adequate and weak global structure), MPS reports the best results on 6 of the 10 functions.

Fig. 2. Mean error over 25 trials for various amounts of FEs on functions F15 (Rastrigin Rotated) and F16 (Weierstrass).

Fig. 3. Mean error over 25 trials for various amounts of FEs on functions F17 (Schaffers F7) and F19 (Composite Griewank-Rosenbrock).

E. Preliminary Results for Three Dimensions Preliminary results for minimum population search in three dimensions (DIM=3) are presented in Table V. The implemented version of MPS uses a population of three members, instead of two. To generate the “plane” points within the DIM-1 hyperplane (defined by the three members), a linear combination of the three individuals is used instead of difference vectors. After generating the plane points, the rest of the heuristic is similar to MPS in two dimensions. From Table V it can be seen that MPS provides the best results on 8 of 10 functions from BBOB sets 4 and 5. VI.

DISCUSSION

The use of small populations and increased efficiency are key features of minimum population search. These features TABLE IV PERFORMANCE IN DIM=2 FOR ALL BBOB FUNCTIONS Set UMDA PSO DE NM MPS 1 5.79E−05 1.83E−03 3.27E−05 0.00E+00 2.06E−08 2 9.27E−02 1.24E+01 3.51E−02 0.00E+00 7.84E+00 3 1.13E+00 1.62E+00 1.13E+00 9.49E+00 9.82E−01 1 4 2.04E+00 2.31E+00 1.65E+00 7.87E+00 1.44E+00 5 3.39E−02 0.00E+00 0.00E+00 0.00E+00 7.56E−05 6 1.08E+00 7.63E−02 1.89E−02 2.45E−06 7.71E−05 7 6.44E−02 5.45E−02 2.01E−02 9.74E+00 2.56E−02 2 8 1.07E−01 1.65E−01 1.11E−01 0.00E+00 3.95E−01 9 9.99E−02 9.74E−02 1.13E−01 0.00E+00 2.36E−01 10 2.46E+01 2.04E+01 2.98E−01 6.24E−01 1.24E+01 11 1.96E+01 1.85E+01 3.12E+01 2.57E−01 1.45E+01 3 12 5.42E+00 1.29E+01 2.72E+00 1.07E+00 1.40E+01 13 2.45E+00 2.20E+00 1.55E−01 8.37E−11 7.82E−01 14 1.26E−03 8.31E−03 5.52E−04 0.00E+00 4.59E−04 15 1.11E+00 1.68E+00 1.52E+00 1.03E+01 9.26E−01 16 5.17E−01 2.43E−01 5.80E−01 4.29E+00 5.45E−02 4 17 7.09E−02 1.30E−01 5.78E−02 1.78E+01 7.27E−03 18 6.63E−01 8.88E−01 2.46E−01 1.38E+02 6.87E−01 19 3.32E−02 3.71E−02 5.71E−02 1.39E+00 1.84E−02 20 5.99E−01 5.65E−01 4.03E−01 1.37E+00 4.83E−01 21 1.04E−01 7.92E−02 1.33E−01 1.98E+00 5.10E−01 5 22 7.72E−02 5.43E−02 1.61E−01 4.80E+00 3.36E−01 23 1.57E+00 1.61E+00 1.69E+00 1.26E+00 5.77E−01 24 2.97E+00 2.68E+00 3.18E+00 1.26E+01 2.37E+00 Mean error over 25 trials from known optimum with 500 FEs.

should make MPS a promising method for large scale global optimization. With increasing dimensionality, the search space volume grows exponentially. The exponential increase of candidate solutions leads to a decrease in performance for most available optimization algorithms [24]. Despite the exponential increase in problem complexity, the function evaluation budget was only increased linearly in [24]. As a result, one of the main challenges when scaling a heuristic method to extremely high dimensional search spaces is an insufficient amount of FEs to explore the whole function landscape. Small populations allow more generations and increase the efficiency at the risk of losing diversity and premature convergence. To maintain diversity, MPS generates new solutions a minimum distance away from current population members. Due to the lack of an explicit mechanism for preserving diversity, standard implementations of UMDA, DE, TABLE V PERFORMANCE IN DIM=3 FOR ALL BBOB FUNCTIONS Set UMDA PSO DE NM MPS 1 3.24E-03 2.60E-02 4.16E-03 0.00E+00 5.28E-07 2 9.33E+00 5.57E+02 4.84E+00 0.00E+00 1.78E+03 3 4.43E+00 6.08E+00 5.45E+00 1.62E+01 2.55E+00 1 4 7.85E+00 9.24E+00 8.20E+00 2.75E+01 4.27E+00 5 2.25E-01 0.00E+00 3.61E-02 0.00E+00 5.38E-01 6 3.54E+00 1.28E+00 9.62E-01 1.86E-02 3.37E-03 7 2.10E-01 3.40E-01 6.63E-02 8.50E+00 1.36E-01 2 8 1.30E+00 2.28E+00 1.53E+00 8.24E-03 6.71E-01 9 1.50E+00 2.04E+00 1.69E+00 2.61E-02 7.96E-01 10 1.03E+03 7.88E+02 4.05E+01 3.18E+00 1.77E+03 11 4.73E+01 3.13E+01 4.70E+00 5.70E+00 2.32E+01 3 12 2.35E+03 1.74E+03 6.32E+02 4.05E+00 8.87E+00 13 8.24E+00 1.31E+01 4.25E+00 7.98E-01 2.15E+00 14 1.57E-02 4.01E-02 1.25E-02 0.00E+00 1.25E-03 15 4.51E+00 6.25E+00 6.22E+00 2.11E+01 2.37E+00 16 2.90E+00 1.61E+00 2.65E+00 5.76E+00 2.45E-01 4 17 2.47E-01 4.48E-01 3.47E-01 1.46E+01 3.47E-02 18 1.22E+00 1.74E+00 1.25E+00 5.70E+01 5.19E-01 19 3.98E-01 5.61E-01 2.86E-01 2.45E+00 2.52E-01 20 1.41E+00 1.34E+00 1.27E+00 1.56E+00 7.96E-01 21 6.20E-01 5.01E-01 7.31E-01 4.23E+00 8.73E-01 5 22 9.02E-01 5.61E-01 8.95E-01 5.83E+00 1.56E+00 23 2.02E+00 2.06E+00 2.12E+00 1.24E+00 6.41E-01 24 7.41E+00 8.19E+00 9.47E+00 2.68E+01 4.98E+00 Mean error over 25 trials from known optimum with 500 FEs.

and PSO require relatively large populations – even for DIM=2 (see Table III). With a large population, fewer generations are possible and it becomes harder for these methods to converge with a very limited budget of FEs. This limitation leads to a drop in performance even on low-dimensional problems (see Fig. 2 and 3 and Tables IV and V) With increasing dimensions, it is expected that the optimal population size will also increase. Since n points define an n-1 dimensional hyper plane, the difference vectors formed by n points are constrained to an n-1 dimensional sub-space. To search beyond that sub-space, a heuristic based on difference vectors (e.g. NM, PSO, and DE) will require a population larger than the function’s dimensionality, or an explicit mechanism to “step away” from this sub-space. However, previous applications of DE [25] and PSO [26] applied to large scale global optimization (DIM=1000) used population sizes of only 50. In MPS, exploitation is performed by generating points within the “population (hyper) plane” while an orthogonal “step away” guarantees exploring into the missing dimension. To allow exploration on all dimensions, the population size in MPS should be increased proportionally to the function’s dimensionality, i.e. n = DIM. Compared to search techniques which seem to perform best with population sizes of n >> DIM (see Table III), the ability of MPS to scale to high dimensional search spaces is very promising. Another distinctive feature of MPS is that it has been specifically designed for multi-modal functions. Thus, MPS includes features which may weaken its performance on unimodal functions. For example, taking an orthogonal step away from the plane determined by the two/three population members (instead of following the difference vector gradient), may worsen convergence on unimodal functions such as F2 (Ellipsoidal) and F5 (Linear Slope). Heuristics such as PSO and DE are designed and expected to perform well on both unimodal and multi-modal problems. However, with the advent of hyper-heuristics which can select an appropriate heuristic for a given problem [27], a specialist technique for multi-modal search spaces is a useful addition to complement the dominant performance of NM (and other local optimization methods) on unimodal functions (see Tables IV and V). VII. SUMMARY Finding the optimal population size is a key issue for population-based methods. Smaller populations can provide advantages such as a faster convergence and a more efficient use of the available function evaluations, but the risk of premature convergence increases. Based on the hypothesis that the optimal population size depends on the internal mechanics of each heuristic, this paper presents a new heuristic specifically designed to perform well with a minimum population size. This method is efficient and provides good results on multimodal functions. Future work will focus on scaling MPS to high dimensional problems. REFERENCES [1]

K. Deb and S. Agrwal, “Understanding Interactions among Genetic Algorithm Parameters,” Foundations of Genetic Algorithms, Morgan Kauffman, 1999, pp. 265–286.

[2]

[3] [4] [5]

[6] [7] [8] [9] [10]

[11] [12] [13]

[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

[25] [26] [27]

M. Dorigo and L. M. Gambardella, “Ant Colony System: a cooperative learning approach to the traveling salesman problem,” IEEE Transaction on Evolutionary Computation, 1(1), 53, 1997. D. Bratton and J. Kennedy, “Defining a standard for particle swarm optimization,” IEEE SIS, 2007, pp. 120–127. P. Larrañaga and J. A. Lozano, “Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation,” Kluwer Academic Publishers, 2011, pp. 181–193. R. Storn and K. Price, “Differential Evolution – A simple and efficient heuristic for global optimization over continuous spaces,” Journal of Global Optimization, 11:341–359. A. Bolufé Röhler and S. Chen, “An analysis of sub-swarms in multiswarm systems,” Australasian AI, 2011, pp. 271–280. M. Pelikan, D. E. Goldberg, and E. Cantú-Paz, “Bayesian optimization algorithm, population sizing, and time to convergence,” GECCO, 2000, pp. 275–282. Y. Hong, S. Kwong, Q. Ren, and X. Wang, “Over-Selection: An attempt to boost EDA under small population size,” IEEE CEC, 2007, pp. 1075– 1082. C. Reeves, “Using genetic algorithms with small populations,” ICGA, 1993, pp. 92–99. J. Cabrera Fuentes and C. Coello Coello, “Micro-MOPSO: A multiobjective particle swarm optimizer that uses a very small population size,” Multi-Objective Swarm Intelligent Systems, pp. 83–104, 2010. A. Bolufé Röhler and S. Chen, “Multi-swarm hybrid for multi-modal optimization,” IEEE CEC, 2012, pp. 1759–1766. H. Takagi, “Interactive Evolutionary Computation: Fusion of the capabilities of ECO optimization and human evaluation,” IEEE PIEEE, 2001, pp. 1275–1296. J. Brest, A. Zamuda, B. Bošković, M. S. Maučec, and V. Žumer, “Highdimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction,” IEEE CEC, 2008, pp. 2032–2039. F. Glover and M. Laguna, “Tabu Search,” Kluwer Academic Publishers, Boston, 1997. S. Kirkpatrick, C. D. Gelatt, Jr., and M.P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983. J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, “Convergence properties of the Nelder-Mead simplex method in low dimensions,” SIAM Journal of Optimization, Vol. 9, pp. 112–147, 1998. R. Brits, A. P. Engelbrecht, and F. Van den Bergh, “A niching particle swarm optimizer,” SEAL, 2002, pp. 692–696. R. Thomsen, “Multi-modal optimization using crowding-based differential evolution,” IEEE CEC, 2004, pp. 1382–1389. S. Chen, C. Xudiera, and J. Montgomery, “Simulated annealing with thresheld convergence,” IEEE CEC, 2012, pp. 1946–1952. S. Chen and J. Montgomery, “Particle swarm optimization with thresheld convergence,” IEEE CEC, 2013. A. Bolufé-Röhler, S. Estévez-Velarde, A. Piad-Morffis, S. Chen and J. Montgomery, “Differential evolution with thresheld convergence,” IEEE CEC, 2013. N. Hansen, S. Finck, R. Ros, and A. Auger, “Real-parameter black-box optimization benchmarking 2009: noiseless functions definitions,” INRIA Technical Report RR-6829, 2009. T. Bäck, F. Hoffmeister and H. Schwefel, “A survey of evolution strategies,” ICGA, 1991, pp. 2–9. K. Tang, X. Yao, P. N. Suganthan, C. Macnish, Y. Chen, C. Chen, and Z. Yang, “Benchmark functions for the CEC’2008 special session and competition on high-dimensional real-parameter optimization,” Technical report, USTC, China, 2007. S. Zhao, P. N. Suganthan, and S. Das, “Dynamic multi-swarm particle swarm optimizer with subregional harmony search,” IEEE CEC, 2010, pp. 1983–1990. H. Wang, Z. Wu, S. Rahnamayan and D. Jiang, “Sequential DE enhanced by neighborhood search for large scale global optimization,” IEEE CEC, 2010, pp 4056–4062. E. K. Burke, E. Hart, G. Kendall, J. Newall, P. Ross, and S. Schulenburg, “Hyper-heuristics: An emerging direction in modern search technology,” Handbook of Metaheuristics, Kluwer, 2003, pp. 457–474.

Suggest Documents