Linkage Identification Based on Epistasis ... - Semantic Scholar

1 downloads 0 Views 528KB Size Report
Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms. Masaharu Munetomo. Center for Information and Multimedia Studies ...
Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms Masaharu Munetomo Center for Information and Multimedia Studies, Hokkaido University, North 11, West 5, Sapporo 060-0811 Japan. E-mail: [email protected]

Abstract – Genetic algorithms (GAs) process building blocks (BBs) mixed and tested through genetic recombination operators. To realize effective BB processing, linkage identification is essential which detects a set of loci tightly linked. This paper proposes a linkage identification with epistasis measures (LIEM) that detects linkage groups based on a pairwise epistasis measure.

I. Introduction Genetic Algorithms (GAs) are considered robust search technique to solve difficult problems such as combinatorial optimization problems. GAs repeatedly apply genetic operators such as crossover, mutation, and selection to a population of strings each of which represents a point in a problem domain. Recombination operators such as crossover exchange substrings between a pair of strings, and perturbation operators such as mutation flips a bit of a string. Generally speaking, the essence of genetic search lies in its processing of building blocks (BBs) — short, well-performed sub-solutions — through recombination operators to speed up genetic optimization. GAs have been applied to wide spectrum of application problems and a number of researchers reported GAs are quite efficient in solving their problems; on the other hand, some reported quite poor performances on their applications of GAs. Various reasons are considered for the poor performances such as the essential GA-difficulty of the problem itself, not enoughly-sized initial population, and so on. However, in most cases that GAs perform poorly, genetic recombination mechanism which is considered essential in genetic optimization does not work effectively due to improper encoding, failure of designing recombination operators — mainly because the importance of tight linkage in encoding and recombinations is ignored. To insure tight linkage is essential for recombination operators to work effectively. A set of loci that belong to a BB should be tightly linked in order to be survived along genetic optimization process. In this paper, the word linkage means a set of loci that belong to a same BB and linkage group or linkage set is defined as a set of loci tightly linked that may form a BB.

0-7803-7282-4/02/$10.00 ©2002 IEEE

This paper proposes a linkage identification procedure that detects linkage groups to realize efficient genetic recombinations. This paper continues as follows: In section II, an introduction is given on a history of linkage identifications from messy GAs to the current work on linkage identification algorithms. In section III, a detailed description is given on the proposed algorithm called the linkage identification with epistasis measures (LIEM). In section IV and V, empirical investigations and their discussions are given on some nonlinear functions of the sum of GA-difficult trap functions, which were considered difficult to detect accurate linkage groups.

II. Linkage identification Decomposability of a problem into subproblems is essential to realizing effective search with optimization algorithms based on divide and rule strategy. For example, a linear combination of sub-functions is apparently decomposable: a function h(x1 , x2 ) = f(x1 ) + g(x2 ) can be optimized separately by solving optimization sub-problems for functions f(x1 ) and g(x2 ). For genetic optimizations, problem decomposition based on BBs and their mixing by recombination operators are essential to realize effective search. However, this does not mean that strict decomposability is necessary for effective GA search, rather, quasi-decomposability is required. Problems without any decomposability cannot be solved effectively with any divide and rule strategy. Classical GAs did not consider the problem decomposition explicitly — they process BBs indirectly with general or problem-specific crossover operators. To search BBs directly, the messy GA (mGA)[2] processes underor over-specified strings that represents schemata. The mGA does not have an explicit linkage representation, which tries to learn linkages indirectly by searching BBs through cut and splice operators applied to a number of BB candidates (schemata) generated in its primordial phase. On the other hand, in the gene expression mGA (gemGA)[5, 4, 6] assigns a weight to each locus in order to check local optimality concerning the locus and detect candidates of loci that belong to a same linkage group. Another approach employed in the linkage learning GA

(LLGA)[3] is based on two-point-like crossover operator applied to circular encoded strings. The crossover of the LLGA tends to preserve tight linkages of the problem, if the problem is loosely separable to sub-functions which is not uniformly-scaled (that is, contribution of each subfunction to overall function value is not uniform). These GAs try to identify linkage groups along their optimization process. On the other hand, linkage identification procedures generate linkage groups directly from an initial population before starting an optimization process. The linkage identification by nonlinearity check (LINC)[9, 8] pioneered direct linkage identification procedures. The LINC detects arbitrary nonlinearity for each pair of loci based on a population of strings to obtain a set of tightly linked loci. The algorithm of the LINC is shown in figure 1. It calculates the following fitness changes by bitwise perturbations for each pair of loci (i, j): ∆fi (s) = ∆fj (s) = ∆fij (s) =

f(..s¯i .....) − f(..si .....) f(..... s¯j ..) − f(.....s j ..) f(..s¯i .s¯j ..) − f(..si .sj ...),

(1)

where f(s) is a fitness functions of s and s¯ = 1 − s (0 → 1 or 1 → 0) stands for a perturbation. algorithm LINC P = initialize N strings for each s in P for i = 0 to length-1 s’ = Perturb(s, i); df1 = f(s’) - f(s); for j = i to length-1 if i != j then s’ = Perturb(s, j); df2 = f(s’) - f(s); s’’ = Perturb(s’, i) df12 = f(s’’) - f(s); if |df12 - (df1 + df2)| > epsilon then /* nonlinearity detected between i and j */ adding j to the linkage_set[i]; adding i to the linkage_set[j]; endif endif endfor endfor endfor

Figure 1: The Linkage Identification by Nonlinearity Check (LINC) In the LINC algorithm, if ∆fij (s) = ∆fi (s) + ∆fj (s) (practically, we should allow some small amount of error in the condition; therefore, we employ |∆fij (s) − (∆fi (s) + ∆fj (s))| > ) is satisfied in at least one string s in a population P , we consider loci i and j are tightly linked and include them to the same linkage group.

0-7803-7282-4/02/$10.00 ©2002 IEEE

It is essential to checking the above condition for all strings in an enoughly-sized population. Even if a GAeasy linearity is detected in a string, this does not mean the problem is GA-easy because of the possibility that a strong nonlinearity may be detected in another string (For example, GA-difficult trap function is linear along its deceptive attractor). Therefore, it is necessary to check the conditions for all possible strings in order to detect nonlinearity of the problem. It is practically impossible to check the condition for all possible strings of order O(2l ), where l is the string length. Instead, when we assume that maximum order of BBs is fixed to k that satisfies k e12 > e14 > e16 > e15 , and second, we pick up three loci according to the sorted eij and obtain {3, 2, 4} as tightly linked with locus 1 and consequently, the obtained linkage group is {1, 2, 3, 4}.

0-7803-7282-4/02/$10.00 ©2002 IEEE

Note that loci with zero epistasis measure (eij = 0) should not be included in the linkage group. In the above definition of epistasis measure, a pair of loci (i, j) do not consider to be tightly linked when eij = 0. To apply the LIEM, we need to assume the maximum length of BBs as the fixed number of loci k defined above. In this paper, we call the order difficulty number d because it represents the problem difficulty for genetic recombinations. According to the population sizing discussed in the previous section[9], an initial population of O(2k ) strings becomes necessary to obtain correct linkage groups. Rather, it is more natural to argue that when the initial population size is fixed, the maximum length of BBs — how many order of BBs can detect — is fixed. In figure 3, we show a detailed description of the LIEM algorithm. The algorithm start with setting the initial population size N = c ∗ 2d and randomly generate a population P consisting of N strings. After the initialization, epistasis measures e[i][j] are calculated for each pair of loci (i, j) based on fitness changes for all strings s in a population by applying perturbations (perturb(s,i) means i-th position of string s is perturbed by si = s¯i = 1 − si ). And the calculated epistasis measures are sorted. A linkage group of locus i (l[i][*]) is obtained by selecting a set of locus according to the sorted measures (except when the measure is equal to zero, practically, smaller than a small value of epsilon). The time complexity of the LIEM algorithm is O(2k l2 ) because perturbations are applied for all pair of loci of order O(l2 ) to each string in properly sized O(2k ) strings.

IV. Empirical investigations The LIEM algorithm is expected to identify correct linkage groups not only for quasi-linear combination of GAdifficult sub-functions but also some (weak) nonlinear combination of the sub-functions. It is considered difficult to validate the effectiveness of the LIEM for wide spectrum of practical problems. Therefore, in this empirical study, we perform experiments on some linear and nonlinear functions of the sum of GA-difficult trap functions. In the following simulation experiments, we employ the following sum of trap functions defined as follows: h(x) =

10 

fi (ui ).

(3)

if 0 ≤ ui ≤ 4 if ui = 5

(4)

i=1

 fi (ui ) =

4 − ui 5

where ui is the number of ones (unitation) in each 5-bit substring of x. In order to control difficulty of linkage identification by changing nonlinearity of the whole fitness functions, we employ f(x) = h(x)n (n = 1, 2, · · ·) as test functions.

Table 1 shows the percentage of linkage groups correctly identified by the LIEM and the LINC. (Even though strings are randomly enocoded, the same result should be obtained from the nature of the algorithms.)

the trap functions and it becomes difficult to detect epistasis difference between the functions.

Table 1: % of linkage correctly identified

Through empirical studies above, we have shown that the LIEM can obtain correct linkage groups for weak nonlinear functions of the sum of GA-difficult trap functions. This result implies that correct linkage can be detected when overall nonlinearity of fitness functions is small enough compared with nonlinearity inside a linkage group. This is because the LIME observes the difference between strong epistasis inside a set of loci in a linkage group and relatively weak epistasis among the other locus pairs. In this paper, we adopt a simple pairwise epistasis measure based on the condition of the LINC. Although the proposed definition seems reasonable, it is still important to seek for more general definition of epistasis measures. In our definition in equation (2), we calculate a maximum difference between simultaneous and individual fitness changes by perturbations. This definition only considers one maximum instance in the population and does not deal with overall fitness landscape. By replacing the maximum part with some nonlinear functions, we may design another measure taking populational fitness landscape into consideration. Other than designing epistasis measures based on the LINC, we may consider another definition of epistasis measures based on the linkage identification by nonmonotonicity detection (LIMD) proposed elsewhere[10, 11]. The LIMD generates linkage groups by detecting violation of its monotonicity condition. The monotonicity condition is based on an idea that monotonicity of a function is considered easy for all search algorithms including GAs and violation of monotonicity should be detected to find GA-difficulty. Although the basic idea of the LIMD is different from that of the LINC which is based on epistasis, we might define an epistasis-like measure from the LIMD condition that may give more accurate identifications of linkage groups.

h(x)n n 1 2 3 4 5

% of correct linkage LIEM LINC 100 100 100 0 94 0 76 0 60 0

The LINC is vulnerable to nonlinearity of overall fitness functions because the algorithm detects strict nonlinearity to obtain linkage group. We can introduce some threshold in the condition, however, such modification only removes minor effects caused by small amount of noise and cannot solve the above problem caused by relatively weak nonlinearity. On the other hand, the LIEM achieves robust linkage identification for these nonlinear test functions. The obtained linkage groups for functions h(x), h(x)2 , h(x)5 are shown in figure 4. The numbers after “:” represent a set of loci tightly linked to the locus specified before “:”. For example, “0 : 0 1 2 3 4” represents that a set of loci {0, 1, 2, 3, 4} belong to a linkage group for locus 0. For functions h(x), linear combination of trap functions and h(x)2 , weak nonlinear one, the LIEM can obtain correct results in which each 5-bit sub-function is detected as a linkage group. For h(x)5 , relative strong nonlinear function of the sum of trap functions, the LIEM sometimes failed to identify linkage groups. For example, in linkage group of locus 48, the obtained linkage group is {48, 45, 47, 49, 0}, that should be {48, 45, 47, 49, 46}. In order to understand the reason linkage identification failed for nonlinear functions, we plot the values of epistasis measures for the test functions in figure 5 (h(x)), figure 6 (h(x)2 ), and figure 7 (h(x)5 ). In the figures, Xaxis and Y-axis represent a pair loci (bit position), and Z-axis shows the value of epistasis measure of the pair. As in figure 5, the landscape of the epistasis measures for linear function h(x) has clear distinction between tightly linked and loosely-linked pair of loci. For h(x)2 (in figure 6), its landscape becomes more complex because of its nonlinearity, however, it is still not difficult to detect linkage groups. On the other hand, for h(x)5 , a function with strong nonlinearity, the landscape of the epistasis measures has complex structure which is difficult for the LIEM to identify correct linkage groups. This difficulty is considered natural because the nonlinearity caused by function of h(x)5 is large enough compared with the nonlinearity by

0-7803-7282-4/02/$10.00 ©2002 IEEE

V. Discussions

VI. Conclusion Linkage identification is essential for genetic recombinations to work effectively and reliably. This paper proposes a linkage identification procedure based on a pairwise epistasis measures calculated for each pair of loci. The key idea of the linkage identification is to detect difference between strong and weak epistasis. The LIEM we propose is a simple yet powerful procedure that generates correct linkage groups even for some nonlinear functions of the sum of trap functions for which the LINC cannot obtain correct results. Through empirical studies that illustrates the landscape of the epistasis measures for some test functions, we show the effectiveness of the linkage identification based on the measures.

References [1] Y. Davidor. Epistasis variance: A viewpoint on GA-hardness. Foundation of Genetic Algorithms 1, pages 23–35, 1991. [2] D. E. Goldberg, B. Korb, and K. Deb. Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems, 3(5):493–530, 1989. [3] G. R. Harik and D. E. Goldberg. Learning linkage. Foundations of Genetic Algorithms 4, pages 247– 262, 1996. [4] H. Kargupta. The gene expression messy genetic algorithm. Proceedings of the 1996 IEEE Conference on Evolutionary Computation, pages 814–819, Piscataway, NJ, 1996. IEEE Service Center. [5] H. Kargupta. SEARCH, evolution, and the gene expression messy genetic algorithm. Unclassified Report LA-UR 96-60, Los Alamos National Laboratory, Los Alamos, NM, 1996. [6] H. Kargupta and S. Bandyopadhyay. Further experiments on the scalability of the GEMGA. Proceedings of the Parallel Problem Solving From Nature V, pages 315–324, 1998. [7] Naudts, B. and Suys, D. and Verschoren, A. Epistasis as a basic concept in formal landscape analysis In Proceedings of the Seventh International Conference on Genetic Algorithms, 1997. [8] Masaharu Munetomo and David E. Goldberg. Designing a genetic algorithm using the linkage identification by nonlinearity check. Technical Report IlliGAL Report No.98014, University of Illinois at Urbana-Champaign, 1998. [9] Masaharu Munetomo and David E. Goldberg. Identifying linkage by nonlinearity check. Technical Report IlliGAL Report No.98012, University of Illinois at Urbana-Champaign, 1998. [10] Masaharu Munetomo and David E. Goldberg. Identifying linkage groups by nonlinearity/nonmonotonicity detection. In Proceedings of the 1999 Genetic and Evolutionary Computation Conference, 1999. [11] Masaharu Munetomo and David E. Goldberg. Linkage identification by non-monotonicity detection for overlapping functions. Evolutionary Computation, 7(4), 1999.

0-7803-7282-4/02/$10.00 ©2002 IEEE

algorithm LIEM N = c*2^difficulty; P = initialize N strings; /* Calculate epistasis measure e[i][j] */ for i = 0 to l-1 for j = 0 to l-1 e[i][j] = 0; if i != j then for each s in P s’ = perturb(s, i); f1 = fitness(s’) - fitness(s); s’’ = perturb(s, j); f2 = fitness(s’’) - fitness(s); s’’’ = perturb(s’, j); f12 = fitness(s’’’) - fitness(s); ep[s] = |f12 - (f1+f2)|; if(ep[s] > e[i][j]) then e[i][j] = ep[s]; endfor endif endfor endfor /* Generate linkage group l[i][k] where k = 0, 1,..., difficulty-1 */ for i = 0 to l-1 for j = 0 to l-1 id[j] = j; endfor sort e[i][j] with j by descendent order; /* select linkages */ for k = 0 to difficulty-1 if(e[i][k] > epsilon) l[i][k] = id[i][k]; else break; endfor endfor

Figure 3: The Linkage Identification with Epistasis Measure

ˆ‹‡‡ ˆ‰‡‡ ˆ‡‡‡ ‡‡ ‡‡

ª‹ ª‹ˆ ªŠ ªŠˆ ª‰ ª‰ˆ ªˆ ªˆˆ

‹‡‡ ‰‡‡ ‡

ª ‹

‹ˆ

‰

ŠŠ

‰ˆ

‰Œ

ˆŠ

5

‹Œ

h(x)

ˆ‡‡

ŠŽ

2

ˆ‡‡

ˆŽ

h(x)

‰‡‡‡

ˆ

h(x)

0: 0 1 2 3 4 1: 1 0 2 3 4 2: 2 0 1 3 4 3: 3 0 1 2 4 4: 4 0 1 2 3 5: 5 8 6 7 9 6: 6 5 7 8 9 7: 7 6 9 8 5 8: 8 5 6 7 9 9: 9 6 7 8 5 10 : 10 14 13 11 6 11 : 11 8 43 14 0 12 : 12 8 18 6 0 . . 43 : 43 11 21 42 44 44 : 44 42 43 40 41 45 : 45 46 47 41 43 46 : 46 45 48 47 0 47 : 47 49 48 46 45 48 : 48 46 47 49 0 49 : 49 47 46 48 43

Œ

0: 0 1 2 3 4 1: 1 0 3 2 4 2: 2 0 1 4 3 3: 3 0 1 4 2 4: 4 0 1 3 2 5: 5 8 7 6 9 6: 6 8 7 5 9 7: 7 8 6 5 9 8: 8 5 6 7 9 9: 9 8 7 6 5 10 : 10 14 13 12 11 11 : 11 14 13 12 10 12 : 12 13 11 10 14 . . 43 : 43 40 41 42 44 44 : 44 43 41 42 40 45 : 45 46 49 47 48 46 : 46 48 45 47 49 47 : 47 49 48 46 45 48 : 48 46 47 49 45 49 : 49 47 48 45 46



0: 0 1 2 3 4 1: 1 0 2 3 4 2: 2 0 1 4 3 3: 3 0 1 4 2 4: 4 0 1 2 3 5: 5 8 7 6 9 6: 6 8 7 5 9 7: 7 8 6 5 9 8: 8 5 6 7 9 9: 9 8 7 6 5 10 : 10 13 12 11 14 11 : 11 14 12 13 10 12 : 12 10 11 13 14 . . 43 : 43 40 41 42 44 44 : 44 40 41 42 43 45 : 45 46 49 48 47 46 : 46 48 45 47 49 47 : 47 49 48 46 45 48 : 48 46 47 49 45 49 : 49 47 45 46 48

ªˆ

Figure 6: Epistasis measures for h(x)2 Figure 4: Results of the LIEM for test functions

‰‡

Ž‡‡‡‡‡‡‡‡

ˆ ‡‡‡‡‡‡‡‡

ˆ ˆ‹

Œ‡‡‡‡‡‡‡‡

ˆ‰

‹‡‡‡‡‡‡‡‡

ˆ‡

Š‡‡‡‡‡‡‡‡

‹

‹

ŠŽ

‹‡

‹

ª ‹ˆ

ŠŠ

ŠŽ

‰Œ

‰

ˆŽ

‡

‹Œ

ª ‹Š

‰

Šˆ

ªˆˆ Š‹

ˆ

‰‰

‰Œ

ˆ ‹ Ž ˆ‡ ˆŠ ˆ

‡

ˆ‡‡‡‡‡‡‡‡

‰ˆ

‰

ˆŠ

‹

ª‹ ª‹ˆ ªŠ ªŠˆ ª‰ ª‰ˆ ªˆ ªˆˆ

‰‡‡‡‡‡‡‡‡

Œ





ª‹ ª‹ˆ ªŠ ªŠˆ ª‰ ª‰ˆ ªˆ

ˆ



ªˆ

ªˆ

Figure 5: Epistasis measures for h(x)

0-7803-7282-4/02/$10.00 ©2002 IEEE

Figure 7: Epistasis measures for h(x)5