Anti-correlation Measures in Genetic Programming - CiteSeerX

10 downloads 1320 Views 64KB Size Report
In applications of standard genetic programming to concept learning, the fitness ... calculated as the sum of rewards reward(i(c)) for the individual cases c from a ...
Anti-correlation Measures in Genetic Programming R I (Bob) McKay and H A Abbass School of Computer Science University of New South Wales at the Australian Defence Force Academy Northcott Drive, Campbell, ACT 2600 Australia rim,[email protected] Abstract We compare three diversity-preserving mechanisms, implicit fitness sharing, negative correlation learning, and a new form, root-quartic negative correlation learning, on a standard genetic programming problem, the 6multiplexer. On this problem, root-quartic negative correlation learning significantly outperforms standard negative correlation learning, and marginally outperforms implicit fitness sharing. We analyse the difference between standard and root-quartic negative correlation learning, and provide a partial explanation for the improved performance. Index terms-- committee learning, fitness sharing, anti-correlation, genetic programming, population diversity

1.

Introduction

Premature convergence is an important problem in genetic programming, as in other areas of evolutionary computation. A number of mechanisms for preserving diversity and avoiding premature convergence have been proposed for use in genetic programming, notably the island model [1] and implicit fitness sharing (IFS)[2,3]. Negative correlation learning (NCL) [4] was introduced in the field of evolutionary neural networks as an alternative mechanism for diversity preservation in committee learning systems. Here, we investigate the use of negative correlation, and introduce an alternative measure, root quartic negative correlation (RTQRT-NCL). We describe some experiments on the 6-multiplexer problem which investigate the relative performance of IFS, NCL and RTQRT-NCL. We compare these experiments with similar comparisons between NCL and RTQRTNCL acting on neural network ensembles [5]. We provide preliminary theoretical justification for the differences between NCL and RTQRT-NCL, and suggest further directions for research in negative correlation learning.

2.

Background

2.1.

Diversity Preserving Mechanisms

In applications of standard genetic programming to concept learning, the fitness fraw(i) for an individual is calculated as the sum of rewards reward(i(c)) for the individual cases c from a set C of cases:

fraw (i ) = ∑ reward (i(c)) c ∈C

In implicit fitness sharing, the reward is divided amongst all individuals making the same prediction:

fshare (i ) = ∑ c ∈C

reward (i(c)) ∑ reward(i′(c))

i ′ : i' ( c ) = i ( c )

Negative correlation learning makes an additional assumption, namely that the primary reward is a mean square error function:

fmse (i ) =

1 2C

∑ Error

2

c ∈C

NCL adapts the reward with an additional term providing reward for individuals being negatively correlated with other individuals in the population (we use the notation i(c) to represent the output of individual i on case c, V(c) to represent the correct value for case c, and MI(c) to represent the mean over all individuals from I):

fncl (i ) =

1 C

1   2 i c − V c + λ i c − M c j c − M c ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ∑ ∑ ∑ I I   2   c ∈C c ∈C  j ≠i

Our initial interest in NCL stemmed from work on populations of partial functions in genetic programming. In that context, it is highly desirable to incorporate an information cost metric into the reward function, to avoid generating functions which specialise on only one or two cases. However we have found it very difficult to combine information metrics with IFS, the diversity preserving mechanism we have been using until now. We tried NCL as an alternative, because it was easy to combine with information metrics, and also because it fitted well with ensemble methods of population evaluation, which are highly desirable for evaluating populations of partial functions. Our initial experiments were very disappointing, as NCL seemed unable to generate comparable diversity with IFS. Our initial explanation for this was that the NCL penalty function

penaltyncl (i, c) = (i(c) − MI (c))∑ ( j (c) − MI (c)) j ≠i

could be re-written:

penaltyncl (i, c) = −(i(c) − MI (c))

2

I.e. it could be reformulated in terms of differences from the mean. This seemed to us an important disadvantage, as increasing difference from the mean is not necessarily a good, or the only, way of increasing the spread of a population. This led us to a set of criteria for selecting penalty functions. • The function should tend to maximise the distance between all networks, providing diversity • The function should be dimensionally consistent with the error function (in this case, mean square error MSE) • The penalty function should not be expressible directly in terms of differences from the mean • The penalty function should not have a larger value than the error function We were led then to an alternative penalty term, RTQRT-NCL:

penaltyrtqrt (i, c) = −

4 1 (i(c) − j(c)) ∑ I j ∈I

Initially, we believed that RTQRT-NCL could not be reformulated in terms of differences from the mean (because of cross terms), and hence we expected improved performance relative to NCL. However after achieving that performance, we discovered a reformulation of RTQRT-NCL in terms of differences from the mean:

penaltyrtqrt (i, c) = −

2 6 2 ∑ (i(c)− M I (c)) +  ∑ (i(c)− M I (c))  I  j ∈I 

2

4

j ∈I

And hence were led into the need for a better explanation of its improved performance. We attempt to provide a partial explanation in the discussion section of this paper.

2.2.

Genetic Programming Details

This work uses the grammar-guided programming paradigm [6], in particular the DCTG-GP genetic programming system [7,8]. In committee learning, the population as a whole is used to determine the prediction of the system on a newly presented case, instead of the prediction being made by a single - usually the fittest - individual. A number of different mechanisms may be used to determine the prediction made by the population; in this work we used weighted voting - the members of the population vote to determine the prediction, but the votes are weighted by the fitness of the individual. In our experience on this and similar problems, a linear weight to the vote gives too little influence to the votes of the fittest individuals, so we use instead the fourth power of the individual's fitness. The system used is described in more detail in [3] and [9].

3. Experimental Design Our initial experiments use a single simple problem, the 6-multiplexer problem: Table 1: 6-Multiplexer Grammar EXPR → BOOL BOOL → TERM BOOL → and BOOL BOOL BOOL → or BOOL BOOL BOOL → not BOOL BOOL → if BOOL BOOL BOOL TERM → a0 TERM → a1 TERM → d0 TERM → d1 TERM → d2 TERM → d3 The 6-multiplexer problem is to predict, from the inputs, the outputs of a multiplexer having two address and four data lines. The search space is the set of boolean combinations of the address and data values using 'and', 'or', 'not' and three-way 'if' combinators. Table 2: GP Parameters PARAMETER

SPECIFICATION

Number of Runs Generations/Run (6/11) Population Size (1st/later) Max depth (initial pop) Max depth (subsequent) Tournament size Crossover Probability Mutation Probability

100 100 300/150 8 10 5 0.9 0.1

The raw fitness was the proportion of the 64 cases correctly predicted. Runs were terminated at 200 generations. 20 experiments were conducted. Nine experiments used the NCL measure in the fitness function, with λ ranging from 0.1 to 0.9. Nine equivalent experiments were performed with the RTQRT-NCL measure. One experiment used raw fitness; it is included in the graphs for NCL and RTQRT-NCL since it is equivalent to using these measures with a λ value of zero. Finally, the experiment was repeated using implicit fitness sharing as the fitness measure.

4.

Results The population fitness by generation for NCL and RTQRT-NCL are shown in Figures 1 and 2.

Figure 1: Fitness vs Generation, NCL

Figure 2: Fitness vs Generation, RTQRT-NCL

It is clear even from a cursory glance that RTQRT-NCL error rates are significantly lower than NCL error rates over the range of values of λ, and that for the best values of λ for each, RTQRT-NCL gives a better performance than NCL. This is confirmed by Table 3, which shows, for each value of λ, the mean number of generations, out of the 100 generations in each run, where perfect accuracy on the 6 multiplexer was not achieved. It is thus a proxy measure of the expected number of generations before perfect accuracy is achieved. Table 3: Mean Proportion of Generations with Perfect Accuracy λ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

NCL 74.31 56.24 40.37 34.36 37.48 48.78 57.96 73.36 78.91 82.51

RTQRT-NCL 74.31 62.38 56.57 57.63 56.89 37.63 30 24.7 26.02 30.42

These results are shown graphically in Figure 3. The equivalent value for implicit fitness sharing is 31.11. Thus at least on this measure, RTQRT-NCL with a λ value between 0.6 and 0.9, outperforms IFS, though at the cost of an extra parameter (λ) which must be appropriately set. Figure 4 shows the error of raw fitness and IFS against generation. In comparison with figure 2, it appears to confirm that the optimal λ values for RTQRT-NCL out-perform IFS.

Figure 3: Mean Proportion of Generations with Perfect Accuracy

5.

Figure 4: Fitness vs Generation, raw fitness and implicit fitness sharing (IFS)

Discussion

To summarise the results, with the appropriate choice of λ, root-quartic negative correlation learning outperforms implicit fitness sharing, which in turn out-performs standard negative correlation learning. Both forms of negative correlation learning out-perform raw fitness except when extreme values of λ are specified. These results are highly attractive, but we are left in the situation where our original motivation for the choice of RTQRT-NCL over NCL does not explain them (given that the RTQRT-NCL penalty function can also be reexpressed in terms of differences from the mean).

f1

50

m1

mean

75

m2

f2

100

Figure 5: Points f1 and f2 are fixed, points m1 and m2 move symmetrically about the mean An alternative possible explanation lies in th e overall shape of the RTQRT penalty function. This function is highly complex, and not particularly susceptible to global analysis. However consideration of a simple specific case may cast some light on the behaviour. We consider a simplified 1-dimensional model, in which there are two fixed points f1 and f2 at the end of an interval, and two movable points m1 and m2 symmetrically distributed about the mean within the interval. We consider the variation in the penalty function, and its gradient, as the movable points move within the interval. Figure 5 illustrates the situation under consideration. Figure 6 shows the values of the RTQRT and NCL penalty functions against the distance of the movable points m1 and m2 from the mean. Figure 7 shows the effect of the distance of the movable points from the mean on the gradient of the penalty functions at each of the points.

f1

f2

f1 3000

3000 RTQRT−NCL

NCL

2000

1000

80

RTQRT−NCL

2000

RTQRT−NCL

−40

60 RTQRT−NCL

1000

NCL

f2

−20

NCL

−60

40

−80 0

20 0

NCL 0 0

5

10

15

20

0 0

5

10

15

20

5

m2

m1 3000

10

15

20

5

10

m1

3000

0

2000

−20

15

20

m2 NCL

2000

60 RTQRT−NCL

RTQRT−NCL 1000

1000

0 0

5

10

15

20

0 0

5

10

RTQRT−NCL

20

−60

NCL

NCL

40

−40

RTQRT−NCL

NCL 15

20

Figure 6: RTQRT function values: Points f1 and f2 fixed, m1 and m2 move symmetrically about the mean

0

5

10

15

20

0 0

5

10

15

20

Figure 7: RTQRT gradient values: Points f1 and f2 fixed, m1 and m2 move symmetrically about the mean

The important points to note from figures 6 and 7 are the greater concavity of both the function values, and the gradients, for the RTQRT-NCL penalty function when comparied with the NCL penalty function. The RTQRT-NCL function increases the pressure for separated points to move further apart, without also increasing the pressure for clustered points to separate. Thus compared with NCL, the RTQRT-NCL penalty function more strongly favours the creation of widely separated, but small, clusters. This we believe to be a useful learning characteristic, at least for the 6multiplexer problem.

5.1.

Further Work

Promising results have been obtained with RTQRT-NCL in committee learning with both neural network ensembles [5], and here with genetic programming on a simple problem. Clearly, there is a need to extend these experiments to a wider range of problems, and we intend to do so in the near future. We also believe that there is considerable value in investigating penalty functions which cannot be expressed directly in terms of differences from the mean, and we intend to pursue that direction further.

6.

Conclusions

The root quartic penalty function for negative correlation learning is a promising candidate for further work, showing a worthwhile improvement over the standard negative correlation learning penalty function on the 6-multiplexer problem, and a slight improvement for this problem also over implicit fitness sharing, at the cost of an additional tuning parameter. The results so far justify further investigation of this penalty function, and its application in genetic programming.

Acknowledgments Xin Yao made the initial suggestion of investigating negative correlation learning in genetic programming. Daryl Essam has provided a valuable sounding board for discussions on partial function genetic programming and diversity measures, which led to the work described here. More generally, the work has benefited greatly from the insights of members of the Machine Intelligence and Communication Group of the University of New South Wales at the Australian Defence Force Academy. The system has been implemented through modifications to Brian Ross innovative DCTG-GP system.

References [1] Andre, D and Koza, J R: 'Parallel Genetic Programming on a Network of Transputers' in J Rosca (ed) Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications, Pp 111-120, Morgan Kaufmann, 1995 [2] Langdon, W B: 'Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!', Kluwer Academic Publishers, Boston, 1998 [3] McKay, R I: 'Fitness Sharing in Genetic Programming', Proceedings, Genetic and Evolutionary Computation Conference 2000, Morgan-Kaufman, San Francisco, Pp 435 - 442, 2000 [4] Liu, Y and Yao, X: 'Ensemble Learning via Negative Correlation', Neural Networks 12 (10), 1399--1404, 1999 [5] McKay, R I and Abbass, H A: ’ Analyzing Anti-correlation in Ensemble Learning', submitted to the Australasian Conference on Neural Networks and Expert Systems, 2001 [6] Whigham, P A: 'Grammatically-biased Genetic Programming' in J Rosca (ed) Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications, Pp 33-41, Morgan Kaufmann, 1995 [7] Ross, B J: 'Logic-based Genetic Programming with Definite Clause Translation Grammars', Proceedings GECCO-99, Morgan Kaufmann, 1999. [8] Abramson, H and Dahl, V 'Logic Grammars', Springer-Verlag, 1989 [9] McKay, R I 'Committee Learning of Partial Functions in Fitness-Shared Genetic Programming', Proceedings, SEAL 2000