Fast Grammar-Based Evolution Using Memoization

3 downloads 0 Views 193KB Size Report
Key words: Evolutionary algorithms, genetic programming, grammat- ical evolution, shared ... syntactically legal. With CFG-GP, Whigham [2] not only introduced the use of ... We do so by deriving a solution candidate from the original grammar, which again involves ..... Strategizing Machine for Game Playing and Beyond.
Fast Grammar-Based Evolution Using Memoization Martin Luerssen and David Powers Artificial Intelligence Laboratory, School of Computer Science, Engineering and Mathematics, Flinders University, Adelaide, Australia {martin.luerssen,david.powers}@flinders.edu.au

Abstract. A streamlined, open-source implementation of Shared Grammar Evolution represents candidate solutions as grammars that can share production rules. It offers competitive search performance, while requiring little user-tuning of parameters. Uniquely, the system natively supports the memoization of return values computed during evaluation, which are stored with each rule and also shared between solutions. Significant improvements in evaluation time, up to 3.9-fold in one case, were observed when solving a set of classic GP problems – and even greater improvements can be expected for computation-intensive tasks. Additionally, the rule-based caching of intermediate representations, specifically of the terminal stack, was explored. It was shown to produce significant, although lesser speedups that were partly negated by computational overhead, but may be useful in dynamic and memory-bound tasks otherwise not amenable to memoization. Key words: Evolutionary algorithms, genetic programming, grammatical evolution, shared grammar evolution, memoization

1

Introduction

Genetic Programming (GP) [1] operates on variable-length n-ary syntax trees, which offer greater representational flexibility than the fixed-length strings used in the canonical genetic algorithm. One of the main constraints of GP is the need for closure, i.e., the requirement that any combination of arguments is syntactically legal. With CFG-GP, Whigham [2] not only introduced the use of context-free grammars (CFGs) for this purpose, but also included an automatic mechanism for modifying the grammar based on the fittest solutions, an idea that has since found further refinement in Grammar-Model-based Program Evolution [3]. Adapting the CFG in this way improves the search bias towards not just valid, but better performing solutions. Grammars can also assist in another GP challenge: that of achieving scalability. For a variable length representation, search spaces are effectively unbounded; solutions can exist in higher-dimensional spaces, even if they are structurally simple. Scalability can be facilitated here by a “divide and conquer” approach, which

2

Martin Luerssen and David Powers

decomposes the larger problem into weakly correlated subproblems that can be dealt with independently. Software engineering practice has inspired techniques such as Automatically Defined Functions (ADFs) [4] that extend GP by enabling the creation and reuse of discovered modules. Nature, however, has its own means of promoting modularity. Embryogenesis efficiently captures the difference between a compact, searchable representation (genotype) and the functional, highdimensional solution (phenotype). This mapping process can be modelled with a generative grammar; L-systems have been particularly popular in this context [5]. Shared Grammar Evolution (SGE) [6] combines the three uses of grammars for closure, bias, and modularity. Solutions obtained through SGE comply with a user-defined CFG – but are also represented directly as simple, deterministic CFGs. One of the notable side-benefits of this is an intrinsic support of memoization, that is, the caching of return values for parts of the evolved solution. As an evolved population converges, we can expect the similarity between members to increase, so memoization can accelerate evolution by reducing or even eliminating evaluation time of shared modules. This paper explores the performance benefits of memoization on solving several classic GP problems with SGE. Toward this purpose, we present a streamlined version of SGE that is now publicly available under an open source license. 1.1

Grammatical Evolution

Grammatical Evolution (GE) [7] is currently the most well-published technique of applying grammars to evolution. GE and SGE can be used towards the same tasks and share in common that solutions are derived from context-free grammars (CFGs) defined in Backus-Naur Form (BNF). However, the underlying mechanics differ substantially. GE employs a genetic algorithm, where the genotype is a variable-length bit string that is read left to right to generate 8-bit integers (so-called codons). The modulus of each codon and the total number of production rules in the CFG specifies the rule to be applied to the currently replaced nonterminal, starting from the axiom (starting symbol) and ending when all nonterminals have been replaced. If all codons are read but not all nonterminals are replaced, the expansion wraps and continues from the start of the genome, unless a pre-determined maximum number of wrappings has already been exceeded.

2

Shared Grammar Evolution

With SGE, the user must likewise provide an initial template CFG, which delimits the space of valid solution candidates. If we had knowledge of what made a good solution, we could perhaps define a grammar that only contains good solutions. The user rarely has this information in advance, but it is possible to sample this space. We do so by deriving a solution candidate from the original grammar, which again involves choosing between alternative production rules according to

Fast Grammar-Based Evolution Using Memoization

3

the original CFG. However, in SGE, these choices specify an i -grammar (individual grammar; see Figure 1) that produces just that one solution. i -grammar rules that originate from the same template rule are legal alternatives to each other, just as i -grammar axioms constitute alternative solution candidates. Applying selection to the solution population implies selection of i -grammar rules through their contribution to the selected solutions. The rule alternatives that survive are those that contribute to better solutions. As our search converges towards the optimum, so should the suitability of the production rules and hence the building blocks from which we build the solutions. Template grammar

::+ + ::= a + b

::- c + d ::- e + f

Instantiate i‐grammar... i‐grammar production rule

A|1

a+b

A|2

retrieve (or create) cached value when evaluating solution

cache of a + b

B|1

::+ +

+

C|1 c+d

Mutation C|1 replaced by alternative C| 2 leads to new rule A| 2

::+ + Axioms ( legal solution candidates)

+

C|2 e+f

Nonterminals

Terminals

Fig. 1. Solution candidates are described by production rules initially derived from the user-defined template grammar. Mutation involves replacing an existing rule with another rule, either part of another solution (a reuse) or generated from the template grammar (a reinstantiation). In this example, rule B|1 is shared between two solutions; memoized results of evaluating this rule can be reused by both solutions.

2.1

Shared Representation

Due to the overhead of having to describe each rule, representing a solution as a series of production rules is not very efficient. However, if multiple rules have identical successors, they can be represented by a single rule. Instead of keeping duplicates, i -grammar rules are shared across multiple solution candidates. If these solutions are similar, then the population of solutions and rules can be represented more compactly. Moreover, since i -grammar rules are context-free and deterministic, every rule leads to a specific derivation. We can therefore treat an i -grammar rule as an encapsulated module or subroutine that may be shared and reused among solution candidates. Once it is not required any longer, the rule is automatically eliminated. Unlike in [6], our new implementation relies solely on a reference-counting scheme; e.g., a rule that is only referred to by one

4

Martin Luerssen and David Powers

other rule, even if that rule is referred to many times, will now have a usage count of only one. 2.2

Mutation

Each i -grammar rule has syntactically valid alternatives based on the template rule it relates to. A new solution candidate can be created by choosing an existing candidate, then choosing a rule from it, and replacing it with an alternative, which can either be from any other i -grammar (a reuse of a rule) or by deriving a new sequence from the template grammar (a reinstantiation of a rule). The chance of a particular rule being reused is proportional to its reference count; a rule that is part of many solutions, but only called by few other rules – i.e., rarely added successfully to a new solution on its own terms – will be less likely to be chosen. The reinstantiation probability is decided by the user. A notable exception is that reinstantiation will be automatically chosen instead of reuse if there is no valid alternative in existence, or if the production by which a replacement is to occur is the same as the replacement. During reinstantiation, the choice between reuse and reinstantiation is separately made for every rule considered. If we want to maintain both parent and offspring solution, the production rules of the parent, between the starting rule and where the replacement occurs, will need to be copied and pointed to the replacement or the likewise modified successor. The new implementation is simplified compared to [6], in that a production rule only affects one point in the derivation tree, not all points where that rule is called. No recursion can arise from such singular changes, so for problem tasks that benefit from a recursive description, recursion must be represented within the solution space by providing recursion operators in the terminal set. Furthermore, we rely on non-elitist selection to account for cases where progress is only possible through multiple changes on separate branches of the derivation tree. 2.3

Memoization

The shared representation does not merely lead to a potentially smaller memory footprint. As each solution candidate is derived in a series of rule expansion steps, we can also interpret these as nested subroutine calls. If the same rule is used by many candidates, then this is equivalent to calling the same subroutine, which will return the same result. Why not store the result of the first call and re-use it afterwards? Memoization is a well-established technique that enables a subroutine to record, in a local table, values that have previously been calculated. The term was coined by Donald Michie [8] and is frequently encountered in the context of functional programming languages. Memoization lowers a function’s time cost in exchange for space cost; that is, memoized functions are optimized for speed in exchange for greater use of memory space. This kind of caching is highly applicable to the evolution of programs and other hierarchies that include common modules that are reused many times. Its benefits have been exploited

Fast Grammar-Based Evolution Using Memoization

5

on only a few problems with expensive evaluation functions, such as robot evolution [9] and chess playing [10]. In these instances, however, memoization is applied only at the level of the terminals, whereas in SGE every rule that can have a result associated with it (a cacheable rule) can be memoized. To be able to look up a result in a table, rather than recompute it, a number of constraints must be satisfied. Firstly, there needs to be a deterministic outcome that, preferably, is isolated from its context. If a production A leads to an expression such as X + X, then 2A would be 2X + X, so knowing the result of X + X is not helpful here (unlike if A were (X + X)). Secondly, memoization becomes impractical for tasks that involve side-effects, e.g., dynamic control tasks, because of the indirect dependencies that arise between program parts. Memoizing the evaluated results is therefore not useful in all cases. As an alternative, we could instead cache the expanded terminal string. It may be represented as, or compiled into, a form that is easy to compute, i.e., through an intermediate representation, such as bytecode. The table look-up would then involve retrieving not the value of a particular evaluation, but essentially a subprogram that accomplishes the same task as the derivation of the production rule would – but, ideally, faster. Separating the evaluated representation from the grammar in this way can bring about further opportunities, e.g., evaluating solutions on specialized hardware. In this paper, we limit ourselves to testing the idea of using the terminal stack as the intermediate representation. Thus, rather than repeatedly deriving from rules and evaluating operator precedences, the ordered symbol sequence will be shared, with higher-level sequences constructed from encapsulated, lower-level sequences (or so-called fragments).

3

Implementation

SGE as presented in this paper has been implemented as an open source software package written in C++, called COGENT [11]. COGENT reads a template grammar as well as a set of parameters from a text file, evolves solution candidates to the problem, and evaluates these on user-defined objective functions. We can optimize towards any number of objectives at the same time; the objectives can be organized according to their importance. Objectives of identical importance are sorted according to their Pareto-domination using the NSGA-II [13]. If, for selection purposes, two solutions are ranked identically, further user-defined objectives of increasingly lesser importance are taken into account, possibly describing other aspects of the solution, such as crowding-distance. Grammar definitions are BNF-like, extended by allowing you to set multiple production rules as starting rules (rather than just the first) when following the left-hand side with a “::+””. Such starting rules are automatically isolated for memoization; other cacheable rules can be denoted with “::=”. Rules that have only terminals as their successors are regarded as constant and are represented by one rule for each successor combination. Memory for memoization is obtained from a Boost memory pool [12] for fast allocation when creating a cacheable rule. This is not feasible for variable size

6

Martin Luerssen and David Powers Table 1. COGENT parameters for the five problem tasks.

Task

Description

Template CFG

Quintic Regression

Symbolic regression of x5 + x4 + x3 + x2 + x for 20 equidistant points in x = [−1, 1)

::+ ( ) | ::- add | sub | mul ::- VAR | CONST

6th-order Regression

Symbolic regression of x6 − 2x4 − x2 for 20 equidistant points in x = [−1, 1)

::+ ( ) | ::- add | sub | mul | div ::- VAR | CONST

Even-5 Parity

Returns true if an even number of 5 Boolean inputs is true (over all 32 combinations of inputs)

::+ | ( ) | | ::- or | and | xor ::- not ::- var | var | var | var | var

6-bit Mul- Return the value of a data tiplexer register (d0, d1, d2, d3) specified by the binary address (a0, a1) (over all 64 combinations of inputs)

::+ ( ) ::- and | or | not | if ( ) ( ) ( ) | var | var | var | var | var | var

Santa Fe Ant Trail

::+ | | |

Collect 89 food pellets laid out on a toroidal 32 × 32 grid

move | left | right iffoodahead
prog prog

Evolution Parameters

Generations: 100 Population size: 100 (200 for 6th-order and multiplexer) Selection: Applied to the combined set of parents and offspring; elite of 10, with remaining 90 solutions chosen by tournament of size 3 Solution size limit: Maximum size of 800 rules per solution (if exceeded, discard solution and retry) Objectives: Minimize error; if equal, minimize solution size, then minimize solution age (generations)

allocations, such as for memoization of intermediate representations. In those cases, the memory only contains pointers to the allocations, which are allocated conventionally via malloc. COGENT supports multiple threads for deriving, memoizing, and evaluating solutions, but not for modifying the grammar, due to the extensive and costly use of the mutexes that would be required.

4

Experiment

The experimental aim is to determine whether evolution can be accelerated using memoization, but the extent to which this is possible may be highly dependent on the problem task and the exact evolutionary parameters. Five classic GP problems were chosen that are already well-understood and also quick to evaluate. The observed results should therefore be closer to the lower bound of what can be achieved with memoization, rather than artificially inflated. As shown in Table 1, the experimental setup contains few surprises, but note that SGE does not have mutation or recombination parameters in the traditional sense; the reinstantiation probability controls the changes that are made to solutions. Since rules that are freshly derived from the template grammar are unshared, we expect this to directly affect the memoization benefit. We hence investigated the effect of having no reinstantiation, to reinstantiating in 25% of cases, 50% of cases, and always. Furthermore, the evolutionary runs are timed for four different memoization scenarios: 1. no caching: solution is explicitly derived for each problem case

Fast Grammar-Based Evolution Using Memoization

7

2. solution caching: derived solution (i.e., the terminal stack; see section 2.3) is stored and reused between problem cases 3. fragment caching: each cacheable rule stores its own intermediate terminal stack 4. value caching: each cacheable rule stores the return value for its evaluation (i.e., classic memoization) Finally, the quintic regression, parity problem, and ant trail were evolved using GEVA [14], an open-source implementation of Grammatical Evolution, where they are included as example problems (with equivalent parameters to the above). 4.1

Results

Table 2 lists performance statistics for the best evolved solutions as well as the total computation time, which is split into an evaluation time and evolution time. The former specifies how long the program spent on deriving and evaluating all solutions, whereas the latter includes the remaining computing time, mostly of the evolutionary algorithm itself and the collection of experimental statistics. Results are averaged over 100 runs. Random seeds remain constant across different memoization options so that timings are based on the same evolutionary outcomes. There is surprisingly little objective performance variation between the different reinstantiation probabilities; only on the two regression problems does the 100% setting perform significantly worse than the alternative configurations (p < 0.001 on a two-tailed t-test). It suggests that one need not be particularly careful with this parameter. However, reinstantiation is theoretically very limited, as it derives changes directly from the template grammar, i.e., a fixed distribution. Its apparent effectiveness may simply reflect the ineffectiveness of the opposite: if we just draw building blocks from within the population, overall entropy will drop and premature convergence may happen. Conversely, random draws from the template grammar increase entropy, so a trade-off is ultimately necessary, especially for deceptive problems; i.e., some – but not too much – reinstantiation is needed. Since it appears to work well for a broad range of values, however, we cannot deduce much more from this experiment. The memoization results offer greater clarity. Firstly, evaluating a solution by deriving it for every variation of a terminal value is about an order of magnitude slower than if we retain an intermediate representation (the terminal stack) across all the problem cases. If we break the representation further up by storing substacks with each evaluable rule, we gain a 50+% improvement in evaluation speed on quintic and parity problems, over 100+% on the 6th-order polynomial (all significant to p < 0.001), but just a few percent on the multiplexer (not significant). One of the explanations for this lies in the evolved solution size. For the 6th-order polynomial, it is around 300 rules by the 100th generation, but for the multiplexer, it is only 20-30 rules. The extent to which productions are shared is directly affected by this; the share ratio indicates that there is 6× as much sharing happening with the polynomial than with the multiplexer. Additionally,

8

Martin Luerssen and David Powers

Table 2. Evolution statistics for each problem, averaged over 100 runs. Best refers to the fittest solution. The Share Ratio is the number of expressed rules across all solutions divided by the rules in the global grammar and averaged over all generations. Indicated run times (in milliseconds) are for each complete run and given separately for the evaluation and evolution components of the system. Experiment Success Best Best Share Rate Error Size Ratio

Total Evaluation + Evolution time (ms) for memoization scenario: None Solution Fragment Value

Quintic 0% reinst. 25% reinst. 50% reinst. 100% reinst.

99% 94% 90% 83%

0.003 0.008 0.009 0.015

39.2 58.5 90.5 80.4

5.47 6.02 6.99 4.81

3,575+ 104 4,638+ 117 5,223+ 127 5,338+ 208

372+ 105 495+ 118 591+ 126 531+ 208

6th-order 0% reinst. 25% reinst. 50% reinst. 100% reinst.

41% 38% 44% 14%

0.009 0.010 0.011 0.020

361.2 315.6 315.6 296.4

9.95 9.23 9.09 6.91

40,091+ 485 36,000+ 460 36,068+ 473 29,592+ 540

5,039+ 495 4,384+ 470 4,395+ 483 3,242+ 550

Even-5 Parity 0% reinst. 25% reinst. 50% reinst. 100% reinst.

79% 80% 81% 78%

1.34 0.99 0.97 1.20

35.1 38.7 37.0 36.4

6.11 6.89 5.67 3.73

7,284+ 102 8,058+ 107 8,056+ 113 7,376+ 183

596+ 103 656+ 108 678+ 114 615+ 186

385+ 296 414+ 311 420+ 341 414+ 443

339+ 109 351+ 113 373+ 119 641+ 197

6-bit M.plex 0% reinst. 25% reinst. 50% reinst. 100% reinst.

75% 85% 82% 71%

1.36 0.57 0.63 1.10

29.5 24.7 20.6 19.9

2.81 2.60 2.23 1.91

22,696+ 376 20,283+ 364 17,973+ 351 9,992+ 320

2,468+ 378 2,280+ 366 1,997+ 354 1,224+ 315

2,202+ 703 2,054+ 662 1,813+ 613 1,185+ 441

1,546+ 388 1,458+ 375 1,383+ 361 1,195+ 322

Ant Trail 0% reinst. 25% reinst. 50% reinst. 100% reinst.

11% 7% 15% 19%

21.2 22.6 20.3 21.6

75.5 80.3 85.1 32.7

4.56 4.48 4.58 2.91

147+ 157 148+ 168 147+ 176 138+ 449

241+ 266 289+ 353 327+ 430 322+ 296 2,135+ 3233 1,962+ 2883 2,000+ 2911 1,552+ 2331

254+ 111 296+ 128 323+ 135 559+ 225 1,297+ 541 1,206+ 515 1,225+ 529 1,455+ 598

N/A N/A N/A N/A

most of the rules of the multiplexer grammar are not defined as cacheable (and are not designed to be), whereas all of the rules for the polynomial are. These factors influence how worthwhile it is to use memoization. Storing the intermediate representation with each cacheable rule involves a notable overhead, as can be observed from the evolution time. On all the cacheable problems tested here, it negates the entire performance benefit of this strategy. However, on real-world problems with more costly evaluation functions, we should still expect a substantial net benefit. A more convincing result is produced by memoization in the classic sense of caching evaluation results. Here, the evolution cost is only marginally higher than baseline, but the evaluation improvement is even greater: up to 3.9× on the 6th-order polynomial, and between 1.4× to 1.9× on the other problems (all significant p < 0.001). These numbers exclude the 100% reinstantiation setting, as hardly any improvement is observed there. 100% reinstantiation should lead to less speedup, because any

Fast Grammar-Based Evolution Using Memoization

9

rules obtained via reinstantiation are new and unshared, and we can accordingly observe a much lower share ratio for 100% than for any other setting. It is not clear, however, why this has a much greater impact with value caching than with stack caching – we may be hitting an implementation bottleneck of some kind here. The Santa Fe ant trail turned out to be a task not suitable for memoization, as it is not only recurrent but also a single case problem that can be evaluated directly and more efficiently from its i-grammar than through any other means. We include it solely for comparative purposes. Likewise intended for informal comparison are the GEVA outcomes: 20% success rate (with error 0.727) for the quintic regression, 76% success rate (with error 1.61) for the parity problem, and 8% success rate (with error 23.88) for the ant trail. COGENT appears to perform slightly better on these problems, significantly so for the regression (p < 10−10 ), but since a GE expert could likely improve upon this, we can only say that our system appears to be competitive without much fine-tuning involved.

5

Conclusion

In this paper we have presented a streamlined implementation of a novel scheme for evolving solutions from a user-defined CFG, which is performance competitive with GE, requires few parameters to be tuned, and is available as open source [11]. A particularly noteworthy feature is its support for memoization, through which we achieved significant improvements to evaluation speed between 1.4−3.9× over just caching the expanded solution between sample presentations. As the tested problems were quite simple, we expect this to be a lower bound; real-world problems with expensive evaluation functions should benefit much more so. Memoization was also noted to be more effective for larger solutions that share many production rules. Connected to this, we also explored the reinstantiation probability, which defines the balance between creating new rules from the user-defined CFG and exploiting existing rules in the population. It affects both sharing and the objective error of evolved solutions, although significant changes (for the worse) were only observed when reuse of existing rules was discouraged completely. The observed tolerance to parameter changes is postulated to be due to its complex relationship to diversity in the population and the associated exploration-exploitation balance, which needs to be investigated further. At present, our scheme is used only at its most basic; there are many opportunities for enhancement. For instance, the rule to be changed is chosen randomly, but one could instead impose and modify probabilities for each rule, e.g., in an ant-system like manner. Experimental evaluation of such ideas will greatly profit from faster evolution. Multicore support is currently limited to shared-memory multithreading for derivation and evaluation of solutions. We intend to expand this to a distributed memory, multi-grammar approach, but this has implications on the applicability of memoization that need to be explored. We also hope to expand on the notion of intermediate representations, especially for more com-

10

Martin Luerssen and David Powers

plex problems that share correspondingly complex modules, with more selective caching to emphasize performance benefits (over the drawbacks) of memoization, and compilation to faster representations that can run on specialized hardware, such as GPUs. We will continue to make these changes publicly available; fellow researchers are welcome to check them out and contribute.

6

Acknowledgments

We acknowledge and appreciate the support of the Thinking Head SRI grant TS0669874 of the Australian Research Council (ARC) and the National Health and Medical Research Council (NH&MRC).

References 1. Koza, J.: Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge (1992) 2. Whigham, P.: Grammatically-based genetic programming. In: Rosca, J. (ed.) Workshop on Genetic Programming: From Theory to Real-World Applications, pp. 33–41. Morgan Kaufmann Publishers, San Mateo (1995) 3. Shan, Y., McKay, R., Baxter, R., Abbass, H., Essam, D., Nguyen, H.: Grammar Model-based Program Evolution. In: IEEE Congress on Evolutionary Computation (CEC 2004), pp. 478–485. IEEE Press (2004) 4. Koza, J.: Genetic programming II: automatic discovery of reusable programs. MIT Press, Cambridge (1994) 5. Hornby, G.: Generative representations for evolutionary design automation. Ph.D. thesis, Brandeis University Dept. of Computer Science (2003) 6. Luerssen, M., Powers, D.: Evolving encapsulated programs as shared grammars. Genetic Programming and Evolvable Machines 9(3), 203–228 (2008) 7. O’Neill, M., Ryan, C.: Grammatical evolution. IEEE Transactions on Evolutionary Computation 5(4), pp. 349–358 (2001) 8. Michie, D.: Memo functions and machine learning. Nature 218, pp. 19–22 (1968) 9. Suwannik, W., Chongstitvatana, P.: On-Line Evolution of Robot Arm Control Programs for Visual-Reaching Tasks using Memoized Function. ECTI Trans. on Electrical Eng., Electronics, and Communications 4(2), pp. 145–155 (2006) 10. Sipper, M., Azaria, Y., Hauptman, A., Shichel, Y.: Designing an Evolutionary Strategizing Machine for Game Playing and Beyond. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 37(4), pp. 583–593 (2007) 11. COGENT: Concurrent Grammar Exploration and Transformation, http://code. google.com/p/cogent 12. Boost Pool Library, http://www.boost.org/doc/libs 13. Deb, K., Pratab, A., Agrawal, S., Meyarivan, T.: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2), pp. 182–197 (2002) 14. GEVA: Grammatical Evolution in Java, http://ncra.ucd.ie/Site/GEVA.html