Program distribution estimation with grammar ... - Semantic Scholar

Program distribution estimation with grammar models Y. Shan, R. I. McKay, H. A. Abbass, D. Essam School of ITEE, University College, University of New South Wales, Australia Defence Force Academy, Canberra, ACT 2600, AUSTRALIA Email: {shanyin,rim,abbass,daryl}@cs.adfa.edu.au

Abstract This research extends conventional Estimation of Distribution Algorithms (EDA) to Genetic Programming (GP) domain. We propose a framework to estimate the distribution of solutions in tree form. The core of this framework is a grammar model. In this research, we show, both theoretically and experimentally, that a grammar model has many of the properties we need for estimation of distribution for tree form solutions. We report one of our implementations of this framework. Experimental study confirms the relevance of this framework to problem solving.

1.

Introduction

There has been growing interest in Estimation of Distribution Algorithms (EDA) (Bosman and Thierens, 1999; Larrañaga and Lozano, 2001; Müehlenbein and Paaß, 1996; Pelikan, 2002). The basic idea is to estimate the distribution of good solutions. EDA mainly use a linear string representation, resembling an individual of Genetic Algorithms (GA) (Goldberg, 1989) because conventional EDA research is closely related to GA. Tree representation is rarely considered in EDA research. However, there is an important class of problems for which tree representation is suitable, so that the limited consideration of tree representation in previous EDA literature is a significant gap in the research. In this paper, we propose a framework, called Program Distribution Estimation with Grammar Model (PRODIGY), which extends conventional EDA to a more complex tree representation. Since EDA research is heavily motived by Evolutionary Computation approaches (EC), in particular Genetic Algorithms (GA), it is easier to understand it from a GA perspective. One common theory to interpret GA behaviour is that through genetic operators, especially crossover, building blocks, which are high quality partial solutions, are discovered and combined to form better solutions (Holland, 1975; Goldberg, 1989). Therefore, there is reason to believe that if we can build a model which can explicitly identify, preserve and promote these building blocks, we can improve GA performance. This has been experimentally verified (Bosman and Thierens, 1999; Larrañaga and Lozano, 2001; Müehlenbein and Paaß, 1996; Pelikan, 2002). EDA can also be understood as estimation of the distribution of good solutions, as its name suggests. Most EDA research is based on a GA-like linear string presentation because of the close relation between EDA and GA. There is only very limited amount of work on more complex tree representations, resembling Genetic Programming (GP) (Cramer, 1985; Koza, 1992). This is a significant gap in EDA and GP research, because EDA not only has the potential to improve GP performance on some problems, but also provides a new perspective for - 191 -

understanding GP. Our work aims to help to fill this gap. For simplicity, we refer to the idea of applying EDA approaches to estimate the distribution of GP-style tree form solutions as EDAGP. In this research, we propose a framework to estimate the distribution of solutions in tree form. We call this framework Program Distribution Estimation with Grammar Model (PRODIGY). The basic algorithm of PRODIGY is consistent with the EDA algorithm. The core of this framework is a grammar model.

a) Context-free grammar for symbolic regression problem b) The generation of x(x+x) Figure 1. Example of Grammar Guided Genetic Programming

In this research, we show, both theoretically and experimentally, that a grammar model has many of the properties we need for estimation of distribution for tree form solutions. This paper is organized as follows. The background information is given in Section 2. Properties that a probabilistic model should have for EDAGP are discussed in Section 3. The PRODIGY framework is proposed in Section 4. Our implementation and experiments are reported in Section 5. and 6. Section 7. discusses related work. Conclusions are given in the last section.

2.

Background

2.1 GP and Grammar Guided GP Genetic Programming (GP) (Cramer, 1985; Koza, 1992) is an Evolutionary Computation approach (EC). GP can be most readily understood by comparison with Genetic Algorithms (GA). Rather than evolving a linear string as GA does, genetic programming evolves computer programs, which are usually tree structures. The genetic operators, such as crossover, are refined to act upon subtrees, so as to ensure that the child is syntactically correct. Grammar Guided Genetic Programming (GGGP) (Whigham, 1995) is a genetic programming system with a grammar constraint. All kinds of grammars can be used to describe the constraint. However, we mainly focus on Context-free Grammars (CFG). The formal definition of CFG will be given later in Section 4.1. GGGP provides a systematic way to handle typing. In this respect, it has more formal theoretical basis than strongly typed GP (Montana, 1993). More importantly, GGGP can constrain search space so that only grammatically correct individuals can be generated. This also requires that genetic operators, such as mutation and crossover, must respect the grammar constraint, i.e., after imposing a - 192 -

genetic operation, the offspring must still be grammatically correct. For example, a common CFG grammar for the symbolic regression problem is illustrated in Fig 1.a. An individual, which may be derived from this grammar, is illustrated in Fig 1.b. Numbers in the parentheses in Fig 1.b are the relative locations of symbols, which will be discussed later.

2.2 Estimation of Distribution Algorithms The basic idea of EDA is to estimate the probability distribution of optimal solutions (Bosman and Thierens, 1999; Larrañaga and Lozano, 2001; Müehlenbein and Paaß, 1996; Pelikan, 2002). The formal description of EDA can be found in (Bosman and Thierens, 1999). Assume Z is T the vector of variables we are interested in. P (Z) is the probability distribution of individuals whose fitnesses are greater than some threshold T (without loss of generality, we assume we Topt are dealing with a maximization problem). If we know P (Z) for the optimal fitness Topt, we can find a solution by simply drawing a sample from this distribution. However, usually we do not know this distribution. Because we have no prior information on this distribution, we start from a uniform distribution. In the commonest form of the algorithm, we generate a population P with n individuals and then select a set of good individuals G from P. Since G contains only selected individuals, it represents the search space that is worth further investigation. We now estimate a probabilistic model M(ζ, θ) from G. ζ is the structure of the probabilistic model M while θ is the associated parameter vector. With this model M, we can obtain an approximation of the true distribution PT(Z). To further explore the search space, we sample distribution , and new samples are then reintegrated into population P by some replacement mechanism. This starts a new iteration. The first work in the field of EDAGP, applying EDA to GP-style tree form solutions, was Probabilistic Incremental Program Evolution (PIPE) (Salustowicz and Schmidhuber, 1997). The probability model employed in PIPE is a prototype tree. A prototype tree is a complete full tree with maximum arity. Each node of this prototype tree has a probability table indicating the probability of each symbol. The prototype tree enumerates all possible positions of tree nodes, given the depth limit and the maximum arity of symbols. Hence, what PIPE tries to learn is the correct symbol at a particular location, regardless of its surrounding symbols. We call each node in the prototype tree an absolute position. In these terms, the prototype tree of PIPE is an absolute position dependent model.

3.

Wisdom from GP research in searching for model

Because the basic idea of EDA is to approximate the true distribution using a model M, it is vital to choose an appropriate model M. Consequently, conventional EDA research, which often employs a GA-style linear string to encode individuals, heavily focuses on finding a suitable probabilistic model M. The GA literature supports the belief that there are dependencies between genes (also known as linkages). In the EDA literature, many different kinds of probabilistic models have been proposed to represent linkage, ranging from methods that assume genes in the chromosome are independent, through others that take into account pairwise interactions, to methods that can accurately model even a very complex problem structure with highly overlapping multivariate building blocks. There is only a very limited research on EDAGP, i.e. applying EDA ideas to GP style tree representation. However tree representation provides an alternative to encode solutions. It is - 193 -

more flexible, which is important for a wide class of problems which have been previously studied with GP. Therefore, study on this direction is worth taking. One of the major difficulties of EDAGP lies in finding an appropriate probabilistic model. We may take lessons from GP research in this respect. Based on current GP research, we are convinced that a good model for EDAGP should have the following properties.

3.1 Internal hierarchical structure The tree representation has an internal structure – an intrinsic hierarchical structure. Our model needs to be able to represent this structure, for example when sampling the model, i.e. when generating individuals from the model, a valid tree structure should be guaranteed.

3.2 Locality of dependence In conventional EDA, the model employed usually does not assume specific dependencies between adjacent loci a reasonable assumption for GA representation. However in a tree structure, dependence exhibits strong locality, and this is the primary dependence which we should consider in tree representation. Dependencies (relationships) between parent and child nodes are expected to be stronger than among other nodes. Another perspective for viewing this locality of dependence comes from semantics. When evolving a linear structure, as conventional EDA does, usually we assume the semantics is attached to the locus. For example, when using GA to solve a traveling salesman problem, one common encoding method is that each locus represents one step at which one specific city is visited. However, in tree representation, the meaning of the node has to be interpreted in its surrounding context. For example, in an Artificial Ant Problem (Koza, 1992), one of the standard GP benchmark problems, the node move would have entirely different effects depending on the current position of the ant. The model chosen for evolving tree structure has to be able to represent this strong local dependence, as well as dependence on a larger scale. 1.

Generate a population P randomly.

2.

Select a set of fitter individuals G.

3.

Estimate a SCFG model M over G.

4.

Sample the SCFG model M to obtain a set of new individuals G’.

5.

Incorporate G’ into population P.

6.

If the termination condition is not satisfied, go to 2.

Figure 2. High level algorithm of PRODIGY

3.3 Position independence Position independence actually is very closely related to locality of dependence. It emphasizes that, in GP tree representation, the absolute position of a symbol usually does not play a large role. Because of locality of dependence, a substructure may occur in various locations in different GP trees but still has identical or similar contribution to the overall fitness. This belief is well reflected in various GP schema theories. Usually, a schema is defined as a subtree of some kind. The absolute position of the subtree is not considered. The quality of a schema is determined by its own structure, not just by where it is. Position independence is also related to introns in GP research. When evolving solutions in GP, it is common that some parts of an individual do not contribute to the fitness, and thus can - 194 -

be removed when evaluating fitness. This phenomenon of evolution of introns is another cause of position independence. A subtree fragment can be moved further down by inserting introns in front of it, yet the function of the subtree does not change.

3.4 Modularity In a tree representation, it is common that building blocks, which are relatively independent subsolutions, may need to be shared among tree branches and individuals. Therefore, one building block may occur multiple times, either in one individual or across the population. This has been validated by numerous studies, such as Automatically Defined Function (ADF) (Koza, 1992), Genetic Library Builder (GLiB) (Angeline and Pollack, 1994) and Adaptive Representation (AR) (Rosca and Ballard, 1994). By identifying and reusing common structures, or building blocks, which occur frequently in fitter individuals, these studies reported improvement over conventional genetic programming.

3.5 Non-fixed complexity Individual trees have no fixed complexity. Even if we do impose some limits, such as maximum depth or maximum number of nodes, the individual complexity varies significantly from individual to individual, from generation to generation. A fixed model, which resembles the models in EDA in having a fixed size in terms of number of variables, cannot reflect this variation.

4.

Framework of PRODIGY

Having discussed properties that a probabilistic model should have for EDAGP, we propose our framework, called Program Distribution Estimation with Grammar Model (PRODIGY) in this section. The high level algorithm of PRODIGY can be found in Fig 2. It is consistent in its essentials with the EDA algorithm.

4.1 Grammar model The key component in the PRODIGY framework is a Stochastic Context-free Grammar model (SCFG). We argue that a grammar model, in particular the SCFG model, has all the properties for EDAGP that we discussed in Section 3.. The detailed discussion of how a grammar model embodies these properties is left to Section 4.2.

A Stochastic Context-free Grammar (SCFG) M consists of 1)a set of nonterminal symbols N ; 2) a set of terminal symbols (or alphabet) Σ; 3)a start nonterminal S ∈ N ; 4)a set of productions or rules R. The productions are of the form X → λ where X ∈ N and λ ∈ (N ∪ Σ)*. X is called the left-hand side (LHS) of the production, whereas λ is the right-hand side (RHS); 5) production probabilities P (r) for all r ∈ R. The probability attached to each production indicates the probability of being chosen. If the probability component in the above definition is removed (i.e., a uniform distribution is imposed on the productions), a SCFG becomes a Context-free Grammar (CFG). CFGs are - 195 -

commonly used in GGGP to impose grammar constraints. A simple example of a CFG can be found in Fig 1.a. As can be seen, in CFG, it is possible that one nonterminal can be rewritten in different ways. For example, nonterminal “exp” can be rewritten using either rule 1, 2 or 3. In CFG, all of these rules have equal probabilities to be chosen and therefore there is no bias. In SCFG, the added probability component makes it possible to bias towards some rules rather than others.

4.2 Relevance of the grammar model As we foreshadowed in Section 3., in this section, we will look at the proposed desirable criteria for an EDAGP model, to see to what extent SCFG meets them. Grammars were invented to represent the internal hierarchical structure of natural language. The SCFG model intrinsically respects the tree structure, which is a basic requirement of GP. Therefore, it can represent GP style tree structure naturally. No extra constraint needs to be introduced. SCFG grammars can model locality of dependence very well. The probability attached to each SCFG production rule can be regarded as a conditional probability, describing dependencies between parents and children. As mentioned before, this dependence is the primary dependence in tree structure. A position dependent model, such as in PIPE or other prototype tree based work, has to learn building blocks at different positions separately. The SCFG model does not assume any position dependence. The common structures (building blocks) among individuals, even if they are not at the same position, can be represented and preserved by the grammar model. It may be easier to illustrate this point with an example. Suppose we have selected the two individuals illustrated in Fig 3.b. Between them, one subtree occurs in three positions A, B and C. If a model is position dependent, the subtrees in positions A and C would be considered different because of their different positions in the tree, although they are obviously identical subtrees. Therefore, a position dependent model cannot learn anything from the duplication of the subtrees in positions A and C, while a grammar model such as Fig 3.a can reflect this common structure very well. The common structure occurs only once in this grammar, represented through rules 2, 4 and 6. This confirms that the grammar can represent position independent building blocks. Note that the probability component of the SCFG is omitted in Fig 3.a because it is not relevant to this specific discussion. The structure constraint imposed by SCFG is flexible. Actually, SCFG is a constrained probabilistic graphical model, i.e., the common structure can be shared while the individual generated is guaranteed to be a valid tree structure. This suggests that SCFG can represent the modularity in tree representation very well. This makes the SCFG model essentially different from models which employ a rigid tree structure. With SCFG, it is possible that the common structure can be used multiple times in the individual, yet it does not have to be relearned every time. Let us look at the example used above. The common structure, which is represented by rules 2, 4 and 6, occurs in the grammar model illustrated in Fig 3.a only once, but it occurs twice in the right-hand individual in Fig 3.b. This suggests that in SCFG, common structure can be shared both within the individual, and of course across the population. It may be easier to understand from another perspective. If we take symbols in this grammar as nodes and rules as edges, it is clear that this grammar is actually a graph. And thus, its components can be shared while the individual generated is still guaranteed to be a valid tree.

- 196 -

This example clearly demonstrates that a common structure which is repeated at different positions can be represented by a grammar model. The common structure represented by a grammar can be easily shared among tree branches, either in one individual or across the population. Furthermore, this makes learning more efficient, because we do not need to relearn the common structure in different positions. For example, in Fig 3.b, the subtrees at A, B and C are identical (although at different locations) and therefore they all contribute to learning rules 2, 4 and 6. Therefore, the total number of training cases needed is reduced, and this improves learning efficiency. Another issue worth mentioning is that SCFG does not presume a particular shape for building blocks, and theoretically it can represent all kinds of building blocks previously studied in the GP research.

a) A trivial grammar model

b) Position independence and modularity

Figure 3. Example of Position independence and modularity

Lastly, tree individuals have no fixed complexity. With an appropriate learning method, the grammar model may vary in size so as to reflect the superior individuals (training samples) and generalize them to a certain extent.

4.3 Grammar model learning and sampling A SCFG can be understood as a CFG with probability attached. Therefore, it has two parts: the structure (the CFG part) and the probability part. To learn a SCFG model, we have to learn these two parts. Firstly, we have to decide the structure of the SCFG, including the number of symbols and the structure of the rules. Secondly, given this structure, the probability attached to each production rule is to be estimated accordingly. The PRODIGY framework does not have to be bundled with a specific SCFG model learning method. Any SCFG model learning method can be used. In this framework section, we will not describe any particular learning method. In the next section, one of our implementations will be discussed. Sampling a grammar model, i.e., generating individuals from a SCFG grammar model, is very similar to generating individuals from a CFG model, except that, in a SCFG, the rule is chosen based on its probability.

5.

Implementation

In this section, we present one of our implementations of PRODIGY, which represents a significant extension to our early method – Program Evolution with Explicit Learning (PEEL) (Shan et al., 2003). We choose a specialized SCFG model in this implementation. - 197 -

5.1 Grammar model Our grammar model is a specialized stochastic parametric Lindenmayer system (Prusinkiewicz and Lindenmayer, 1990) (Lsystem). The reason we use this specialized SCFG is that we want to balance between expressiveness of the grammar model and learning efficiency. This Lsystem is equivalent to a standard SCFG with two extra conditions on the LHS. The two conditions we introduce are depth and location. So the modified form of a production rule is X(d, l) → λ (P ). The first parameter depth d is an integer, indicating that, when generating individuals, this rule may be applied only at level d. The second parameters is a location l, indicating the relative location where this rule may be applied. P is the attached probability. From now on, matching the LHS, does not mean merely matching the LHS symbol X, but also matching these two conditions on the left hand side.

Figure 4. Initial rules of depth 0

The location information is encoded as the relative coordinate in the tree, somewhat similar to the approach of (D’haeseleer, 1994). The position of each node in a rooted tree can be uniquely identified by specifying the path from the root to this specific node. A node’s position can therefore be described by a tuple of n coordinates T = , bn), where n is the depth of the node in the tree, and bi(b1, b2,…, bn) indicates which branch to choose at level i. For example, in Fig 1.b to reach the bottom rightmost node, from the root (with coordinate 0) we need to choose the 2nd, 2nd and 0th branch, subsequently. This gives us 0220. For any given problem, we have a primitive grammar to specify the search space, just as in Grammar Guided Genetic Programming (GGGP). The initial grammar of our implementation is very similar to this grammar, except that this set of rules is duplicated at each depth level. For example, rules 1, 2 and 3 in Fig 1.a are changed into the rules illustrated in Fig 4 to generate individuals at depth 1. Other rules at this level, and rules at other levels/depths, can be defined in a similar way. The first condition, d = 1, indicates these rules can only be used to generate the 1st level (d = 1) of the tree. The probabilities are initialized uniformly, and are then subject to update by the probability learning method. All locations are initialized to the value #, “don’t care” or “match to any value”. When necessary, the grammar will be refined by the structure learning method, and location information will be incrementally added.

5.2 Algorithm The high level algorithm is sketched in Fig. 2. We just need to add some details here. Firstly, we set the size of the selected individual set G to 2. Hence, G has two individuals, the best individual in the current generation and the best individual found so far, also called the elite. Secondly, the population P is not maintained. Therefore, step 5 in Fig. 2 is simplified to replacing population P with G’.

5.3 Grammar model learning The SCFG grammar model learning has two parts: learning of structure and learning of parameters (probabilities attached to rules). The basic idea of probability learning is to increase the probabilities of production rules which are used to generate good individuals. - 198 -

We want to increase probabilities according to individual id. Assume the rules involved in id deriving individual id are R . To update the probability of each rule id P(ri), ri ∈ R . P (ri)= P (ri) + (1 −P (ri)) ×lr inc(ri)= inc(ri) + (1 −P (ri)) where lr is the pre-defined learning rate. P (ri) is then renormalized among rules with identical LHS (including both the LHS symbols and the two conditions). inc(ri) is used to record the increment, which will be used later in structure learning. The initial grammar will usually be minimal, representing the search space at a very coarse level. Then, when necessary, rules are added to focus the search to a finer level. This is grammar structure learning. The way we add new rules is by splitting existing rules, adding more context information. Once more context information has been added, one production can be split into several ones. The key idea here is to split the production rule by using more contextual information, which is partially motivated by (Bockhorst and Craven, 2001). Two steps are needed in this splitting: 1) detecting the most overloaded Left Hand Side (LHS), and 2) splitting the most overloaded LHS.

Figure 5. Rule split.

The overloaded LHSs are those which are both highly influential, and badly converged. Specifically, we calculate the burden for each LHS in the grammar. The LHS with the largest burden is the most overloaded LHS. The burden of LHS L is defined as:

The idea behind this measure is: if one LHS L keeps being rewritten using different rules, it suggests that there is a great degree of confusion regarding which rule should be chosen given LHS L. To reduce this confusion, we need to add more contextual information, which is represented by the relative location. By adding more context information, rules become distinguishable. Once the context information has been added, one production can be split into several. For example, the first rule in Fig 4 can be split into three rules by adding more relative location information according to the grammar in Fig 1.a, as illustrated in Fig 5. Initially, the new rules each have the same probability as the original one. However, after further learning, i.e., probability learning, these three rules may have different probabilities. More specifically, in this example, originally the first rule in Fig 4 will be applied with equal probability anywhere in depth 1 as long as the LHS symbol “exp” is matched, no matter what precedes it. In other words, it cannot recognize its surrounding context. By adding relative context, this one rule becomes three. Thus, at different relative locations, it may have different probabilities. This means that it is now possible to recognize the context to a certain extent. Note that the location information is added incrementally – each time, only the minimum location information is added. In this case, with added minimum relative location, this rule now depends on its immediate parent. If this is not enough, in the next iteration of learning, more relative location information may be added so that this rule can recognize not only its immediate parent but also its more distant ancestors. - 199 -

As can be seen, our implementation is an incremental learning approach. It starts from a minimum initial Lsystem, a special kind of SCFG in this context, which describes the search space very roughly. Then, when necessary, rules are added to the Lsystem to focus the search to a finer level.

5.4 Grammar model mutation A primary exploration mechanism is mutation. We want to explore the search space surrounding individuals, particularly good individuals. Therefore, we consider individuals in id defining grammar mutation. Assume we want to do mutation with respect to individual id. R id are all rules involved in deriving id. Q are all rules which have identical LHS as any of rules id id id in R . We mutate rules in R ∪ Q . The mutation probability is pm. nr is the total number id

id

rules of R ∪ Q . There is a risk that the number of rules mutated may increase too fast as the size of individuals increases, so we make pm depend inversely on the nr: pm = cm/nr, where cm is a pre-defined mutation rate. In this paper, the purpose of mutation is to allow the search to escape from the areas of previously learned knowledge, and thus explore different parts of the search area. Consequently, our mutation will force converged LHSs to partially lose their convergence, thus encouraging exploration. If rule ri is mutated, its probability P (ri) is changed: P (ri)= P (ri) + mi × (1 − P (ri)), where P (ri) is the probability of rule ri which is being mutated, and mi is a pre-defined mutation intensity. P (ri) is renormalized after mutation.

5.5 Grammar model sampling Grammar model sampling, i.e. generating individuals from the grammar, is basically consistent with SCFG grammar sampling. This has been illustrated in Section 4.3. There is only one minor difference in matching the LHS. In this implementation, matching the LHS includes matching both the LHS symbol and the two conditions on the left hand side.

Table 1. Comparison of PRODIGY and GGGP

6.

Empirical study

6.1 Function regression Function regression is often used in the GP literature for evaluating the performance of an algorithm. In this work, the function to be approximated is adopted from (Salustowicz and Schmidhuber, 1997). The function is:

This complicated function was chosen to prevent the algorithm from simply guessing it. The following results compare PRODIGY and Grammar Guided Genetic Programming (GGGP). The results of PIPE in (Salustowicz and Schmidhuber, 1997) over the same problem are also briefly mentioned. However it is almost impossible to make a fair comparison with PIPE, because the results given in (Salustowicz and Schmidhuber, 1997) are in diagram - 200 -

format only, and in any case, PIPE has ephemeral random constants while PRODIGY does not. The fitness cases were sampled at 101 equidistant points in the interval [0,10], 100,000 program evaluations were set for both PRODIGY and GGGP, the same as in (Salustowicz and Schmidhuber, 1997). The other settings for PRODIGY were population size 10, maximum depth 15, mutation rate cm = 0.4, and mutation intensity mi = 0.2. The structure learning happened every 10 generations, and the number of elitists was 1. The settings for GGGP were population size 2000, maximum depth 15, crossover rate 0.9, mutation rate 0.1, tournament size 3. Thirty runs were generated for both PRODIGY and GGGP. The errors (sum of absolute value of error) of the best individual in each run are summarized in Table 1. As can be seen from Table 1, when considering the mean and median statistics, PRODIGY gets much better results than GGGP, although both the standard deviation and range of results for PRODIGY is larger than for GGGP. We conducted a two-tailed student t-test to see whether the difference of means between these two systems is statistically significant. We found at p< 0.01 that the difference is statistically significant. We also presented the size in term of the number of terminal nodes (i.e., GP symbols nonterminal grammar symbols are not counted) in Table 1. The difference of size is not statistically significant. It is interesting to see that PRODIGY found better solutions than GGGP did, but the sizes of these solutions were similar. This experiment suggests that PRODIGY has stronger parsimony pressure than GGGP, but this pressure did not appear to hinder its ability to explore the search space. Comparing PRODIGY with the results of PIPE and GP in (Salustowicz and Schmidhuber, 1997), we find that PRODIGY is no worse than GP and PIPE in terms of mean of fitness of the best individuals, even though we use less tree depth and have no ephemeral constants in PRODIGY. Overall, in terms of the general accuracy, on this specific problem, PRODIGY has roughly similar performance to PIPE.

6.2 Time series prediction For this problem, we used the sunspot benchmark series for years 1700-1979 (The data can be obtained from ftp://ftp.santafe.edu/pub/TimeSeries/data/). It contains 280 data points, divided into three subsets: the data points from years 1700-1920 are used for training, while 19211955 and 1956-1979 are for testing. The task of prediction is to map data points from lag space to an estimate of the future value work, and t is the time index.

where δ is the order of the model, which is set to 5 in this

Table 2. Comparison of PRODIGY and GP on sunspot prediction

The accuracy of the model is measured by Average Relative Variance (ARV) (Weigend et al., 1992). Given data set S, ARV is defined as: - 201 -

where targett or xt are the actual values, predictiont or are the predicted values, N is the 2 total number of data points, and σ is the variance of the total dataset (i.e. from year 17001979). It is set to 0.041056, as in (Jäske, 1996). Twenty seven runs were generated using the same settings as in the function regression experiment, except that the number of generations was set to 600,000 to make it comparable with the GP runs in (Jäske, 1996), and the grammar structure learning interval was increased from 10 to 50 to reduce computational time. The results are summarized in Table 2. The GP results, which are the statistics of ten runs, are adopted from (Jäske, 1996), which is a widely cited comparison work for GP on the sunspot data. The first row is the statistics of all PRODIGY runs. The second and third rows are the top 80% of PRODIGY runs and GP runs respectively, measured by training accuracy. The fourth row is a two-tailed student t-test of the above two rows. The fifth and sixth rows are accuracies of the top 60% of PRODIGY runs and GP runs respectively. The fourth row is two-tailed student t-test. The rightmost column, size, is the number of nodes in the best individuals. The season we report the top 80% and 60% runs is that this is the only comparison data reported in (Jäske, 1996). As can be seen from Table 2, the means of accuracy on training and two testing datasets of PRODIGY are all smaller than its GP counterparts. Most of the differences are statistically significant, as suggested by the p-value from our t-test. Therefore, on this problem, our implementation of PRODIGY, in general, outperforms the conventional GP. We notice that PRODIGY generalizes no worse than the GP system. Note that the GP in (Jäske, 1996) was carefully designed to alleviate overfitting, while neither overfitting control nor explicit parsimony pressure were used in PRODIGY.

7.

Related work

The first work in this area was PIPE (Salustowicz and Schmidhuber, 1997), which we discussed earlier. PIPE differs from our system in two main aspects. Firstly, the probability at each node is independent from that at other nodes, while in our SCFG, the model probability is a conditional probability with respect to its parent. Secondly, the probability sits on the tree structure and thus no common structure can be shared, i.e. PIPE is absolute position dependent. Subsequently to PIPE, a number of other authors have based distributional GP approaches on the PIPE prototype tree. Estimation of Distribution Programming (EDP) (Yanai and Iba, 2003) is an extension of PIPE. The PIPE prototype tree is still used, but the probability on the node is replaced by a conditional probability with respect to its immediate neighbour nodes. Extended Compact Genetic Programming (ECGP) (Sastry and Goldberg, 2003) is a direct application of ECGA (Harik, 1999) in tree representation. The PIPE prototype tree is used but, instead of considering each node as independent random variable, it estimates a probabilistic model, the Marginal Product Model (MPM). MPM is formed as a product of marginal distributions on a partition of the nodes of the PIPE prototype tree. Therefore it can represent the probability distribution for more than one node at a time. All these non-grammar EDAGP works are based on the prototype tree proposed in PIPE. In the original PIPE, the probability of each node does not depend on any neighbouring nodes. Extensions, made in EDP and ECGP, consider dependencies among nodes in the form of either conditional probability or joint probability. The PIPE prototype tree does not have a - 202 -

palpable problem handling individuals with varying complexity. However, we cannot see an obvious way for methods based on the prototype tree to handle position independence and modularity. AntTAG (Abbass et al., 2002) was an early attempt at using a grammar as a probabilistic model for EDAGP. In antTAG, the grammar structure is fixed, and only the attached probabilities are re-estimated in every iteration. This is like our implementation, but without structure learning. Due to the limited expressiveness of a fixed grammar, the search cannot go very far. We recently became aware of interesting work by Ratle et al (Ratle and Sebag, 2001), which preceded our earlier work (Shan et al., 2003). This work is similar to our implementation in that an SCFG grammar is used for EDAGP, and that both of these works use depth as an additional constraint. However, in (Ratle and Sebag, 2001), the grammar structure is fixed, so that we expect it to suffer from the same problem as antTAG, in having bounded expressiveness. Subsequently to our work (Shan et al., 2003), Tanev (Tanev, 2004) and Bosman (Bosman and de Jong, 2004) independently proposed similar ideas, namely using a grammar for program evolution. Tanev added a fixed amount of learned context to the grammar, to make the grammar more specific so that some part of the search space could be investigated more intensively. Bosman started with a minimum SCFG and then, by adding more production rules, made the grammar more specific.

8.

Conclusion and future research

Estimation of distribution of tree form solutions, EDAGP for short, is an interesting research direction because it not only has potential to improve GP performance, but also provides a new perspective to understand GP. It is vital to find a proper model for this kind of research. Based on current GP research, we identified some important properties that a model should have for this purpose. A framework called PRODIGY for EDAGP was proposed. In this framework, the core is a Stochastic Context-free Grammar model (SCFG). We argued that SCFG, which is a probabilistic model with a flexible structure constraint, has all the properties we identified as important for EDAGP. We presented our implementation of PRODIGY. Experimental study confirmed the relevance of PRODIGY to problem solving. There are some further issues that are worth further investigation: PRODIGY is a general framework for estimation of distributions of programs. Many kinds of grammar learning methods can be plugged in. It is worth investigating what are the pros and cons of these grammar learning methods in the context of PRODIGY. We have done some further research in this direction (Shan et al., 2004), but due to space constraints will present this in a future paper. Because the knowledge of good solutions is learned and encoded in the grammar, extraction and study of the knowledge embedded in the learned grammar is an interesting research direction. Conversely, for real world problems, domain experts usually have some level of background knowledge. The grammar model used in PRODIGY provides a mechanism to incorporate background knowledge. By incorporating domain dependent knowledge into PRODIGY, we would hope for superior performance.

- 203 -

Finally, PRODIGY has a built in Occam’s razor which causes small individuals to be preferred. It is well known in machine learning that less complicated models generalize better than more complicated ones. Hence, we expect PRODIGY to perform particularly well in learning from noisy data.

References Abbass, H. A., Hoai, N. X., and McKay, R. I. (2002). AntTAG: A new method to compose computer programs using colonies of ants. In The IEEE Congress on Evolutionary Computation, pages 1654–1659. Angeline, P. J. and Pollack, J. B. (1994). Coevolving high level representations. In Langton, C. G., editor, Artificial Life III, volume XVII, pages 55–71, Santa Fe, New Mexico. Addison Wesley. Bockhorst, J. and Craven, M. (2001). Refining the structure of a stochastic context-free grammar. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI2001). Bosman, P. and Thierens, D. (1999). An algorithmic framework for density estimation based evolutionary algorithms. Technical Report Technical Report UUCS199946, Utrecht University. Bosman, P. A. N. and de Jong, E. D. (2004). Grammar transformations in an eda for genetic programming. In Special session: OBUPM Optimization by Building and Using Probabilistic Models, GECCO, Seattle, Washington, USA. Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. J., editor, Proceedings of an International Conference on Genetic Algorithms and the Applications, pages 183–187, Carnegie Mellon University, Pittsburgh, PA, USA. D’haeseleer, P. (1994). Context preserving crossover in genetic programming. In Proceedings of the 1994 IEEE World Congress on Computational Intelligence, volume 1, pages 256–261, Orlando, Florida, USA. IEEE Press. Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison Wesley, Reading, MA. Harik, G. (1999). Linkage learning via probabilistic modeling in the ECGA. Technical Report IlliGAL Report No. 99010, University of Illinois at Urbana Champaign. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. Jäske, H. (1996). One step ahead prediction of sunspots with genetic programming. In Proceedings of the Second Nordic Workshop on Genetic Algorithms and their Applications (2NWGA), Vaasa Finland. Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Larrañaga, P. and Lozano, J. A. (2001). Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers. Montana, D. J. (1993). Strongly typed genetic programming. BBN Technical Report #7866, Bolt Beranek and Newman, Inc., 10 Moulton Street, Cambridge, MA 02138, USA. Müehlenbein, H. and Paaß, G. (1996). From recombination of genes to the estimation of distributions i.binary parameters. In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature, PPSN IV, pages 178–187. Springer. Pelikan, M. (2002). Bayesian optimization algorithm: From single level to hierarchy. PhD thesis, University of Illinois at Urbana Champaign, Urbana, IL. Also IlliGAL Report No. 2002023. Prusinkiewicz, P. and Lindenmayer, A. (1990). The Algorithmic Beauty of Plants. Springer. Ratle, A. and Sebag, M. (2001). Avoiding the bloat with probabilistic grammar guided genetic programming. In Collet, P., Fonlupt, C., Hao, J.K., Lutton, E., and Schoenauer, M., editors, Artificial Evolution 5th International Conference, Evolution Artificielle, EA 2001, volume 2310 of LNCS, pages 255–266, Creusot, France. Springer Verlag. Rosca, J. P. and Ballard, D. H. (1994). Genetic programming with adaptive representations. Technical Report TR 489, University of Rochester, Computer Science Department, Rochester, NY, USA.

- 204 -

Salustowicz, R. P. and Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2):123–141. Sastry, K. and Goldberg, D. E. (2003). Probabilistic model building and competent genetic programming. In Riolo, R. L. and Worzel, B., editors, Genetic Programming Theory and Practise, chapter 13, pages 205–220. Kluwer. Shan, Y., McKay, R., Baxter, R., Abbass, H., Essam, D., and Nguyen., H. (2004). Grammar model based program evolution. In Proceedings of The Congress on Evolutionary Computation, Portland, USA. IEEE. Shan, Y., McKay, R. I., Abbass, H. A., and Essam, D. (2003). Program evolution with explicit learning: a new framework for program automatic synthesis. In Proceedings of 2003 Congress on Evolutionary Computation, Canberra, Australia. University College, University of New South Wales, Australia. Tanev, I. (2004). Implications of incorporating learning probabilistic context sensitive grammar in genetic programming on evolvability of adaptive locomotion gaits of snakebot. In Proceedings of GECCO 2004, Seattle, Washington, USA. Weigend, A. S., Huberman, B. A., and Rumelhart, D. E. (1992). Predicting sunspots and exchange rates with connectionist networks. In Nonlinear Modelling and Forecasting, pages 395–432. Addison Wesley. Whigham, P. A. (1995). Grammatically based genetic programming. In Rosca, J. P., editor, Proceedings of the Workshop on Genetic Programming: From Theory to Real World Applications, pages 33–41, Tahoe City, California, USA. Yanai, K. and Iba, H. (2003). Estimation of distribution programming based on bayesian network. In Proceedings of Congress on Evolutionary Computation, pages 1618–1625, Canberra, Australia.

- 205 -

Program distribution estimation with grammar ... - Semantic Scholar

Program distribution estimation with grammar ... - Semantic Scholar

Suggest Documents

estimation of multimodal orientation distribution ... - Semantic Scholar

Parallel communicating grammar systems with ... - Semantic Scholar

Grammar Generation - Semantic Scholar

Grammar-based Program Generation Based on ... - Semantic Scholar

Multilingual Grammar Development via Grammar ... - Semantic Scholar

ROBUST ESTIMATION WITH FLEXIBLE ... - Semantic Scholar

Estimation in Nonparametric Regression with ... - Semantic Scholar

Robust Linear Estimation With Covariance ... - Semantic Scholar

Estimation of Heterogeneous Panels with ... - Semantic Scholar

Multidimensional Frequency Estimation with Finite ... - Semantic Scholar

Power Constrained Distributed Estimation with ... - Semantic Scholar

Demand Estimation With Heterogeneous ... - Semantic Scholar

Structural Demand Estimation with Varying ... - Semantic Scholar

Adaptive Generalized Estimation Equation with ... - Semantic Scholar

Horvitz-Thompson estimation with randomized ... - Semantic Scholar

hierarchical disparity estimation with ... - Semantic Scholar

Improved Probability Estimation with Neural ... - Semantic Scholar

Spacecraft Angular Rates Estimation with ... - Semantic Scholar

Structural Demand Estimation with Varying ... - Semantic Scholar

Power Constrained Distributed Estimation with ... - Semantic Scholar

Robust Fundamental Matrix Estimation with ... - Semantic Scholar

Estimation of Distribution Algorithms based on ... - Semantic Scholar

The Comparison of Beta Distribution Estimation ... - Semantic Scholar

State estimation in power distribution networks ... - Semantic Scholar