Using Evolvable Regressors to Partition Data Joseph A. Brown Computer Science University of Guelph Guelph, Ontario Canada, N1G 2W1
[email protected]
Daniel Ashlock Mathematics and Statistics University of Guelph Guelph Ontario Canada, N1G 2W1
[email protected]
Abstract This manuscript examines permitting multiple populations of evolvable regressors to compete to be the best model for the largest number of data points. Competition between populations enables a natural process of specialization that implicitly partitions the data. This partitioning technique uses function-stack based regressors and has the ability to discover the natural number of clusters in a data set via a process of sub-population collapse.
1
Introduction
Partitioning of data aims to break a set of data into a logical divisions based on a classification of data members. The use of regressors as a partitioning agents allows the partions to be made via a model which is not specified a priori. Such partitioning without prior knowledge of the structure of the dataset can be used to inform as to which specific model for partitioning would be useful for future data sets. It can also discover relations between variables which would impair standard techniques such a K-means [2, 6]. The only requirement for the system permitted here is that the operations allowed in the regressors are sufficient to represent the data. Koza[4] showed how Genetic Programming (GP), a method of automatic equation or program generation, can be used to provide symbolic multiple regression via empirical discovery. Datasets with many dependant values can be discerned by an equation of the independent variable only looking at sampled data through this method. The equations in GP are normally modeled as parse trees using a set of functional primitives. This provides a mathematical expression which models the entire data set. This study uses a variant of GP in conjunction with a multiple world model. Multiple populations are used. During fitness evaluation these populations are shuffled and groups with one member of each population are evaluated to obtain a partial regression or partitioning regression. The groupings of one member of each population are the worlds. The thought is that each population will specialize in a subset of the data. The fitness of a given population member or regressor is gauged by the number of points
for which its error is the smallest, in comparison to regressors from other populations. Evolution has many proprieties which make it a excellent technique for providing a partial regression. Here we exploit the natural idea of specialization in competition for resources as presented by Darwin. Specialization over the supply of food acts as a natural biological analogy to the goal of partitioning data. Darwin’s Finches[5] in the Gal´apagos islands are an example of the process in his work On the Origin of Species[3]. These birds where found to be very close in terms of most phenotype traits. The major difference being the beak. It had specialized in the two groups in order to allow for two different food types to be fully exploited. When one species of finch appeared on an island it developed an intermediate beak type and exploited both food sources moderately efficiently. When two finch species appear on an island then each has a specialized beak type that permits it to exploit a single food source efficiently. The separate populations used in this study are analogous to the finches; they do not breed with one another and so function as distinct species. Beak specialization in the Darwin’s finches is analogous to specialization of a population of regressors in a subset of the data being modeled. If there is the ability to specialize, we will see the equations diverge to fit a set of data and hold it for oneself. If there is not a need for specialization then the regressors will be quite similar in terms of their choice of modeled data. This will either lead to a fight between the two regressors to overfit a part of the data, or preferably to one of the groups modeling a set of points which is inconsequential — a sub-population collapse. Subpopulation collapse is a novel method for the determination of a required number of partial regressors. This is not an extinction event for a species, which is impossible under the evolutionary computation model which has a set number of members within the population, but it is an indication that the species has no place in the data modeling ecology.
2 2.1
Methods Function Stacks
This study uses a form of linear Genetic Programming based on Cartesian Genetic Programming[7] known as Function Stacks[1]. The normal parsetree structure is removed in favour of a directed acyclic graph. Each of the nodes takes as arguments either a value from the dataset or the output of a node with a lower index. The backward links to the calculations allows the ability to avoid having to re-perform a calculation that has already been made. The motivation of using this structure is that the data structure is linear and of a fixed size removing the issue of program bloat. In addition, standard methods of crossover and mutation can be used without the modifications required for trees.
Name Arity Definition neg 1 negates X sel 1 scales X by the ephemeral constant sqt 1 square root of X sqr 1 squares X sin 1 sine of X cos 1 cosine of X add 2 adds X and Y sub 2 subtracts Y from X mul 2 multiplies X with Y dup 2 divides X by Y , if Y = 0 then 0 is the result max 2 argument maximum of X and Y min 2 argument minimum of X and Y wav 2 weighted average of X and Y with the weight given by the ephemeral constant Table 1: Operations Used in the Nodes of the Function Stacks
The model of evolution used is tournament selection which takes four members of the population, orders them by fitness, and replaces the bottom two with replicates of the top two subjected to crossover and mutation. The crossover is a two-point crossover which exchanges nodes. The mutation was chosen from three types probabilistically. The mutations are to mutate the operator on a node, change a single link between nodes, or change the ephemeral constant. The change to the ephemeral constant was held steady at 40%, while the mutation to an operator and links where set to 20%/40%, 30%/30%, and 40%/20%. The process of evolution lasts for 1000 generations which has been found to provide more than enough for the population to converge to a steady state. The population size was varied in 20 member intervals between 20 and 100. To allow for use of the normal distribution in statistical tests the number of replicates was set to 30. The number of nodes in each function stack was set to 20 nodes, though a function stack may not use all of the nodes provided. Table 2.1 shows the operations used in this study within each function stack node. These operations are good at providing a linear regressor. These operators obey the closure property due to use of a protected division, therefore regressor created by the function stacks will always be a well defined function, f : Rn → R.
2.2
Multiple Worlds
The multiple worlds model is used in situations where agents with distinct but interacting roles must be evolved. Each population specializes in one of the roles. Here the roles are modeling different subsets of the data. If one data model is adequate then the system will produce a single population of high fitness individuals and several populations of low fitness individuals. If there is a natural partition of the data then different populations will specialize in modeling the distinct subsets of the data that make up the partition. This specialization is achieved by the choice of fitness function. In fitness evaluation, the populations are shuffled. The corresponding (first, second, ..., last) members of each population are grouped to form collections of regressors with one from each population. The error of each regressor on each data point is computed. The fitness of a regressor is the number of points on which it has the lowest error within its group.
2.3
Data Sets
The data sets used sampled 100 points of each of the three classes for a total of 300 data points. The first dataset, DATAONE, is points sampled from three skew lines in six dimensions. DATATWO is points sampled from three skew lines in six dimensions. DATATHREE is points sampled from three concentric spheres in three dimensions. DATATHREE is a negative case test case for the method; it is a set of data which there is an insufficient set of operators to be modeled successfully. The number of worlds was set to be 6 whereas the data sets provided 3 classes of data. This is to provide a basis to see if the system will provide a sub-population collapse or if it will be prone to a fight to overfit the data.
3
Results and Discussion
We use the Rand index[8] as our measure of quality of the partitions resulting from running the evolutionary algorithm. It produces a real value from the set [0, 1], where an exact match of the induced and actual partitions would score 1. The Rand index is calculated as follows: Let D be a set of data and let P and Q be two partitions of this data into classes. The Rand index comparing P and Q is the fraction of pairs from D that are either in the same class in both P and Q or are in distinct classes in both P and Q. The regressors in all cases are highly resistant to change in the population and mutation types. Visible sub-population collapse events happened in DATAONE and DATATWO. The RAND index scores where in the 0.8 range with the best replicate scoring 0.9 for DATAONE and 0.85 for DATATWO, as shown in Figures 1 and 2. Some of the produced regressors were modeling less than 10 of the data points showing that the number of classes selected to model the data was too large as expected. Even where
0.9 95% Confidence Best Value 0.88
0.84
0.82
0.8
Parameter Settings
Figure 1: DATAONE results with a 95% confidence interval about the mean and best value. Sorted by mean from smallest to largest. The data labels are read as P number in the population, M mutation levels of operator, link, and constant change rates. the collapse did not happen within these data sets the regressors created still kept like classes together, each class of the three was usually modeled by two of the six regressors. While this is not the optimal situation, the removal of these like regressors in terms of the created equation as a post processing step would improve the RAND score DATATHREE did not have a prevalence of such events reaching a steady optimum just above a RAND index of 0.6 which is only slightly better than a completely random division, Figure 3. The concentric circles where given as a difficult set the model as the operations provided are insufficient to fully model the data; they are meant for producing lines. In this instance the RAND index also suffers as the lines wish to model the circles by producing an approximation by multiple lines. As the RAND index tests the partitioning even a set of lines which is a fair approximation of the circle
P40M204040
P20M204040
P20M402040
P20M303040
P40M402040
P40M303040
P60M204040
P80M204040
P60M303040
P80M303040
P80M402040
P100M204040
P100M303040
P100M402040
0.78
P60M402040
RAND Index
0.86
0.88 95% Confidence Best Value 0.86
0.82
0.8
0.78
Parameter Settings
Figure 2: DATATWO results with a 95% confidence interval about the mean and best value. Sorted by mean from smallest to largest. The data labels are read as P number in the population, M mutation levels of operator, link, and constant change rates. would gain a low score as the lines produced for the modeling of a circle would not share the same class. The addition of operations more suited to circles, such as distance to a given point, would allow for a sufficient modeling for this data set.
4
Conclusion
The experiments presented demonstrate that the system tested in this study can handle data sets based on a set of lines or plains in a large number of dimensions. The modeling is based off the natural idea of specialization of traits in a closed population to naturally partition data. Evolutionary methods allow for the modeling of such a phenomenon allowing a partial regression or partitioning regression via a process of empirical discovery.
P100M204040
P80M303040
P100M402040
P40M402040
P60M204040
P80M204040
P40M303040
P80M402040
P20M204040
P100M303040
P20M303040
P60M303040
P40M204040
P20M402040
0.76
P60M402040
RAND Index
0.84
0.64 95% Confidence Best Value 0.635
0.625
0.62
0.615
Parameter Settings
Figure 3: DATATHREE results with a 95% confidence interval about the mean and best value. Sorted by mean from smallest to largest. The data labels are read as P number in the population, M mutation levels of operator, link, and constant change rates. Sub-population collapse occurs when the number of regressors set by the user is larger than the the number required. This is similar to a trait being fully removed from an ecosystem during evolution. However, it is not sufficient in the current form to automatically reduce where the number of models greatly exceeds the necessary number of functions. In this case, multiple functions which are essentially modeling the same set of data evolve. This, while it does still give a good accounting to the data should be avoided as it overfits sections of the data with similar lines. Actions will be taken to reduce and remove this effect in the future. One such technique would be to remove a world if the sub-population collapse effect is pronounced, returning the data points to the surviving world groups to model, which in turn may cause more sub-population collapse events. This would truely make sub-population collapse an extiniction as the world model is
P40M204040
P20M204040
P60M402040
P20M402040
P80M204040
P100M303040
P80M303040
P20M303040
P100M204040
P40M402040
P100M402040
P80M402040
P40M303040
P60M204040
0.61
P60M303040
RAND Index
0.63
in effect a species which is being removed due to poor fitness within the environment.
Acknowledgements The authors thank the National Science and Engineering Council of Canada for its support of this research.
References [1] Daniel Ashlock. Training function stacks to play the iterated prisoner’s dilemma. In Computational Intelligence and Games, 2006 IEEE Symposium on, pages 111 –118, may 2006. [2] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [3] Charles Darwin. On the Orgin of Species : by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. Sterling, New York, illustrated edition, 2008. [4] John R. Koza. Genetic Programming: on the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. [5] David Lack. Darwin’s Finches. Cambridge University Press, 1947. [6] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–136, 1982. [7] Julian F. Miller and Pete Thomson. Cartesian genetic programming. In Proceedings of the European Conference on Genetic Programming, pages 121–132. Springer-Verlag, 2000. [8] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846– 850, December 1971.