A Generalised Linear Machine (GLIM) 1] was chosen as the classi cation algorithm used by. EPrep. A GLIM consists of a single-layer of winner-take-all neurons, ...
AUTOMATIC SELECTION OF FEATURES FOR CLASSIFICATION USING GENETIC PROGRAMMING Jamie Sherrah, Robert E. Bogner and Abdesselam Bouzerdoum e-mail: [jsherrahjbognerjbouzerda]@eleceng.adelaide.edu.au The University of Adelaide, South Australia 5005. CSSIP: The Cooperative Research Centre for Sensor, Signal and Information Processing SPRI Building, The Levels, South Australia 5095.
Abstract Classi er design often involves the hand-selection of features, a process which relies on human experience and heuristics. We present the Evolutionary Pre-processor, a system which automatically extracts features for a range of classi cation problems. The Evolutionary Pre-processor uses Genetic Programming to allow useful features to emerge from the data, simulating the innovative work of the human designer. The Evolutionary Pre-processor improved the classi cation performance of a Linear Machine on two real-world problems. Although these problems are intuitively dicult to solve, the Evolutionary Pre-processor was able to generate complex feature sets. The classi cation results are comparable with those achieved by other classi ers.
1 Introduction
machine-selection of features. The diculties of human feature-selection are addressed by applying standard techniques to every problem. The user is not required to understand the data. Instead, EPrep learns which combinations of the data are useful for classi cation. EPrep is especially useful for problems in which the measurements are intuitively dicult to combine.
Much of Pattern Classi cation research seeks to construct general classi cation algorithms which are broadly applicable; examples are Multi-Layer Perceptrons (MLPs) and Decision Trees. Often these algorithms are dogmatically applied to arbitrary problems, and parameters are empirically selected. Experience has shown, however, that pattern classi ers are more successful when explicit knowledge is incorporated into their construction. The primary manifestation of explicit knowledge in Pattern Classi cation solutions is the selection of features. Given a set of d measurements of an object, x = [x1; x2; : : :; xd ], a transformation y = f (x) is selected such that the g features, y1 ; y2; : : :; yg , bear more discriminatory information than the original measurements. The selection of features is often ad-hoc, and proceeds through a combination of human experience, domain-knowledge and trial-and-error. Selecting features by hand is problematic for the following reasons: The procedure must be manually performed for every new problem which is examined; There is little theory to guide the selection, except for the designer's intuitive ideas about which features are important [1]; For a dicult problem, there may be no intuitive understanding of the data; The optimal number of features is not known. This paper presents the Evolutionary Preprocessor (EPrep), a tool for the intelligent
2 Genetic Programming
Genetic Programming (GP), a variant of Genetic Algorithms, is a broadly-applicable search technique based on the Darwinian theory of natural selection [2]. A Genetic Program maintains a population of solutions to a problem. Each solution, or individual, is the parse tree for a functional expression. Each internal node of a tree is a function whose arguments are the children of the node. The leaf nodes, called terminals, are inputs to the functional expression. Each individual in the population has a tness value which quanti es its ability to solve the given problem. The set of functions and terminals, and the tness function must be chosen according to the problem at hand. The population evolves over a number of generations through the application of genetic operators. The GP begins with a population of randomly-generated individuals. Each generation, the old population is transformed to a new one using the primary genetic operators reproduction and crossover. A genetic operator is chosen with some pre-speci ed probability, then individuals are
1
input a real vector of arbitrary length l. The length of the output vector depends on the function. The function set consists of the arithmetic operations f+; ; pow; Y g. The + function outputs the sum of its inputs, and outputs the product of its inputs. pow raises its rst (l ? 1) inputs to the power of its last input, producing a vector of length (l ?1). If the pow function has only one input, the result is the square of this input. The function Y is an output point. Each individual must have an output point at its root node. The input vector of a Y function is copied to its output, and is also placed in the feature set. Duplicate and constant features are removed after pre-processing by forming a checksum for each of the features. A Generalised Linear Machine (GLIM) [1] was chosen as the classi cation algorithm used by EPrep. A GLIM consists of a single-layer of winner-take-all neurons, one for each class. The GLIM was selected for the following reasons: it is computationally simple; it is insensitive to the initial conditions of the weights; non-linearly transforming the data and using the Linear Machine is equivalent to using a more complex classi er on the original data; the GP is supposed to do all the hard work by selecting the features, so a simple classi er suces. Genetic Programming belongs to a class of algorithms which exhibit emergent intelligence: through the interaction of the problem solver and the task description, problem-speci c information emerges automatically [3]. The task description is speci ed via the tness function, and is therefore an integral part of the search algorithm. Through empirical credit assignment, problem characteristics which may be too subtle for the engineer to detect emerge naturally. This principle is exploited by EPrep to automatically extract features for a general classi cation problem. Note that EPrep is expected to be most useful for problems with disparate measurements. EPrep is not designed to solve problems involving structured data such as images or time series. Data samples are treated as points in d-dimensional space; for images, the spatial relationships between pixels are lost. Given the current limitations on computational power, it is unlikely that EPrep can \reinvent the wheel" by learning these inter-pixel relationships. EPrep begins by reading the data from a le, and dividing it into training and test data. The test data are placed aside as a hold-out set; only the training data are used by EPrep. The training data are further divided into a validation set and a test set. The tness of each individual is calculated as follows: 1. Pre-process training data using the individual;
selected stochastically from the old population in proportion to their tness to be operands to the genetic operator. The results of the operation, called ospring, are placed in the new population. Because of the stochastic nature of GP, a number of runs must be made to obtain reliable results. The Reproduction operator copies an individual into the new population. Crossover takes two parent individuals, randomly selects a crossover point in each individual, and swaps the sub-trees below the crossover points to form two new ospring. Secondary operators, whose eects are not fundamental to the success of the GP, can be applied to individuals with some probability before they are placed in the new population. The mutation operator randomly selects a node in the individual and replaces the sub-tree below it with a new, randomly-generated sub-tree. Mutation allows large jumps in the search space which makes the GP less susceptible to local optima. The editing operator is applied to every individual with some speci ed generational frequency. Editing replaces identities in the individuals with simpler statements. For example, functions with only constant inputs are replaced by the constant result of the function evaluation. Although it may appear that the GP is performing a random search, the re-combination of partially- t individuals to form potentially- tter ospring enables the algorithm to intelligently search those regions of the search space with highest pay-o. The basic assumption for this behaviour is empirical credit assignment: the ability of a useful sub-tree in an individual to contribute to that individual's overall tness [3]. Empirical credit assignment enables the GP to discover useful building blocks which proliferate in the population through crossover. For the genetic operators to result in valid functional expressions, the set of functions and terminals must adhere to the closure property: each function must be able to take as its input any value or type returned by another function or terminal. Therefore some functions must be protected, e.g. 01 should return a sensible value.
3 Evolutionary Pre-processor
The Evolutionary Pre-Processor is an interactive tool for the automatic selection of features. The problem addressed by EPrep is: given a data set containing measurement vectors and associated class labels, and given some classi cation algorithm which results in an error rate E , nd a pre-processor which transforms the data such that the new error rate E is signi cantly less than E . The pre-processors manipulated by EPrep are represented as functional expressions by the GP. The terminal set consists of the input measurements, x1; x2; : : :; xd, and a set of basic constants, f0; 1; 21 ; 2; 31 ; 3; ; eg. Each function takes as its 0
2
2. use the training set to train the classi er; 3. classify the validation set using the trained classi er, and obtain the correct classi cation rate ri = 100:ncorrect=nvalidation (%); 4. use v-fold cross-validation, so that the validation set is rotated to form each of the v unique partitions of the training and validation sets, and repeat steps 2 and 3; 5. the tness of the individual is the v-fold P correct classi cation percentage, r = 1v vi=1 ri. The result of a run is the best individual encountered. When all runs are nished, the best-of-allruns individual is chosen as EPrep's result for the problem.
found that E < E with a very high con dence level for both problems. The statistical signi cance level of the test is shown for each problem in Table 2; this value is the probability that there is no real improvement, and E < E by chance. 0
0
Problem
Raw E
diabetes 36.5 % glass 72.2 %
EPrep E
21.9 % 48.1 %
0
0.00005 0.00120
Table 2: Test set error rates before and after preprocessing. The ttest individual produced by EPrep for the diabetes data set is shown in Figure 1. The features generated by this pre-processor are: y1 = ?2(x6 + e)(x22 + x2) y1 = x7 + x1 y1 = (?2(x6 + e)(x22 + x2) ? 3)2 The features selected by EPrep are intuitively dicult to interpret. This shows EPrep can nd combinations of the input measurements which would not occur to a person. It is also instructive to note that, of the original eight data measurements, EPrep only uses four of these, [x1; x2; x6; x7]. These quantities are [Number of times pregnant, Plasma
4 Experimental Results
EPrep was run on two real-world data sets to determine whether it can evolve pre-processors to signi cantly improve classi cation performance. The two problems, diabetes and glass, were taken from the Proben1 benchmark database [4]. diabetes is a yes/no diagnosis of a group of Pima Indians. glass is taken from forensic samples of dierent types of glass. These data sets were not used to construct EPrep. The details of the problems are shown in Table 1. For both data sets, the rst 50% of samples were used as the training set, the next glucose concentration, Body mass index, Diabetes 25% as the validation set and the last 25% as the pedigree function]. EPrep found that the other test set. four measurements are redundant, a piece of information which would be useful to a physician. Problem samples d classes Although the sizes of the trees were not explicitly diabetes 768 8 2 encouraged to be parsimonious, EPrep evolved a glass 214 9 6 tree which was much smaller than the maximum Table 1: The two problems used to test EPrep. allowed size and halved the number of features required. Recalling that smaller architectures generThe same GP parameters were used to solve both alise better, it appears EPrep discovered this ruleproblems. A population of size 100 was evolved for of-thumb for itself. 100 generations each run, and 10 runs were performed. The maximum tree depth was restricted to 12, and the maximum number of children for any node was limited to 3. Therefore the largest allowed tree contains 1 + 311 = 177; 148 nodes. Each time the classi er was trained, the weights were initialised to zero. The reproduction, crossover, mutation and editing operators were used. Editing was performed every second generation. To determine the success of EPrep, we compared the classi cation errors on the test set before and after pre-processing. Before EPrep began, the training data were used to train the classi er and the test set was classi ed to obtain the raw classi cation error rate E = 100:nincorrect=ntest(%). After EPrep nished, the individual with the high- Figure 1: Best individual for the diabetes data. est score was used to pre-process all of the data. Using the pre-processed data, the training samples EPrep's result for the glass problem is shown in were again used to train the classi er, and the test Figure 2it appears the dimensionality of the feaset was classi ed to obtain the error rate after pre- ture space has been increased from 9 to 10, it has processing E . The values of E and E for both actually been reduced to 8 because x1 and x8 are problems can be compared in Table 2. Using Mc- included twice. Although some of the features are Nemar's signi cance test as described in [5], we the same as the original measurements, others are 0
0
3
The success of empirical credit assignment in EPrep may be attributable to the output point functions. These allow useful sub-trees to contribute discriminatory features without having to be connected to the root node. The eect of output point functions on empirical credit assignment is an avenue for further investigation. Future work on EPrep will include the implementation of larger function and terminal sets, and testing on a broader range of classi cation problems.
complicated combinations of the data. Previous work has found little correlation between the data and the sample class for x6 and x7 [4], and yet these measurements are still used in the pre-processor. This does not necessarily mean x6 and x7 are useful, since the GP individuals often contain super uous sub-expressions. This excess material could be removed by penalising large numbers of features and nodes through the tness function. These experiments were not intended to compare EPrep's classi cation performance with other algorithms. Nevertheless, comparable classi cation rates were achieved for these data sets using MLPs in [4]. For instance, an MLP classi ed the diabetes data set with mean test-set error rates over 60 runs of 24.57%, 25.91% and 23.06%, corresponding to three dierent random partitions. EPrep's mean error over 10 runs was 26.771%. Similarly, the MLP obtained error rates of 39.03%, 55.60% and 59.25% on three dierent permutations of the glass data set, while EPrep had an average test set error of 52.41%. These results must be compared with caution, because dierent permutations of the data were used by EPrep.
6 Acknowledgments
The Genetic Program was implemented using Matthew's GA-Lib with tree genomes [6].
References
[1] N. J. Nilsson, The Mathematical Foundations of Learning Machines. Morgan Kaufmann Publishers, 1993. [2] J. Koza, Genetic Programming. MIT Press, 1992. [3] P. J. Angeline, \Genetic Programming and Emergent Intelligence," in Advances in Genetic Programming (K. E. J. Kinnear, ed.), ch. 4, pp. 75{97, MIT Press, 1994. [4] L. Prechelt, \PROBEN1 | A set of benchmarks and benchmarking rules for neural network training algorithms," Tech. Rep. 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, Sept. 1994. Anonymous FTP: /pub/papers/techreports/1994/1994-21.ps.Z on ftp.ira.uka.de. [5] A. Feelders and W. Verkooijen, \Which Method Learns Most From the Data?," tech. rep., University of Twente, Department of Computer Science, the Netherlands, January 1995. Anonymous FTP: /pub/doc/pareto/aistats95.ps.Z on ftp.cs.utwente.nl. [6] M. B. Wall, \\GAlib: A C++ Genetic Algorithm Library (ver. 2.4)"." http://lancet.mit.edu/ga/, September 1995. Copyright 1994-5 Massachusetts Institute of Technology.
5 Conclusion
A new, easy to use tool for feature selection has been presented. Using Genetic Programming, EPrep has been able to simulate human innovation: by combining useful building blocks through crossover, GP is able to produce increasingly better solutions. The system can discover complex functions which would not occur to a human designer: although EPrep is not aided by a priori knowledge, it is not biased by it either. EPrep is a step in the direction of machines which are able to make intelligent decisions, and also provides a standard methodology for the selection of features for classi cation. EPrep's advantages over existing methods are: the size and the architecture of the solution are determined automatically; the pre-processors provide more information about the problem than other classi ers; the features evolved are generally economical; EPrep works with the original data: no preprocessing required.
Figure 2: Best individual for the glass data. 4