Genetic Algorithms For Classification and Feature Extraction Min Pei, 1,2 Erik D. Goodman, 2 William F. Punch III 3 and Ying Ding 2 1
Beijing Union University, Beijing, China Case Center for Computer-Aided Engineering and Manufacturing 3 Intelligent Systems Laboratory, Department of Computer Science 2
Michigan State University Genetic Algorithms Research and Applications Group (GARAGe) 112 Engineering Building East Lansing, MI 48824 Tel: (517)-353-4973. Fax: (517)-355-7516 e-mail:
[email protected] Abstract In this paper we summarize our research on classification and feature extraction for high-dimensionality patterns using genetic algorithms. We have developed two GA-based approaches, both utilizing a feedback linkage between feature evaluation and classification. That is, we carry out feature extraction (with dimensionality reduction) and classifier design simultaneously, through “genetic learning and evolution.” These approaches combine a GA with two different approaches: the KNearest-Neighbor decision rule and a production decision rule. We show the effectiveness of these approaches on a series of artificial test data and on real-world biological data. keywords: genetic algorithms, classification, feature extraction, K Nearest Neighbor rule, production rule.
I. Introduction In a decision-theoretic or statistical approach to pattern recognition, the classification or description of data is based on the set of data features used. Therefore, feature selection and extraction are crucial in optimizing performance, and strongly affect classifier design. Defining appropriate features often requires interaction with experts in the application area. In practice, there is much noise and redundancy in most high dimensionality, complex patterns. Therefore, it is sometimes difficult even for experts to determine a minimum or optimum feature set. The “curse of dimensionality” becomes an annoying phenomenon in both statistical pattern recognition (SPR) and Artificial Neural Network (ANN) techniques. Researchers have discovered that many learning procedures lack the property of “scaling” -- i.e., these procedures either fail or produce unsatisfactory results when applied to problems of larger size [1]. To address this scaling problem, we have developed two approaches, both based on genetic algorithms (GA’s). The basic operation of these approaches utilizes a feedback linkage between feature evaluation and classification. That is, we carry out feature extraction (with dimensionality reduction) and classifier design simultaneously, through “genetic learning and evolution.” The objective of these approaches is to find a reduced subset among the original N features such that useful class discriminatory information is included and redundant class information and/or noise is excluded. We take the
1
following general approach. The data’s original feature space is transformed into a new feature space with fewer features that (potentially) offer better separation of the pattern classes, which, in turn, improves the performance of the decision-making classifier. The criterion for optimality of the feature subset selected is usually the probability of misclassification. Since the number of different subsets of N available features can be very large, exhaustive search is computationally infeasible and other methods must be examined. A number of heuristic techniques have been used, but it is not clear under what circumstances any one heuristic should be applied, as each has its good and bad points. Genetic algorithms are good candidates for this task since GAs are most useful in multiclass, high-dimensionality problems where heuristic knowledge is sparse or incomplete. In order to apply a GA to classification/discrimination tasks, we combine the GA with two different approaches: the K Nearest Neighbor decision rule and a production decision rule. 1) GA/KNN hybrid approach - A genetic algorithm using a K Nearest Neighbor decision rule. Here the GA defines a population of weight vectors w*, where the size of each w* is the size of the data pattern x for each example. Each w* from the GA is multiplied with every sample’s data pattern vector x, yielding a new feature vector y for the given data. The KNN algorithm then classifies the entire y-transformed data set. The results of the classification for a known “training” sample set are fed back as part of the fitness function, to guide the genetic learning/search for the best transformation. The GA result is the “best” vector w* under the (potentially conflicting) constraints of the evaluation function. For example, the “best” w* might be both parsimonious (has the fewest features) and gives fewest misclassifications. Thus, the GA searches for an optimal transformation from pattern space to feature space for improving the performance of the KNN classifier[2]. 2) GA/RULE approach - Genetic algorithm combined with a production rule system. This approach is based on the general structure of a classifier system, and focuses on the classification of binary feature patterns. A single, integrated rule format is used. The inductive learning procedure of the GA evolves a rule classifier using a known “training” sample set. The result of training is a small set of “best” classification rules which classify the training set more accurately than other rule sets. By examining these “good” rules, one can determine the features which are most useful for classification. This provides information useful to domain experts for deriving new hypotheses. The two methods described above were applied successfully to a series of artificial test data and three high-dimensionality (96 or more features) biological data sets containing multiple classes. The results achieved indicate potential for application in many other fields. They can also be further developed using a GA combined with other classifiers, such as neural networks, to solve a fairly broad class of complex pattern recognition problems. In the following sections, we introduce the concepts of feature selection and feature extraction then give a brief review of related research; discuss the basic model of our approaches; present the GA/KNN approach in more detail and the results of several experiments; discuss the GA/Rule approach and its test results; then finally propose some brief conclusions and offer areas of future study.
II. Feature Selection and Extraction The excellent overview of classifical feature selection and extraction is presented in the book by Devijver and Kittler [17]. An overview of recent research is summarized by Siedlecki and Sklansky [5].
2
Pattern Recognition typically consists of three stages and four spaces [3] [4]. First, the physical world is sensed by some transducer system which inputs raw data from raw measurement space into the pattern space. The physical world can be represented by a continuum of raw data and is essentially infinite in dimensionality. The transducer describes a representation of that world with N scalar values where N is typically quite large. Second, the pattern data can be converted by feature extraction to appropriate features which represent patterns to be used in recognition, thus defining the feature space. The feature space dimensionality is postulated to be much smaller than the dimensionality of pattern space. Third, a classifier or decision rule is used to separate patterns into different classes. The process is summarized in Figure 1. Physical world Infinite dimensional
N=Finite dimensional Feature Extractor
Transducer Raw data Space
n = Finite dimensional
Pattern Space
K = number of classes
Classifier Feature Space
Classification Space
Figure 1. Pattern Recognition and Classification System Feature selection and extraction are crucial in optimizing performance, and strongly affect classifier design. How many features and what kind of features should be used in classifier design can be a difficult problem. It is true that, if the number of training samples is infinite, that is the class-conditional density functions are completely known, then one could not improve the performance (decrease classification error) of a Bayesian classifier by eliminating a feature. This property is called monotonicity. However, in practice, because of the limited number and various anomalies in the training samples, that assumption in the design of Bayes classifiers is almost never valid [5] [6]. Most practical pattern recognition systems employ a non-Bayesian decision rule because the use of the optimal Bayesian approach requires knowledge of a priori density and its complexity precludes the development of real-time recognition systems. Classifier performance based on a small number of training samples improves up to a point, then deteriorates as additional features are added in the classifier design. The peaking behavior in practical classifiers is due to the “non-optimal” use of added information which tends to outweigh the advantage of extra information [7]. This phenomenon is often referred to as the “curse of dimensionality.” In practical classifier design, the removal of redundant and irrelevant features, which reduces the incidence of relevant features being masked by irrelevant ones, can increase reliability of the classifier. This is especially important for high-dimensionality pattern recognition. Thus, defining the feature space for a problem is difficult. The use of too few features or of non-representative features can make the classification difficult, if not impossible; too many features not only can cause the problems of “curse of dimensionality”, but also increasing the cost of data acquisition and computational complexity. Therefore, for computational, performance, and cost reasons, extensive research has taken place on the development of efficient and reliable methods of feature selection and extraction in the design of pattern classifiers. Feature Selection is the task of finding the “best” subset of size d of features from the initial N
3
features in the data pattern space. The criterion used to define the best subset is usually the probability of misclassification. Feature Extraction defines a transformation from pattern space to feature space such that the new feature set used gives both better separation of pattern classes and reduces dimensionality. Thus feature extraction is a kind of feature selection, but also includes a space transformation. Feature extraction is a superset of feature selection; feature selection is a special case of feature extraction (feature extraction with the identity transformation). The terms “feature selection” and “feature extraction” have sometimes been used interchangeably in the literature. Feature selection’s objective is typically to minimize the overall cost of measurement acquisition. In feature extraction, the objective is to find that transformation, usually linear, which helps to eliminate irrelevant information and redundancy. Our approach will focus on both of them. There are two common problems found in feature selection and feature extraction, namely the selection of both the “criterion function” and the “search procedure”. Early work on feature selection was mostly “parametric” in the sense that the criterion function was based on probabilistic measures of class separability such as Bayes error probability, divergence, Bhattacharya distance, and entropies [5] [7]. The criteria used by these methods satisfy the monotonicity property, which permits the use of efficient computational techniques. However, their common shortcoming is that these criterion functions require a priori knowledge of distribution. It seems therefore that a promising and legitimate way of selecting features might be to use the error rate of the classifier being designed. It is well known, for example, that the best features for a quadratic classifier may be worthless for a linear classifier; this also satisfies our intuitive understanding of design policy, although it has some theoretical drawbacks [5] [8] [9]. An exhaustive search procedure which would determine the best subset of features size d from N available features requires examining all possible subsets of size d. The number of different subsets of size d among N available features is Nd , which can be very large, and the search problem here is classified as NP hard, so exhaustive search is computationally infeasible. Several heuristics have been proposed to cut down the amount of computation, but these do not guarantee success. In 1963, the sequential backward elimination method, which uses divergence distance as the criterion function, was invented by Marill and Green [10]. The search method is suboptimal and the knowledge of probability distribution required by the criterion function has limited its use. In 1971, the sequential forward selection method, in which the criterion function uses a direct error estimation, was introduced by Whitney. This approach used a heuristic to build up a feature set by first picking the best single feature, then picking the best pair of features subject to the constraint that the pair contain the best single feature, then picking the best triplet, and so on. The procedure adds just one feature to the set at every stage of the search, and there is no way a feature can ever be removed from the feature set once selected. The subsets generated by this approach are nested. Features that are powerful at early stages of the search can never be replaced by other more powerful combinations of features. Obviously, since two independent features do not have to be the two best, as was pointed by Cover [11], this is also a suboptimal approach which even has the potential to select the worst possible set of features [12]. In 1976, a similar technique was introduced by Stearns in which the plus-p-take-away-r search method was used instead of Whitney’s sequential forward selection. Here, feature sets are evaluated using the probability of error as the criterion with respect to particular classifier. Although the new algorithm avoids the nesting
4
property on the sequentially generated subsets and considers feature interactions in evaluating a set of features, it is still a suboptimal method [13] [14]. In 1977, a breakthrough came with the branch and bound algorithm proposed by Narendra and Fukunaga. This procedure makes use of a parametric criterion function J (such as Bhattacharya distance or divergence) which is monotonic with respect to the feature subsets; i.e., the criterion function J used to evaluate feature subsets changes monotonically over a sequence of nested feature subsets {F1, ..., Fk}. That is F1 ⊂ F2 ⊂ ⋅ ⋅ ⋅ ⊂ Fk ⇒ J(F1) J(F2) ⋅ ⋅ ⋅ J(Fk) The branch and bound algorithm guaranteed the selection of an optimal feature subset if the monotonicity condition is satisfied. A shortcoming of this approach is that the criteria are not necessarily related to the performance (or the error) of the classifier, particularly when the data are far from Gaussian and the class regions overlap [15] [16]. In 1985, the concept of approximate monotonicity was introduced by Foroutan and Sklansky. In their work they used the error rate of piecewise linear classifiers as a criterion function that is only mildly nonmonotonic. Consequently, they successfully used a modified branch and bound based on the the classifier’s error rate to search for the optimal feature subset [16]. The branch and bound approach leads to search efficiency by reducing search in infeasible regions. It is estimated that branch and bound search reduces the time spent by exhaustive search by over 99% in typical cases. Even so, while branch and bound search is feasible using up to a 20dimensional feature space, it becomes rapidly impractical as this dimensionality is exceeded [16]. Thus large-scale or high-dimensionality feature selection was still a problem Feature extraction is a process of mapping original measurements into more effective features. These mappings can be either linear or nonlinear. Only a few nonlinear methods have been reported in the literature, so we therefore focus on linear transformation research. There are two widely used feature extraction methods: the principal component method and discrimination analysis. The principal component method uses the eigenvectors of the sample covariance matrix to define a linear transformation in which features are uncorrelated and the mean-square error is minimized. This method may not be very good for classification purposes because it does not utilized available category information. Discrimination analysis finds a projection into a feature space having fewer dimensions than the original pattern space and defines a linear transformation in terms of eigenvectors of SW-1SB such that between-group scatter (SB) is maximized while holding the within-group scatter (SW) constant. All above methods are hard to apply to high-dimensionality problems effectively because of their high computational complexity [6]. Recently there have been surprising successes in attacking NP-complete problems in computer science using Genetic Algorithms. In 1989, Siedlecki and Sklansky [18] [19] began work on large-scale feature selection using genetic algorithms. Their results indicated that genetic algorithms were a powerful technique for optimal feature selection which was insensitive to nonmonotonicity of subset inclusion. In their approach, the goal was to find the “the smallest or least costly subset of features for which the classifier’s performance does not deteriorate below a certain special level.” Essentially, they did this by constructing a GA chromosome which consists of a binary encoding in which the length in bits equalled the number of features. If a bit was one then that feature was included in this subset, otherwise not. The error of a classifier was used to measure the performance and a subset was defined as feasible if the classifier’s error rate was below the so-called feasibility threshold. In this form, feature selection was a constrained optimization
5
problem. They found their approach to be as good as exhaustive search and better than Stearn’s (p,q) search, branch and bound search and others. This approach is much better than other heuristic methods on large feature sets (20 features or more), where the other methods began to show combinatorial explosion and therefore poorer performance. In this paper, we present research on methods for automatic feature selection and extraction using genetic algorithms carried out at the Genetic Algorithms Research and Application Group (GARAGe, http://isl.cps.msu.edu/GA) of Michigan State University over the last four years. Our approach was inspired by work first reported by Siedlecki and Sklansky [18] [19] and partly similar to work independently pursued by Kelly and Davis [20].
III. Feature Extraction Using a GA There are three versions of the problem of feature selection and feature extraction in the design of a pattern classifier, each version addressing a specific objective and leading to a distinct type of optimization. In one version the objective is to find the smallest subset of features or “optimal transformation” for which the error rate (or perhaps some other measure of performance) is below a given threshold. This version leads to a constrained combinatorial optimization task in which the error rate serves as a constraint and the number of features is the primary search criterion. The feature selection work done by Siedlecki and Sklansky is an example of this version. In the second version of this problem we seek an optimal transformation or subset that yields the lowest classification error. This version of the problem leads to an unconstrained combinatorial optimization in which the error rate is the search criterion for the optimal feature subset or optimal transformation. Our GA/KNN hybrid addresses this version. In the third version of the problem, the objective is to find an optimal transformation or feature subset that yields both the lowest error and smallest subset. This version leads to a combinatorial multiobjective optimization problem. The GA/RULE approach addresses this version . •Formal Definition of Feature Extraction. We can define the set of N features (types of measurements) representing the pattern in the pattern space as a vector x. x = [x1, x2, ⋅ ⋅ ⋅ , xN] The goal of feature selection is to find the best subset of features y that satisfies some optimal performance assigned by criterion function J(.), ideally the probability of misclassification. In other words y = [y1, y2, ⋅ ⋅ ⋅ , yn] satisfies min J(y), or J(y) = min (∀y)J(y). The goal of feature extraction is to yield a new feature vector of lower dimension by defining a mapping function M( ) such that y = M(x) which satisfies some optimal performance assigned by criterion function J(.). The result of applying M is to create y such that |y| ≤ |x| and to increase the separation of pattern classes in the feature space now defined by x. In other words, y = M(x) satisfies min J{M(x)}, or J{M(x)} = min (∀M(x))J{M(x)}. In general, the map M(x) could be any vector function of x, linear or non-linear, but the major-
6
ity of existing methods of feature extraction restrict themselve to linear mapping, i.e. y = Wt x. where W is a (N × n) matrix of coefficients [17]. •Genetic Algorithms In 1975, Holland [21] described a methodology for studying natural adaptive systems and designing artificial adaptive systems. It is now frequently used as an optimization method, based on an analogy to the process of natural selection in biology. The biological basis for the adaptation process is evolution from one generation to the next, based on elimination of weak elements and retention of optimal and near-optimal elements (“survival of the fittest”). References[21] [22] contain a theoretical analysis of a class of adaptive systems in which the space of structural modifications is represented by sequences (strings) of symbols chosen from some alphabet (usually a binary alphabet). Searching this representation space is performed using so-called “Genetic Algorithms” (GA’s). The genetic algorithm is now widely recognized as an effective search paradigm in artificial intelligence, image processing, VLSI circuit layout, optimization of bridge structures, solving of non-linear equations, correction of test data with functional grouping, and many other areas. In a genetic algorithm approach, a solution (i.e., a point in the search space) is called a “chromosome” or string [21] [22]. A GA approach requires a population of chromosomes (strings) representing a combination of features from the solution set, and requires a cost function (called an evaluation or fitness function) F(.). This function calculates the fitness of each chromosome. The algorithm manipulates a finite set of chromosomes (the population), based loosely on the mechanism of evolution. In each generation, chromosomes are subjected to certain operators, such as crossover, inversion and mutation, which are analogous to processes which occur in natural reproduction. Crossover of two chromosomes produces a pair of offspring chromosomes which are synthesis of the traits of their parents. Inversion in a chromosome produces a mirror-image reflection of a subset of the features on the chromosome. Mutation of a chromosome produces a nearly identical chromosome with only local alternations of some regions of the chromosome. The optimization process is performed in cycles called generations. During each generation, a set of new chromosomes is created using crossover, inversion, mutation and other operators. Since the population size is fixed, only the best chromosomes are allowed to survive to the next cycle of reproduction. (There is considerable variation among various implementation of the genetic algorithm approach in the strictness with which the principle of “survival of the fittest” is applied. In some system, the fitness affects only the probability of survival, whereas in others, only the N most fit individuals are allowed to survival at each generation). The crossover rate usually assumes quite a high value (on the order of 80%), while the mutation rate is small (typically 115%) for efficient search [23]. The cycle repeats until the population “converges”, that is all the solutions are reasonably the same and further exploration seems fruitless, or until the answer is “good enough”. Note that the number of generations required could be anywhere from hundreds to millions. •Basic Concept of the Approach In the design of automatic pattern classifiers, feature selection and extraction are crucial in optimizing performance, and strongly affect classifier design. Ideally, the problem of feature selection and extraction on the one hand and the classifier design on the other should never be considered independently. Yet, for practical considerations, most researchers make the simplifying assumption that the feature selection/extraction stage and the classification stage are indepen-
7
dent. However the ultimate goal is correct classification and the intermediate step of feature extraction and dimensionality reduction is, in a sense, subservient to that goal and is not an end in itself. It would be better to couple feature selection/extraction with effective classification techniques. This then implies some sort of classification decision feedback mechanism to modify or adapt the feature extractor. Our research follows this direction. We have developed two approaches which both adopt the above idea and are based on genetic algorithms (GA’s). The basic operation of these approaches utilizes a feedback linkage between feature evaluation and classification. That is, we carry out feature extraction (with dimensionality reduction) and classifier design simultaneously, through “genetic learning and evolution.” as shown in Figure 2. The objective of this approach is to find a reduced subset from the original N features such that useful class discriminatory information is included and redundant class information and/or noise is excluded. We take the following general approach. The data’s original feature space is selected or transformed into a new feature space with fewer features that (potentially) offer better separation of the pattern classes, which, in turn, improves the performance of the decision-making classifier. The criterion for optimality of the feature subset selected is usually the probability of misclassification. N=Finite dimensional
n = Finite dimensional Feature Extractor
K = number of class Classifier
Pattern Space
Feature Space Genetic learning and Evolution
Classification Space
Feedback
Figure 2. Feature Extractor and Classifier with Feedback Learning System We combined the GA with two different approaches: the K Nearest Neighbor decision rule and a production decision rule. GA/KNN uses GA learning and evolution to find a best transformation through the evaluation of a KNN decision rule. GA/RULE uses GA learning and evolution to find a set of rule classifiers from which an optimal feature set can be drawn. We introduce these two approaches in the following in detail.
IV. GA/KNN Hybrid Approach In the GA/KNN hybrid approach, as mentioned above, a genetic algorithm using a K Nearest Neighbor decision rule searches for a “best” diagonal matrix [W] or weight vector w* -- i.e., a “best” transformation from pattern space to feature space for improving the performance of the KNN classifier [2] -- under potentially conflicting constraints of the evaluation function. The whole structure of the GA/KNN hybrid approach is shown on Figure 3. • Different Transformations Using a GA Based on the above discussion of a GA approach to feature extraction, we formally describe
8
the problem by defining the mapping function M() and criterion function J(). . Initialize Population (random set of weights w*)
New Population
Training x data set
y = M(x) = W x Evaluate by KNN Rule Yes
Satisfy? No GA Evolution Selection Crossover
Best set of weights w*
Mutation
Figure 3. The structure of GA/KNN approach We define w*, the weight vector, to be a N = |x| vector (or NxN diagonal matrix [W] ) generated by the GA, to multiply each pattern vector x, yielding the new feature vector y. That is, (1) linear transformation y = M(x) = Wx ...˙ 0 0 w 2 ... 0 . . . . . , or w* = [w1,w2, …, wΝ], . . . . .
w1 0
where
W =
x = [x1, x2, ⋅ ⋅ ⋅ , xN]
0 0 0 0 wN
• wi ∈{0,1}, 1≤ i ≤Ν. When we use only binary values for w*, we interpret it as follows: if the ith weight component is one, then the ith feature was preserved in feature space; otherwise, the feature was discarded from the feature set. Thus we perform feature selection to define an optimal subset and (potentially) reduce the dimensionality of the original data pattern. We can view Siedlecki and Sklansky’s chromosome as an example of a 0/1 weighting of feature importance. • wi ∈[a, b], such as wi ∈[0.0, 10.0] 1≤ i ≤Ν. We can also do feature extraction by allowing the values of the weight components to range over some values [a, b], such as 0.0 to 10.0 (presuming that the features are first “normalized” to some standard range). That amounts to searching for a relative weighting of fea-
9
tures that gives optimal performance on classification of the known samples. Thus we are selecting for features in a linearly transformed space. Those weight components that move towards 0 indicate that their corresponding features are not important for the discrimination task. Essentially, those features “drop out” of the feature space and are not considered. Any feature that moves towards the maximum weight indicates that the classification process is sensitive to changes in that feature. That feature’s dimension is elongated, which probably reflects that an increased separation between the classes gave better resolution. The resulting weights indicate the usefulness of a particular feature, sometimes called its discriminatory power. Usually, only feature selection can reduce the overall cost of measurement acquisition, since feature extraction utilizes all information in the pattern representation vector x to yield feature vector y of lower dimension. But in the case above, the simple diagonal transformation certainly can reduce the cost of measurement. (2) non-linear transformation, as, for example: y = M(x) = Wx* x* = [x1, x2, ⋅ ⋅ ⋅ , xN, (xi•xj)1, (xn•xm)2, ⋅ ⋅ ⋅ , (xr•xs)k]
where
˙ 0 ... 0 w 2 ... 0 . . . . . 0 . . . . 0. 0
w1 0 W =
,
or
w* = [w1,w2, …, wΝ, wN+1, …, wN+k ],
0 0 0 0 wN + k
The previous linear transformation can mainly be used on data patterns with classes which are linearly separable based on scaling or other linear operations. However, linear transformations are not necessarily sufficient for effective feature selection and extraction, so an extended form of the above linear transformation has been proposed. We investigated the use of non-linear transformations to address the feature extraction and subsequent classification of data patterns, where those data patterns have correlations between different feature fields. This approach is based on an a priori knowledge of possible forms of the relationship of the features and the use of a GA to discover which of those forms best captures the relationship. The various combinations (xi•xj)1, (xn•xm)2, ⋅ ⋅ ⋅ , (xr•xs)k in x* are obviously problem-dependent. In our experiments we tested an example data pattern which consists of 15 features, any feature of which by itself is not discriminatory but which does contain two combinations of features which are discriminatory. The search problem is NP hard. Originally, for even moderate numbers of features (say between 15 and 30) the problem must be solved with the aid of suboptimal heuristic methods. These methods, by definition, do not ensure that the result is optimal. Genetic algorithms are good candidates for this task, since GAs search from a population, not a single point, and discover new solutions by speculating on many combinations of the best partial solutions (called building blocks) contained within the current population. GAs are most useful in multiclass, high-dimensionality problems in which guarantees on the performance of heuristics are sparse or inadequate. GA’s are considered to be global optimal search methods.
10
• Several Problems Encountered in Running GA’s 1. Chromosome Encoding For a typical feature selection problem wi ∈{0,1}, 1≤ i ≤Ν, the chromosome encoding requires a single bit for each wi, a component of weight vector w*. For a typical feature extraction problem wi ∈[0, 10] 1≤ i ≤Ν, the weight’s resolution is determined by the number of bits used, for example, 256 values between 0 and 10 using 8 bits per feature weight. For our experiments, the weights were mapped exponentially to the real range [0, 10). This mapping was done to allow “fine tuning” below 1.0, and to allow mutation to have a similar effect throughout the range of weights. For our experiments, if the component wi of the weight vector fell below a set threshold value, then the wi was effectively set to 0 for the evaluation. This is an empirical modification based on the fact that weights were indeed quite low and apparently not contributing to the classification, though this again can be problem dependent. 2. Normalization of Training Datasets Since the KNN evaluation rule used is heavily affected by scaling, we scaled all features to a unit standard range. Generally the training data are prenormalized to some range such as [0, 1] so that each feature has equal weight in the beginning. Thus, the best result can give an interpretation of relative importance of features to the classification. 3. Evaluation Criteria -- Fitness Functions In our approach we actually define a classifier as a function from pattern space to feature space then to classification space as P→F→C, i.e. G: X × W → Y → Κ, where X is a pattern space, W is a set of weights (parameters) of the classifier, Y is the feature space and K is a set of class labels. We assume our classifier is trainable with respect to a criterion function J: W → R, where R is the set of real numbers; that is, we can find a weight vector w*∈W such that J(w*) is a minimum. In [5] the above notation G(., w*) is assumed to be an optimally trained classifier. For a classifier if we can satisfy following condition by imposing a certain value to the parameter vector w*, it is called a scalable classifier, i.e. for any two feature vectors, suppose we discard the i-th feature, x = [x1, x2, ⋅ ⋅ ⋅ , xi, ⋅ ⋅ ⋅ ,xN], x’ = [x1, x2, ⋅ ⋅ ⋅ , xi’, ⋅ ⋅ ⋅ ,xN] and xi ≠ xi’ then G(x, w) = G(x’, w). In fact, many known classifiers, including linear, piecewise linear and quadratic classifiers, are scalable classifiers. A KNN classifier can be transformed to similarly satisfy the definition of a scalable classifier. In an appendix of the Sklansky and Siedlicki paper[5], they proved that if a classifier is a scalable classifier then its error rate satisfies the monotonicity condition provided we use an optimum training procedure to minimize the error rate of the classifier. For the GA/KNN experiments, we used a GA to train a KNN scalable classifier, and obtain a vector of optimal parameter weights (weight vector) w*. Some of these parameters are responsible for amplifying or reducing the influence of each feature (i.e. weight the features in the KNN classifier). Hence, by comparing these parameters, we can deduce the discriminatory power of each feature. The evaluation criterion function J() we used is the error rate of the classifier being designed; i.e., the performance of the KNN in properly classifying exemplars into pattern classes. Each weight vector w* or its corresponding diagonal matrix [W] is used to create the new feature vector y as described above. The KNN is then run on this new feature space and the number of misclassifications is noted. The evaluation of the weight J(w*) is through the result of applying the KNN rule to the whole set of feature vectors y. The w* that gives the fewest misclassifications
11
is considered the “optimal” transformation of the feature space. The probability of misclassification is based on the error rate on actual test data for a finite set of training samples. Since the KNN rule is known to react strongly to undersampling, we used a leave-one-out strategy for testing small problems. The leave-one-out estimate of the error rate is nearly unbiased, but it has a large variance. Recall that the reliability of the error rate estimate is a function of the number of samples: the larger the number of samples, the larger the reliability (or smaller the variance) of these estimates[6] [7]. . We have experimented with a number of fitness functions based on the above criterion function. The simplest is shown in equation (1). Fitness = J(w*) = (TotPats - CorrectPats) / Totpats
(1)
where TotPats is the number of total data patterns to be examined and CorrectPats is the number of patterns correctly classified in the transformed space with a feature subset of reduced dimensionality. While this function gives reasonable performance, it does not generate a weight vector that gives maximum class separation. Instead, it only gives a vector that optimizes separation based on the threshold for rating each exemplar as a member of one class or another. Equation (2) takes both into account, giving a better optimization: Fitness = J(w*) + T = γ (TotPats - CorrectPats) Totpats
+δ
nmin/K Totpats
(2)
where nmin is the cardinality of the near-neighbor minority set (i.e., the number of nearest neighbors that were not used in the subsequent determination of class) and K is the number of nearest neighbors. This gives credit for not only getting the examples to be correctly classified by a majority of the nearest neighbors, but also for getting the majority to the highest value possible (K). The constants γ and δ are used to tune the processing of the algorithm, currently 2 and 0.2 respectively.
V. Test Results of GA/KNN Hybrid Approach 1. Artificially Generated Datasets We used Equation 2 as the fitness function and a population in which each string used 8 bits to represent each weight component, wi, in the weight vector w*. The actual length of w* (chromsome) depended on the number of features to be encoded. All work reported in this section was done using either GAucsd1.4 [24] program or modifications to that program, and a KNN written locally, with K=5. Our first work used several artificially generated data sets, of which Table 1 is typical. The table represents a template from which we could generate as many examples as we needed. The expression {0.0 - 1.0} represents uniform random values generated in the range of 0.0 to 1.0. The expression 0.5+5∗[f3]2 represents a generation expression where [f3] means that the present value of the f3 field is used in the calculation. Thus field f1 and f7 - f15 represent uniform random noise generated in the range {0.0 - 1.0}, f2 - f6 are fields that can be used to distinguish the eight classes A - H, where fields f4 and f5 make use of values generated in f3. Note that using uniform random values up to contiguous field boundaries (in contrast to normally distributed values with a specified mean and variance) yields a dataset much harder for a KNN to classify than (for example) for
12
an algorithm based on discrete thresholds. The encoding string (chromosome) length for the GA for this data set was (15∗1) =15 bits for feature selection and (15∗8) = 120 bits for feature extraction. The population size used was 50. The parameters used to run the GA were crossover rate: 0.60 ~ 0.80, mutation rate: 0.001. class A
field 1 0.0~1.0
field 2 0.0~1.0
field 3 0.0~0.1
field 4 0.2+[f3]
field 5
field 6
0.5∗[f3]2
0.5+5∗[f3]2
field7~15 0.0~1.0
B
0.0~1.0
0.0~1.0
0.0~0.1
0.2+[f3]
0.5∗[f3]2
1.0-25∗[f3]2
0.0~1.0
C
0.0~1.0
0.0~1.0
0.1~0.2
0.3--[f3]
0.5∗[f3]2
0.5+5∗[f3]2
0.0~1.0
D
0.0~1.0
0.0~1.0
0.1~0.2
0.3--[f3]
0.5∗[f3]2
1.0-25∗[f3]2
0.0~1.0
E
0.0~1.0
0.1~0.2
0.0~0.1
0.2+[f3]
0.5∗[f3]2
0.5+5∗[f3]2
0.0~1.0
F
0.0~1.0
0.1~0.2
0.0~0.1
0.2+[f3]
0.5∗[f3]2
1.0-25∗[f3]2
0.0~1.0
G
0.0~1.0
0.1~0.2
0.1~0.2
0.3--[f3]
0.5∗[f3]2
0.5+5∗[f3]2
0.0~1.0
H
0.0~1.0
0.1~0.2
0.1~0.2
0.3--[f3]
0.5∗[f3]2
1.0-25∗[f3]2
0.0~1.0
Table 1: Artificially generated dataset Errors # of unknown normalized patterns KNN 1200 35.93% Type of w* wi ∈{0,1} wi ∈[0, 10]
f1 0 0.000
Errors 0/1 weighted KNN 6.4%
f2 1 5.154
f3 1 4.656
Errors 0~10 weighted KNN 4.13% f4 1 7.622
f5 0 0.000
Errors 0~10 weighted KNN after selection 3.0% f6 1 4.248
f7-f15 0 0.000
Table 2: Results on artificially generated dataset The GA was “trained” on a set of 30 examples of each of the 8 pattern classes for a total of 240 examples. The GA was run on the 240 exemplars until the population converged. This “best” w* was then evaluated on a set of 1200 new examples generated by the same template. Two typical such w*, in the range of [0, 1] and {0.0 - 10.0}, are shown in Table 2. Note that the irrelevant features were removed (set to 0) by both feature selection and feature extraction. The performance of the discovered w* on classification of the unknown set is shown in Table 2. The percentages listed are averages of 5 runs and compare a number of variations on the basic approach. Note that in classifying the training data, the KNN evaluation function was modified so as to not use the point being tested as its own “neighbor”. This “leave-one-out” strategy was used make the training data resemble the test data more closely, since a test datum will not have itself as a near neighbor during KNN evaluation. Note the distinct improvement of the transformed space over the standard and normalized spaces using a KNN alone. Further note that the results of selection (wi ∈{0,1} weighting) and extraction (wi ∈[0, 10] weighting) are quite similar. The results of the 010 weighting would have been better, but appeared to settle at a local minimum. This was likely
13
due to the small signal/noise ratio of the artificially generated, uniformly random dataset. This is supported by the experiment in which we ran the 0/1 weighting as a feature selection filter, followed by the 0-10 weightings, in which we obtained a much better result (3% error). We have compared our approach with Whitney’s method (estimation of k-NN classification error by the method of leave-one-out and sequential forward selection) on this artificially generated data set. The results of that method appear quite poor. The classification error rate was 60% for the 4-class data set and 95% for the 8-class data set. This suboptimal method is essentially useless in this case. 2. A Real-World Problem -- Biological Pattern Classification. Having established the usefulness of GA feature selection and extraction on various test data, we explored classification of several real-world datasets exemplifying high-dimensionality biological data. Researchers in the Center for Microbial Ecology (CME) of Michigan State University have sampled various agricultural environments for study. Their first experiments used the Biolog test as the discriminator. Biolog consists of a plate of 96 wells, with a different substrate in each well. These substrates (various sugars, amino acids and other nutrients) are assimilated by some microbes and not by others. If the microbial sample processes the substrate in the well, that well changes color, which can be recorded photometrically. Thus large numbers of samples can be processed and characterized based on the substrates they can assimilate. Each microbial sample described was tested on the 96 features provided by Biolog (for some experiments, extra taxonomic data were also used); the value of each feature is either 0 or 1. Using the Biolog test suite to generate data, three test sets were used in showing the effectiveness of our approach.
Rhizosphere Data Set Soil samples were taken from three environments found in agriculture: 1) near the roots of a crop (rhizosphere), 2) away from the influence of the roots (non-rhizosphere), or 3) in a fallow field (crop residue). There are 300 samples in total, 100 samples per area.
2,4-D Data Set Soil samples were collected from a site that is contaminated with 2,4-D (dichlorophenoxyacetic acid), a pesticide. There are 3 classes, based on three genetically similar microbial isolates which show the ability to degrade 2,4-D. There were are a total of 232 samples.
Chlorinated Organic Data Set Water samples were selected from the wastewater treatment output of a bleach kraft mill and from the river that supplies the mill. There are 4 classes: river, mill clarifier, mill lagoon, and mill pond -- totalling 168 samples. The question to be asked using these three experiments were two twofold: 1. Classification. Whether these samples from different environments could be distinguished. (into the 3 environments for rhizosphere data, 3 classes for 2,4-D data, and 4 locations for chlorinated organic data). 2. Identification. Which of the available features are most important for the discrimination and which are acting primarily as “noise” - that is, non-contributing features.
14
There are several characteristics of this type of problem: 1. High dimensionality. The feature space is quite large and therefore computationally expensive for traditional approaches. 2. Noise. The data are very noisy. The microstructure of each sample is extremely heterogeneous. 3. An understanding of the relative importance of the features for discrimination is important to researchers trying to distinguish the different bacterial communities. We applied our GA/KNN hybrid method on the three types of biological data sets described above. Each data element was a 96-bit or 117-bit vector, in which a positive result was a represented by a 1 and a negative result, a 0. In other words, the data elements form dichotomous measurement spaces, or binary pattern spaces. When doing feature selection, the chromosome was, correspondingly, a 96- or 117-bit binary vector, indicating feature used or not used. When we used a GA to search for a best transformation weight vector, the length of string (chromosome) was a considerably longer (8*96) = 768 or (8*117) = 936. Compared with the large number of features, the available training data were limited, since there were 300 (rhizosphere), 241(2,4-D), and 168(Chlorine) samples in the three test data sets, respectively. For each dataset, all samples were used in leave-one-out training, then 10 randomly selected subsets of the training data were tested and results averaged. The classification results of these experiments are listed in Tables 3, 4 and 5 respectively.
number of patterns
Errors, with original Knn
Errors, with 0/1 weighted Knn
Errors, with 0-10 weighted Knn
test 60
27.72%
35.00%
20.10%
training 300
29.00%
19.58%
18.00%
Table 3: Results on rhizosphere dataset number of patterns
Errors, with original Knn
Errors, with 0/1 weighted Knn
Errors, with 0-10 weighted Knn
test 10% (10 times)
8.30%
3.20%
2.00%
training 232
6.64%
1.66%
0.83%
Table 4: Results on 2,4-D data
15
number of patterns
Errors, with original Knn
Errors, with 0/1 weighted Knn
Errors, with 0-10 weighted Knn
test 10% (10 times)
36.76%
46.44%
29.53%
training 232
53.57%
52.38%
23.21%
Table 5: Results on chlorinated organic dataset The results showed that the classification error rate for the ordinary KNN was high for these complex data patterns. The ordinary KNN is sensitive to noisy data and does not rate the relative importance of individual features for discrimination. The performance of the GA/KNN hybrid approach was significantly better than the unmodified KNN. However, the 0/1 selection approach no longer performed adequately on the large data sets with considerable noise, such as found in the rhizosphere data and the chlorinated organic data; however, it remained effective for the 2,4-D data. While the error rates were still relatively high, the results were deemed good by domain researchers. 3. Finding a Nearly Optimum Feature Set For the artificially generated dataset of Section V.1, we found either the minimum or nearly minimum optimum feature set. However, for the real-world classification problems just discussed, it is not clear whether the GA/KNN solution is indeed the global optimal solution. We also noted in repeated runs of the GA/KNN on our biological datasets that the sets of weights found were not identical from run to run. This may reflect the fact that the minimum feature set required to distinguish the classes is not unique. For example, different subsets of Biolog features for the rhizosphere problem (from as few as 10 features to as many as 50 out of the original 96) could discriminate at the same error level. We addressed these problems by further exploring the 2,4-D data using a number of variations on the original approach. The first of these was to perform 0/1 selection, then peform extraction on the reduced data set (called Selector-Distributor), in a two-step process. The first step, feature selection using wi ∈{0,1}, eliminates some features which are either noisy or contain insufficient or redundant information. The first step on the 2,4-D dataset reduced the number of features from 96 to 31. The error rate using selection alone was 0.83% In the second step, we used feature extraction in the range of [0-10] on the reduced set. For practical speedup of the the GA, we set weight components less than 1 to zero. We ran the algorithm twice, using wi ∈[0, 10] range on a 31-feature set (based on the first selection result) and then on a 23-feature set (based on the second selection run). Both runs achieved an error rate of 0.24%, as shown in Figure 6. The best weight vector is shown in Table 6. We then examined the stepwise contribution of each element of the weight vector in their order of importance as indicated by the GA/KNN. Thus, we plotted the error rate using w1; w1 and w2; w1, w2 and w3; and so on. This experiment is a practical example of the theorem in [5] . The result is shown in Figure 4.
16
. feature #
42
86
53
66
56
63
19
21
27
64
71
82
Weight 1.551 2.077 0.000 3.690 2.781 0.000 1.648 4.297 0.000 0.000 3.113 0.000 feature #
89
51
14
22
11
93
78
6
81
91
9
Weight 1.673 0.000 0.000 2.411 0.000 3.646 4.248 4.150 2.106 1.217 4.103 Table 6: Best weight vector for the 2,4-D data set
81 86 22
71 66 6
19 42 91 89
56
93
9
78 21 Number of features Figure 4. 2,4-D Important feature analysis From the above results we can draw this preliminary conclusion: feature selection and extraction as performed by the GA/KNN method can be used to find nearly minimum, nearly optimal feature sets for discrimination of at least some types of high dimensionality, noisy data. We note several issues concerning the use of GA/KNN. (1) Classifiability of High Dimensionality Data Not all high dimensionality datasets are readily classifiable, or classifiable with high accuracy. One cause for such difficulties is that the data classes have overlapping features or lack of significant information for discrimination among the classes Even if a particular H.D. dataset is classifiable, in order to obtain a good set of weights, we need to pay attention to the quality and quantity of the training samples given. The sample
17
(training) dataset needs to be typical, and it must also be large enough to allow for effective training. (2). The Need for Speed, Parallel Processing One of the main problems encountered in using a GA/KNN approach for a classifying H.D. dataset is the time cost. On a sequential machine running the GA/KNN on the Biolog dataset with 96 features, for example, it took about 14 days to run the GA to convergence, about 29,000 generations, on a SPARC I processor with a population size of 50 and 225 training samples. Of course, on a SPARC 20, it runs much faster, but the time is still significant. However, a GA can easily be implemented as a parallel algorithm in a number of ways. For the applications of selection and extraction being discussed here, most of the computational time is spent in evaluating the chromosome, so we can distribute the evaluation of individuals in the population to several nodes (processors). Such a use of multiple processors results in nearly linear speedup -- that is, using k processors in the system (up to the number of individuals in the population) reduces the clock time to run the problem by nearly a factor k. On a GP1000 (Butterfly architecture using Motorola 68020 processors) with 1 processor to evaluate each string in the population and one master node, we could do 46,000 generations in about 1 1/2 days. On a TC1000 (Butterfly with 88000 processors), using only 6 nodes (so at most 9 strings to evaluate per processor), we could do 39,000 generations in 1 1/2 days; with 11, the time was reduced by 40% (linear speedup would reduce it by 45.6%). One of these experiments is shown in Table 7. Machine
Training data size
Population Size
Number of Generations
Number of Processors
Time dd:hh:mm
Sparc 4/390 (timeshared server)
240
50
29000
1
14:19:02
GP1000
240
50
46000
51
1:14:47
TC2000
240
50
39000
11
1:00:58
Table 7: Speeds in various configurations (including parallel) This approach to GA parallel processing is sometimes called “micro-grain” parallelism[31] and is diagrammed in Figure 5. In contrast to most other parallel GA techniques, in this method, the calculations performed are exactly the same as if done on a single processor. Please see [31] for a discussion of various approaches to GA parallel processing. 4. Exploiting Additional Non-Independence of Fields Using Non-Linear Transformation Having tested the approach on independent features, we examined the search capabilities of GA feature extraction using non-linear transformations on correlated data. We investigated ways to use the correlations between fields in improving classifications. Of course, K nearest neighbor methods are sensitive to some inter-field correlations already, but not in the sharpest sense. If we know the form of some possibly significant inter-field correlations in advance, we can use that
18
form of non-linear transformation to search for fields with those sorts of correlations. Not only
Master Node
Setup Initial Population
Crossover
Mutate
Distribute All the Structures to Slave Nodes Slave Nodes Calculate Fitness Value 1
2
n
Evaluate the Fitness of Every Structure
Save Some Structures with High Fitness
No
Complete Trials Master: Yes Slave :
Figure 5: Parallel Processing Structure for GA would we do dimensionality reduction, but also creation of new meaningful features for effective classification. For example, we wondered whether, given a dataset in which a correlation between two otherwise non-discriminatory fields was important, the GA could discover the usefulness of that pairing in our scheme. An artificially generated dataset to test this hypothesis is shown in Table 8. Clearly, what we could not do was to create a field for every combination of pairs of fields. This is because, for high dimensionality data sets, the size of the chromosome would be very large
19
(n genes for the features and n (n-1) for the pairs, for example). Instead, we coded some special fields, termed “index fields”, in the genome, representing pairs of field indices (instead of some weight) plus additional weight component fields to apply to these index-pair-generated terms. In this way, the GA can select the pair of fields to test for correlation under the predetermined transformation, and then weight, as it does all the other fields, the significance of that transformation. The evaluation function was modified such that values in the fields pointed to by the two indices would be passed through some pre-determined function (in this case, just simple multiplication). This essentially creates the product of these two fields as a new attribute of the individual to be classified. The GA can then search for a weight component for that product field. Only a tiny subset of all possible pairs of fields is being examined by the GA at any point in time, but whenever crossover or mutation alters an index pair, the GA essentially begins to search for a weight component for that new pair. For the runs to be discussed, only two weight values were possible for each index pair, represented by a single bit in the index part of the chromosome. The value of the weight component was set to either 10.0 or 0 depending on whether the weight component bit was on or off. We have conducted other experiments in which the GA searches through a large weight space, but this simpler approach seems to work very well, at least for these artificially generated test data. To allow a “selector of pairs” capability, if the GA chooses either index to be 0, then that index pair is not used in the evaluation function, thus allowing the GA to select for no index pair terms. The search strongly discourages the use of the same of the same index twice in a pair (e.g. 3,3) or the repetition of a pair in a string (e.g. 2, 5; 5, 2), by assigning a low fitness to any string with such a pair.
class
feature 1~15
hidden feature 1
hidden feature 2
1
0.0~1.0
0.42 < (1) * (2) < 0.44
0.54 < (3) * (4) < 0.56
2
0.0~1.0
0.42 < (1) * (2) < 0.44
0.58 < (3) * (4) < 0.60
3
0.0~1.0
0.42 < (1) * (2) < 0.44
0.62 < (3) * (4) < 0.64
4
0.0~1.0
0.42 < (1) * (2) < 0.44
0.66 < (3) * (4) < 0.68
5
0.0~1.0
0.46 < (1) * (2) < 0.48
0.54 < (3) * (4) < 0.56
6
0.0~1.0
0.46 < (1) * (2) < 0.48
0.58 < (3) * (4) < 0.60
7
0.0~1.0
0.46 < (1) * (2) < 0.48
0.62 < (3) * (4) < 0.64
8
0.0~1.0
0.46 < (1) * (2) < 0.48
0.66 < (3) * (4) < 0.68
Table 8: An artificially generated test template using hidden features.
20
Algorithm
population size
Trails
Error rate training set
Error rate 1200 test set
50
25,600
0.0%
0.58%
50
36,000
6.5%
22.67%
39.25%
86.75%
Non-linear transformation GA/KNN with index GA/KNN without index pure KNN
field weight component and index
field 1-15 field 16 0.0
10.0
field 17
field 18
index pair 1
index pair 2
index pair 3
10.0
0.0
1, 2
3, 4
5, 6
Table 9: Results, hidden feature data set using nonlinear transformation (index feature) The example data sets we created using the template shown in Table 8 creates individual fields which are not by themselves significant for classification, but taken as pairs and multiplied together, completely separate the classes. The data consist of 15 uniform random values in the range {0.0 - 1.0}, and two “hidden features” consisting of a multiplication of the f1 and f2 or of f3 and f4; that is, the data were sieved to make the products significant, but no products actually appear among the data values. The goal was to see if the GA, using the approach of searching simultaneously for index pairs and weight components on that pair, would form a nonlinear transformation and work effectively. We modified the evaluation function as described, and allowed 3 index pairs at the end of each individual string of the population. We trained the algorithm on 10 examples from each of the 8 classes (80 total) then tested the trained algorithm on 1200 newly generated test data sets. The results of running this modification are shown in Table 9. The GA indeed found the “hidden features” and therefore could perform classification. The pure KNN alone had an 88% error rate for the unknowns. The nonlinear transformation index algorithm did indeed find the appropriate indices in all of several runs and got the error rate down to 0.0% by finding the two index pairs (1,2) and (3,4), setting their weight components to 10, and dropping all other weight components to a low level.
VI. GA/RULE Approach As previously mentioned, the computational cost of the GA/KNN method is very high and requires parallel or distributed processing to be attractive for large, high-dimensionality problems. On the one hand, the very high computational cost comes from the problem of feature selection and extraction of high dimensionality data patterns themselves. On the other hand, the evaluation of the whole population using the KNN algorithm is computationally intensive and time-consuming. Since some of the high dimensionality patterns of real-world applications use binary feature
21
patterns (such as the Biolog biological data sets), we examined whether we could decrease computational cost without sacrificing performance by directly generating rules -- i.e,. using a genetic algorithm combined with a production (rule) system. This approach is based on the general structure of classifier systems, and focuses on the classification of binary feature patterns. This work utilizes a genetic algorithm that develops classification “rules” for both classification and feature extraction of high-dimensionality binary feature patterns. This inductive learning method is simpler to implement than the previous GA/KNN hybrid system, and requires substantially fewer computation cycles to achieve answers of similar quality. The system was developed for application to both classification and feature extraction of high dimensionality binary feature patterns. Thus, we have named our system the HDBPCS (High Dimensionality Binary Pattern Classification System) and have applied it to the three previous biological data sets, with encouraging results as before. The accuracy of this method is significantly better than the classical KNN method, and significantly faster than our previous hybrid method for the classification of binary feature patterns. Our work builds on that of Wilson[25], Bonneli[26], Riolo[27], Oliver[28], Sedbrook et. al[29], and Liepins[30], all of whom studied systems for inductive learning of classification. 6.1 General Structure of the GA/RULE Approach Binary feature classification was selected for study because it is both a common problem and easy to represent in the HDBPCS. As with other inductive learning approaches, this algorithm uses a collection of labeled “training” data to drive learning. What is the chrasteristic about the approach is the “evolution” of classifiers/rules to classify the training data, and subsequent testing of the accuracy of those classifiers on data not used in classifier training. . Initialize Population (random set of rules)
Evaluation New Population Decision Rule with/wo weight
Training x data set
Yes
Satisfy? No Evolution GA Selection Crossover
Best set of rules and optimal feature subset
Mutation
Figure 6: The structure of the GA/RULE approach
22
The results of this process are two: 1) “good” rules are created for classifying “unknown” data, and 2) domain researchers are shown those features which are important for the classification, based on the features used by the rules. Our HDBPCS, like most machine learning systems, performs supervised learning. The characteristics of the HDBPCS genetic algorithm are described in the following sections and in Figure 6. The objective of the GA/RULE approach is to find and optimal transformation that yields both the lowest error and smallest feature set. The basic tenet of this approach is to also utilize a feedback linkage between feature evaluation and classification, but in a way much different from GA/ KNN. GA/RULE directly manipulates a rule representation which is used for classification, rather than creating a transformation on the KNN rule. 6.2 Rule (Chromosome) Structure Suppose there are k possible pattern classes ω1, ω2, ⋅ ⋅ ⋅ , ωk. A classifier is a production rule which is used to make a decision assigning a pattern x to one of a set of classes ωj, j = 1, 2, ⋅ ⋅ ⋅ , k. The part of the rule in HDBPCS is a string which consists of k class-feature vectors, where k is the number of pattern classes. Each class-feature vector consists of N elements, where N is the number of features being used for the classification of that class k. Each class vector play the role of examiner as a classification predicate for the input data pattern. All class-feature vectors put together to determine which features are to be used in this rule’s decision for classification. The part of the classifier is a class variable ω which indicates into which class the rule classifier places an input pattern based on matching against the feature vector. A classifier rule therefore takes the following form:
: : = : A specific form of the rule is:
( y11, ..., y1i, ..., y1N), ... , ( yj1, ..., yji, ..., yjN), ... , ( yk1, ..., yki, ... ykN) : ω ,
where i = 1, 2, ⋅ ⋅ ⋅ , N; j = 1, 2, ⋅ ⋅ ⋅ , k, and where ω is a variable whose value can be one of the k classes. The is a simple pattern matching device. The alphabet of the class-feature vector consists of 0, 1, and a don’t-care character, i.e., yji ∈{0, 1, #} 6.3 Training Data The training data consists of a training vector of size N (again, N indicating the number of features being used) which is the feature vector x as follows and a known classification label. x = [x1, x2, ⋅ ⋅ ⋅ , xi,⋅ ⋅ ⋅ , xN]. i = 1, 2, ⋅ ⋅ ⋅ , N, where xi∈{0,1}, and N is the number of features 6.4 Evaluation of Decision Rule Classifier Each rule in the population is evaluated by matching it against the training data set. In this procedure, every class-feature vector of the rule’s condition is compared with the training vector at every position (i.e. every component). A 0 in the class-feature vector matches a 0 in the training vector, a 1 matches 1 and a # don’t-care character matches either 0 or 1. Thus, for a training set
23
with three classes, the training vector would be compared with the three class-feature vectors in each rule. The number of matching features in each class-feature vector of the rule’s condition is counted, and that rule’s class-feature vector with the highest number of matches determines the classifier’s message -- the class of the sample. Since the class of each training sample is already known, this classification can then be judged correct or not. Based on the accuracy of classification, the decision rule can be directly rewarded or punished. Based on this rule “strength”, the GA can evolve new rules. Let us now discuss the use of classifier rules as a pattern discriminators. A given pattern x of unknown class normally can be assigned to any of the k classes. Thus, we have k possible decisions. Let R(x) be one of the above rule classifiers -- i.e., a function of x that tell us which decision to make for every possible pattern x. For example, R(x) = ωj denotes the decision to assign pattern to x to class ωj. During the classification step, class membership needs to be determined based on the comparison of k discrimination functions g1(x), g2(x), ... , gk(x), or say by k matching functions m(xi, yji), computed for the input pattern under consideration. The definition of a discrimination function or matching function for two different cases is defined as follows: For Feature Selection 1 ( x i = y ji )or ( y ji = # ) [ ] m ( x , y ) = , ∑ i ji 0 ( x i ≠ y ji )and ( y ji ≠ # ) i=1 N
g j( x) = For Feature Extraction
N 1 ( x i = y ji )or ( y ji = # #) ] g j ( x ) = [ W ] ∑ [m ( x i, y ji ) = , i = 1 0 ( x i ≠ y ji )and ( y ji ≠ # )
...˙ 0 0 w 2 ... 0 . . . . . , or w* = [w1,w2, …, wΝ], . . . . .
w1 0
where
W =
x = [x1, x2, ⋅ ⋅ ⋅ , xN]
0 0 0 0 wN
It is convenient to assume that the discriminant function gj(x) uses scalar values and that the pattern x belongs to the j’th class if and only if gj(x) > gi(x), for j, i = 1, 2, …, k, i≠j This is equivalent to R(x) = ωj, when gj(x) = max {gi(x)} i = 1, 2, …, k, The decision in N-dimensional feature space between pattern classes ωj and ωi is defined by the equation gj(x) - gi(x) = 0 The above equation represents the decision surface. When a pattern x falls on the border of this surface, it is ambiguous as to which classification to make. It would be advantageous to allow the classifier to withhold making a decision and reject the pattern.
24
6.4 Evolution of Rule Classifier using GA Running HDBPCS yields a small set of “best” classification rules which classify the training set more accurately than other rule sets. By examining those “good” rules, one can determine those features which are most useful for classification. To keep the number of features used to a minimum, one must provide a mechanism by which one can foster not only accuracy of classification, but also minimum cardinality of the feature set used. The problem then becomes a multiobjective or multicriterion optimization problem. The way to accomplish this is to add another term to the fitness function and to choose different proportions of wild cards # in the rule classifiers of the initial population. •Fitness Function Each rule’s fitness is based on the classification of the known samples. The form of the fitness function is as follows:
Fitness = CorrectPats/TotPats + α ∗ n_don’tcare/TotPats. where TotPats is the number of total patterns (training samples) to be examined and CorrectPats is the number of patterns correctly classified by the rule, and n_don’tcare is the number of invalid features (see below) for classification. The constant α is used to tune the two terms of the fitness function, and its value is determined on a problem-specific basis. This fitness function guides the GA to search out best rule sets as a multiobjective optimization problem. •Genetic Operators Crossover was standard one-point crossover. Mutation was standard bit-modification. Operations were performed only on the condition part of the classifiers/rules. Elitism was used to keep the best solution in the population. The entire population (except for the best solution) is replaced each generation. •Determining Invalid Features To keep the complexity of the rule generated to a minimum, an optimization step is applied to remove redundant features
y 11, ..
y 21,
y 12, . . . y 1N y 21, . . . y 2N
...
...
...
...
y k1
y k2
...
y kN
⇒ω
For each rule, we arrange the class vectors as a matrix above, then determine whether every class vector has the same value (1 or 0) or a don’tcare (#) at any position (column). If they do, the n_don’tcare variable is incremented, as this feature is useless for classification, i.e. for any i column if ∀(ymi, yni) (ymi = yni) m, n =1, 2, . . ., k, m ≠ n or (ymi ≠ yni) and (ymi = # or yni = #). where m, n =1, 2, . . ., k, m ≠ n then ith feature is invalid for classification else valid.
25
From this operation on the part of the rule we can reduce the feature subset directly. At the end, the nearly optimal feature subset can be drawn from the best rule which keep the classification performance at certain level. 6.5 Test Results We applied the HDBPCS to the three biology datasets described in Section 2 using feature selection. The parameters used were those that gave the best results over a number of runs. The initial populations were created randomly, with the value at each vector position generated using probability 0.3 for a 1 or 0, and 0.4 for a #. The population size was 200, the selection rate was 0.6, the crossover rate was 0.6, and the mutation rate was 0.005. In several runs, the HDBPCS system discovered a reasonable classification rule for each data set. The HDBPCS outperformed a standard KNN method in every case, and its performance approached the hybrid GA/KNN method in most cases. It is most important to note, however, that the computation time required for the hybrid GA/KNN method running under similar circumstances was about 14 days (if run on a single SPARC I processor), whereas the HDBPCS required only 3 hours on a somewhat faster processor. The average accuracy results of six runs and the number of valid features after selection are shown in Table 10 below. KNN GA/KNN HDBPCS Correctness Correctness Correctness rate rate rate Rhizosphere data
Training
71.00%
82.00%
80.00%
Test
68.00%
79.90%
70.00%
Training
93.36%
99.17%
98.61%
Test
91.70%
98.00%
96.00%
Training
46.43%
76.79%
79.86%
Test
63.24%
70.47%
66.67%
2,4-D data
Chlorinated data
Number of features
Number of valid features
96
23
96
22
117
60
Table 10: Classification results using HDBPCS system The information shown in Table 10 is particularly important to biological researchers, as it provides information that can be used to derive new hypotheses. Furthermore, by reducing those features required for discrimination, HDBPCS gives domain researchers information on possible foci for future research efforts. In particular, for our biology examples, it was shown that HDBPCS can indeed distinguish between different bacterial communities on the basis of which substrates are being used -- i.e., 2,4-D degraders, based on a standardized test that measures substrate uptake. This may help to distinguish what substrates are available in each of the communities, and may have ramifications in determining which bacteria are suitable for bioremediation of particular environments. Finally, the errors of classification in each class are not uniform across the data sets examined. This information may also provide some hints for research in the domain being studied.
26
The same problems of classifiability of high dimensionality data sets that existed for the GA/ KNN approach exist here. The quality of data sets, whether or not they have overlapping features in some classes, the lack of the significant information for discrimination among the classes -- all these factors affect classification accuracy. The quantity of training samples given or the ratio of training sample size to dimensionality is a crucial factor in the design of a classification or recognition system. The sample (training) dataset needs to be representative, and it must also be large enough to allow for effective training. Otherwise, it will allow ‘false’ rules to be induced, based on a few examples which do not span the space of possible class membership.
VII. Conclusion In this paper we have shown that Genetic Algorithms can play an important role in classification and feature extraction for high dimensionality and multiclass data patterns. The basic method of our two approaches is a classification--feedback--feature extraction technique. This an automatic pattern recognition method which utilizes feedback information from the classifier being designed to change the decision space. The GA/KNN does this by transforming that space and reducing its dimensionality, the GA/RULE does this by modifying the decision rule using inductive learning and evolution to improve the performance of the classifier. The results achieved by these approaches indicate potential application in many other fields. They can also be further developed using a GA with other classifiers to solve a fairly broad class of complex pattern recognition problems.
Acknowledgments This work was supported in part by Michigan State University’s Center for Microbial Ecology and the Beijing Natural Science Foundation of China. References: [1] Anil K. Jain, Artificial Neural Networks and Statistical Pattern Recognition. Editor preface, 1991, Elsevier Science Publishers B.V. [2] W.F. Punch, E.D. Goodman, Min Pei, et al. Further Research on Feature Selection and Classification Using Genetic Algorithms. Proc. Fifth Inter. Conf. Genetic Algorithms and their Applications (ICGA), 1993, p.557. [3] Harry C. Andrew, Introduction to Mathematical Technique in Pattern Recognition. New York: JohnWiley & Sons Inc. 1972. [4] Jacek M. Zurada, Introduction to Artificial Neural Systems West Publishing Co. 1992. [5] W. Siedlecki and J. Sklansky, On Automatic Feature Selection Internat. Journal of Pattern Recognition and Artificial Intelligence. Vol 2, No.2 1988, 197-220. [6] A. K. Jain and B. Chandrasekaran, Dimensionality and Sample Size Considerations in Pattern Recognition Practice. Handbook of Statistics, Vol. 2 P. R. Krishnaiah and L. N. Kanal, eds. North-Holland, Amsterdam, 1982, 835-855 [7] A. Jain, Pattern Recognition. International Encyclopedia of Robotics Applications and Automation, Wiley & Sons, 1988, 1052-1063. [8] C. Y. Chang, Dynamic Programming as Applied to Feature Subset Selection in a Pattern Recognition System, IEEE Trans. Sys. Man Cybern. 3, 1973, 166-171. [9] M. Ben-Bassat, Use of Distance Measures, Information Measures, and Error Bounds in Feature Evaluation. Handbook of Statistics, Vol. 2, P. R. Krishnaiah and L. N. Kanal, eds. NorthHolland, Amsterdam, 1982, 773-791. [10] T. Marill and D.M.Green, On the Effectiveness of Receptors in Recognition Systems, IEEE
27
Trans. Inf. Theory, 9 1963,11-17. [11] T. M. Cover, The Best Two Independent Measurements Are Not the Two Best, IEEE Trans. Sys. Man Cybern. 4 1974, 116-117. [12] T. M. Cover and J. M. Van Campenhout, On the Possible Orderings in the Measurement selection problem, IEEE Trans. Sys. Man Cybern. 7 1977, 651-661. [13] A. Whitney, A Direct Method of nonparametric Measurement Selection, IEEE Trans. Computers. Vol C-20, 1100-1103, Sept. 1971. [14] S. D. Stearns, On Selecting Features for Pattern Classifiers, Proc. 3rd Int. Conf. Pattern Recognition, Coronado, CA. 19766, 71-75. [15] P. M. Narendra and K. Fukunaga, A Branch and Bound Algorithm for feature Subset Selection, IEEE Trans. Computers. 26 1977, 917-922. [16] I. Foroutan and J. Sklansky, Feature Selection for Automatic Classification of Non-Gaussian Data, IEEE Trans. Sys. Man Cybern. 17 1987 187-198. [17] P. A. Devijverand, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall, Lodon, 1982. [18] W. Siedlecki and J. Sklansky, A Note on Genetic Algorithms for Large-Scale Feature Selection, Pattern Recognition Letters, 10 1989, 335-347. [19] W. Siedlecki and J. Sklansky, Constrained Genetic Optimization via Dynamic Reward-Panalty Balancing and its Use in Pattern Recognition, Proc. Third Inter. Conf. Genetic Algorithms and their Applications (ICGA), 1989, 141-150. [20] J. D. Kelly and L. Davis, Hybridizing the Genetic Algorithm and the K Nearest Neighbors Classification Algorithm, Proc. Fourth Inter. Conf. Genetic Algorithms and their Applications (ICGA), 1991, 377-383. [21] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975. [22] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. New York: Addison Wesley, 1989. [23] K. De Jong, Genetic Algorithms: A 10 Years Perspectives, Proc. First International Conf. Parallel Problem Solving from Nature, 1990, 169-177. [24] N. Schraudolph and J. Grefenstette, A User’s Guide to GAucsd 1.4, Technical Report CS92249, Technical Reports, CSE Dept., UC San Diego, La Jolla, CA 92093-0114, 1992. [25] S.Wilson. Classifier Systems and the Animate Problem. Machine Learning Journal 2: 199-2 28, Kluwer Academic Publishers, Boston, 1987. [26] Bonelli, Pierre and Parodi, Alexandre (1991). An Efficient Classifier System and its Experimental Comparison With Two Representative Learning Methods on three Medical Domains. Proc. Fourth Inter. Conf. Genetic Algorithms and their Application (ICGA), 1991, p. 288. [27] R. L. Riolo. Modeling Simple Human Category Learning with a Classifier System. Proc. Fourth Inter. Conf. Genetic Algorithms and their Application (ICGA), 1991, p. 324. [28] J. Oliver, Discovering Individual Rules: An application of Genetic Algorithms, Proc. Fifth Inter. Conf. Genetic Algorithms and their Application (ICGA), 1993, p.216. [29] T. A. Sedbrook, H. Wright, R. Wright, Application of a Genetic Classifier for Patient Triage, Proc. Fourth Inter. Conf. Genetic Algorithms and their Application (ICGA), 1991, p. 334. [30] G.E. Liepins, L.A. Wang, Classifier System Learning of Boolean Concepts, Proc. Fourth Inter. Conf. Genetic Algorithms and their Application (ICGA), 1991, p. 318. [31] S.-C. Lin, W. F. Punch, and E. D. Goodman. Coarse-grain Parallel Genetic Algorithms: Categorization and New Approach. Sixth IEEE SPDP, pp 28--37, October 1994.
28