GRACCE: A GENETIC ENVIRONMENT FOR DATA MINING
Robert E. Marmelstein and Gary B. Lamont Department of Electrical and Computer Engineering Graduate School of Engineering Air Force Institute of Technology Wright-Patterson AFB, Dayton, OH 45433-7765 EMAIL:
[email protected] and
[email protected]
ABSTRACT: Data mining is the automated search for interesting and useful relationships between attributes in databases. In this regard, the rules used by classifiers are inherently interesting because they distinguish between similar looking data of differing class. In this paper, we introduce the Genetic Rule and Classifier Construction Environment (GRaCCE) as a means for extracting classification rules from data. GRaCCE uses a multi-stage, Genetic Algorithm (GA) based approach to first reduce the feature set and then locate class homogeneous regions within the data. Classification rules are subsequently generated from these regions. The utility of GRaCCE is demonstrated by comparing it to a decision tree induction technique using several benchmark data sets. We show that GRaCCE achieves competitive classification accuracy while producing a more compact decision rule set.
X2
o o o oo o o o o o o o o o o o oo o x x o x x x x o o oo o x oo o o o o x o o x x x o o x x x x x x xx x x x x x xx x x x
X2
x x
o o o oo o o o o o o o o oo o oo o x x o o x o oo x x x o x oo o o o o x o x o o x x x o x x x x x xx x x x x x xx x x x
X1
x x
X1
(A) Decion Tree Partitions
(B) Actual Partition o x
- Class 1 - Class 2
Figure 1. Decision trees can produce rule sets that are more complex than necessary.
1 Introduction A classifier provides a mapping from a set of input variables x1 ; ::; xd to an output variable y whose value represents the class label ω1 ; ::; ωm (Bishop, 1995). While these mappings can be extremely complex, they are also a potentially important source of useful information. Approximating these mappings with rules that are both accurate and understandable is a fundamental data mining problem. Decision tree algorithms, such as CART (Breiman, 1984) and C4.5 (Quinlan, 1993), are frequently employed to extract classification rules from data. These algorithms can quickly produce accurate rules which have the advantage of being relatively understandable. The disadvantage, as illustrated by Figure 1, is that the induced rules can be overly complex. This situation is due to a number of factors, including: 1. The greedy nature of the induction algorithm. 2. Simplifying assumptions about the rule structure (e.g., single variable splits).
3. The use of a “one size fits all” partitioning metric (such as information gain (Quinlan, 1993)).
The Genetic Rule and Classifier Construction Environment (GRaCCE) is a proposed alternative to these approaches. GRaCCE harnesses the power of genetic search to mine classification rules from data. GRaCCE accomplishes this by generating linear decision boundaries based on a simplified data set. A search is then performed for combinations of these partitions that isolate class homogeneous (CH) regions in the data. These regions are refined further and used to generate a final rule set for classifying the data. In this paper, we show how GRaCCE operates on a synthetic problem. We also compare its performance to the CART algorithm for several benchmark data sets.
2 The GRaCCE Algorithm
10
The GRaCCE algorithm has five distinct phases: feature selection, partition generation, data set approximation, region identification, and region refinement. Each of these phases is discussed in the sections that follow. For the purpose of this article, we assume the reader has a general familiarity with GAs; if not, Goldberg (Goldberg, 1989) provides a comprehensive reference on this subject.
Class 1 Class 2 Class 3
9
8
7
6
5
2.1 Feature Selection Phase
4
In this phase of the algorithm, we reduce the dimensionality of the data set containing d features. GRaCCE offers a number of feature selection algorithms, including a deterministic forwardsearch, a GA-based search, or a hybrid of the two. Starting with Sklansky (Sklansky, 1989), a number of researchers have demonstrated the effectiveness of the GA-based feature selection. In our implementation, the fitness of each feature set is based on the accuracy achieved against a k Nearest Neighbor (kNN) classifier (Bishop, 1995) using the features enabled in the GA chromosome. The hybrid method uses the results of a partial forwardsearch to find the best m features (where m s) ~ z(i) = 1 else s = Φ(~z) end; %if end; %for i return [~z];
~
Figure 6.
Initial Template Search Algorithm
Figure 7.
Partitions selected for Class 3
phase, we possess a simplified version of the regions and their respective decision boundaries that were derived in the previous section. Because these decision boundaries separate each region from the rest of the data space, we can treat each region as a rule for classifying the data and each of its partitions as conditions for that rule.
3 Test Methodology In order to gauge the relative utility of GRaCCE, we compare its performance against the CART decision tree induction algorithm (as implemented in the LinkNet pattern recognition testbed (Lippmann, 1995)). Since both systems produce rules as output, a comparison between them is far more meaningful than if GRaCCE were evaluated against a black box type of classifier (such as a neural network). A decision tree is a classifier that has a hierarchical (tree) structure. According to Quinlan (Quinlan, 1993), a decision tree is made up of two kinds of components. The first is a leaf, which indicates class. The second is a decision node that contains a criteria for separating the data into individual classes. Feature vectors are classified by starting at the root node and evaluating each decision node (taking the appropriate path) until a leaf, indicating the class of the sample, is encountered. In the CART algorithm, each decision node’s partitioning criteria can be expressed as either single or linear combinations of features; we evaluate both these modes against GRaCCE. The systems were evaluated using the data sets described in Table 1. Most of these data sets are frequently cited benchmarks available from the UC-Irvine Machine Learning Repository; the FLIR data set is an unclassified data set used to perform target recognition of SCUD missiles. Each data set was cross-validated against each classifier over a series of five runs. For each run, the classifier was trained with 80% of the data; the remaining data was utilized for testing.
Table 1.
Name
Classes
Data Set Descriptions
Features
Samples
Description
(Original/Reduced) Iris
3
4/2
150
Fisher’s Iris Database
Cancer
2
9/6
699
Wisconsin Breast Cancer Database
Wine
3
13/3
178
Chemical Analysis of Wine
Glass
6
9/4
214
Attributes of Crime Scene Glass
Diabetes
2
8/5
768
Pima Indians Diabetes Database
FLIR
2
6/2
1000
SCUD Missile Forward Looking Infra-Red (FLIR) Database
Table 2.
4 Interpretation of Results The results of our evaluation are shown in Tables 2 to 7. The Table 2 to 4 results were gathered using the original features of each data set. The remaining tables reflect usage of the selected features only; the third column in Table 1 indicates the degree of feature set reduction. In this paper, we interpret the results with regard to classification accuracy and rule set complexity 1 . In most cases, GRaCCE and CART were roughly equivalent with regard to classification accuracy. With the original feature set, each method had one case where it proved to be statistically more accurate than the other; neither method had an advantage in accuracy using the reduced feature set. These results are significant because CART executes in a matter of minutes while GRaCCE can take several hours to full cross-validate a single data set. Therefore, if one is chiefly concerned with classification accuracy, GRaCCE is not the preferred method. In the area of rule set complexity, however, GRaCCE was the clear winner. For each data set, GRaCCE classified the data using a smaller number of rules and fewer average conditions per rule. These results indicate that GRaCCE found rules enjoying better coverage than those derived with CART while maintaining a comparable level of accuracy. In several cases, such as Cancer and FLIR, GRaCCE improved on CART’s performance by a better than 2 to 1 margin. While not presented here, similar results were found relative to the C4.5 algorithm. We attribute the difference in rule set size to the quality of the decision boundaries generated by GRaCCE. Decision tree algorithms tend to partition data based on statistical metrics computed for a given data subset. Although this splitting criteria is effective, it may prove arbitrary for a given data set and result in rule sets that are much more complicated than necessary (recall Figure 1). In contrast, GRaCCE forms CH regions from partitions that separate class boundary points. Once they are selected, GRaCCE adjusts these partitions to better reflect the shape of the regions they separate (refer to Figure 5). By utilizing these natural boundaries, GRaCCE derives a more streamlined, and potentially more useful, rule set than CART.
1 Cases where one method proved superior using a difference of means test at the 0.05 level of significance are marked by a “*” in the appropriate table
GRaCCE - Original Feature Set
Test Set Results (µ) Data Set
GRaCCE Error
Rules
Conds/Rule
Iris
0.019
3.0*
1.4*
Cancer
0.037*
2.0*
1.1*
Wine
0.104
3.8
1.9
Glass
0.387
11.8
2.7*
Diabetes
0.284
8.2*
4.0
FLIR
0.292
10.2*
2.2*
Table 3.
CART (Linear) - Original Feature Set
Test Set Results (µ) Data Set
CART Error
Rules
Conds/Rule
Iris
0.060
4.4
2.1
Cancer
0.057
6.0
2.6
Wine
0.101
4.0
2.0
Glass
0.410
12.2
3.6
Diabetes
0.277
18.0
4.2
FLIR
0.227*
22.8
4.5
5 Summary and Future Work In this paper we have presented a comprehensive overview of the data mining algorithm employed by GRaCCE. In addition, we demonstrated GRaCCE’s effectiveness with respect to the CART decision tree algorithm on several benchmark data sets.
Table 4.
CART(Single) - Original Feature Set
Table 6.
CART (Linear) - Reduced Feature Set
Test Set Results (µ)
Test Set Results (µ) Data Set
CART
Data Set
Error
Rules
Conds/Rule
Iris
0.073
5.0
2.3
Cancer
0.064
7.6
Wine
0.129
Glass
CART Error
Rules
Conds/Rule
Iris
0.033
4.8
2.3
2.9
Cancer
0.053
5.8
2.5
4.2
2.1
Wine
0.101
4.2
2.1
0.393
13.6
3.8
Glass
0.360
13.2
3.7
Diabetes
0.230*
17.8
4.2
Diabetes
0.276
17.0
4.1
FLIR
0.239
23.6
4.6
FLIR
0.265
23.2
4.5
Table 5.
GRaCCE - Reduced Feature Set
Table 7.
CART (Single) - Reduced Feature Set
Test Set Results (µ) Data Set
GRaCCE
Test Set Results (µ) Data Set
Error
Rules
Conds/Rule
Iris
0.041
3.0*
1.3*
Cancer
0.041
2.4*
Wine
0.097
Glass
CART Error
Rules
Conds/Rule
Iris
0.067
5.0
2.3
1.4*
Cancer
0.062
8.0
3.0
3.8
1.8
Wine
0.107
4.4
2.1
0.375
6.8*
1.8*
Glass
0.332
13.2
3.7
Diabetes
0.251
9.2*
3.4*
Diabetes
0.264
18.4
4.2
FLIR
0.272
8.6*
2.6*
FLIR
0.259
25.2
4.7
Although GRaCCE’s classification accuracy was comparable to the other methods, it did yield a significantly more compact rule set for nearly every data set tested. These findings are extremely important since the principle of Occam’s razor tells us we should always prefer the simplest model that fits the data. Because these results are empirical in nature, there is no guarantee that GRaCCE will duplicate them for every data set. Indeed, GRaCCE is most effective when mining rules from data sets where the classes are multi-modal and Gaussian in nature. However, our results to date are promising enough for us to continue GRaCCE’s development. Our current work focuses on developing techniques to further simplify the rule set produced by GRaCCE. In addition, we seek to accelerate the execution of GRaCCE for large, high dimensional data sets by parallelizing the algorithm.
REFERENCES Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. L. Breiman, J.H. Friedman, R.A. Olsen, and C.J. Stone. Classification and Regression Trees. Wadsworth International
Group, 1984. Larry J. Eshelman. The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In Foundations of Genetic Algorithms, pages 265–283. Morgan-Kaufmann, 1991. David E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. Chulhee Lee and David Landgrebe. Feature extraction based on decision boundaries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4), April 1993. Richard Lippmann and Linda Kukolich. Link Net Users Guide. MIT Lincoln laboratories, 1995. J. Ross Quinlan. C4.5 - Programs for Machine Learning. Morgan Kaufmann, 1993. J. Sklansky and W. Siedlecki. A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters, 10:335–347, December 1989. A. E. Smith and D. M. Tate. Genetic optimization using a penalty function. In Proceedings of the Fifth International Conference on Genetic Algorithms (ICGA-93), 1993.