Chidanand Apte, Se June Hong, Jonathan R. M. Hosking, ... accident in a given time period may depend principally on the kind of driver and the driving.
Decomposition of Heterogeneous Classification Problems C. Apte, S.J. Hong, J. Hosking, J. Lepre, E. Pednault, and B. Rosen Intelligent Data Analysis, 1998 This is an expanded version of an earlier paper with same title in proceedings of IDA'97, August 1997
Decomposition of heterogeneous classi cation problems Chidanand Apte, Se June Hong, Jonathan R. M. Hosking, Jorge Lepre, Edwin P. D. Pednault, and Barry K. Rosen IBM T. J. Watson Research Center, Yorktown Heights, N.Y., U.S.A. Abstract In some classi cation problems the feature space is heterogeneous in that the best features on which to base the classi cation are dierent in dierent parts of the feature space. In some other problems the classes can be divided into subsets such that distinguishing one subset of classes from another and classifying examples within the subsets require very dierent decision rules, involving dierent sets of features. In such heterogeneous problems, many modeling techniques (including decision trees, rules, and neural networks) evaluate the performance of alternative decision rules by averaging over the entire problem space, and are prone to generating a model that is suboptimal in any of the regions or subproblems. Better overall models can be obtained by splitting the problem appropriately and modeling each subproblem separately. This paper presents a new measure to determine the degree of dissimilarity between the decision surfaces of two given problems, and suggests a way to search for a strategic splitting of the feature space that identi es regions with dierent characteristics. We illustrate the concept using a multiplexor problem, and apply the method to a DNA classi cation problem. Key words: contextual merit, decision trees, entropy, feature merit measures, Gini impurity.
1 Introduction Many classi cation problems contain a mixture of rather dissimilar subproblems. This can occur when the features important for classi cation are dierent in distinct regions in the feature space, or when the decision rules that distinguish one group of classes from another are very dierent from those that separate the classes within each group. We use the terms feature-space heterogeneity and class heterogeneity to denote these two situations. Feature-space heterogeneity is exhibited, for example, by some medical diagnosis problems: diagnosis may require quite dierent models for dierent sexes, for the relevant sets of symptoms can be very dierent. Here dierent regions in the feature space, split along the sex feature, exhibit distinct decision characteristics. 1
As an instance of class heterogeneity, consider the problem of modeling severity levels of automobile insurance accounts. The payout amount is modeled as one of three severity classes: high, medium and low. However, the vast majority of examples have no claims (considered as having \no accident"). Assigning these to a fourth class, \none," and generating one classi cation model for all four classes may be unwise. Whether a driver has an accident in a given time period may depend principally on the kind of driver and the driving area, while the severity given that an accident happened may depend more on the kind of vehicle, its net value and the cost of its parts. Here one model would be appropriate for distinguishing \no accident" from \accident," but a quite dierent model would be required to classify the severity given that an accident occurred. A logical rst step towards solving a classi cation problem is to search for evidence of heterogeneity, and, if any is found, to try to decompose the problem into its constituent subproblems. This decomposition approach is potentially highly bene cial, because most widely used modeling techniques (including decision trees, rules, and neural networks) rely on measures that are computed over all the features and all the examples at hand, and are inevitably diused by an averaging eect over the entire problem. How then does one detect and separate out heterogeneity in a given classi cation problem? We answer this question in the following three sections. In Section 2 we present a new measure that re ects the degree of dissimilarity between the decision boundaries for two given subproblems. It is based on measures of feature merit, of which we give several examples in Section 3. In Section 4 we propose a tree strategy for identifying suitable regions in feature space. Once the problem is properly separated, each subproblem can be tackled by its own most appropriate model. This may even involve dierent model families: e.g., one subproblem may be modeled by a neural network and the other by a Bayesian classi er. The well-known multiplexor function is an ideal example of feature-space heterogeneity. Under dierent control variable settings the output depends on entirely dierent input variables. From the classi cation point of view, the decision surfaces in each of the regions represented by the control variable settings are orthogonal to each other. In Section 5 we apply our methods to several classi cation problems based on the multiplexor function. For comparison, Section 6 illustrates the performance of two standard methods, C4.5 and CART, on those problems. In Section 7 we apply our methods to a DNA classi cation problem from the Statlog [7] collection. Conclusions and some further discussion are given in Section 8.
2 A measure of dissimilarity Given a classi cation problem and a proposed decomposition of it into two subproblems, we wish to nd how dissimilar the decision surfaces are in the two subproblems. Dierences in class probability distribution, or class probability pro le, may give an indirect indication. We argue that a more direct indication comes from comparing the pro les of importance of the features of the two problems. When many features display widely dierent importance in the two subproblems the decision surfaces of the two subproblems must be quite distinct, although the converse may not hold in general. To determine the degree to which the importance of features varies between two subproblems, we make use of the angle between the two vectors of the feature importance values in 2
each subproblem. Let the importance measures of one subproblem be denoted by the vector of merits, Ma1; Ma2; :::; Maf , and of the other by Mb1; Mb2; :::; Mbf , where f is the number of features and M denotes a measure of feature importance or feature merit, discussed further below and in Section 3. The angle formed by the two vectors in the f -dimensional space is 1 0 P f 2 arccos B CA ; i=1 Mai Mbi (1) @ P f M 2 1=2 Pf M 2 1=2 i=1 ai i=1 bi which we will call the Importance Pro le Angle (IPA); strictly, (1) de nes a normalized IPA that takes values between 0 and 1. Linear scaling of the vector does not change the angle, and hence the importance pro le depends only on the relative magnitudes of the features' importance. The de nition of IPA assumes that feature merits are always positive. It is natural to expect that the merit of a feature that does not improve the classi catory ability of the model is zero, though this is not a requirement of the de nition in (1). Some feature merit measures are described in Section 3, including Gini gain, information gain, and Hong's [5] contextual merit. IPA can be applied to continuous or discrete-valued features, or both, provided that the chosen merit measure can be computed for the features. As a result of splitting the universe along some feature values, two kinds of degeneracy can arise: either the class variable or some of the features may become constant within one of the subproblems induced by the split. A feature that is constant within a subproblem is of no use for classi cation and should logically be given zero merit. The angle between two vectors that have zeroes in identical positions is the same as the angle between the two vectors with these zero elements deleted; thus the IPA is unaected by deletion of features whose values are constant within each subproblem. The situation in which the class variable is constant within a subproblem is not quite so straightforward. In such a subproblem perfect classi cation is obtained by assigning each example to their common class. Knowledge of the feature values cannot improve this classi cation, so the merit of each feature must logically be zero. The feature merit vector is therefore identically zero and the angle between it and the feature merit vector for another subproblem cannot be computed. It might be argued that in this case the IPA should be set to 1, its maximum value, so that this split would be regarded as the best of the candidates. The underlying strategy would be that one should take the oered opportunity of correctly classifying a subset of the examples, while continuing to look for homogeneous decompositions of the nontrivial leg of the split. Instead, we prefer to set the IPA to zero, its minimum value; this causes the \constant-class" split to be ignored. We do so because at this stage of the analysis we are not seeking perfect classi cation of examples, and to do so may mask a broader pattern of heterogeneity. The split can always be found at a later stage of the analysis when a regular tree classi er is run on the homogeneous subproblems.
3 Feature merit measures The importance, or merit, of a feature is often measured by the dierence between an impurity measure of the classes and the resulting impurity of the classes given the fact that 3
the feature value is known. Two of the most frequently used impurity measures are the Gini index, used in CART [1], and entropy, used in C4.5 [2]. We now de ne these impurity measures and feature merit measures for categorical features. Denote by I (p1; : : :; pm ) the impurity of a set of probabilities p1; p2; : : : ; pm, p1 + p2 + : : : + pm = 1. An impurity measure should satisfy I (p1; : : : ; pm) = 0 whenever pi = 1 for some i, and should be maximized when pi = 1=m for all i. Reasonable forms for I (p1; : : :; pm ) include the Gini measure X G(p1; : : : ; pm) = 1 ; p2i and the entropy measure
i
H (p1; : : :; pm ) = ;
X i
pi log pi :
In a categorical classi cation problem, there are N examples over which the dependent variable (\class") C takes C distinct values with frequencies bi, i = 1; : : : ; C . We consider an explanatory variable (\feature") F that takes F distinct values with frequencies aj , j = 1; : : : ; F . The conjunction of class i and feature value j occurs with frequency xij . The impurity of the class variable is denoted by
I (C) = I (b1=N; : : : ; bC =N ) : For those examples corresponding to a particular value of the feature, the impurity is denoted by I (CjF = j ) = I (x1j =aj ; : : :; xCj =aj ) : When feature value j occurs with probability pj the average impurity of the class variable, given the feature, is F X I (CjF) = pj I (x1j =aj ; : : :; xCj =aj ) (2) j =1
For the set of N examples the proportion of occurrences of feature value j is aj =n, and using this value as pj in (2) yields
I (CjF) =
F a X j I (x1j =aj ; : : :; xCj =aj ) : j =1 N
(3)
This is the impurity that remains in the class variable after the information present in the feature variable has been used. When using Gini's measure of impurity, (3) becomes F a C x 2 X X j ij G(CjF) = 1 ; j =1 N i=1 aj C x2 F X X ij : = 1 ; N1 (4) a j =1 i=1 j The best feature is the one that achieves the lowest value of the Gini index (4), or, equivalently, the highest value of the \Gini gain" G(C) ; G(CjF). 4
When using the entropy measure of impurity, (3) becomes F a X C x X xij j ij H (CjF) = ; log aj j =1 N i=1 aj X F F X C X 1 = N aj log aj ; xij log xij : j =1 j =1 i=1
(5)
The best feature is the one that achieves the lowest value of the entropy (5), or, equivalently, the highest value of the \information gain" H (C) ; H (CjF). Feature merit measures based on Gini and entropy measures are greedy, or myopic [3], in that they re ect the correlation between a single feature and the class, disregarding other features. To overcome this problem, new merit measures for features have recently been developed. They take into account the presence of other features that may interact in imparting information about the class. They include the RELIEF measure developed by Kira and Rendell [4] and its follow-on RELIEFF developed by Kononenko et al. [3], and the \contextual merit" (CM) developed by Hong [5]. These new measures require more computation than the two myopic varieties, but are more robust in general. We now describe CM in more detail. CM assigns merit to a feature taking into account the degree to which other features are capable of discriminating between the same examples as the given feature. As an extreme instance, if two examples in dierent classes dier in only one feature, then that feature is particularly valuable|if it were dropped from the set of features, there would be no way of distinguishing the examples|and is assigned additional merit. To de ne contextual merit, rst de ne the distance d(rsk) between the values zkr and zks taken by feature k for examples r and s. If the feature is symbolic, taking only a discrete set of values, de ne zkr = zks , d(rsk) = 01 ifotherwise. If the feature is numeric, set a threshold tk |Hong [5] recommends that it be one-half the range of values of feature k|and de ne
d(rsk) = min(jzkr ; zksj=tk ; 1) : The distance between examples r and s is now de ned to be
Drs =
Nf X k=1
d(rsk) ;
Nf being the number of features. The merit of feature f is now de ned as N X X Mf = wrs(f )d(rsf ) ; r=1 s2C(r)
where N is the number of examples, C (r) is the set of examples not in the same class as example r, and wrs(f ) is a weight function chosen so that examples that are close together, 5
i.e. that dier in only a few of their features, are given greater in uence in determining each feature's merit. Hong [5] used weights wrs(f ) = 1=Drs2 if s is one of the k nearest neighbors to r, in terms of Drs , in the set C (r), and wrs(f ) = 0 otherwise; the number of nearest neighbors used by Hong was the logarithm (to base 2) of the number of examples in the set C (r). An IPA can be de ned for any measure of feature importance. Depending on the original merit function used, we speak of Gini gain IPA, information gain IPA, CM IPA, etc.
4 Strategic splitting of the feature space Now that we have an eective means to tell whether two classi cation subproblems are similar, what remains is to generate candidate split regions in the feature space so that their dissimilarity can be measured. This is a dicult problem since one naturally wants to avoid exhaustive enumeration, and yet a reasonable set of candidates is desired. We suggest using the tree paradigm that splits along the values of a chosen \best" feature. We rst consider cases in which all features are categorical and have binary values; we then discuss ways to generalize the idea to multiple-valued and numeric features. To use the IPA statistic in practice, for each feature F one divides the feature space into two subregions corresponding to the dierent values of F and computes the IPA value using (1). If the largest of these values exceeds a suitable threshold, this is an indication of heterogeneity: the feature space, and hence the training set, is split according to the values of the feature that gives the largest IPA value and analysis proceeds separately on the two subproblems thereby de ned. The splitting process may proceed recursively until some stopping criterion is met. The computation at a node is analogous to a one-level look-ahead scheme in conventional tree building. We are still investigating what the practical stopping criterion should be. Since the goal is to generate subproblems to be modeled separately, this strategy tree would not be deep in general. In fact, one should guard against fragmentation which renders the distinct regions too small to generate a reliable model from: this is a perennial problem facing all tree-based modeling techniques. The stopping criterion should probably involve a threshold on the IPA value and a threshold on the number of examples in the subregions. IPA is a useful measure so long as the problem can be decomposed into subproblems that are of lower eective dimension in that their decision surfaces involve fewer features than the original problem. At some point, reduction in eective dimension is no longer possible, and the objective changes to nding the best models within each lower-dimensional subspace. At this point the choice of which feature to split on should be based on other criteria such as entropy gain or contextual merit. The stopping criterion for the IPA method therefore needs to be sensitive to lack of reduction in eective dimension. When a feature has multiple values, an IPA can be computed for each pair of the values, by considering splits between these two values and ignoring all examples in which the feature takes any other value. Values can then be merged recursively by grouping together the pair of values with the lowest IPA and regenerating the importance measure vector for the justgrouped values versus the rest, until the angles between each group are \acceptably" large. This approach can generate more than two groups|an added exibility. The smallest of the angles between the nal groups would be used as the IPA of the feature when deciding 6
which feature to split on. For numeric features, the split candidates can be generated from an initial discretization by any of the methods described in [6] or by the method of [5], based on contextual merit. However, this method may not be completely satisfactory: discretization methods use global analysis, which may not be a good approach when heterogeneity is present. Further study and experiments are needed before these ideas can be routinely used in practice.
5 Lessons from the multiplexor problem As was mentioned above, the multiplexor is an ideal example of feature space heterogeneity. Our basic example is the 4-way multiplexor, which has two binary control inputs, X1 and X2, four signal inputs, Y1 through Y4, a number of irrelevant inputs|we use two irrelevant inputs, R1 and R2|and an output, or class, C, de ned by 8 Y if X = 0 & X = 0, > > < Y12 if X11 = 0 & X22 = 1, C=> (6) 1 = 1 & X2 = 0, > : YY3 ifif X X1 = 1 & X2 = 1. 4 To be more realistic, we have devised the following variations of the 4-way multiplexor as classi cation problems (we have also carried out similar experiments on variations of 2-way and 8-way multiplexors and con rmed the similar behavior of IPA measures). They are progressively more \dicult" and approach more closely what some real problems might be. Case 1: 4-way multiplexor with two random inputs and even distribution of feature values. Generate a 1000 8 random binary array, the values 0 and 1 being equally likely to occur. Generate the class variable according to (6). Case 2: 4-way multiplexor with two random inputs, even distribution of feature values, and 5% noise. Take Case 1 and ip 50 randomly chosen class bits, thereby injecting 5% noise. Case 3: 4-way multiplexor with two random inputs and uneven distribution of feature values. Generate a 1000 8 random binary array, the values 0 and 1 occurring with probabilities 2=3 and 1=3 respectively. Generate the class variable according to (6), except that the assignment C = Y2 is replaced by C = 1 ; Y2 and C = Y4 is replaced by C = 1 ; Y4; i.e., in these two cases the signal value is ipped when it is copied to the class. The uneven distribution of 0 and 1 feature values is intended to simulate more realistic practical situations. Flipping some of the class values is done to achieve an even distribution of 0s and 1s in the class values; it does not change the multiplexor nature of the problem. Case 4: 4-way multiplexor with two random inputs, uneven distribution of feature values, and 5% noise. Take Case 3 and ip 50 randomly chosen class bits, thereby injecting 5% noise. 7
1.0
Gini gain IPA Information gain IPA CM IPA Gini IPA Entropy IPA
Gap
0.8 0.6 0.4 0.2 0.0 1
2 Case number
3
4
Figure 1: Plots of the \Gap" values from Table 1 for each of the ve IPA measures. The desirable decision tree for all of these cases should have exactly three levels of decision nodes, the rst two levels consisting of the two control features in either order, and then the four signal features in the last level. There should be exactly seven decision nodes. These trees cannot be further pruned. Tree generation should start by splitting on one of the control features. This splits the problem into two subproblems each of which is a 2-way multiplexor problem; thus at the second level of the tree the other control feature should be selected for splitting. Once the top two levels are constructed correctly, any reasonable tree-generation method should be able to complete the third level nodes and stop. In Table 1 we assess the ability of dierent feature importance measures and IPAs to identify which feature should be split on at the top-level node. Tabulated gain and IPA values are rounded to 3 decimal places. In a multiplexor problem, the most eective split into subproblems is a split on a value of a control feature. A good measure of feature importance should therefore attach high importance to the control features and lower importance to signal and irrelevant features. Each \Gap" value in Table 1 is the dierence between the smallest IPA value for either of the control features and the largest IPA value of any of the other features. A large \Gap" value means that the IPA measure performs well and indicates a clear preference for initially splitting on a control feature. The \Gap" values for each IPA are plotted in Figure 1. Here are some observations from these experiments. 1. The myopic gain measures, Gini gain and information gain, are unable to identify appropriate variables to split on. In all four cases they choose a signal variable rather than one of the control variables for the rst split. 2. The IPAs based on myopic gain measures, Gini gain IPA and information gain IPA, show good discrimination between control and signal features (a high \Gap" value) in Cases 1 and 2, but their performance is much poorer in the more dicult Cases 3 and 4. 3. The IPAs based on impurity measures, Gini IPA and Entropy IPA, are able to distinguish the control features. However, their resolution is much poorer than those of IPAs based on gain measures: the size of the \Gap" is rather small in each of Cases 1{4. 8
Table 1: Results for multiplexor classi cation problems.
Tabulated values are merit and IPA measures for each of the eight feature variables. Numbers of 0s and 1s in the 1000 examples are also given, both for the feature variables and for the class variable C. Starred values are the smallest of the IPA values for the control inputs and the largest of the IPA values for the signal and irrelevant inputs. \Gap" is the dierence between the two starred values in the row.
Case 1
Case 2
Case 3
Case 4
Number of 0s Number of 1s Gini gain Gini gain IPA Gini IPA Information gain Information gain IPA Entropy IPA CM CM IPA Number of 0s Number of 1s Gini gain Gini gain IPA Gini IPA Information gain Information gain IPA Entropy IPA CM CM IPA Number of 0s Number of 1s Gini gain Gini gain IPA Gini IPA Information gain Information gain IPA Entropy IPA CM CM IPA Number of 0s Number of 1s Gini gain Gini gain IPA Gini IPA Information gain Information gain IPA Entropy IPA CM CM IPA
X1
X2
510 510 490 490 .000 .000 .986* .993 .129* .139 .000 .000 .986* .993 .096* .103 1349 1406 .496* .509 510 510 490 490 .001 .000 .976* .991 .102* .115 .001 .000 .976* .992 .076* .086 1306 1371 .371* .407 656 656 344 344 .002 .057 .947* .997 .144* .174 .000 .082 .954* .998 .113* .139 1389 1685 .509* .601 656 656 344 344 .002 .045 .966* .992 .120* .138 .000 .063 .969* .993 .092* .106 1397 1619 .410* .488
Y1
Feature
Y2
Y3
Y4
510 508 505 502 490 492 495 498 .052 .029 .032 .027 .220* .080 .125 .189 .013 .006 .012 .014* .076 .043 .045 .040 .224* .080 .128 .192 .010 .005 .009 .011* 1051 959 954 933 .035 .057* .037 .032 510 508 505 502 490 492 495 498 .046 .027 .024 .021 .299* .205 .217 .274 .014 .011 .016* .016 .067 .038 .035 .030 .303* .206 .220 .277 .011 .009 .013* .012 1103 1042 1005 959 .047 .055 .019 .025 660 654 663 646 340 346 337 354 .081 .024 .029 .011 .815* .664 .649 .604 .098* .076 .064 .057 .110 .033 .040 .014 .817* .675 .653 .615 .080* .061 .050 .044 1452 1017 874 633 .267 .268* .141 .179 660 654 663 646 340 346 337 354 .069 .016 .020 .010 .819* .617 .598 .551 .079* .059 .048 .041 .099 .021 .026 .012 .821* .622 .599 .555 .063* .045 .036 .031 1429 1052 934 716 .218 .220* .135 .115
9
R1 R2
488 473 512 527 .000 .000 .071 .142 .004 .008 .000 .000 .072 .143 .003 .006 471 483 .045 .049 488 473 512 527 .000 .001 .172 .244 .008 .012 .000 .000 .173 .245 .006 .009 602 607 .067* .031 659 681 341 319 .002 .002 .150 .233 .014 .020 .000 .000 .150 .236 .010 .015 439 435 .087 .060 659 681 341 319 .002 .002 .136 .271 .010 .018 .000 .000 .137 .274 .007 .013 585 580 .085 .053
C Gap
507 493
.766 .115 .762 .085 .439 511 489 .677 .086 .673 .063 .304 473 527 .132 .046 .137 .033 .241 469 531 .147 .041 .148 .029 .190
4. CM IPA shows the most consistent ability to discriminate between control and signal variable across all four Cases. 5. The CM-based measures are the most robust in the presence of heterogeneity. In all four cases CM and CM IPA choose the control variable X2 for the rst split. This split separates the problem into two subproblems, each of which is a 2-way multiplexor with control variable X1 , and in each subproblem CM and CM IPA identify X1 as having the highest merit. However, CM IPA performs better than CM: from Table 1 it can be seen that in all four Cases CM IPA gives both control variables higher merit than any of the other variables, whereas in Cases 3 and 4 CM does not do so. These results are in agreement with theoretical considerations. We expect CM IPA to outperform Gini gain IPA and information gain IPA, because the vector of contextual merits provides an indication of the eective dimensionality of a problem (the number of features that contribute to classi cation) by estimating the importance of each variable to solving the problem. The information-gain and Gini-gain vectors, on the other hand, measure the one-step improvements in the degrees of t to the data if splits are performed on the corresponding variables. One-step improvements in the degrees of t are only indirectly related to the dimensionality of a problem. Moreover, because these measures are myopic, they may miss interactions between features that are important to ecient classi cation.
6 Comparison with C4.5 and CART To illustrate the performance of standard decision tree algorithms in multiplexor problems, we applied C4.5 and CART, with default parameters. All pruned trees had either 5 or 6 levels with the number of decision nodes ranging from 15 to 50. None of the trees achieved the optimum misclassi cation rate (0 in Cases 1 and 3, 5% in Cases 2 and 4). Table 2 gives the number of levels, number of nodes, and the top three levels for the trees generated by C4.5 and CART, and also for the optimal (\Best") tree. The tree is shown by scanning each level left to right. It is noteworthy that neither C4.5 nor CART succeeds in identifying the control variables as being the best to split on: in each case the rst split is on Y1, a signal variable, and only nine of the 32 paths to third-level nodes contain splits on both X1 and X2. Curiously, the \measure of importance" values output by the CART program and de ned in [1, p. 147] do consistently give the highest importance to one of the control features. These values are given in Table 3. Although CART has not generated optimal trees for these multiplexor problems, it has at least given some indication that its trees are not optimal.
7 A DNA classi cation problem We applied the strategy developed here to a real example from the Statlog data [7, sec. 9.5.3]. In Statlog's DNA problem, each example is a sequence of 60 DNA nucleotides, each represented by one of the four values A, C, G or T; the aim is to identify the midpoint of the sequence as being an exon-intron boundary, an intron-exon boundary, or neither. This is therefore a classi cation problem with 60 four-valued features, which we denote by F1 through F60, and three classes, denoted by EI, IE and NB. There are 2000 training 10
Table 2: Summary of the trees generated by C4.5 and CART for Cases 1{4. Case Method Levels Nodes Top 3 levels of tree 1 C4.5 5 43 Y1 X2 Y3 Y3 Y4 X1 X2 2 C4.5 6 50 Y1 Y2 Y3 X1 Y4 X1 X2 3 C4.5 5 35 Y1 X2 X1 X1 Y2 X2 Y3 4 C4.5 6 35 Y1 X2 X1 X1 Y2 Y2 Y3 1 CART 6 29 Y1 Y4 Y3 Y3 X2 X1 X2 2 CART 6 25 Y1 Y2 Y3 X1 Y4 X1 X2 3 CART 5 18 Y1 X1 X2 X2 Y3 Y2 X1 4 CART 5 15 Y1 X1 X2 Y2 Y3 Y2 X1 1{4 Best 3 7 X1 X2 X2 Y1 Y2 Y3 Y4 Table 3: CART's \measure of importance" of features for Cases 1{4. Case X1 X2 Y1 Y2 Y3 Y4 R1 R2 1 100 75 32 79 36 53 2 2 2 100 92 33 45 46 70 7 4 3 65 100 57 57 70 39 0 1 4 58 100 56 33 61 26 0 1 examples and 1186 test examples. On this problem, C4.5, using a four-way split at each node in the tree, produced a pruned decision tree with 7.6% error rate on the test set. In terms of error rate, C4.5 ranked 10th among the 22 dierent techniques reported in [7]. C4.5 uses information gain|speci cally, the information-gain ratio, which is the gain divided by the entropy of the split proportions|as its criterion for choosing splits. To maintain compatibility with C4.5, all of our IPA calculations also used information-gain IPA as the criterion for splitting. First we searched for feature-space heterogeneity. Since all features have four values, we computed the information-gain IPA for all seven possible partitions of the values for each feature. The highest IPA values for the training set are shown in Table 4. Among the 977 training examples for which F31=G, the best IPA split was produced by F32=T with an IPA value of .573. Among the 1023 examples for which F316=G, the best IPA split was produced by F30=G with an IPA value of .709. For two of the candidate splits at the second level, the IPA could not be computed because all of the examples on one side of the split were in the same class. These splits were (F316=G, F29=G) and (F316=G, F30=G). As noted above, we assign an IPA value of zero to these splits. We stopped the strategic splitting at two levels arbitrarily and submitted the four sub-problems to C4.5, to complete the rest of the decision tree by the standard C4.5 method. One of these subtrees, for the combination (F316=G, F306=G), was, after pruning, a null tree that assigned all of its examples to a single class. Error rates for the resulting trees are given in Table 5. The top three levels of our tree and the \C4.5 tree" obtained by applying C4.5 to the entire training set are shown in Figure 2. Introducing two levels of strategic splitting reduced 11
Table 4: IPA values for feature-based subproblems of the DNA data. Feature Partition Info gain IPA F31 ACTjG .593 F32 ACGjT .568 F35 ACTjG .547 F30 ACGjT .503 F33 AGjCT .497 Table 5: Error rates for feature-based subproblems of the DNA data. Size of Size of Tree size Errors training test (number on First test Subproblem Condition set set of nodes) test set by C4.5 1 F31=G, F32=T 588 377 53 38 F35 2 F31=G, F326=T 389 235 17 11 F29 3 F316=G, F30=G 441 245 13 20 F29 4 F316=G, F306=G 582 329 1 0 none Complete tree 2000 1186 87 69 (5.8%) C4.5 tree 2000 1186 129 90 (7.6%) the error rate from 7.6% to 5.8%, and the number of nodes in the tree from 129 to 87. The strategic-splitting tree would rank 4th among the 22 techniques considered in [7]. In a second look at the DNA problem, we searched for class heterogeneity. The original three-class problem can be subdivided in three dierent ways into a pair of two-class problems. Class heterogeneity is measured by the IPA between the vectors of feature merits for the two two-class problems. The IPA values for the various problem pairs are given in Table 6. The largest IPA value of .889 is very large, and means that the importance vector for classifying EI versus the other two classes is almost orthogonal to the importance vector for classifying IE versus NB. The four-way trees produced by C4.5 for the two subproblems are summarized in Table 7. There were 38 test set examples that were classi ed as class EI instead of \rest", but all of these examples were classi ed correctly in the \IE vs. NB" tree. The total error count of the combined tree is therefore still 73. Once again, the initial strategic split has led to a combined tree that, compared to applying C4.5 to the entire data, is smaller and has a lower error rate. Table 8 shows the features used in the top levels by the tree generated by C4.5 for the entire data set and for the two subproblems \EI vs. rest" and \IE vs. NB". Features are listed in order of their proximity to the root node of the tree. It can be seen that the two-class subproblems make use of several features that do not appear in the top levels of the tree for the full problem. Furthermore, there is little overlap between the features appearing in the trees for the two-class subproblems; this re ects the high value of IPA between the feature merit vectors for these two subproblems, and illustrates the degree of class heterogeneity present in the problem. 12
F31 ?
G
F32 ?
T
F35 ?
F30 ?
A
F32 ?
A,C,G
Class NB T
C
F31 ?
Class NB A,C,G
G
F29 ?
F29 ?
A C G
A,C,T
F30 ?
G
F29 ?
T T
A,C,T
F35 ?
F28 ? F32 ? F32 ? F32 ?
A,C,T
Class NB G
Class NB
F32 ?
Figure 2: First three levels of complete tree (left) and C4.5 tree (right). The complete tree is generated by two levels of strategic splitting, followed by C4.5 on the resulting subproblems. The C4.5 tree is generated by C4.5 on the full problem. Table 6: IPA values for class-based subproblems of the DNA data. Problem sequence IPA value EI vs. rest, then IE vs. NB .889 IE vs. rest, then EI vs. NB .795 NB vs. rest, then IE vs. EI .551 Table 7: Error rates for class-based subproblems of the DNA data. Size of Size of Tree size Errors training test (number on Subproblem set set of nodes) test set EI vs. rest 2000 1186 37 42 IE vs. NB 1536 883 25 31 Complete tree 2000 1186 62 73 (6.2%) Table 8: Features used in trees for dierent class-based subproblems of the DNA data. Problem sequence Features EI vs. rest F32, F31, F35, F33, F30, F50, F21 IE vs. NB F29, F30, F28, F21, F35, F14 Full problem F30, F32, F35, F29, F31, F28, F32 13
Although we have demonstrated an improvement over the results given by C4.5 using a four-way split at each node, no comparable improvements were obtained when a binary tree, with two-way splits at each node, was used in C4.5. With C4.5 restricted to binary trees, error rates on the test set were 5.8% for C4.5 alone and 5.9% for the combination of two levels of strategic splitting of the feature space followed by C4.5 on each subproblem. In this case, therefore, strategic splitting adds nothing to the classi cation ability of C4.5. One possible explanation is that the components of our importance vector are based on four-way splitting of each feature. One could instead use the best two-way splitting information gain for each feature. Another possibility is that the decomposition did not go far enough.
8 Concluding remarks Standard measures of the importance of features in classi cation problems can be misleading when the problem is really a mixture of distinct subproblems with dierent decision characteristics. To address this problem we have de ned importance pro le angles (IPAs), which directly measure the extent to which the relation between the class and feature variables is dierent in dierent parts of the feature space. IPAs can be used to decompose a classi cation problem into more nearly homogeneous subproblems. This initial decomposition is analogous to the branching of a decision tree, but any classi cation method, not just decision trees, can subsequently be used on the homogeneous subproblems. IPAs may be based on any of the conventional measures of the importance or merit of features. Though some details of the implementation of IPAs remain open, particularly for numeric and multiple-valued features, the initial results are promising. In several variants of the multiplexor problem, IPAs identi ed the features that led to the most ecient decomposition of the classi cation problem into subproblems, whereas measures of feature merit based on conventional impurity measures were unsuccessful. The IPA derived from contextual merit had the most consistent overall performance. In the DNA problem from the Statlog data, use of two levels of initial splitting based on information gain IPA led to improvements in the decision trees generated by C4.5. Another approach to decomposition is the twoing strategy of CART [1, pp. 104.]. In twoing, the decrease in impurity (originally, Gini gain) is computed for each feature and for each two-way grouping of the classes into (C1, C2) during the tree generation at each node. We quote from CART [1, p. 105]: The idea is then, at every node, to select that conglomeration of classes into two superclasses so that considered as a two-class problem, the greatest decrease in node impurity is realized. This approach to the problem has one signi cant advantage: It gives \strategic" splits and informs the user of the class similarities. At each node, it sorts the classes into those two groups which in some sense are most dissimilar and outputs to the user the optimal grouping C1 ; C2 as well as the best split S . The word strategic is used in the sense that near the top of the tree, this criterion attempts to group together large numbers of classes that are similar in some characteristic.
14
Twoing can be eective when the groups of classes that it forms de ne subproblems that are more easily solved than the original problem. However, it does not address the question of whether dierent sets of features are needed to classify examples within the subgroups of classes, and it is this question that is the essence of class heterogeneity as we have de ned it. A measure of feature importance introduced by Fayyad and Irani [8] calculates the angle between the class probability pro les of the descendents of a two-way split of examples according to the candidate feature values. This measure is very much related to the twoing idea in that it attempts to identify the feature that best separates the data into two groups of predominately dierent classes. In this case, however, the emphasis is on separating the class statistics instead of the classes themselves. This subtle dierence makes the Fayyad-Irani measure less costly to compute. Like the IPA measure, the Fayyad-Irani measure is expressed as the angle between two vectors. However, the meanings of the vectors and, hence, the measures are quite dierent. In Fayyad and Irani's case, the class statistics are being separated and termination occurs when these statistics cannot be further re ned. In the case of IPA, the importance of the features are being separated and termination occurs when the classi cation problem cannot be further decomposed. In this paper we have not addressed the issue of normalizing importance measures to compensate for the biases that particular measures have for certain features|for instance, those that take a large number of distinct values. The C4.5 method normalizes the information gain by the entropy of the feature. Hong et al. [9] have de ned an alternative normalization scheme that also applies to the contextual merit. Feature merits should be properly normalized in practice. However, the ideas in this paper have been illustrated on examples in which all features have similar distributions of their values, so that the issue of normalization becomes moot.
References [1] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., Classi cation and Regression Trees, Wadsworth, Monterey, Calif., 1984. [2] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, Calif., 1993. [3] Kononenko, I., Simec, E., and Robnik, M., Overcoming the myopia of inductive learning algorithms with RELIEFF. Applied Intelligence, 7, 39{55, 1997. [4] Kira, K., and Rendell, L., The feature selection problem: traditional methods and a new algorithm, in Proceedings of AAAI{92, 129{134, 1992. [5] Hong, S. J., Use of contextual information for feature ranking and discretization, IEEE Transactions on Knowledge and Data Engineering, 9, 718{730, 1997. [6] Dougherty, J., Kohavi, R., and Sahami, M., Supervised and unsupervised discretization of continuous features, in Proceedings of ML{95, 1995. 15
[7] Michie, D., Spiegelhalter, D. J., and Taylor, C. C., (Eds.), Machine Learning, Neural and Statistical Classi cation, Ellis Horwood, Hemel Hempstead, U.K., 1994. [8] Fayyad, U., and Irani, K., The attribute selection problem in decision tree generation, in Proceedings of AAAI{92, 104{110, 1992. [9] Hong, S. J., Hosking, J. R. M., and Winograd, S., Use of randomization to normalize feature merits, in Information, Statistics and Induction in Science, (D. L. Dowe, K. B. Korb and J. J. Oliver, eds.), World Scienti c, Singapore, 10{19, 1996.
16