Document not found! Please try again

Use of Contextual Information for Feature Ranking and ... - CiteSeerX

46 downloads 299 Views 209KB Size Report
merits are then used to rank the features, select a feature subset, and to discretize the numeric variables. Experience ... using the RAMP system in a later section.
Use of Contextual Information for Feature Ranking and Discretization

S.J. Hong

IEEE Transactions on Knowledge and Data Engineering, 1997

Use of Contextual Information for Feature Ranking and Discretization Se June Hong IBM Research Division T.J. Watson Research Center Yorktown Heights, NY 10598 [email protected]

Abstract Deriving classi cation rules or decision trees from examples is an important problem. When there are too many features, discarding weak features before the derivation process is highly desirable. When there are numeric features, they need to be discretized for the rule generation. We present a new approach to these problems. Traditional techniques make use of feature merits based on either the information theoretic or statistical correlation between each feature and the class. We instead assign merits to features by nding each feature's \obligation" to the class discrimination in the context of other features. The merits are then used to rank the features, select a feature subset, and to discretize the numeric variables. Experience with benchmark example sets demonstrates that the new approach is a powerful alternative to the traditional methods. This paper concludes by posing some new technical issues that arise from this approach.

Keywords

Feature Analysis, Classi cation Modeling, Discretization

1 Introduction When a given example data set for classi cation contains too many features, a practical need arises to select a relevant subset of features for generating a model for the class. The rules should not include tests on insigni cant features, and most of the state of the art techniques [1, 2, 3] succeed in not using insigni cant features in the generated rules. However, it is advantageous from a computational point of view to begin with a fewer feature variables in the generation algorithm, even though there may be a potential degradation of the rules from the absence of some discarded features. Many real problems, especially those arising from text categorization data [4] and manufacturing process data [5], may contain hundreds to thousands of feature variables. The presence of many extra features can also \fool" the modeling algorithms into generating inferior models|a well known problem for arti cial neural networks. Most decision tree algorithms produce the cut points for numeric features as a byproduct of the process of recursively picking the best tests. The collection of these cut points amounts to a discretization. Many rule generation algorithms deal with symbolic intervals de ned by the cut points, which in turn might have been induced by some tree generation process. Once a numeric feature variable is discretized, a rule may test if the value of the feature belongs to any proper subset of the now categorical values. One of the important applications of the new feature merits presented here is to use them for such discretization. This paper does not address the problem of nding new or derived features other than the ones explicitly given at the outset. A linear or other functional combination of several numeric features may very well be a 1

better feature (see for instance [6]) or even as a good discriminator by itself. We rely on the domain experts having provided such useful derived features as part of the example data and deal exclusively with the given feature set. In other words, this paper is focused on discriminating (locally in a test) only by hyperplanes orthogonal to the axes of features in the multidimensional feature space where examples reside. We regard the problem of missing values as a separate issue and assume all feature values are speci ed for all examples. The noisy data problem is indirectly but quite e ectively handled by the inherent robustness of our approach and we do not address the problem of over tting in a direct way in this paper. The techniques we describe here have been implemented in the RAMP (Rule Abstraction for Modeling and Prediction) system [7]. We summarize some benchmarking experiments and real problem experience using the RAMP system in a later section. The closest approaches to what is presented here are RELIEF [8] and its more recent follow-on RELIEFF [9]. We will brie y contrast our approach to these in the concluding section.

2 Common characteristics of traditional decision tree generation Determining the relative importance of features is at the heart of all practical decision tree generation algorithms. An exception to this is when some enumerative scheme is used to generate an absolute minimum expression tree [10], but any such process must pay an exponential computational cost in terms of the number of features. Aside from decision trees, determining a most relevant subset of features for the classi cation task would be important on its own. (See for instance an analysis of a minimum feature set approach in [11], feature scoring based on structural indices [12], and the use of the discernibility criteria in the Rough Set approach [13].) A set of examples (usually a subset of the original example set in the tree generation process) generally contains examples that belong to many classes. If a subset contains examples of only one class, no more tests need be applied to the decision path that led to the separation of the subset from its superset. The degree of uncertainty about the classes of examples in a set is often P called its class impurity. Let Pi = ni=N, where ni is the number of examples whose class is i and N = ni , the total number of examples in the given set. ThePinformation theoretic measure for the impurity is the entropy function on the probabilities, H(class) = ; Pi logPi which is used P in [14] and ID3/C4.5 [2] and its variants. A statistical measure often used is the Gini Index, G(class) = 1 ; Pi 2 which is used by CART [15] and its variants. For binary trees, Fayyad and Irani [16] utilizes a new measure based on the degree of orthogonality of class pro les on the two daughter subsets induced by the given test. For VLSI logic synthesis a minimal decision lattice has been pursued as well [17]. When a given set of examples is tested on a variable/feature, it is partitioned into subsets such that each subset contains only the examples of a unique value of the variable tested. Each of these subsets have its own impurity whose weighted average (weighted by the size of the subsets) re ects the improvement due to the variable. The \best" variable to test for the given set of examples is then the one that produces the smallest average impurity, in most of the traditional approaches. An alternative way of selecting the \best" variable is to use some kind of correlation measure between the variable and the class. In the information theoretic approach this takes the form of the information (gain) from the variable to the class, de ned as I(class : variable) = H(class) ; H(class=variable). H(class) is common to all variables for it is the impurity of the set being partitioned. H(class=variable) is exactly the weighted impurity of the partitioned subsets induced by the variable, which makes the information correlation approach equivalent to the simple minimal impurity scheme unless the information (gain) is further normalized as in [2]. 2

All decision tree generation techniques start with the entire training set of examples at the root node. The \best" feature is selected as the test and the resulting partition of the examples recursively induce the new \best" feature selection for each subset, until the subset is \pure" to the desired degree. A minimal feature subset can be obtained by a slight variation [18]. At each level of the tree, all the partitioned subsets (i.e. the nodes at current level) are tested for a common \best" feature, instead of independently processing each subset. When all subsets are \pure" at the bottom, all the features that appear on the tree usually form a reasonably minimal feature set. (Of course, the tree induced by this process would not be as minimal as the tree of fully recursive tree generators.) All of these approaches, whether the \best" variable is selected from the viewpoint of minimal resulting impurity or maximal correlation to the classes, rely on the single variable's e ect on the class distinguishability at each stage of the tree generation. This is problematic especially at the nodes near the root because the context of other features is ignored. One, two, or even more levels of look ahead does ameliorate this problem somewhat, at the cost of a steeply increasing computational burden. When rules are generated from a tree, it has been observed [19, 20] that the tests performed near the root node are often super uous or weak and consequently removed in a rule.

3 A pathological case: EXOR Ranking features by their own correlations to the classes is generally not e ective when there is a strong feature interaction in discriminating the class. Symmetric functions are typical of strong feature interactions and they are not well handled by traditional techniques. These include the important class of majority functions and m-out-of-n functions which are often cited as predominant in medical diagnosis problems. The Exclusive-Or (Ex-Or, parity) function represents, in some sense, an extreme case of symmetric functions. An Ex-Or function on n binary variables, X1 ; X2; :::; Xn, takes on the output value of 1 if an odd number of the variables are 1, and 0 otherwise. Consider the truth table of this function with 2n rows and n + 1 columns. For each row the value in the last column (the class value) is 1 i there are odd number of 1s in the rst n bits of the row. Here the values 0 and 1 are categorical. It is readily observed that the information conveyed by each of the feature variables, Xi , is zero. The statistical correlation between a variable and the class is also identically zero. This is true even if one considered any proper subset of the variables as a tuple and computed information or correlation associated with it and the class. Only when all the variables are considered together is there perfect information and perfect correlation. Now consider an expanded Ex-Or function that takes on extra random variables constructed as follows. First replicate each row of the truth table k times. Then, insert a k  2n by m array of random bits between the rst n columns and the last column. Let us call such an example set, EXOR(n; m; k  2n ). Each row of this new table can now be interpreted as an n + m binary feature example, with its class designation in the last position. For large values of m and k, the minimum set of rules induced from this example set should correspond to the binary Ex-Or function on the original n variables, or the rules for EXOR(n; 0; 2n). And these rules should use only the original n variables in its expression. Likewise, a good decision tree should test only on the original n variables, resulting in a full n level binary tree. Most rule generation algorithms except R-MINI [21] and SWAP1 [3] seldom generate the minimal set of rules for EXOR cases. While the original Ex-Or variables (base variables) have zero correlation to the class, the random variables in general have non-zero correlation to the class. Therefore, all traditional decision tree algorithms fail to develop the desirable tree, i.e. the random variables tend to dominate the tests near the 3

root, and the relevant variables tend to be tested near the leaves of the tree, resulting in an unnecessarily large tree. Any attempt to use a traditional decision tree approach to select a reduced relevant feature set would fail, for the same reason, on the cases like EXOR(n; m; k  2n). Feature interaction is a well recognized problem that is exempli ed by the EXOR cases, although some would argue that, in most of \real" problems, nature is kind and it does not present EXOR like problems.

4 A new contextual merit function for features In general, one feature does not distinguish classes by itself; it does so in combination with other features. Therefore, it is desirable to obtain the feature's correlation to the class in the context of other features. We seek a merit function that captures this contextual correlation implicitly, since enumerating all possible contexts is impractical. Consider a set of N examples with Ns symbolic, or categorical, features and Nn numeric features. We will assume that such an example set is represented by a N by Nf (= Ns +Nn ) array of feature values along with a length N vector of class labels, called C. The class label is always symbolic. Further assume for now that there are only two classes, and the examples are grouped by the class labels, C1 and C2, containing N1 and N2 examples, respectively, so that N1 + N2 = N. Associated with each example, ei , is an Nf tuple feature vector, ei = (x1i; x2i; :::; xki; :::; xNf i ), which is a row entry in the example data array as shown in gure 1. The value of a feature Xi is denoted by xi . X X 1

2

...

X

Ns

Symbolic Features

XN

s

...

+1

XN

f

Class

Numeric Features C1 C1 .

N1

. . C1 C2 C2 . N2

. .

C2 Ns

Nn

N 1 + N 2 = N total examples N s + N n = N f total features

Figure 1: Examples We now de ne a distance measure Dij between two examples similar to many such measures in the 4

literature, but with a slight modi cation. Dij =

N X dk f

k=1

( )

i;j

(1)

where for a symbolic feature, Xk , its component distance, d(ijk), is d(ijk) = 0 if xki = xkj = 1 if xki = 6 xkj

(2)

d(ijk) = min(jxki ; xkj j=tk ; 1)

(3)

and for a numeric feature, Xk ,

The value, tk , is a feature dependent threshold, designed to capture the notion that if the di erence is \big enough" the two values are likely to be di erent when discretized, hence the component distance of 1. However, if the di erence is less than the threshold, its distance is a fraction as shown above. This formulation coincides with the probability of an interval of length jxki ; xkj j having its ends in two distinct bins, if the range of the feature were cut equally into consecutive bins of size tk . We will discuss how we determine the tk values in a later section. The distance between two examples is traditionally computed by formulae similar to (1-3), in pattern recognition, learning techniques, and in case based reasoning. However, we feel that the subtle departure of employing a ramp function as in (3) makes this a more e ective measure of distance. Usually, the value di erence is normalized (divided) by the magnitude of the range of the variable as the component distance, which is a straight line rather than a ramp. The new distance function is shown in gure 2. component distance

1

t = ramp threshold k

t

value difference k

Figure 2: The ramp function We remark here that the 0, 1 values used as the component distance for symbolic features in (2) can be changed to take on any value between 0 and 1. This is useful when certain symbolic values of a feature are 5

closer to each other than some of the other values. For instance, if color were a feature, distance between yellow and yellow-green, or between magenta and pink may be justi ably less than that between red and blue, which would be 1. Before introducing a formal de nition of the new merit function, let us consider an example set that has only categorical features. Suppose an example in class C1 and a counter example in class C2 di er in only two features, say, X2 and X7 . The distance between the two examples is 2. These two features are then the sole contributors in distinguishing the classes of the two examples, and hence are \responsible" or \obliged". We might say that the magnitude of the obligation is inverse of the distance (1/2) which should be shared equally by X2 and X7 , i.e. shared in proportion to their component distances (1/4 each in this case). This is the key observation. From the pairs of examples in di erent classes, we obtain merit contributions to the features, so that they re ect the feature's share of the obligation to separate the classes in the context of all other features. We accomplish this without explicitly enumerating the contexts themselves. For each example in C1, we compute the distances to all the counter examples in C2, and retain for merit considerations only some fraction of the counter examples that are nearest to the example. The counter examples that are far from the example will only contribute noise to the process, and they would be distinguished from the example easily anyway. This philosophy is similar to that of K-nearest neighbor schemes. We shall retain contributions from the K = log2N2 nearest counter examples. (This is an arbitrary choice, but justi ed by extensive experiments.) Each of the counter examples chosen for the example contributes its 1=D distributed to the features in proportion to their component distance, d(k), i.e. d(k)=D2 , to the respective merit values of the features. (The i and j indices of D and d(k) are dropped for convenience here). The quantity 1=D can be viewed as the degree of independence of the distance constituting features in the given context, or the strength of the context. (Any strictly decreasing function of D would model the situation. See the concluding section for more discussion on this). The contextual merit mk of a feature Xk is de ned as mk =

N X X

i=1 j 2C(i)

wij d(ijk)

(4)

where wij = 1=Dij2 if j is one of the K-nearest counter examples to i, and wij = 0 otherwise. We use  j where C(i)  is the set of all examples not in the class of i. This de nition is general for K = log2 jC(i) any number of classes and symmetric for all classes. To compute mk in practice, we can use the following approximation to save computation. mk =

X Xw

i2C1 j 2C2

(k) ij dij

(5)

where wij = 1=Dij2 if j is one of the K = log2N2 neighbors of i among all j in C2. For a two class situation with about the same number of examples in each, the result of (4) is close to double that of (5). We give the contextual merit algorithm to compute M = (m1 ; m2 ; :::; mNf ) according to (5) here. Most of our practical experiments made use of this shortcut without signi cant di erences. In the algorithm below, we denote the contribution to M from each example ei in the class C1 as an Nf tuple vector, Mi , and that from i and j as Mij , likewise. Contextual Merit Algorithm, CM CM0)Initialize to be an all 0 vector. CM1)Fix the numeric threshold k for each numeric feature. (usually set to 1/2 the magnitude of ranges)

M

t

6

e

C

CM2)For each i of class 1 , do CM2.0)Initialize i to be an all 0 vector. CM2.1)Compute distances to all examples of 2 and let 2 2 (at least 1) of the nearest distance counter examples be the BESTSET. CM2.2)For each counter example, j , in the BESTSET, do

M

C

e (k ) 2 8k CM2.2.1)Compute Mij : mij = dij =Dij CM2.2.2)Update Mi : Mi = Mi + Mij CM2.3)Update M : M = M + Mi CM3)Return M . (k)

log N

components

Let us return to the EXOR case and observe the behavior of the algorithm. Although this is an all categorical feature case, it exhibits many salient characteristics of the algorithm.

5 Contextual merits of the features of EXOR cases Any merit measure based on each single feature's correlation to the class, generally fails to assign a higher merit value to the base variables than to the random variables. The contextual merit of the CM algorithm does. We shall demonstrate this with a more relaxed EXOR case than the one de ned earlier as EXOR(n; m; k  2n ) in which the truth table entries of the base case, EXOR(n; 0; 2n), are repeated exactly k times. For an M suciently greater than 2n , let us now rede ne EXOR(n; m; M) constructed as follows. Let k0 = 2  dM=2n e. First, construct the earlier EXOR(n; m; k0  2n ). The number of examples in this table is at least twice M. A random subset of M entries from this table is now de ned as EXOR(n; m; M), which contains M examples and the entries of the base table are not replicated exactly the same number of times. The optimal rules and decision trees derived from this EXOR should be identical to the base case as discussed earlier. The rst example case is an EXOR(4,10,1000). The CM algorithm produces the following merit values for the rst four base variables and the remaining ten random variables (delineated by a \k"). Since we are interested in the relative importance, the values shown are decimal shifted and rounded so that the maximum is a three digit number: M = (184 194 176 176 k 89 87 93 90 89 87 89 92 93 87) The merits of the base variables are indeed greater than those of the random variables. For the purpose of comparing the merit values of base versus random variables, let us de ne by Distinguishing Ratio (DR), the ratio between the least base variable merit and the greatest random variable merit. DR=176/93=1.89 for this example case. We say that the CM succeeds if DR is greater than 1 and fails otherwise. Here is the DR value summary of some more EXOR example cases, denoted as (n; m; M) below. Recall that each new construction of an EXOR example set uses random llings. Multiple DR values shown below are from the di erent such EXOR constructions with the same parameter set. (3, (3, (3, (3, (3, (3,

10, 3200): 4.68, 4.84 10, 1600): 3.1, 3.58, 3.41 10, 800): 2.59, 2.56 10, 400): 1.95 15, 400): 1.35 15, 100): 1.03

(4, 10, 1000): 1.89, 1.97, 1.74, 1.75, 1.92 (4, 10, 200): 1.16, 1.28 (4, 10, 150): 1, 1.01, fail

7

(3, (3, (3, (3, (3, (3, (3,

16, 20, 10, 20, 10, 20, 10,

100): 400): 200): 200): 100): 100): 50):

fail 1.15 1.46 1.11 1.16 fail fail

(5, (5, (5, (5, (6, (6, (7,

10, 1000): 1.37, 1.28, 1.34 10, 500): 1.14, 1.04 10, 400): fail, 1.06 10, 350): 1.02, fail 10, 1000): 1.09 10, 500): fail 10, 1000): fail

The DR values are quite similar despite the random variations in the di erent invocations of EXOR(n; m; M). It is also seen that the DR increases with the number of examples, M, and decreases with n more sensitively than it does with m. The parameters at the boundary of failures are meaningful in that they give some concrete values of the number of examples needed for CM to succeed, which we believe is related to the number of examples needed to learn EXOR functions based on the VC dimensions [22] in the presence of random variables. What happens when two features are highly correlated? Let us take an EXOR(4,10,1000) which produces the following merits and DR: M = (192 187 184 184 k 86 94 92 90 89 95 96 90 91 93) DR = 184=96 = 1:92 We now replace the rst random variable X5 with a copy of the last base variable X4 so that the modi ed example data now has two identical columns (variables); X4 = X5 . The merits of this modi ed case are: M = (274 270 282 32 k 32 100 102 92 95 101 101 105 100 98) The contextual merits of X4 and X5 are now the least, for each of these variables contribute little to the class separation in the presence of other variables that include a copy of itself. This e ect is a consequence of the CM algorithm; duplicated variables must contribute to the distance by the number of duplicity and share the merit component by portions which are that much smaller. Of course, as soon as one of these variables is discarded, the merit of the remaining copy of the base variable improves and the new DR becomes that of a typical EXOR(4,9,1000). The next example case illustrates an interesting property of the contextual merit. First, we take an EXOR(5,10,1000) which produces the following merits and DR: M = (125 126 118 125 124 k 86 86 92 81 89 81 76 83 78 79) DR = 118=92 = 1:28 Let us now replace the rst random variable X6 with the Ex-Or of the two columns (X1 and X2 ), and replace the next random variable X7 with the Ex-Or of the remaining three base variable columns (X3 , X4 and X5 ). The class can be represented by a simple Ex-Or function of the two new variables due to the associativity of the Ex-Or function. This modi ed example case produces the following merits: M = (89 88 82 81 84 k 174 241 k 53 50 58 51 48 53 53 50) DR = 81=58 = 1:4 (of the original base variables) DR0 = 174=58 = 3 (of the two new variables) This example case illustrates several points. First, the DR value for the original variables did not decrease because the new variables introduced are not directly correlated with any of the base variables. In fact, the DR value increased from 1.28 in the original EXOR case to 1.4 in the modi ed case primarily due to the 8

fact that the the classes can be distinguished by lesser number of variables that include the two new ones (e.g. X1 , X2 and X7 can now represent the class function). Second, the DR value for the two new variables is greater than that of the original base variables. Third, the merits of X6 and X7 are close to the sum of the merits of their constituent base variables! This is due to a general property of the contextual merit for categorical features. Suppose that a feature X is classi cation-equivalent to a t-tuple of other features, (Y1 , Y2 ,...,Yt), where all the features are categorical, that is, whenever the t-tuple values are di erent for two examples in di erent classes, so are the values of X and vice versa. Since contributions to the contextual merit comes from pairs of di erent class examples, whenever the 1=D2 contribution is added to the merit of X, only those features among the t-tuple that are actually individually di erent in the pair will receive the same contribution. Therefore, the merit of X is no less than the maximum merit of the Yi s, and obviously, it is bounded from above by the sum of the merits of the Yi s. We state this as a formal property of the contextual merit: Property: If a categorical feature X is classi cation-equivalent to t other categorical features, (Y1, Y2 ,...,Yt), present in the data, the contextual merit m(X) satis es max m(Yi )  m(X)  i

X m(Y ) i

i

(6)

The m(X) value approaches the upper bound as the number of examples increases, which was the case in the modi ed EXOR case just illustrated. This is due to the fact that only the K-nearest counter examples are considered in the merit contribution. Among the K-nearest, the partial distance among the t-tuple would be close to 1 when there is a sucient number of examples. That is, usually only one of the Yi s would receive the merit contribution each time X is credited. A similar argument holds for numeric features, however, the property holds only as a tendency since the component distance of a numeric feature is not a crisp 1 or 0.

6 Considerations for numeric features Earlier we introduced the ramp function threshold tk for a numeric feature Xk in de ning its component distance. If a numeric feature is to be discretized at all, it must have at least two intervals within the range. Therefore, an initial estimate of half of the size of the range would be reasonable, i.e., if the values of a feature Xk ranges as xmin  xk  xmax , tk is estimated as (xmax ; xmin )=2. We have used the initial threshold value of 1/3, 1/2 and the full range in many experiments. The smaller values were better than the traditional full range choice. This led us to use 1/2 the range as the initial default value in our system. (See section 8 for how the threshold values change during the discretization process). When the domain knowledge indicates that the values of Xk should be more or less equally partitioned into size t intervals, and small value di erences less than certain threshold, ek , are meaningless (due to, say, the accuracy of measurements), then it would be natural to use tk = t and modify the ramp function so that the ramp begins at ek . Let us now examine a few cases of numeric feature problems derived from the EXOR cases. We will make use of a randomizing transformation on the EXOR cases so that they become numeric cases, through a process RANNUM(R, EXOR). RANNUM substitutes a categorical feature value \0" with a random number drawn from 0  (R ; 1), and \1" with a random number drawn from R  (2R ; 1), except for the class labels. In the following example cases, we use the threshold value, t = (2R ; 1)=2. (t = (2R ; 1)=3 produces similar results here). The following is a list of DR values for some EXOR(n; m; M) cases denoted as E and their numericized versions, RANNUN(R,E), denoted as (R,E). 9

E=(3, E=(3, E=(3, E=(3, E=(3,

10, 3200): 4.84 10, 1600): 3.58 10, 800): 2.56 10, 400): 1.95 10, 200): 1.46

---------------------

( ( ( ( (

10, 10, 10, 10, 10,

E): E): E): E): E):

1.19 1.14 1.12, 1.11 (RANNUM twice) 1.05 0.94 (fail)

E=(3, 10, 1600): 3.41

-----

( 2, ( 3, ( 5, (100,

E): E): E): E):

1.24 1.19 1.16 1.16 (converges with R)

We observe that CM succeeds in the numericized EXOR cases, but less e ectively. (Again, recall that the traditional measures fail on all of these cases). Di erent RANNUMed versions of the same case produce similar DR values, as shown for the EXOR(3,10,800) case. The increased randomizing range, R, lowers the DR value as shown in the second set of examples above, but it converges rapidly. In general, the resolution of the contextual merits is degraded for the numeric features, which is mainly due to the way the component distances are computed. The component distance of a numeric feature is a fraction between 0 and 1 (seldom 0). Hence, a positive merit contribution accrues almost every time the step CM2.2.2 is invoked. The net e ect is that the merit of a numeric feature is increased (in ation) in comparison to that of a categorical feature and the relative contrast of the merits is diminished (smoothing). We call this the Ramp E ect. The in ation of merit values for numeric features is in part due to another reason; there are, in general, greater variety of distinct values in a numeric feature. The greater the variety of values, the greater the frequency of large component distances, and there are the more contributions to the feature's merit (see CM2.2.1). This phenomenon is also true for a categorical feature. Many have observed this bias of favoring large variety features in the traditional merits (or demerits) such as Gini and Entropy based measures as well. We call this the Variety E ect. That is, the more variety of values a feature has, the more its contextual merit tends to be. C4.5 [2], for example, employs a normalization step in an attempt to compensate for this e ect. There the information gain of the feature is divided by the entropy of the feature which tends to increase with the variety of the feature values. These e ects just diminish the e ectiveness of contextual merits when all features are numeric; for mixed categorical and numeric features, relative merits of features may become meaningless unless these e ects can be countered. There are several ways. One way to overcome these e ects is to make use of external reference knowledge when it is available. That is, if it is known that a categorical feature X and a numeric feature Y are of comparable relevance, and the CM generated merits are m(X) and m(Y ) respectively, we can calibrate the merits of numeric features by multiplying them by m(X)=m(Y ). This technique requires that we know at least one reference pair of features and the varieties of values in numeric features to be comparable. We illustrate this with another example case. First, let us take an EXOR(3,10,1600) and numericize X3 through X13 by the RANNUM process with R=5. The rst two variables are still categorical. The merits and DR value are: M = (83 83 203 k 121 133 130 132 130 128 130 132 131 132) DR = 83=133 = 0:62 (fail) As expected, the numerical variables attain in ated merit values. Of the numerical variables, however, the base variable X3 still has greater merit than random variables. Suppose we are given the fact that X2 and X3 are of comparable relevance to the class. We can now multiply the merits of numerical features by 10

m(X2 )=m(X3 )=83/203 to obtain the calibrated merits and DR (in rounded numbers): M 0 = (83 83 83 k 49 54 53 54 53 52 53 54 54 54) DR0 = 83=54 = 1:54 The calibration using external knowledge succeeds in handling the mixed features with the new DR of 1.54, which is still only about half that of a typical all categorical EXOR(3,10,1600). We can attribute this to the smoothing e ect which can not be corrected this way. Another way to overcome the problem is to rst discretize the numeric features and then nd the contextual merits of the resulting all categorical data. We discuss in section 8 a new discretization algorithm based on our contextual feature analysis. We still need to counter the ramp e ect and variety e ect during the discretization process. But it will be seen that a heuristic countermeasure scheme we introduce there is adequate for the task. We have recently found a more powerful technique to neutralize these e ects [23]. It is based on computing, for each feature, the expected contextual merit of a random feature, in its place, that has the same value distribution as the original feature it replaces; and then comparing the original merit of the feature to the random counterpart. The merit of such randomized feature is subjected to the same biases of the merit computation and therefore serves as a normalizing value in the same manner as the feature's entropy is used in C4.5.

7 Considerations for application of contextual merit We have shown EXOR cases where CM \failed"for basically having too small a number of examples for the given number of features. When these same cases were subjected to R-MINI as they were, they all produced the correct EXOR rules using only the base variables. It indicates that even for these cases, the base variables are the most important variables. We understand this as a limit on the resolution of CM generated merits when there are too many irrelevant features and not enough examples. Contextual merits capture relative importance of features in distinguishing the classes in the context of other features. The notion of context is important here. We have shown that highly correlated features mutually weaken their importance even when any one of them, in the absence of the others, is a strong feature. For feature selection then, one should not choose features from the top in the order of their contextual merits until enough are selected. The correct strategy is to discard the feature with least merit, regenerate the merits, and repeat the process until necessary number of features remain. (Here, \enough" and \necessary" mean that the feature set is adequate for distinguishing the classes. What is adequate depends on the domain, ranging from strict value di erences in the case of noiseless data to tolerating some preset amount of con icts). Referring back to the EXOR cases, eventually all random variables will be discarded and only the base variables will remain. In fact, the \fail" cases shown before will recover by this strategy as well. Since the worst merit of the base variables of these cases were not the least (not shown in the data), discarding the least merit variable from such an EXOR(n; m; M) case is equivalent to a new case EXOR(n; m ; 1; M) which yields an improved DR value. When there are highly correlated but important features, the last one will not be discarded. We have just described the core process by which a minimal set of relevant features can be extracted. The process requires repeated CM invocation, which is a practical problem. When there are large number of 11

features, say in the hundreds, one can discard a reasonable chunk of least merit features at a time. However, a separate correlation analysis should be done on the set of least merit features, to ensure that highly correlated features, if any, are not altogether discarded in a given round. We use the normalized contextual merits [23] for this purpose in our RAMP system [7]. We have observed in real cases where some features that were returned from the discard candidates in a round eventually survived to the nal feature set. We note that discarding a feature at a time is similar to the way the rough set approach arrives at the minimal feature sets (called reducts [13]). Our method does not search for all reducts, and continues to reduce the number of features until a \desired" number of features remain. The desired number of features may be more than the necessary minimal set to maintain a reasonable distinguishability depending on the problem case, especially when the data is highly noisy. The EXOR cases are instructive only when the number of examples M is sucient so that the minimal function is still an Ex-Or function of the base variables. When M is too small for an EXOR(n; m; M) case, one can not insist that the data should be representing an Ex-Or function of the base variables any more. All the cases we presented have sucient M so that this is not a problem. However, we do not know yet how to determine from the data if the number of examples is sucient to correctly model it in the presence of noise and irrelevant features. The VC dimension theory is a start and a study of EXOR family of functions may provide some clues to understanding this dicult problem. Returning to the CM algorithm, we see that it is class order dependent. For the EXOR family, there is hardly any di erence whether we take as C1 the odd examples or the even examples. When the sizes of the two classes, N1 and N2 , are comparable, either direction produces similar results. When there is a large disparity, results may di er in signi cant ways, and we take the smaller class to be the class C1. If desired, one can compute the merits using the rst de nition given by (4), at the cost of about doubling the computation. When there are more than two classes, one can always treat it as a series of two class problems, or compute the merits according to (4). In practice, we have successfully used the same CM algorithm (according to (5)) as follows: Simply accumulate the merits, rst from C1 versus rest, then discarding C1, from C2 versus rest, and so on, until the last two classes are processed. When the data does not contain highly correlated features the contextual merit can serve as an alternative mechanism to select the current test node in a decision tree generation. Although we have not pursued this yet, we expect this approach will yield improved trees in many cases, similar to the experience reported in [9] using the RELIEFF based merits for trees. In the lower levels of the tree, however, one may revert to the traditional method of test feature selection because the decision path to the current node e ectively de nes the context. When numeric variables are involved, a proper countermeasure should be used against the ramp e ect and variety e ect. We suggest that the tree generation involving numeric features be carried out after they have been discretized so that the problem can be handled as an all categorical case, which was shown e ective in [24]. The contextual merits also suggest the key dimensions to be examined when one performs projection pursuit [25]. We believe it would be more fruitful to study the projections of combinations of the higher merit feature dimensions, than that of the lower merit ones. The complexity of the CM algorithm is basically of order Nf  N 2 . When the number of examples is large, one can use a random subset of examples to compute the merits. The resolution of relative signi cance may degrade due to sampling, but it is quite robust until the example size becomes too small, as seen in the EXOR cases. Our experience with many large example sets con rms that a sampling strategy is practically e ective. First, take a reasonable subset and compute the merits. Compute merits again using a larger subset (say up to a double size), and compare the new merits to the old. If the new merits are more or less in the same order and the merit values more or less in the same proportion, use the new merits. Else, repeat 12

the process with a larger subset.

8 Optimal discretization of numeric features We now present another important application of contextual merit, namely discretization of numeric features. Our approach seeks multiple \optimal"cut points that would maximize the contextual merit of the discretized feature. All the discretization techniques discussed by Dougherty, Kohavi and Sahami [24], as well as the ChiMerge scheme of Kerber [26], consider the given feature's correlation to the class in isolation. Even so, the experiments reported in [24] demonstrate that pre-discretization is often better than the common practice of selecting one cut point at a time during the tree generation process. Recall the CM algorithm step CM2.1 where the BESTSET of nearest counter examples are chosen. We claimed that the current example ei and a counter example ej in this set had contextual strength of 1=Dij . This was distributed in proportion to the component distances d(ijk) as incremental merit contributions. When a numeric feature Xk is discretized, the new component distance becomes 1 if and only if there is a cut point between the pair of values, xki and xkj , and 0 otherwise. The Dij value changes relatively little by the discretization of Xk since it is the sum of all component distances. Therefore, the incremental merit contribution to the discretized Xk , due to ei and ej , would be approximately 1/Dij2 if and only there is a cut point between xki and xkj . Since the same holds for each numeric feature individually, we make use of a common data structure, SPAN, which is a list of triplets, (i; j; 1=Dij2 ) for all ej s in the BESTSET of ei , for all ei . This is obtained by simply inserting one step, CM2.2.0, to the CM algorithm: \Append the triplet (i; j; 1=Dij2 ) to SPAN." The SPAN list thus obtained is used to discretize all the numeric features. It contains N1  log2 N2 (order NlogN) entries. To discretize Xk , we develop a specialization, SP ANk , by substituting the actual values, xki and xkj in ascending order, for the the i and j indices in SPAN, and removing those entries if the pair have the same value. An entry in the SPANk is an interval on the value line of Xk , speci ed by its beginning and ending points and a weight value which is the approximate merit contribution should the interval be cut. For a given set of cut points, the sum of the weights of all intervals that are cut by any of the points in the set approximates the merit of the discretized Xk , which we call the cut (merit) score of the set. We wish to nd a set of c cut points that maximizes the cut score for a given c. This is a well known problem called interval covering (IC). One has a choice of dynamic programming algorithms with varying complexity [27, 28] to achieve this. To maximize the cut scores through interval covering, one can simplify the intervals of SP ANk to reduce the computation. First, order all the end points of intervals in the SP ANk . (Some beginning/ending points may coincide). Now, all the consecutive pure beginning points can be replaced with the greatest among them. Likewise, all the consecutive pure ending points can be replaced with the least among them. All intervals that become the same as a result are merged into one with the sum of their weights. This reduces the number of distinct end points and the intervals. The simpli ed end points form an alternating beginning/ending points on the value line of Xk . An interval may stretch over many such values and an ending point may coincide with the beginning point of the next pair. (Only the ordinal rank values of these end points matter in the interval covering process.) The cut point candidates are taken as the mid points between these beginning/ending pairs, which is sucient for the interval covering. Let the candidate set thus obtained be CC = (u1 ; u2; :::; uK). The maximum number of cuts one can pro tably make is K and the maximum cut score achievable is just the sum of all the weights, which we denote as S. The cut points can actually be any value between the beginning/ending pairs and one can re ne the cut point values once the cut point set is chosen by the interval covering algorithm. 13

We give an outline of how the dynamic program proceeds for the interval covering problem and refer those who may be interested in implementation to [27, 28]. Let us de ne a cut solution for a given number of cuts, c, and a given upper limit for the largest cut point it may have, ut, for 1  c  t  K. A cut solution C(c; t) represents (c; t; s; (v1; v2; :::; vt)). The quantity s is the best cut score achieved by the best set of cut points, (v1 ; v2; :::; vt), where vi < vi+1 , vt  ut, and vi 2 CC. The IC algorithm rst nds all one cut solutions, C(1; 1); C(1; 2); :::;C(1;K). To reduce computation, one needs to keep only the lexically smallest cut point set solutions when there is a tie in the scores, as t increases for any given c. The solutions for C(c; t) is obtained dynamically from the previous generated and kept solutions among C(c ; 1; t0), for all t0 < t. We nd the best additional cut point vc  ut for each of these earlier solutions and select the best cut score solution among them. The nal solution for the given c is then C(c; K). The IC algorithm should be implemented in such a way that the solution after each increasing number of cut points, c, can be examined and stopped from from further solutions as desired. A straightforward implementation of the IC algorithm would take on the order of cK 2 N computations to generate all solutions up to c cuts, where N is the number of intervals in the simpli ed SPANk . The scores attained by increasingly many cut points in the course of IC typically exhibit the behavior shown in gure 3. It resembles an \eciency curve" that grows relatively fast to a knee and slowly reaches its maximum value, S, at the maximum of K cuts, as the number of cuts increase. (The knee is generally sharper for more relevant features). It is obviously a monotone increasing function and convex as well. One would naturally select the number of cuts somewhere around the knee of such a curve depending on the variable's relative contextual merit (on the high side if the merit is high). In general, the more the number of cuts for a given numeric feature, the more the danger of over tting to potentially noisy data. On the other hand, too small a number of cuts would cause a loss of information. For an oracle that generates the best rules/trees from an all categorical data set, too many cuts are tolerable, but too little cuts would deprive it of necessary information to produce the best overall description. Therefore, the knee area of the curve is of interest as an engineering tradeo . The number of cuts can be determined either manually in an interactive IC where the progress of the increasing score is monitored and IC terminated at a knee point, or by an automatic knee point selection mechanism. cut score

low alpha

high alpha

S

number of cuts

K

Figure 3: Cut score curve and determining the number of cuts 14

To automate the process, we found it e ective to use an ad hoc method based on a simple heuristic. The idea is to make use of the convexity of the score curve. To do this we rst mark a point at (  K; S) on the maximum score line of the curve, where is computed to re ect the relative merit of the feature. We draw a conceptual line from the origin to this point and nd the highest point of the curve to this line. (The height of a point (x; y) to the line connecting the origin to (x0, y0 ) is proportional to x0y ; y0 x). The convenience of convexity is that we can stop developing the curve as soon as the next point on the curve has a smaller height, and select the previous point as the \desired" knee point. This is illustrated in Figure 3. The parameter has two components, = 1  where is chosen as a countermeasure for the ramp e ect in the following way. Let the sum of all the categorical features' contextual merits be Ms , and of all the numeric features, Mn. The merit of the given numeric feature Xk is mk . Let M 0 be the in ation adjusted (grossly) average merit of all the features, as M 0 = (Mn + 2  Ms )=Nf We say the relative signi cance of Xk adjusted (grossly) for the smoothing of the ramp e ect is , where = (mk =M 0) 3 The 1 parameter controls the preference for more or less number of cuts. The role of 3 is to sharpen the contrast against smoothing. We choose these parameters for the problem and not for each feature during the process. (In principle, we consider the feature discretization process as a series of interactive experiments. One can experiment with the parameters to ultimately nd the \best" minimal rule sets). For the many problems, both benchmark and real, to which we have applied this technique, results have been rather robust over these parameter settings. Based on our experience, the following range of parameters are generally recommended. (The default settings within our RAMP system is shown in (), which were used exclusively for all the benchmark cases that are reported in the R-MINI paper [21].) 7  1  15 (default 15) 1:5  2  2 (default 1:5) 1  3  2 (default 2) After one round in which all the numeric features are discretized in this manner, they are restored back to the numeric values one at time and re-discretized in re ning rounds. The cut points for a feature can be re ned by re-discretization now that other features are discretized. To re-discretize Xk , we update its ramp threshold tk to be the average interval size of the previous discretization as an approximation to the new interval size. The cut point values generally converge after a few rounds of re nement. In the RAMP system, one sets the number of re ning rounds, nr (default 3). We summarize the Numeric Feature Discretization (NFD) algorithm: Numeric Feature Discretization algorithm: NFD NFD1) Set i range size for each numeric feature. NFD2) Perform CM to obtain the contextual merits and SPAN. NFD3) Set i parameters. (default (15, 1.5, 2)). NFD4) Set number of rounds, r . (default 3). NFD5) Initial round: Do for each numeric feature, k NFD5.1) Specialize SPAN to k and simplify. NFD5.2) Perform IC : Monitor the progress of merit increases as the number of cuts increase. Terminate IC either by

t = 1=2

n

X

SPAN

15



NFD6) NFD7) NFD8)

manual or automatic method based on i parameters. Return the best cuts for the chosen c. Return the cuts for all numeric features as the . 0 Using the cuts, discretize all numeric features, obtaining an all categorical representation of the problem. Refining rounds: Do for r rounds, NFD8.1) Do for each numeric variable, k , NFD8.1.1) Restore k values to the original numeric ones. NFD8.1.2) Recompute k as the average interval size defined by the previous cuts for k . NFD8.1.3) Perform CM, specialize SPAN to k , and simplify. NFD8.1.4) Perform IC for . k NFD8.1.5) Discretize k using the cuts. (all categorical again) NFD8.2) Return . i Return all . i

ROUND CUT S

n

X

NFD9)

X

t

ROUND CUTS ROUND CUTS

X

X

X SP AN

The NFD algorithm above is the default mode for discretization in our RAMP system. For some applications, we have used the following variation. We manually intervene after the ROUND1 and examine the number of cuts per feature in ROUND0 and ROUND1 along with the merit values at the end of these rounds. For each feature, the number of cuts from these two rounds may vary (usually decreasing slightly). We interpolate a number between the two depending on the relative positions of the feature's merit from the two rounds. We then hold the number of cuts for the subsequent rounds xed to the number we determined, which usually helps in the convergence of the cut values. From the results of NFD, one has a choice of using di erent sets of cuts for the numeric features, although, in our experience, the last round results have been the best most often. One of the criteria for choosing a particular set of cuts is the number of con icting examples due to the discretization. The con icting examples can be removed from the rule/tree generation, or assigned to the class that has the majority of the group that con icts, depending on the application. One can also assign the entire group that are in con ict (due to having the same categorized feature values but di erent class labels) to the safest risk class when appropriate. In general, the greater the number of cuts, the less con icts are generated by discretization. Also, when the problem has many features little con icts are generated. One would normally prefer the discretization that produces the least con icts. However, if the data is known to be highly noisy, one might prefer a discretization that produces more con icting examples, which can then be removed.

9 Discussions on the discretization algorithm The key idea here is to maximize the merit of the discretized feature by nding the optimal cut point through an interval covering algorithm. The main weakness of the algorithm lies in determining the best number of cuts. We have shown how this can be accomplished automatically through a heuristic method. Controlling the number of cuts is not an issue in traditional methods since the discretization is passively determined by the rule/tree generation process. Our method can generates better or more meaningful cut points, but the choice of number of cuts remains heuristic. We advocate that the number of cuts be determined by the location of the \knee" of the score curve. This has the immediate bene t that over tting to noisy data is prevented. All the arti cially numericized cases shown earlier discretize correctly for the base variables, each with 16

one of the cut points at the critical value, (R ; 1)=2, (R being the RANNUM parameter), when either the number of cuts were manually xed to be 1 or automatically chosen with a small 1 value. When the number of cuts for the base variables becomes two due to a large 1 setting, the cuts may miss the critical value, and the resultant discretized problem would not lead to a desired EXOR rules. However, if the number of cuts are further increased, the correct cut point is usually included in the set. As long as the critical cut point is included, R-MINI nds the EXOR solution. Therefore, when the optimal number of cuts is very small, missing it by 1 is worse than missing it by a few more. In our experiments, a few tries with di erent choices of cuts (manual or di erent parameter settings) always produced satisfactory discretization in these numericized EXOR cases. However, it is clear that a further study is in order. We mention here that self discretizing conventional techniques including SWAP1 fail to produce correct EXOR rules for any of the numericized EXOR cases. Linear scaling of numeric values does not a ect the contextual feature analysis. However, non-linear but order preserving scaling does. In one extreme, all the numeric values of a feature can be replaced with its ordinal rank number obtained from ordering all unique values of the feature, as a preprocess. Our experiments used the following generic normalization scheme for favorable outcomes. Let x1; x2; ::: be the unique values of a numeric feature in ascending order. Let their successive di erences be d1 ; d2; :::. We let d0i = d1i =p for some small p, say, 2. The normalized values are x0i where x01 = 0, and x0i = x0i;1 + d0i. One can see that as p increases, the resulting x0i approaches the rank number of the extreme case. (When one does this, the cut points should be mapped back to the unnormalized values in a straightforward manner). This process tends to even out the distribution of the values. The same rationale that makes medians, and quartiles, etc. meaningful in statistics applies to this kind of normalization scheme that tends toward rank numbers. The re ning round, NFD8, is computationally expensive for it requires an invocation of CM and IC per numeric feature. More ecient strategies for accomplishing the re nement are being explored. If it is known from the domain knowledge that a certain cut point(s) is signi cant for a feature yet additional cut points are sought, our approach can simply honor this as follows. First, the average interval size de ned by the given cut point(s) can be used as a better initial ramp threshold, tk , for the feature's component distance. Second, the intervals that are cut by the given points are removed from the SPANk for the IC process, which then produces only the additional cut points.

10 Application experience The contextual feature analysis and discretization algorithms described in this paper are incorporated into our data abstraction system called RAMP (Rule Abstraction for Modeling and Prediction) [7] which also contains our minimal rule induction algorithm, R-MINI. The RAMP system is used to reduce the number of features (feature selection), discretize numeric features, generate minimal sets of rules, determine rule application strategy, apply rule based regression if it is to model numerical outcome, and evaluate prediction accuracy. The RAMP approach has been used on many real applications including fraud detection, pro tability classi cation of customers, chemical bi-metalic salt typing, coin sorting, hand written character recognition and equity return predictions [29]. The number of features ranged from tens to several hundreds and the number of training examples ranged from several hundreds to 45,000. For these real cases, the resultant accuracy of the RAMP generated rules were either comparable or higher than those obtained from SWAP1, C4.5, or CART, some signi cantly. 17

We processed through RAMP 8 benchmark cases from Statlog [30] which compares error rates of 23 di erent techniques on 21 cases. (for detail results, see [21]). All but one of these cases have numeric features. These were discretized by the technique described here using default parameter values exclusively before rules were generated by R-MINI. (These rules are minimal, complete and consistent). We applied these rules without pruning, but with a simple voting scheme for classi cation. The RAMP results were among the top 5 in the StatLog ranking in ve of these cases. For one of these cases, SHUTTLE, the RAMP error rate was 0.0011, which ranked 8th from the top. Of the two cases, DIABETES and VEHICLE, where RAMP did not do too well, the DIABETES case included missing values that were given the value 0 (a fact con rmed by the data owner). This is a peculiar kind of noise and perhaps it is not an appropriate benchmark case. The RAMP error rate for the VEHICLE case was 0.289 (rank 15). We suspect that the decision surface for this case is not conducive to modeling by rules whose tests are only orthogonal to the features' axes. The following experiment on this case is illustrative. The VEHICLE case has 18 numeric features, 846 examples, and 4 classes. We generated a Backprop neural network with 5 hidden nodes (trained on all examples for 2500 epochs, using IBM Neural Network Utility version 3.1.1). From the trained network, the hidden node input values (weighted sum values before nonlinear transformation) were extracted for all the examples. We appended these ve derived feature values to the original 18 features, and used our contextual merit to select a feature set of 18 out of 23. It included all of the hidden node features, discarding original features 7, 9, 11, 12, and 14. This new case underwent a 9-fold cross validation through RAMP. The new error rate was 0.19, now ranking 4th. The 9-fold error rate for the Backprop in the StatLog was 0.207. Clearly, the contextual merit based feature selection and discretization were e ective.

11 Conclusion and challenges We have presented a new approach to feature evaluation, one which is based on their contextual signi cance in distinguishing the classes. The main idea is to assign the contextual merit based on the component distance of a feature weighted by the degree of similarity (function of total distance) between examples in di erent classes. The technique captures a given feature's contribution to the class separation in the context of other features without enumerating distinct contexts. The net e ect would be akin to a full look-ahead in a decision tree approach. Furthermore, we use the contextual merits for obtaining \optimal" cut points for discretizing numeric features. The complexity of our contextual analyses is linear in the number of features and quadratic in the number of examples. We have made use of a generic class of pathological problems derived from Ex-OR functions as the main vehicle for demonstrating the method. Perhaps, the most telling of these example cases is where partial EX-ORs of the base variables are shown to have commensurately higher contextual merits than the original base variables. We proposed using a ramp function as a better model for the component distances of numeric features. We identi ed the Ramp E ect and Variety E ect that a ect the relative merits of numeric features and presented heuristic countermeasures. A more recent normalization technique based on randomized feature merits is reported in [23]. For discretization, we presented an e ective heuristic method to determine a reasonable number of cuts near the knee of the cut merit score curve, which e ectively avoids over tting to noisy data. We described a new value scaling scheme to even out the value distribution of a numeric feature, which in our experience, often gives better results. While the new approach is demonstrably a powerful alternative to traditional approaches, many unanswered issues surface because of it. Here are some issues for further investigations. 18

1) When there are not enough examples, the EXOR(n; m; M) cases are \failed" by CM. As M decreases further, it may not adequately represent the base EXOR function altogether. An analytic expression for the lower bound of number of examples, M(n; m), such that given that many examples CM would \succeed" can be derived. This would give some concrete cases toward learnability results in the presence of random/noisy features. 2) For each example in C1 , CM takes K = log2 N2 of the K-nearest counter examples. What is the best K for the K-nearest? Should the criterion for nearest ones be based on the distance itself? 3) We used 1/D as the contextual strength to be distributed among the distance contributing features in CM and SPAN (for discretization). We report here that 1=2D or 1=aD for a even greater that 2 also work well. In fact, the EXOR results generally are sharper with a 1=aD scheme. A theoretically understanding of contextual strength needs to be pursued. 4) CM uses every example in C1 against C2. Should only some selected examples be considered? 5) Counter measures against ramp e ect and variety e ect should be better understood. 6) Although the number of cuts can be determined by our knee picking scheme (several i combination experiments, if necessary), further work needs to be done. We are pursuing a less ad hoc method to determine the number of cuts, based on the normalization technique of [23]. 7) The cut re ning rounds (NFD8) are computationally very costly. More ecient re ning technique will greatly improve the run-time of NFD. 8) We have used the contextual merits of features only in a relative sense. The absolute magnitude of the merits should be understood better and made use of. We mentioned earlier that the RELIEF algorithm proposed by Kira and Rendell [8] was found to be the closest approach to ours. Brie y, this is how RELIEF works. For a randomly chosen example, nd one nearest example in the same class and one in the counter class. (Distance function is similar to (1-3) except without a ramp threshold). Their \Relevance" variable, corresponding to our merit, is initially an all zero vector of length Nf . The squared component distances of these two closest examples are componentwise subtracted from (or added to) the Relevance vector depending on whether the closest example was in the same (or di erent) class. Repeat this process for m (a parameter) times and select those features whose Relevance weight thus computed are above a certain threshold, another parameter determined by statistical considerations. The RELIEF algorithm is said to be inspired by instance-based learning and the method is justi ed on statistical grounds. Our contextual analysis was motivated by a desire to capture the feature's ability to discriminate each example from its nearby counter examples in the context of other features present. While the two approaches have some super cial similarities and many di erences in detail, the most important di erence stems from the fact that the cornerstone of our approach is the use of contextual strength 1=D. This also allows us to naturally apply the same concept to discretization. During the revision of this paper we came upon a more recent follow-on work, RELIEFF by Kononenko [9]. The main departure from the original RELIEF is that, for each chosen example, contributions from not just the nearest but the K-nearest same class and K-nearest other class examples are summed together. While this is an improvement on the original RELIEF, the strength of the context is still not utilized, which remains as the main contrast to our approach. 19

12 Acknowledgements I am greatly indebted to my colleagues, Don Coppersmith, Claude Greengard, Jonathan Hosking, Marshall Schor, Madhu Sudan, Shmuel Winograd, and the members of the Data Abstraction Research group in the Mathematical Sciences Department at the Thomas J. Watson Lab. They provided many useful comments on the ideas and on this manuscript. Chid Apte helped me through in preparing this manuscript. I thank Sandy Wityak for the gures. The members of the Data Abstraction Research Group (Chid Apte, Edna Grossman, Jorge Lepre, Seema Prasad, and Barry Rosen) are greatly acknowledged for having made it possible to carry out many application experiments with the RAMP system, as well as for their thoughtful comments to improve this paper.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3:261{283, 1989. J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. S. Weiss and N. Indurkhya. Optimized Rule Induction. IEEE EXPERT, 8(6):61{69, December 1993. C. Apte, F. Damerau, and S. Weiss. Automated Learning of Decison Rules for Text Categorization. Technical Report RC 18879, IBM T.J. Watson Research Center, 1994. Also appears in ACM Transactions on Oce Information Systems, July 1995. S. Weiss C. Apte and G. Grout. Predicting Defects in Disk Drive Manufacturing: A Case Study in High-Dimensional Classi cation. In Proceedings of the IEEE CAIA{93, pages 212{218, 1993. S. Murthy, S. Kasif, S. Salzberg, and R. Beigel. OC1: Randomized Induction of Oblique Decision Trees. In Proceedings of AAAI{93, pages 322{327, 1993. C. Apte, S. Hong, S. Prasad, and B. Rosen. RAMP: Rule Abstraction for Modeling and Prediction. IBM Tech. Report, RC{20271, 1995. K. Kira and L. Rendell. The Feature Selection Problem: Traditional Methods and a New Algorithm. In Proceedings of AAAI{92, pages 129{134, 1992. I. Kononenko, E. Simec, and M. Robnik. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF. To appear in Applied Intelligence, 1995. P. Agarwal and P. Raghavan. On building small decision trees for geometric classi cation. Private communication. H. Almuallim and T. Dietterich. Learning with Many Irrelevant Features. In Proceedings of AAAI{91, pages 547{558, 1991. M. Kudo and M. Shimbo. Feature Selection based on the Structural Indices of Categories. Pattern Recognition, 26(6):891{902, 1993. N. Shan, W. Ziarko, H. Hamilton, and N. Cercone. Using Rough Sets as Tools for Knowledge Discovery. In Proceedings of KDD-95, pages 263{268, 1995. C. Hartmann, P. Varshney, K. Mehrotra, and C.L. topberich. Application of Information Theory to the Construction of Decision Trees. IEEE Trans. on Info. Th., July 1982. 20

[15] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth, Monterrey, CA., 1984. [16] U. Fayyad and K. Irani. The Attribute Selection Problem in Decision Tree Generation. In Proceedings of AAAI{92, pages 104{110, 1992. [17] R. Bryant. Graph Based Algorithms for Boolean Function Manipulation. IEEE Trans. on Comput., August 1986. [18] S. Hong. Developing Classi cation Rules (Trees) from Examples. Tutorial Notes of IEEE/CAIA{92 and 93. [19] S. Weiss and C. Kulikowski. Computer Systems That Learn. Morgan Kaufmann, 1991. [20] J. Quinlan. Simplifying Decision Trees. International Journal of Man-Machine Studies, 27:221{234, 1987. [21] S. Hong. R{MINI: An Iterative Approach for Generating Minimal Rules from Examples. In IEEE TKDE, This issue, 1996. Earlier description of R-MINI appears in IBM Tech. Report, RC-19664, 1994, and in Preceedings of PRICAI-94. [22] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis Dimension. JACM, 36:929{965, 1989. [23] S. Hong, J. Hosking, and S. Winograd. Use of Randomization to Normalize Feature Merits. IBM Tech. Report, RC{20072, 1995. [24] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised Discretization of Continuous Features. In Proceedings of ML{95, 1995. [25] J. Friedman. Exploratory Projection Pursuit. Jour. of Am. Statistical Association, pages 249{266, March 1987. [26] R. Kerber. ChiMerge: Discretization of Numeric Attributes. In Proceedings of AAAI-92, pages 123{128, 1992. [27] A. Aggarwal and T. Tokuyama. Consecutive Interval Query and Dynamic Programming on Intervals. In Proceedings of 4th In. Symp. on Algorithms and Computation, December 1993. [28] A. Asano. Dynamic Programming on Intervals. In Proceedings of 2nd Int. Symp. on Algorithms, Lecture Notes in Comput. Sci., volume 557, pages 199{207. Springer-Verlag, 1991. [29] C. Apte and S. Hong. Predicting Equity Returns from Securities Data with Minimal Rule Generation. In Advances in Knowledge Discovery. AAAI Press, 1995. [30] D. Michie, D. Speigelhalter, and C. Taylor (eds.). Machine Learning, Neural and Statistical Classi cation. Ellis Horwood, 1994.

21

Suggest Documents