Best Subset Feature Selection for Massive Mixed-Type Problems Eugene Tuv1 , Alexander Borisov2, and Kari Torkkola3 1
2
Intel, Analysis and Control Technology, Chandler, AZ, USA
[email protected] Intel, Analysis and Control Technology, N.Novgorod, Russia
[email protected] 3 Motorola, Intelligent Systems Lab, Tempe, AZ, USA
[email protected]
Abstract. We address the problem of identifying a non-redundant subset of important variables. All modern feature selection approaches including filters, wrappers, and embedded methods experience problems in very general settings with massive mixed-type data, and with complex relationships between the inputs and the target. We propose an efficient ensemble-based approach measuring statistical independence between a target and a potentially very large number of inputs including any meaningful order of interactions between them, removing redundancies from the relevant ones, and finally ranking variables in the identified minimum feature set. Experiments with synthetic data illustrate the sensitivity and the selectivity of the method, whereas the scalability of the method is demonstrated with a real car sensor data base.
1
Introduction
Ensembles of decision trees have proven to be very efficient and versatile tools in classification and regression problems [2,4]. In addition, the structure of the trees can be used as a basis for variable selection methods. We have presented such a variable selection method that amends the ensemble with the concept of artificial contrast variables (ACE) [10]. The contribution of this paper is to extend ACE into the removal of redundant variables. This is an important problem in several domains. It may be expensive to observe all features at once. In medical diagnostics the smallest subset of tests for a reliable diagnosis is usually desirable. Similarly, in engineering problems concerned with a set of sensors, it is often necessary to design the lowest cost (the smallest) set of sensors to accomplish a particular task. The generalization capability of a learner typically improves with a smaller set of parameters. Even with regularizing learners, removal of redundant features has shown improvement in domains such as cancer diagnosis from mass spectra or text classification [5]. Smaller model, in this case a smaller subset of relevant features, makes it easier to interpret the structure and the characteristics of the underlying domain. DNA microarray gene expression analysis is an example. As a reduced set of genes E. Corchado et al. (Eds.): IDEAL 2006, LNCS 4224, pp. 1048–1056, 2006. c Springer-Verlag Berlin Heidelberg 2006
Best Subset Feature Selection for Massive Mixed-Type Problems
1049
is chosen, it makes their biological relationship with the target diseases more explicit. New scientific knowledge of the disease domain is then provided by these important genes. We describe first how ensembles of trees can produce a variable masking measure that forms the basis of the elimination of the redundant variables. Next we introduce the idea of artificial contrasts that is in the core of the proposed algorithm for the best subset feature selection. Experimentation with artificial as well as real data sets demonstrates the performance of the method.
2
Tree Ensemble Methods in Feature Ranking
In this paper we address the problem of feature filtering, or removal of irrelevant and redundant inputs in very general supervised settings. The target variable could be numeric or categorical, the input space could have variables of mixed type with non-randomly missing values, the underlying X − Y relationship could be very complex and multivariate, and the data could be massive in both dimensions (tens of thousands of variables, and millions of observations). Ensembles of unstable but very fast and flexible base learners such as trees can address most of the listed challenges when equipped with embedded feature weighting [1]. They have proved to be very effective in variable ranking in problems with up to a hundred thousand predictors [1,7]. More comprehensive overview of feature selection with ensembles is given in [9]. Random Forest (RF) and MART are two distinguished representatives of tree ensembles. Random Forest extends the ”random subspace” method [6]. It grows a forest of random trees on bagged samples showing excellent results comparable with the best known classifiers [2]. MART is a sequential ensemble of trees that fits a sequence of the shallow trees using gradient boosting approach [4]. 2.1
Feature Masking
Decision trees can handle missing values gracefully using so-called surrogate splits. The surrogate splits, however, could also be used to detect feature masking. We describe now a novel masking measure assuming that the reader has an understanding of basic decision trees such as CART [3]. The predictive association of a surrogate variable xs for the best splitter x∗ at a tree node t is defined through the probability that xs predicts the action of x∗ correctly and this is estimated as: p(xs , x∗ ) = pL (xs , x∗ ) + pR (xs , x∗ ) where pL (xs , x∗ ) and pR (xs , x∗ ) define the estimated probabilities that both xs and x∗ send a case in t left (right). The predictive measure of association d(x∗ |xs ) between xs and x∗ is defined as d(x∗ |xs ) =
min(pL , pR ) − (1 − p(xs , x∗ )) min(pL , pR )
(1)
1050
E. Tuv, A. Borisov, and K. Torkkola
where pL , pR are the proportions of cases sent to the left(or right) by x∗ . It measures the relative reduction in error due to using xs to predict x∗ (1−p(xs , x∗ )) as compared with the “naive” rule that matches the action with max(pL , pR ) (with error min(pL , pR )). d(x∗ |xs ) can take values in the range (−∞, 1]. If d(x∗ |xs ) < 0 then xs is disregarded as a surrogate for x∗ , otherwise we say that x∗ masks xs . To comprehend the masking in terms of model impact, we define the masking metric for a pair of variables i, j as w(si , t)d(s|sj ) (2) Mij = t∈T,s=si
where w(si , t) is the decrease in impurity [3] due to the actual split on the variable si , and the summation is done over those tree ensemble nodes where the primary split s was made on the variable si . Here we take into account how well variable j ”mimics” predictive action of the primary splitter i, and the contribution of the actual split on the variable i to the model. 2.2
Ranking Features
The main idea in this work relies on the following reasonable assumption: a stable feature ranking method, such as an ensemble of trees, that measures relative relevance of an input to a target variable Y would assign a significantly higher score to a relevant variable Xi than to an artificial variable created independently of Y from the same distribution as Xi . The same also applies to the masking measure. We compare the masking of all variables by a list of selected relevant variables, and consider only those masking values as real masking that are statistically higher than masking values of noise variables by selected variables. To select the minimal subset, the algorithm drops all masked (in a statistical sense) variables from the relevant variable list at every residual iteration. We present now an algorithm for the best subset feature selection (BSFS) based on this idea. It is similar to the iterative procedure described in [10], but extends it in order to eliminate redundant features. The proposed approach encapsulates a new masking metric estimation scheme.
3
The Algorithm: The Best Subset Feature Selection
Our best subset selection method is a combination of the following steps: A) Estimating variable importance using a random forest of a fixed depth, such as 3-6 levels (we do split weight re-estimation usiing out-of-bag (OOB) samples because it gives more accurate and unbiased estimate of variable importance in each tree and filters out noise variables), B) Comparing variable importance against artificially constructed noise variables using a formal statistical test, C) Building a masking matrix for selected important variables, and selection of statistically important masking values using a formal test (here a series of short MART ensembles is used [4]), D) The masked variables are removed from the important variable list, and E) The effect of the identified important variables
Best Subset Feature Selection for Massive Mixed-Type Problems
1051
is iteratively removed to allow the detection of less important variables (because trees and a parallel ensemble of trees are not well suited for additive models). Steps C) and D) are novel and different from ACE [10]. A. Split Weight Re-estimation. We propose a modified scheme for calculating the split weight and for selecting the best split in each node of a tree. The idea is to use the training samples to find the best split point for each variable, and then to use the OOB samples that were not used in building the tree in order to select the best split variable in a node. The split weight used for variable importance estimation is also calculated using the OOB samples. B. Selecting Important Features. In order to determine the cut-off point for the importance scores, there needs to be a contrast variable that is known to be truly independent of the target. By comparing variable importance to this contrast (or to several ones), one can then use a statistical test to determine which variables are truly important. These artificial contrast variables are obtained by randomly permuting the values of the original M variables across the N examples. Generating contrasts using unrelated distributions, such as Gaussian or uniform, is not sufficient, because the values of original variables may exhibit some special structure. Trees in an ensemble are then broken into R short independent series of equal size L = 10−50, where each series is trained on a different but fixed permutation of the contrast variables. For each series, the importances are then computed for all variables including the artificial contrasts. Using these series is important when the number of variables is large or when the trees are shallow, because some (even important) features can be absent from a single tree. To gain statistical significance, importance score of all variables is compared to a percentile of importance scores of the M contrasts (we used 75th percentile). A statistical test (Student’s t-test) is evaluated to compare the scores over all R series. Variables scoring significantly higher than the contrasts are selected as relevant. C. Estimation of Masking Between Features. Next we calculate the masking matrix for the selected important features. Let their number be m. We build a set of R independent short MART models [4], with L=10-50 trees, and calculate all surrogates for all variables (including the contrasts) in all nodes of every tree. Note that the surrogate scores and the split weights are calculated using the OOB sample as in step A. For each pair of variables i, j ∈ 1 . . . , 2m and for each ensemble r = 1 . . . , R we compute the masking measure Mijr as a sum of the t be α - quantile masking measure Mij (t) over all trees t in the ensemble. Let Mi,α t ∗ of Mij , j = m + 1, . . . , 2m. Then a square masking matrix Mij is filled as follows. r > 0, r = 1, . . . , R is We set an element Mij∗ = 1 if the hypothesis Mijr − Mi,α accepted with a given significance level p (we used p=0.05). Otherwise we set Mij∗ = 0. We say variable j is masked by variable i if Mij∗ = 1. D. Removal of Masked Features. After building the masking matrix, the masked variables are removed from the important variable list L as follows. Let L∗ be initially an empty list of non-masked important features. We sort the items
1052
E. Tuv, A. Borisov, and K. Torkkola
in L by the importance measure that is calculated as sum of the importances over the ensemble built on step E (or on step B for the initial pass). First, we move variable in L that has maximum importance from L to L∗ . Next, all variables masked by this variable (in terms of matrix Mij∗ ) are removed from L. Then this procedure is repeated for the remaining variables in L, until it is empty. E. Removing Effects of Identified Important Variables. After a subset of relevant variables has been discovered in step B, we remove their effects on the target variable. To accomplish this, the target is predicted using only the identified important variables. The prediction residuals become then a new target. We return to step A, and continue iterating until no variables remain with scores significantly higher than those of the contrasts. It is important that step A uses all variables to build the ensemble not excluding identified important ones. This step is even more important here than in the ACE algorithm since it ensures that the algorithm will recover partially masked variables that still have their unique effect on the target. To accommodate the step E in classification problems we adopted the multiclass logistic regression approach described in [4]. Note that the computational complexity of our method is of the same order as the maximal complexity of RF and MART models (usually MART is more complex, as it requires computing all surrogate splits at every tree node), but it could be significantly faster for datasets with large number of cases since the trees in RF are only built to a limited depth.
4
Experiments
We describe first experiments with the proposed method using synthetic data sets followed by a real example. Synthetic data sets allow one to vary systematically domain characteristics of interest, such as the number of relevant and irrelevant attributes, the amount of noise, and the complexity of the target concept. The relevance of the method is demonstrated using a real-life car sensor dataset. 4.1
Synthetic Data
To illustrate the sensitivity and the selectivity of the method, we simulated a dataset that conceptually should be challenging for our method, but optimal for the standard stepwise selection methods. The data generated had 203 input variables and one numeric response. x1 , ..., x100 are highly correlated with one another, and they are all reasonably predictive of the response (R2 ∼ 0.5). a, b, and c are independent variables that are much weaker predictors (R2 ∼ 0.1). y1 , ..., y100 are i.i.d. N (0, 1) noise variables. The actual response variable was generated using z = x1 + a + b + c + , where ∼ N (0, 1). Tables 1 and 2 compare stepwise best subset forward-backward selection with the significance level to enter/leave 0.05 to the proposed method.
Best Subset Feature Selection for Massive Mixed-Type Problems
1053
Table 1. Ranked list of important variables (synthetic case) found by standard stepwise best subset forward-backward selection. Five of the important (but correlated) variables are at the head of the list. These are followed by eight noise variables before the weak (but important) predictors are discovered. x1 x5 x37 x77 x93 y11 y13 y19 y25 y27 y67 y68 y74 a b c
Table 2. Ranked list of important variables (synthetic case) found by the proposed method together with their importances. Only one of the correlated variables appears in the list; the rest have been pruned as redundant variables. This is followed by the weak predictors. None of the noise variables have been picked. variable x13 a b c importance 100% 10.6% 9.3% 5.9%
Note that even though this case is ideal for standard stepwise best subset selection, noise may be picked before weak but relevant predictors, whereas the proposed method is immune to independent noise. Naturally, the point of the paper, elimination of redundant variables is demonstrated. 4.2
Realistic Data
classification. Sensor data is collected in a driving simulator. This is a commercial product with a set of sensors that, at the behavioral level, simulate a rich set of current and future onboard sensors. This set consists of a radar for locating other traffic, a GPS system for position information, a camera system for lane positioning and lane marking, and a mapping data base for road names, directions, locations of points of interest etc. Thus, sensors are available that would be hard or expensive to arrange to a real vehicle. There is also a complete car status system for determining the state of engine parameters and driving controls (transmission gear selection, steering angle, brake and accelerator pedals, turn signal, window and seat belt status etc.). The simulator setup also has several video cameras, microphones and infrared eye tracking sensors to record all driver actions during the drive that is synchronized with all the sensor output and simulator tracking variables. Altogether there are 425 separate variables describing an extensive scope of driving data — information about the auto, the driver, the environment, and associated conditions. The 29 driver activity classes in this study are related to maneuvering the vehicle with varying degrees of required attention [8]. The classes are not mutually exclusive. An instant in time can be labeled simultaneously as “TurningRight” and “Starting”, for example. Thus, we performed variable selection separately for each of the 29 classes. We compare variable selection without redundancy elimination (ACE) to variable selection with redundancy elimination (BSFS). Current database consists of 629375 data records (7.5 hours of driving) with 109 variables (this study excluded driver activity tracking variables). Of these variables, 82 were continuous, and 27 were catecorigal.
1054
E. Tuv, A. Borisov, and K. Torkkola
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
90
90
100
100 5
10
15
20
25
5
No redundancy elimination.
10
15
20
25
With redundancy elimination.
Fig. 1. Variable importances for each driver activity class. 109 variables on the y-axis, 29 classes on the x-axis. The lightness of the square is proportional to the variable importance for the detection of the class. Thus the set of light squares in a column indicate the set of variables necessary for the detection of that particular class.
A visual overview of the results is presented in Fig. 1 (see caption for legend). Elimination of redundancies between important variables is clearly visible as the drastic reduction of the number of “light” squares in each column corresponding to each class. We present a detailed view of one of the columns, class “TurningLeft” in Table 3. The three most important variables chosen by ACE have been retained as such even after the redundancy elimination (“steeringWheel”, “LongAccel”, “lateralAcceleration”). Of the following three variables on the list, “steeringWheel abs” and “lateralAcceleration abs” are absolute values of one of the three
laneOffsetVarShort
LaneCount
laneHeadingIfLeftTurn
acceleratorManeuver
distToLeftLaneEdge_rd3
3.6
3.5
3.4
3.4 7.3
currentLaneHeading
laneOfLRtire
3.7
crossLaneVelocity
5.6
0.2
laneHeadingIfRightTurn
5.8
minimumStoppingDistance
laneOffset
5.9
3.7
inIntersection
6.6
0.3
LaneIndex
7.4
distanceToLaneChange
crossLaneAcceleration
7.7
4.5
laneBearingIfLeftTurn
8.0
timeToCrossLeftEdge
SubjectEngineRPM
8.0
4.5
crossLaneVelocity_abs
8.2
distToLeftLaneEdge
laneNumber
8.5
4.9
LanePos
8.7
3.1
laneOfLFtire
8.8
distToRightLaneEdge
steeringError
9.0
5.2
CultureType
9.5
0.4
SpeedLimit
11.3
5.3
accelerator
13.4
11.9
turnSignal
16.0 21.0
1.5
laneBearingIfRightTurn
16.2 0.7
1.9
steeringWheel_rd3
20.5 2.8
0.4
speed
22.2 1.3
0.3
steeringManeuver
23.0 40.5
lateralAcceleration_abs 39.0
aheadLaneBearing
steeringWheel_abs 49.3
32.8
acceleration 69.8
2.2
lateralAcceleration 72.2 72.2
36.7
LongAccel 75.9 74.2
Variables for "TurningLeft" BSFS ACE
100.0 100.0 steeringWheel
Table 3. Variable importances for class ”TurningLeft” without (ACE) and with (BSFS) redundancy elimination. See text for discusion
Best Subset Feature Selection for Massive Mixed-Type Problems
1055
most important variables, and “acceleration” is a discretized version of the “LongAccel”. All three have been completely (and correctly) eliminated as redundant variables. Variable “steeringManeuver” is a “virtual” sensor that has been calculated as a discretized nonlinear function of “steeringWheel”, “speed”, and “accelerator”. We can see that “steeringManeuver” has been almost completely eliminated as a redundant variable and that its source variables have all been retained as important variables. These redundancy eliminations verify that the method is working as designed. The rest of the dependencies and their eliminations are subject to further analysis.
5
Discussion
We have presented an extension to feature selection with artificial contrasts, which also removes redundant variables from the result. Although some classifiers, such as Random Forest, for example, thrive with redundant information, there are application domains in which the minimum, most economical set of variables is needed. As an example, we presented a sensor selection example in the automotive domain. Because in real engineering applications inclusion of a sensor incurs a cost, and the costs vary from one sensor to another, an interesting thread of future work would be to optimize the cost of a sensor set for a given fixed and required performance level. Medical diagnostics would also fall under the same scheme.
References 1. A. Borisov, V. Eruhimov, and E. Tuv. Dynamic soft feature selection for tree-based ensembles. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, New York, 2005. 2. L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 3. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. CRC Press, 1984. 4. J.H. Friedman. Greedy function approximation: a gradient boosting machine. Technical report, Dept. of Statistics, Stanford University, 1999. 5. E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make svms competitive with c4.5. In Proc. ICML’04, 2004. 6. T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. 7. K. Torkkola and E. Tuv. Ensembles of regularized least squares classifiers for highdimensional problems. In Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, 2005. 8. Kari Torkkola, Mike Gardner, Chip Wood, Chris Schreiner, Noel Massey, Bob Leivian, John Summers, and Srihari Venkatesan. Toward modeling and classification of naturalistic driving. In Proceedings of the 2005 IEEE Intelligent Vehicles Symposium, pages 638–643, Las Vegas, NV, USA, June 6 - 8 2005.
1056
E. Tuv, A. Borisov, and K. Torkkola
9. E. Tuv. Feature selection and ensemble learning. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, New York, 2005. 10. Eugene Tuv, Alexander Borisov, and Kari Torkkola. Feature selection using ensemble based ranking against artificial contrasts. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2006), Vancouver, Canada, July 16-22 2006.