Playing the Game of Feature Selection
Shay Cohen School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel
[email protected]
Gideon Dror Department of Computer Sciences The Academic College of Tel-Aviv-Yaffo Tel-Aviv 64044, Israel
[email protected]
Eytan Ruppin School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel
[email protected]
Abstract We present and examine a novel Contribution-Selection algorithm (CSA) for feature selection based on the Multi-perturbation Shapley Analysis. The algorithm combines both the filter and wrapper approaches in a multi-phasic manner to estimate features’ usefulness and select them accordingly, either using forward selection or backward elimination. Empirical comparison of several feature selection methods shows that the CSA algorithm with backward elimination obtains better results on several datasets.
1 Introduction Feature selection refers to the problem of selecting input variables (the features) which are relevant to predicting a target value for each instance in the dataset. The potential benefit from feature selection is three-fold: enhancing predictors’ performance, producing costeffective predictors by reducing storage requirements and improving speed, and providing better understanding of the process that generated the data. Feature selection is a search problem, with each state in the search space specifying a subset of features. Exhaustive search is usually intractable, making efficient methods to explore the search space necessary. Feature selection methods are often divided into two main groups [4]: filter methods and subset selection methods. Filters rank each feature according to a measure, and then select features with high ranking value. Subset selection methods include two types of algorithms: embedded algorithms which select the features through the process of generating the predictor, and wrapper algorithms which treat the induction algorithm as a black box, and interact with it in order to perform a search for an appropriate features set using search algorithms such as hill climbing. In this paper, we recast the problem of feature selection in the context of coalitional games, a notion from Game Theory. This perspective yields a greedy algorithm for feature selection, the Contribution-Selection algorithm, which combines both the filter and wrapper
approaches. It uses an induction algorithm as a black box, in order to rank the features based on the Multi-perturbation Shapley Analysis (MSA, see [6]). The MSA aims at quantifying the contribution (importance) of features based on their Shapley Value [11] in the “game” of generating an accurate prediction. That is, the features are the “players” and the “game” value is obtained by estimating the accuracy of the classifier using subsets of features.
(X; Y ) the distribution from which the dataset instances are drawn, where X = (X1 ; :::; Xn) is the vector of features, and Y is a discrete target value (class) for X. Three sets containing i.i.d. sampled instances of the form fxk ; yk g are available: Train, Validation and Test representing the training set, validation set and test set respectively. Given an induction algorithm and a set S f1; :::; ng, fS (x) stands for a predictor constructed from the training set using the induction algorithm, after its input variables were narrowed down to the ones in S. Namely, fS (x) labels each instance of the form (xi ; :::xi S ), ij 2 S , 1 j jS j with a value in the domain of Y . The task of feature We denote by
1
j j
selection is to choose a set S of the input variables, aiming at generating a predictor while maximizing its performance (e.g. its accuracy level or area under ROC curve).
2 Classification as a coalitional game 2.1 Estimating features contribution using the MSA Cooperative Game Theory introduces the concept of “coalitional games”, in which a set of players is associated with a real function (“payoff”), denoting the payoff achieved by different sub-coalitions in a game. It further pursues the question of representing the contribution of each player to the game using “semi-values”, assigning each player a real-value to denote its contribution to achieving high payoff. Borrowing these concepts for our aim to estimate the contribution of the features to the task of generating a predictor, the players represent the features of a dataset (a set N f ; ; :::; ng) and the payoff is represented by a real value function v S , S N , which measures the performance of a predictor generated using the set of features S . The contribution value calculation is based on the Shapley Value 1 [11] which is defined as
( )
'i (N; v ) =
= 12
X
1 (S ()) n! 2 i i
2 N , is the set of permutations over N , Si ( ) = fk j (k ) < (i)g. where i
(1)
i (S ) = v(S [ fig)
v (S ) and
The Shapley Value requires calculating the payoff function for all possible subsets of players, which is impractical in our case. Developing the Multi-perturbation Shapley Analysis [6], the authors have presented a robust method to estimate the Shapley Value 2 by using an unbiased estimator, uniformly sampling only a subset of permutations in . Still, the estimator considers both large and small features sets to calculate the contribution values, while in our feature selection algorithm, we use the Shapley Value heuristically to estimate the contribution value of a feature in order to select a relatively small subset of features that achieve high payoff together, thus rendering the consideration of large subsets a pitfall. We therefore limit ourselves to calculating the contribution value from permutations sampled
1
The Shapley Value is a representative of a broader family of semi-values, and is used for its axiomatic qualities [6]. 2 The estimation in [6] is used in a different context - to analyze the functional contribution in artificial and biological networks; however, the method to estimate the Shapley value is valid for our purpose as well.
( )
Contribution-Selection-Algorithm F; 1. selected 2. for each f 2 F nselected 2.1. Cf := contribution f; selected 3. if f fCf g > 3.1. selected selected[ selection fCf g 3.2. goto 2 else 3.3. return selected
:=
max
(
:=
)
(
)
Figure 1: The Contribution-Selection algorithm. The figure describes the forward selection version of the algorithm, which accepts as input the set of features and contribution threshold and outputs a set of selected features. In the backward elimination version, the selection sub-routine is replaced with an elimination sub-routine and the halting criterion is changed accordingly.
from the whole set of players, with d being a bound on the permutation size. The bounded estimated Shapley Value becomes
'i (v ) =
X
1 jd j 2jdj i (Si ())
(2)
where d is the set of sampled permutations on subsets of size d. The usage of bounded sets coupled with the method for the Shapley Value estimation, yields an efficient robust way to estimate the contribution of a feature to the task of classification. 2.2 The Contribution-Selection algorithm The Contribution-Selection algorithm (CSA), described in detail in Figure 1, is greedy in nature, and can either adopt forward selection approach or backward elimination approach. Using the sub-routine contribution, it ranks each of the features according to their contribution value, and then selects s features with the highest contribution values with forward selection (using the sub-routine selection), or eliminates e features with the lowest contribution values with backward elimination (using elimination). It repeats these phases several times, each time calculating the contribution values of the remaining features given those already selected (or eliminated), and selecting (or eliminating) new features, until there are no remaining features exceeding a contribution threshold with forward selection (or below a contribution threshold with backward elimination).
The main idea of the algorithm is that the contribution sub-routine returns a contribution value for each feature according to its assistance in improving the predictor’s performance, which is generated using a specific induction algorithm, based on the contribution values calculation described in Section 2.1. Using the notation in Section 2.1 and assuming one optimizes the accuracy level of the predictor, the contribution for forward selection calculates the contribution values using the following payoff function v S :
( )
1. S := S
[ selected
()
2. Generate a predictor fS x from the training set Train
()
3. Evaluate the instances in Validation using fS x
(x;y)2V alidationgj ( ) = jfxjfS (x)=jVy;alidation j
4. Return the accuracy level, defined as v S
Name Reuters1 Reuters2 Arrhythmia Internet Ads Dexter
# Classes 3 3 2 2 2
Table 1: Description of datasets used. Classes Dist. # Features Train Size 0.34/0.47/0.19 1579 145 0.21/0.38/0.41 1587 164 0.56/0.44 278 280 0.86/0.14 1558 2200 0.5/0.5 20000 300
Test Size 145 164 140 800 300
=
The case S is an end case which is handled by returning the number of instances in the largest class divided by the total number of instances (a predictor which always selects the most frequent class). Backward elimination is quite similar, and the payoff is calculated only for the set of features sampled in the permutation, because features are eliminated and not selected. The maximal permutation size d has an important role in deciding the contribution values of the different features, and should be selected in a way that ensures that different combinations of features that interact together are inspected.
3 Results 3.1 The data and benchmark algorithms To test the CSA empirically we ran a number of experiments on the following real-world datasets (Table 1): the Reuters1 dataset and the Reuters2 dataset both constructed following [7] using the Reuters-21578 document collection [10]; the Arrhythmia database from the UCI repository [9] ; the Internet Advertisements database from the UCI repository [1, 8] which was collected for the research of identifying advertisements in web pages, and the Dexter dataset from the NIPS 2003 workshop on feature selection 3 [3]. For each of the datasets, an induction algorithm L, was selected, as described in Table 2. Seven different algorithms were then compared on the datasets described above: the induction algorithm L without performing feature selection; linear SVM using the SV M light package [5] 4 ; filtering using mutual information and the Pearson correlation coefficient 5 ; random forests feature selection [2]; the CSA with forward selection and d (d denoting the bound on permutation size) which mimics a common wrapper technique, and the CSA with forward selection and backward elimination. After performing feature selection using the five latter algorithms (mutual information, Pearson correlation, random forests and the CSA variants) the induction algorithm L (Table 2) was used for classification. The CSA variants used the algorithm L to perform the feature selection itself as well. The number of features selected with filter methods and random forests was chosen to optimize the estimated accuracy level on the test set. To avoid overfitting on the validation set used for calculating the payoff with CSA, we used m fold cross validation instead of a single V alidation set.
=1
3.2 Feature selection and classification results Table 3 summarizes the predictors’ performance on the test set and the number of features selected in each of the experiments, using the CSA and various benchmark algorithms for feature selection: 3 The Dexter dataset was constructed from the documents in the corporate acquisitions in the Reuters collection. We binarized the data and removed features that appeared less than 3 times. 4 Datasets that had more than two classes were split to few binary classification problems. 5 We binned continuous domains to estimate the mutual information.
Table 2: The parameters and the predictor used with CSA for each dataset. s is the number of features selected in forward selection, e is the number of features eliminated in backward elimination, d is the permutation size and t is the number of permutations sampled to estimate the contribution values. Dataset Induction Alg. (L) s (Fwd.) e (Bwd.) d t Reuters1 Naive Bayes 1 100 20 1500 Reuters2 Naive Bayes 1 100 20 1800 Arrhythmia C4.5 1 50 20 500 Internet Ads 1NN 1 100 20 1500 Dexter C4.5 50 50 12 3500
The Reuters1 dataset. Feature selection using random forests did best, yielding accuracy level of 100% with 30 features. Not too far behind is CSA in its backward elimination version (98.6%). For another comparison, [7] report that the Markov Blanket algorithm yields approximately 600 selected features with accuracy levels of 95% to 96%. The Reuters2 dataset. CSA with backward elimination did best, yielding accuracy level of 93% with 109 features. For comparison, the Markov Blanket algorithm yields accuracy levels of 89% to 93% [7]. The Arrhythmia dataset. This dataset is considered to be a difficult one. CSA with backward elimination did best, yielding an accuracy level of 84% with 21 ) did better than with features. Forward selection with higher depth value (d d , implying that one should consider many features concomitantly to perform good feature selection for this dataset. For comparison, the grafting algorithm [9] yields an accuracy level of 75%.
= 20
=1
The Internet Ads dataset. All the algorithms performed approximately the same, leading to accuracy level of 94% to 96% with CSA slightly outperforming the others. Interestingly enough, with d the algorithm did not select any feature; the 1NN algorithm had neighbors from both classes, leading to arbitrary selection, and the predictor’s performance was constant. However, when selecting the higher depth levels, the simple 1NN algorithm was enhanced to outperform classifiers such as SVM.
=1
Dexter. For the Dexter dataset, we used algorithm L (C4.5 decision trees) only for the process of feature selection, and linear SVM to perform the actual prediction on the features selected, because C4.5 did not give satisfying accuracy levels for any of the feature selection algorithms. To overcome the difference between these two predictors, we performed m fold cross-validation on the dataset in a similar way to the one used to optimize filter methods to decide at which phase CSA generalized best. The simple mutual information feature selection performed best, followed closely by CSA with backward elimination and random forests, implying that Dexter contains many relevant features, each contributing to the task of classification without interacting much with the others.
In summary, in 3 out of the 5 datasets, CSA with backward elimination achieved the best results. 3.3 A closer inspection of the results The MSA, intent on capturing correctly the contribution of elements to a task, enables us to portray a prior picture of the distribution of the contribution values of the features (Fig-
Table 3: Comparison of accuracy levels and number of features selected in the different datasets. Upper table: No FS (no feature selection), SVM (linear SVM without feature selection), Corr (feature selection using Pearson correlation), MI (feature selection using mutual information), RF (feature selection using random forests). Bottom table: d (CSA with forward selection), and Fwd/Bwd (CSA with forward selection/backward elimination with parameters from Table 2). Accuracy levels are calculated by counting the number of misclassified instances. The number of features selected are given in brackets. Dataset No FS SVM Corr. MI RF Reuters1 84.1% 94.4% 90.3% (20) 94.4% (20) 100% (30) Reuters2 81.1% 91.4% 88.4% (20) 90.2% (5) 87.2% (21) Arrhythmia 76.4% 80% 71.4% (20) 70% (20) 80% (40) Internet Ads 94.7% 93.5% 94.2% (15) 95.75% (70) 95.6% (10) Dexter 92.6% 92.6% 94% (230) 92.6% (1240) 93.3% (800) Dataset d Fwd. Bwd. Reuters1 92.4% (7) 96.5% (10) 98.6% (51) Reuters2 91.4% (5) 90.1% (14) 93.2% (109) Arrhythmia 70% (5) 74.2% (28) 84.2% (21) Internet Ads 95.6% (8) 96.1% (158) Dexter 80% (10) 92.6% (100) 93.3% (717)
=1
=1
0 Arrhythmia (slp=−1.28) Dexter (slp=−1.20) −1
log CV frequency
−2 −3 −4 −5 −6 −7 −4.5
−4
−3.5
−3
−2.5
−2
log CV
Figure 2: Power-law distribution of the contribution values. This log-log plot of the distribution of the contribution values (absolute value) in the first phase for Arrhythmia and Dexter, prior to making any feature selection, demonstrates a power law behavior. The behavior is similar for the rest of the datasets, with different slope values.
(A)
(B)
1
1
0.9 0.8
0.8
0.6 Accuracy / CVs
Accuracy / CVs
0.7 0.6 0.5 Accuracy on Validation Accuracy on Test CV x 10
0.4 0.3
0.4 0.2 0 Accuracy on Validation Accuracy on Test Average CV x 100
0.2 −0.2
0.1 0 0
5
10 15 20 Number of Selected Features
25
30
−0.4 0
50
100 150 200 Number of Eliminated Features
250
Figure 3: Prediction accuracy and feature significance during forward selection (A) and backward elimination (B) for the Arrhythmia dataset. Both figures show how the performance of the C4.5 predictor improves on the validation set as the algorithm selects (eliminates) new features, while the contribution values of the selected features decrease (increase). The backward elimination also generalizes better on the test set through the algorithm’s progress. The behavior for the other datasets is similar.
ure 2); that distribution follows a scale-free power law distribution, implying that large contribution values (in absolute value) are very rare, while small ones are quite common, justifying quantitatively the need of feature selection for these datasets. The behavior of the algorithm is demonstrated in Figure 3; after the forward selection algorithm identifies the significant features in the first few phases, there is a sharp decrease in the contribution values of the features selected in the following phases, while with backward elimination, there is a gradual and rather stable increase in the contribution values of the eliminated features. Figures 2 and 3 also assist in explaining why the backward elimination outperforms several feature selection methods, including forward selection; due to the high dimensionality of the datasets, a feature that assists in prediction could do so merely by coincidence, and yet be selected, on the account of other truly informative features. The forward selection is penalized severely in such case: among the few significant features, some will not be chosen. However, the backward elimination always maintains the significant features in the non eliminated set; a feature that truly enhances the predictor’s generalization will do so for the validation set as well, and will not be eliminated. This leads to a more stable generalization behavior for backward elimination on the test set through the algorithm’s progress (Figure 3).
4 Final notes The Contribution-Selection algorithm presented in this paper views the task of feature selection in the context of coalitional games. It uses a wrapper technique combined with a novel ranking method, which is based on the Shapley contribution values of features to classification accuracy. The CSA works in an iterative manner, each time selecting (eliminating) new features while taking into account the features that were selected (eliminated) so far. Due to the extensive use of the classification algorithm, feature selection by CSA cannot usually be carried with computational intensive learners. This restriction can be reduced by parallelizing the computation of payoffs for different permutations, an advantage not
300
shared by wrapper algorithms which use search methods such as hill climbing. Additionally, the number of permutations t sampled during each phase need not be constant, and the algorithm can be sped up significantly by gradually decreasing the number of sampled permutations as the number of candidate features for selection or elimination decreases along the algorithm’s progress. Clearly, the restriction in selecting the learning algorithm for CSA does not apply to the prediction task once feature selection is completed as demonstrated in Section 3.2 with the Dexter dataset. The CSA was tested on several datasets, and the results show that the algorithm can improve the performance of the classifier, and successfully compete with state of the art feature selection methods, especially in cases where the features interact with each other. This demonstrates the value of applying Game Theory concepts to feature selection, a potential which should be further explored.
References [1] C.L. Blake and C.J. Merz, UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/MLRepository.html, 1998, University of California, Irvine, Dept. of Information and Computer Sciences. [2] L. Breiman, Random forests, Machine Learning 45(1) (2001), 5–32. [3] I. Guyon, Design of experiments for the nips 2003 variable selection benchmark, http://clopinet.com/isabelle/Projects/NIPS2003/, 2003. [4] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, JMLR, Special Issue on Variable and Feature Selection 3 (2003), 1157–1182. [5] T. Joachims, Making large-scale svm learning practical, Advances in Kernel Methods - Support Vector Learning, B. Schlkopf and C. Burges and A. Smola (ed.), MIT-Press (1999). [6] A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin, Fair attribution of functional contribution in artificial and biological networks, Neural Computation 16 (2004), no. 9. [7] D. Koller and M. Sahami, Toward optimal feature selection, Proceedings of the 13th International Conference on Machine Learning (ML) (1996), 284–292. [8] N. Kushmerick, Learning to remove internet advertisements, 3rd International Conference on Autonomous Agents (1999). [9] S. Perkins, K. Lacker, and J. Theiler, Grafting: Fast, incremental feature selection by gradient descent in function space, JMLR 3 (2003), 1333–1356. [10] Reuters, Reuters collection. Distribution for research purposes by David Lewis (1997). [11] L. S. Shapley, A value for n-person games, Contributions to the Theory of Games (H. W. Kuhn and A. W. Tucker, eds.), Annals of Mathematics Studies 28, vol. II, Princeton University Press, Princeton, 1953, pp. 307–317.